# Determening the characteristics of profitable mobile apps.

This project aims to determine the profile of mobile applications leading to increasing the number of users.
The results can be used in marketings purposes. 

Final goal is to find an app profile that will be potentially successful on both Google Play and App Store

The method used for determination is analysis of statistical data from Google Play (approximately 10.000 applications) and App Store (approximately 7000 applications)

## Introduction.

Our first step is to open files containing the datasets and extract dataset from them. 

In [1]:
from csv import reader

In [2]:
# Extracting android data
android_file_open = open('googleplaystore.csv')
android_file_read = reader(android_file_open)
android_dataset = list(android_file_read)

In [3]:
#Extracting apple data
apple_file_open = open('AppleStore.csv')
apple_file_read = reader(apple_file_open)
apple_dataset = list(apple_file_read)

### Initial look at the datasets. Android.

In this section we want to have an initial took at the Android dataset to determine the structure of the data.

In [4]:
print("The number of columns: ", len(android_dataset[0]))

The number of columns:  13


In [5]:
print("The dataset columns' names: " , android_dataset[0])

The dataset columns' names:  ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In [6]:
print("The number of entries: ", len(android_dataset) - 1) #the first row contains the columns' names

The number of entries:  10841


In [46]:
print("Example of an entry: ", android_dataset[500])

Example of an entry:  ['stranger chat - anonymous chat', 'DATING', '3.5', '13202', '6.1M', '1,000,000+', 'Free', '0', 'Mature 17+', 'Dating', 'July 7, 2018', '2.4.1', '4.1 and up']


### Initial look at the datasets. Apple.

In this section we will determine the structure of the Apple dataset.

In [8]:
print("The number of columns: ", len(apple_dataset[0]))

The number of columns:  16


In [9]:
print("The dataset columns' names: " , apple_dataset[0])

The dataset columns' names:  ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In [10]:
print("The number of entries: ", len(apple_dataset) - 1) #the first row contains the columns' names

The number of entries:  7197


In [48]:
print("Example of an entry: ", apple_dataset[1])

Example of an entry:  ['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


## Data cleaning.

In this section we will prepare the data for analysis.
Preparation will consist of:
* Removing duplicates
* Removing invalid entries
* Removing non-free applications
* Removing non-English apps.

### Removing duplicates

Several entries for the same application may lead to statistical errors. So, we will check the datasets for duplicates and remove those. 

We start with the counting of the number of duplicates. We consider two entries *duplicates* if they represent the same application (both entries have the same application name)

In [12]:
# Takes the dataset in a form of list and the index of the column that contains the name.
# Returns the number of duplicates

def number_of_duplicates(dataset: list, name_index: int) -> int:
    uniques = set()
    duplicates = 0
    
    for row in dataset[1:]:
        app_name = row[name_index]
        if  app_name in uniques:
            duplicates += 1
        else:
            uniques.add(app_name)
        
    return duplicates

#### Technique

When we see the duplicate entries for one application, we would like to leave the entry with the maximum number of reviews because that's the newest one. 
To achieve that, we would create a dictionary called ```unique_entries``` the keys of which would be the unique names of the applications. Each value would be a diactionary of the number of reviews and the original row. 

We go through the dataset. When finding a duplicate of an entry for an application, we check if the number of reviews for in this entry larger than the saved one. If so, we update the saved data.
However, if we encounter an entry for an app for the first time, we just save its data alongside the number of reviews to the unique_entries dictionary. 

At the end we go through the dictionary, adding all its entries to the list of unique elements. The cleaning from the duplicates is finished

In [13]:
# Takes the dataset in a form of list,the index of the column that contains the name
# and the index of the column that contains the number of reviews.
# Returns the initial dataset without the duplicates. 
# The order of rows is not guaranteed.


def remove_duplicates(dataset: list, name_index: int, reviews_number_index: int) -> list:
    unique_entries = dict()

    for i in range(1, len(dataset)):
        if dataset[i][name_index] not in unique_entries:
            unique_entries[dataset[i][name_index]] = {'reviews': dataset[i][reviews_number_index], 'data': dataset[i]}
    else:
        if unique_entries[dataset[i][name_index]]['reviews'] < dataset[i][reviews_number_index]:
            unique_entries[dataset[i][name_index]] = {'reviews': dataset[i][reviews_number_index], 'data': dataset[i]}
            
    unique_list = list()
    unique_list.append(dataset[0])

    for app in unique_entries:
        unique_list.append(unique_entries[app]['data'])
        
    return unique_list

#### Android dataset 

Check how many duplicates the Android dataset contains.

In [14]:
duplicates = number_of_duplicates(android_dataset, 0)
print('The number of duplicates in Android dataset is: ', duplicates)

The number of duplicates in Android dataset is:  1181


Remove the duplicates of the Android dataset.
The name index is 0, the number of reviews column's index is 3.

In [15]:
unique_list_android = remove_duplicates(android_dataset, 0, 3)

In [16]:
# Number of unique items of the DS.
resulting_length = len(unique_list_android) - 1
print('The number of unique entries: ', resulting_length)


The number of unique entries:  9660


#### Apple dataset

Check if the Apple dataset contains duplicates. 

The two entries are duplicates if they have the same name of the applciation

In [17]:
duplicates = number_of_duplicates(apple_dataset, 1)
print('The number of duplicates in Apple dataset is: ', duplicates)

The number of duplicates in Apple dataset is:  2


We will use the same techniques of deleting duplicate rows for Apple dataset. 
The name column index is 1, the column index of the number of review is 5. 


In [18]:
unique_list_apple = remove_duplicates(apple_dataset, 1, 5)

In [19]:
# Number of unique items of the Apple DS.
resulting_length = len(unique_list_apple) - 1
print('The number of unique entries in Apple dataset: ', resulting_length)

The number of unique entries in Apple dataset:  7195


### Removing incorrect values.

#### Rows with wrong number of cells.

The one example of invalid rows is the rows with wrong number of cells. 
Android dataset contains 13 columns and each rows must have 13 cells. 
Apple dataset contains 16 columns, so each row must have the length of 16

In each dataset without duiplicates we check if each rows contains correct number of cells and remove invalid rows.

In [20]:
# Takes a dataset with heading row.
# Returns a dataset without rows with incorrect number of cells.

def check_rows_length(dataset: list) -> list:
    correct_rows = list()
    correct_rows.append(dataset[0])
    
    number_of_cols = len(dataset[0])
    
    for row in dataset[1:]:
        if len(row) == number_of_cols:
            correct_rows.append(row)
    
    return correct_rows

In [21]:
# Android
correct_rows_android = check_rows_length(unique_list_android)

In [22]:
removed_rows = len(unique_list_android) - len(correct_rows_android)
print('The number of removes entries: ', removed_rows)

The number of removes entries:  1


In [23]:
# Apple
correct_rows_apple = check_rows_length(unique_list_apple)

In [24]:
removed_rows = len(unique_list_apple) - len(correct_rows_apple)
print('The number of removes entries: ', removed_rows)

The number of removes entries:  0


#### Removing non-English apps.

Our analysis is concerned with applications for English-speaking audience. So, we need to remove applications not designed for them. 

To do that, we will remove all applications that contain not-english, not punctuational and not-digits in its name.

We start with a function detecting if a name given is owned by a foreign app.
It checks is the string contains more than three foreign symbols which could point to a foreign app.

In [32]:
def contains_foreign(text: str) -> bool:
    count = 0
    
    for char in text:
        if ord(char) > 127:
            count += 1
    
    return count > 3

In [33]:
def clear_foreign_names(dataset: list, name_column: int) -> list:
    cleaned_dataset = list()
    cleaned_dataset.append(dataset[0])
    for row in dataset[1:]:
        if not(contains_foreign(row[name_column])):
            cleaned_dataset.append(row)
    return cleaned_dataset

#### Android.
We will use these functions to clear the Android dataset from non-English apps.

In [43]:
final_android = clear_foreign_names(dataset = correct_rows_android, name_column = 0)

The number of deleted rows is:

In [44]:
#number of deleted rows.
print(len(correct_rows_android) - len(final_android))

45


#### Apple.
Now we use the same functions to clear the Apple dataset. The name of the app contains in the 1st column.

In [41]:
final_apple = clear_foreign_names(dataset = correct_rows_apple, name_column = 1)

The number of deleted rows is:

In [42]:
#number of deleted rows.
print(len(correct_rows_apple) - len(final_apple))

1014


### Removing not-free apps.
On this stage, we will remove non-free applications from the given datasets.

#### Android. 

In Android dataset, price colunm is the column number 7. When an app is free, it has a price of '0' in a form of a string.
Using this information, I will check all rows in the dataset and add all appropriate to a new dataset called ```free_android```

In [53]:
free_android = list()
free_android.append(final_android[0])
for row in final_android[1:]:
    if row[7] == '0':
        free_android.append(row)


The number of free apps in ```Android dataset``` is:

In [52]:
print(len(free_android))

8863


#### Apple
Analogically, the Apple dataset has column number 4 as a column containing the price. Also, free apps in the dataset has a value of this column as '0.0.'
I will conduct the same operations to fetch only free apps from the Apple dataset and consolidate them into ```free_apple``` dataset.

In [54]:
free_apple = list()
free_apple.append(final_apple[0])
for row in final_apple[1:]:
    if row[4] == '0.0':
        free_apple.append(row)


The number of free apps in ```Apple dataset``` is:

In [55]:
print(len(free_apple))

3221


## Data Analysis.

Since we are concerned of prediction of profitability for a potential app in both of the markets, we need to fild an app profile that leads to success of the application on both markets.


### Genre analysis.
Out first goal is to determine the genre of the application that has the highest rates on both Apple and Android markets. 

The next function works as a translator from a dataset to a frequency table based on the column with the given index.

In [91]:
def freq_table(dataset: list, index: int ) -> dict:
    total_number = len(dataset) - 1 #the first row is amissed as it's the headings row
    table = dict()
    
    for row in dataset[1:]:
        if row[index] not in table:
            table[row[index]] = 1
        else:
            table[row[index]] += 1
            
    for key in table:
        table[key] = round(table[key]/total_number*100, 2)
    return table


The following function displays the frequency table for a given dataset and a column index in a human-readable sorted way

In [92]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

#### Android frequency tables.

We will base our genre analysis for Android on columns 'Genres' and 'Category'.
Let's display frequency tables for them.

In [93]:
# The frequency table for the Category column.

display_table(free_android, 1)

FAMILY : 18.45
GAME : 9.87
TOOLS : 8.44
BUSINESS : 4.59
LIFESTYLE : 3.9
PRODUCTIVITY : 3.89
FINANCE : 3.7
MEDICAL : 3.52
SPORTS : 3.4
PERSONALIZATION : 3.32
COMMUNICATION : 3.24
HEALTH_AND_FITNESS : 3.08
PHOTOGRAPHY : 2.95
NEWS_AND_MAGAZINES : 2.8
SOCIAL : 2.66
TRAVEL_AND_LOCAL : 2.34
SHOPPING : 2.25
BOOKS_AND_REFERENCE : 2.14
DATING : 1.86
VIDEO_PLAYERS : 1.78
MAPS_AND_NAVIGATION : 1.4
EDUCATION : 1.29
FOOD_AND_DRINK : 1.24
ENTERTAINMENT : 1.13
LIBRARIES_AND_DEMO : 0.94
AUTO_AND_VEHICLES : 0.93
HOUSE_AND_HOME : 0.84
WEATHER : 0.8
EVENTS : 0.71
ART_AND_DESIGN : 0.68
PARENTING : 0.65
COMICS : 0.62
BEAUTY : 0.6


In [94]:
# The frequency table for the Genres column.

display_table(free_android, 9)

Tools : 8.43
Entertainment : 6.07
Education : 5.35
Business : 4.59
Productivity : 3.89
Lifestyle : 3.89
Finance : 3.7
Medical : 3.52
Sports : 3.46
Personalization : 3.32
Communication : 3.24
Action : 3.1
Health & Fitness : 3.08
Photography : 2.95
News & Magazines : 2.8
Social : 2.66
Travel & Local : 2.32
Shopping : 2.25
Books & Reference : 2.14
Simulation : 2.04
Dating : 1.86
Arcade : 1.85
Video Players & Editors : 1.77
Casual : 1.76
Maps & Navigation : 1.4
Food & Drink : 1.24
Puzzle : 1.13
Racing : 0.99
Role Playing : 0.94
Libraries & Demo : 0.94
Auto & Vehicles : 0.93
Strategy : 0.91
House & Home : 0.84
Weather : 0.8
Events : 0.71
Adventure : 0.68
Comics : 0.61
Beauty : 0.6
Art & Design : 0.6
Parenting : 0.5
Card : 0.45
Casino : 0.43
Trivia : 0.42
Educational;Education : 0.39
Educational : 0.37
Board : 0.37
Education;Education : 0.34
Word : 0.26
Casual;Pretend Play : 0.24
Music : 0.2
Racing;Action & Adventure : 0.17
Puzzle;Brain Games : 0.17
Entertainment;Music & Video : 0.17
Casual;

#### Apple frequency table.

We will base our genre analysis for Apple on the column "prime_genre".
Its frequency table is the following

In [95]:
display_table(free_apple, 11)

Games : 58.14
Entertainment : 7.89
Photo & Video : 4.97
Education : 3.66
Social Networking : 3.29
Shopping : 2.61
Utilities : 2.52
Sports : 2.14
Music : 2.05
Health & Fitness : 2.02
Productivity : 1.74
Lifestyle : 1.58
News : 1.34
Travel : 1.24
Finance : 1.12
Weather : 0.87
Food & Drink : 0.81
Reference : 0.56
Business : 0.53
Book : 0.43
Navigation : 0.19
Medical : 0.19
Catalogs : 0.12


The frequency tables we generated are only shows the number of  applications of each genres for both markets. 
However useful, they cannot on itself answer the question about their popularity. 
That's our next step.

### Most popular apps by genre on AppStore

In [105]:
genre_freq_table_apple = freq_table(free_apple, 11) # 11 is the 'prime_genre' column

for genre in genre_freq_table_apple:
    total_reviews = 0
    total_number = 0
    for row in free_apple[1:]:
        if row[11] == genre:
            total_reviews += int(row[5]) #rating_count_tot column
            total_number += 1
    average = total_reviews / total_number
    print(genre, average)


Social Networking 71548.34905660378
Photo & Video 28441.54375
Games 22812.92467948718
Music 57326.530303030304
Reference 74942.11111111111
Health & Fitness 23298.015384615384
Weather 52279.892857142855
Utilities 18684.456790123455
Travel 28243.8
Shopping 26919.690476190477
News 21248.023255813954
Navigation 86090.33333333333
Lifestyle 16485.764705882353
Entertainment 14029.830708661417
Food & Drink 33333.92307692308
Sports 23008.898550724636
Book 39758.5
Finance 31467.944444444445
Education 7003.983050847458
Productivity 21028.410714285714
Business 7491.117647058823
Catalogs 4004.0
Medical 612.0


According to this frequency table, the most popular applications are the applications for navigation and social networking. 

### Most popular apps by genre on PlayStore

To construct the most popular app for Android we will use the number of intallations. 


In [107]:
categories_freq_table_android = freq_table(free_android, 1) #categories column

for category in categories_freq_table_android:
    total_installations = 0
    len_category = 0
    
    for row in free_android[1:]:
        if row[1] == category:
            installations = row[5].replace(',','')
            installations = installations.replace('+','')
            total_installations += int(installations)
            len_category += 1
    average = total_installations/len_category
    
    print(category, average)



ART_AND_DESIGN 1905351.6666666667
AUTO_AND_VEHICLES 647317.8170731707
BEAUTY 513151.88679245283
BOOKS_AND_REFERENCE 8767811.894736841
BUSINESS 1712290.1474201474
COMICS 817657.2727272727
COMMUNICATION 38456119.167247385
DATING 854028.8303030303
EDUCATION 3082017.543859649
ENTERTAINMENT 21134600.0
EVENTS 253542.22222222222
FINANCE 1387692.475609756
FOOD_AND_DRINK 1924897.7363636363
HEALTH_AND_FITNESS 4188821.9853479853
HOUSE_AND_HOME 1313681.9054054054
LIBRARIES_AND_DEMO 638503.734939759
LIFESTYLE 1437816.2687861272
GAME 15837565.085714286
FAMILY 2691618.159021407
MEDICAL 120616.48717948717
SOCIAL 23253652.127118643
SHOPPING 7036877.311557789
PHOTOGRAPHY 17805627.643678162
SPORTS 3638640.1428571427
TRAVEL_AND_LOCAL 13984077.710144928
TOOLS 10695245.286096256
PERSONALIZATION 5201482.6122448975
PRODUCTIVITY 16787331.344927534
PARENTING 542603.6206896552
WEATHER 5074486.197183099
VIDEO_PLAYERS 24852732.40506329
NEWS_AND_MAGAZINES 9549178.467741935
MAPS_AND_NAVIGATION 4056941.7741935486


In [108]:
genre_freq_table_android = freq_table(free_android, 9) #categories column

for genre in genre_freq_table_android:
    total_installations = 0
    len_genre = 0
    
    for row in free_android[1:]:
        if row[9] == genre:
            installations = row[5].replace(',','')
            installations = installations.replace('+','')
            total_installations += int(installations)
            len_genre += 1
    average = total_installations/len_genre
    
    print(genre, average)

Art & Design 2122850.9433962265
Art & Design;Pretend Play 500000.0
Art & Design;Creativity 285000.0
Art & Design;Action & Adventure 100000.0
Auto & Vehicles 647317.8170731707
Beauty 513151.88679245283
Books & Reference 8767811.894736841
Business 1712290.1474201474
Comics 831873.1481481482
Comics;Creativity 50000.0
Communication 38456119.167247385
Dating 854028.8303030303
Education;Education 4759517.0
Education 540691.7721518987
Education;Creativity 2875000.0
Education;Music & Video 2033333.3333333333
Education;Action & Adventure 1000000.0
Education;Pretend Play 1800000.0
Education;Brain Games 5333333.333333333
Entertainment 5602792.775092937
Entertainment;Music & Video 6413333.333333333
Entertainment;Brain Games 3314285.714285714
Entertainment;Creativity 4000000.0
Events 253542.22222222222
Finance 1387692.475609756
Food & Drink 1924897.7363636363
Health & Fitness 4188821.9853479853
House & Home 1313681.9054054054
Libraries & Demo 638503.734939759
Lifestyle 1412998.3449275363
Lifestyle;

Based on the provided frequences alone, I can say that the most successful applications on Android are games and communication applications.

# Conclusion.

Although we cannot make a fully-supported decision about the profile of a succesful application  based on our analysis, we can say that the most popular applications across both platforms are communication applications and games.
That means that successful application can be a good quality spcial game connecting both of the aspects.

That information can be used for future analysis.