# Research on profiles of free mobile apps

## 1. Introduction

* To enter a market with new mobile apps it seems beneficial to perform research under data from two the most popular stores.
* The main goal is to determine which types of _free_ apps are likely to attract more users of Google Play and the App Store.

Fortunately, there are two available datasets on Kaggle:
* [Apple Store dataset documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)
* [Google Play dataset documentation](https://www.kaggle.com/lava18/google-play-store-apps)

Let's open them and convert to a list of lists:

In [2]:
def csv_to_list(csv_file):
    opened_file = open(csv_file)
    from csv import reader
    read_file = reader(opened_file)
    dataset = list(read_file)
    opened_file.close()
    return dataset

ios = csv_to_list('./AppleStore.csv')
android = csv_to_list('./googleplaystore.csv')

Let's create a function for easy exploration of the datasets:

In [4]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [100]:
explore_data(android, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


In [144]:
explore_data(ios, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7198
Number of columns: 16


## 2. Data cleaning

### 2.1 Remove headers, inaccurate data and duplicates

#### 2.1.1 Android

Firstly, for further cleaning procedures decided to remove header:

In [70]:
android_clean = android[1:]

Secondly, one row doesn't have value "Category". [Kaggle's discussion about it.](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015)

Decided to delete this row:

In [77]:
del android_clean[10472]

Some apps have multiple entries in the dataset.

In [79]:
unique_names = []
doubled_names = []

for row in android_clean:
    name = row[0]
    if name in unique_names:
        doubled_names.append(name)
    else:
        unique_names.append(name)
        
print('Number of doubled names: ', len(doubled_names))
print('Examples: ', doubled_names[:10])

Number of doubled names:  1181
Examples:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


Example:

In [80]:
for row in android_clean:
    name = row[0]
    if name == 'Slack':
        print(row)

['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51510', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']


To determine the most actual record for each app could be used columns: `'Reviews', 'Installs', 'Last Updated', 'Current Ver', 'Android Ver'`.

In the example above only `'Reviews'` is changing among records, so let's use the biggest number in this column.

Firstly, let's find the maximum number of reviews for each of the apps:

In [82]:
reviews_max = {}

for row in android_clean:
    name = row[0]
    n_reviews = int(row[3])
    if name not in reviews_max:
        reviews_max[name] = n_reviews
    elif name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews      

I found a value for each app:

In [83]:
len(reviews_max) == len(unique_names)

True

Secondly, based on information from `reviews_max` let's create new dataset `android_unique` without doubled applications:

In [84]:
android_unique = []
already_added = []

for row in android_clean:
    name = row[0]
    n_reviews = int(row[3])
    if name not in already_added and n_reviews == reviews_max[name]:
        android_unique.append(row)
        already_added.append(name)

#### 2.1.2 iOS

Remove header for further data cleaning:

In [78]:
ios_clean = ios[1:]

There are two apps with non-unique names in the dataset: **Mannequin Challenge** and **VR Roller Coaster**.
However, it was defined that these are unique apps. [Kaggle discussion](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion/90409)

### 2.2 Remove non-English apps

We want to focus only on the English-speaking market, so I decided to delete all non-English apps from the datasets.

To determine if an app is English or not I will check its name. 
1. If the name contains more than 3 non-English symbols.
1. If the Unicode code of a symbol is more than 127, this is a non-English symbol.

In [51]:
def is_english_app(name):
    non_eng_symbols = 0
    for symbol in name:
        if ord(symbol) <= 127:
            continue
        elif non_eng_symbols < 3:
            non_eng_symbols += 1
        else:
            return False
    return True

def remove_non_eng(dataset, name_index):
    dataset_eng = []
    for row in dataset:
        name = row[name_index]
        if is_english_app(name):
            dataset_eng.append(row)
    return dataset_eng

android_eng = remove_non_eng(android_unique, 0)
print('Number of the english apps in the Android dataset: ', len(android_eng))

ios_eng = remove_non_eng(ios_clean, 1)
print('Number of the english apps in the iOS dataset: ', len(ios_eng))

### 2.3 Isolate the free apps

We want to create free applications and monetize them via in-built ads, so I'm going to leave only those apps which have a price is equal to zero.

In [97]:
def remove_non_free(dataset, price_index):
    dataset_free = []
    for row in dataset:
        price = float(row[price_index].replace('$', ''))
        if price == 0.0:
            dataset_free.append(row)
    return dataset_free

android_free = remove_non_free(android_eng, 7)
print('Number of free apps in the Android dataset: ', len(android_free))

ios_free = remove_non_free(ios_eng, 4)
print('Number of free apps in the iOS dataset: ', len(ios_free))

## 3. Analysis

### 3.1 Introduction

The overall goal is to build up a successful mobile application for Google Play and App Store. That is why we need to define profiles of applications that are widely represented and popular on both platforms.

To find out the most popular genres we could use the following columns:
- Android
  - `'Category'`
  - `'Genres'`
- iOS
  - `'prime_genre'`

### 3.2 Frequency for a genre

In [105]:
def freq_table(dataset, index):
    freq_value = {}
    
    for row in dataset:
        value = row[index]
        if value in freq_value:
            freq_value[value] += 1
        else:
            freq_value[value] = 1
            
    freq_value_percent = {}
    total_values = len(dataset)
    
    for key in freq_value:
        percentage = freq_value[key] / total_values * 100
        freq_value_percent[key] = percentage
        
    return freq_value_percent

A function for printing a frequency table in order from the most frequent to the less:

In [122]:
def display_table(table):
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [123]:
display_table(freq_table(android_free, 1))

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

In [124]:
display_table(freq_table(android_free, 9))

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

In [125]:
display_table(freq_table(ios_free, 11))

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


**Important notice:**
The analysis above was performed under cleaned datasets which contain only *free* apps, created for *English-speaking auditory*.

##### Apple Store

1. From the Apple Store genre frequency table, we see that more than half of applications here are games (\~58%). The second most popular genre is close to the games - Entertainment (\~8%)
2. A lot of genres keep a small share from 1% to 5%.
3. In the top 10 genres, 7 are about entertainment.

The outcome: it seems like entertainment applications (games mostly) in the Apple Store are far more popular than any others.


##### Google Play

`'Category'`:
1. The most popular category here is 'Family' with \~19%. The nearest follower is the 'Games' category with \~10%.
2. However, in the top 10 categories, we see that only 3 categories are about entertainment.

`'Genres'`:
1. The most popular genre is 'Tools' with \~8%, following by 'Entertainment' with \~6%.
2. In the top 10, we see the same pattern - only 3 entertainment categories.
3. The games have a big number of genres and each of them keep from \~0.01% to \~3%

##### Outcomes

The general trend from the results above is that entertainment applications (the games mostly) keep the biggest market share.

## 3.3 Popularity of the genres

Let's create frequency tables for iOS genres and Android categories:

In [110]:
android_category_freq = freq_table(android_free, 1)
ios_genre_freq = freq_table(ios_free, 11)

Let's find an average number of installs for each of the Android categories:

In [127]:
android_category_installs = {}

for category in android_category_freq:
    total = 0
    len_category = 0
    for row in android_free:
        category_app = row[1]
        if category_app == category:
            installs_num = row[5]
            installs_num = installs_num.replace('+', '')
            installs_num = installs_num.replace(',', '')
            total += int(installs_num)
            len_category += 1
    avg_num_installs = total / len_category
    android_category_installs[category] = avg_num_installs

display_table(android_category_installs)

COMMUNICATION : 38456119.167247385
VIDEO_PLAYERS : 24727872.452830188
SOCIAL : 23253652.127118643
PHOTOGRAPHY : 17840110.40229885
PRODUCTIVITY : 16787331.344927534
GAME : 15588015.603248259
TRAVEL_AND_LOCAL : 13984077.710144928
ENTERTAINMENT : 11640705.88235294
TOOLS : 10801391.298666667
NEWS_AND_MAGAZINES : 9549178.467741935
BOOKS_AND_REFERENCE : 8767811.894736841
SHOPPING : 7036877.311557789
PERSONALIZATION : 5201482.6122448975
WEATHER : 5074486.197183099
HEALTH_AND_FITNESS : 4188821.9853479853
MAPS_AND_NAVIGATION : 4056941.7741935486
FAMILY : 3695641.8198090694
SPORTS : 3638640.1428571427
ART_AND_DESIGN : 1986335.0877192982
FOOD_AND_DRINK : 1924897.7363636363
EDUCATION : 1833495.145631068
BUSINESS : 1712290.1474201474
LIFESTYLE : 1437816.2687861272
FINANCE : 1387692.475609756
HOUSE_AND_HOME : 1331540.5616438356
DATING : 854028.8303030303
COMICS : 817657.2727272727
AUTO_AND_VEHICLES : 647317.8170731707
LIBRARIES_AND_DEMO : 638503.734939759
PARENTING : 542603.6206896552
BEAUTY : 51315

The iOS dataset doesn't have a number of installations, so I decided to figure out this value from a column with a number of users' ratings (`rating_count_tot`).
So, let's find an average number of users' ratings for each of the genres:

In [126]:
ios_genre_user_rat = {}

for genre in ios_genre_freq:
    total = 0
    len_genre = 0
    for row in ios_free:
        genre_app = row[11]
        if genre_app == genre:
            user_ratings_number = float(row[5])
            total += user_ratings_number
            len_genre += 1
    avg_num_user_rating = total / len_genre
    ios_genre_user_rat[genre] = avg_num_user_rating
    
display_table(ios_genre_user_rat)

Navigation : 86090.33333333333
Reference : 74942.11111111111
Social Networking : 71548.34905660378
Music : 57326.530303030304
Weather : 52279.892857142855
Book : 39758.5
Food & Drink : 33333.92307692308
Finance : 31467.944444444445
Photo & Video : 28441.54375
Travel : 28243.8
Shopping : 26919.690476190477
Health & Fitness : 23298.015384615384
Sports : 23008.898550724636
Games : 22788.6696905016
News : 21248.023255813954
Productivity : 21028.410714285714
Utilities : 18684.456790123455
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Business : 7491.117647058823
Education : 7003.983050847458
Catalogs : 4004.0
Medical : 612.0


I'm not quite sure what stands behind `COMMUNICATION` category, so let's find out:

In [149]:
def android_category_installs(category_name):
    android_app_installs = {}
    for row in android_free:
        category = row[1]
        name = row[0]
        installs_num = row[5]
        installs_num = installs_num.replace('+', '')
        installs_num = installs_num.replace(',', '')
        installs_num = int(installs_num)
        if category == category_name:
            android_app_installs[name] = installs_num   
    return android_app_installs

android_category_communication = android_category_installs('COMMUNICATION')
display_table(android_category_communication)

WhatsApp Messenger : 1000000000
Skype - free IM & video calls : 1000000000
Messenger – Text and Video Chat for Free : 1000000000
Hangouts : 1000000000
Google Chrome: Fast & Secure : 1000000000
Gmail : 1000000000
imo free video calls and chat : 500000000
Viber Messenger : 500000000
UC Browser - Fast Download Private & Secure : 500000000
LINE: Free Calls & Messages : 500000000
Google Duo - High Quality Video Calls : 500000000
imo beta free calls and text : 100000000
Yahoo Mail – Stay Organized : 100000000
Who : 100000000
WeChat : 100000000
UC Browser Mini -Tiny Fast Private & Secure : 100000000
Truecaller: Caller ID, SMS spam blocking & Dialer : 100000000
Telegram : 100000000
Opera Mini - fast web browser : 100000000
Opera Browser: Fast and Secure : 100000000
Messenger Lite: Free Calls & Messages : 100000000
Kik : 100000000
KakaoTalk: Free Calls & Text : 100000000
GO SMS Pro - Messenger, Free Themes, Emoji : 100000000
Firefox Browser fast & private : 100000000
BBM - Free Calls & Messages

In [145]:
for row in ios_free:
    if row[1] == 'WhatsApp Messenger':
        print(row)

['310633997', 'WhatsApp Messenger', '135044096', 'USD', '0.0', '287589', '73088', '4.5', '4.5', '2.17.22', '4+', 'Social Networking', '12', '0', '35', '1']


So we figure out that Android apps from `COMMUNICATION` category in iOS are put under `Social Networking` mostly.

Overall, I found some similarities in the top 10s:
- Android (`VIDEO_PLAYERS`, `PHOTOGRAPHY`) and iOS (`Photo & Video`)
- Android (`SOCIAL`, `COMMUNICATION`) and iOS (`Social Networking`)
- Android (`TRAVEL_AND_LOCAL`) and iOS (`Travel`)

## 4. Conclusion

For now, I'd recommend focusing on these three groups.

Recommendations for further analysis:
- Look deeply into each category, perhaps there are apps behind which giant corporations stand, who we don't want to fight with.
- Clean the datasets more, for example, from very popular and very unpopular applications and look at new leaders.
- Look for applications that combine several features from the found groups.