# Determining what types of apps attract most users.

### This project is part of the DataQuest data science track.

#### The goal of this project is to analyze our current data to see what factors attract the most users. This information will allow our developers to prioritize apps they are working on those that will maximize revenue. Further refinements include this and that. 

We will be exploring data from two sources: 
1. [This data](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv) collected in August 2018 about 10,000 Android apps from Google Play. [Data Description](https://www.kaggle.com/lava18/google-play-store-apps)
2. [This data](https://dq-content.s3.amazonaws.com/350/AppleStore.csv) collected in July 2017 about 7,000 iOS apps from the App Store.[Data Description](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

First let's open the two datasets.

In [1]:
from csv import reader
opened_file = open('googleplaystore.csv', encoding="utf8")
read_file = reader(opened_file)
android_data = list(read_file)

opened_file = open('AppleStore.csv', encoding="utf8")
read_file = reader(opened_file)
ios_data = list(read_file)

The following function will allow us to open the datasets and explore.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
#         print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Let's see what the data looks like and also get the length of the data

In [3]:
print('ANDROID APPS\n')
explore_data(android_data, 0, 3, True)
print('\nIOS APPS\n')
explore_data(ios_data, 0, 3, True)

ANDROID APPS

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']
['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']
Number of rows: 10842
Number of columns: 13

IOS APPS

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']
['389801252', 'Instagram', '113954816', 'USD', '0

According to [this discussion](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion/66015), there is a problem with row 10472 in the android data. First, we look at this row and ensure this is a problem. 

In [4]:
print(android_data[10473])
print(len(android_data[10473]))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
12


It is clear that there is indeed an issue here. That row only has 12 columns instead of 13. We will delete this row from the dataset.

In [5]:
del(android_data[10473])

Next, let's remove the duplicate items since we do not want to count the same item multiple times in our analysis. 

First, let's show that there are duplicates in our lists.

In [6]:
def get_duplicates(in_list, name_index):
    app_names = []
    duplicate_names = []
    for item in in_list:
        app_name = item[name_index]
        if app_name in app_names and app_name not in duplicate_names:
            duplicate_names.append(app_name)
        if app_name not in app_names:
            app_names.append(app_name)
    return duplicate_names

def print_duplicates(in_list, duplicate_list, name_index, num_to_print=1):
    for i in range(num_to_print):
        app_name = duplicate_list[i]
        print('The following lists duplicate entries for the app named '+app_name)
        for row in in_list:
            if row[name_index] == app_name:
                print(row)
        print('\n')

android_duplicates = get_duplicates(android_data, 0)
print('Android Data has '+str(len(android_duplicates))+' apps with duplicate data')
print("Here's one example from the Android data:")
print_duplicates(android_data, android_duplicates, 0)
ios_duplicates = get_duplicates(ios_data, 1)
print('IOS Data has '+str(len(ios_duplicates))+' apps with duplicate data')
print("Here's one example from the IOS data:")
print_duplicates(ios_data, ios_duplicates, 1)

Android Data has 798 apps with duplicate data
Here's one example from the Android data:
The following lists duplicate entries for the app named Quick PDF Scanner + OCR FREE
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80804', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']


IOS Data has 2 apps with duplicate data
Here's one example from the IOS data:
The following lists duplicate entries for the app named Mannequin Challenge
['1173990889', 'Mannequin Challenge', '109705216', 'USD', '0.0', '668', '87', '3.0', '3.0', 

One important factor that changes in these duplicate rows is the total number of reviews. Since this is presumably one of the high-impact criteria to determine which apps are the most liked, we will keep the rows with the highest total review count rather than randomly removing duplicate rows.

In [7]:
# To get this done, we will be using a dictionary approach. Here's the pseudo code: 
# 1. Create an empty dictionary 'max_review'
# 2. Iterate through the list of apps and 
#    a. If the name is not in the dictionary, add it as key and add the number of reviews as value.
#    b. If the name is already in, compare the review number to the current value. If the current value is 
#       smaller, update it with the new one. 
# 3. Now create two new lists: 'data_cleaned' to store the cleaned data and 'already_added' to store names 
#    of apps already added. 
# 4. Finally, go through the original dataset again and 
#    a. If the name of the app is not in 'already_added' and the review count is the same as in 'max_review'
#       i. Add the row to 'data_cleaned' and add the app name to 'already_added'
def remove_duplicates(in_list, name_index, review_count_index):
    max_review = {} # Step 1
    for row in in_list: # Step 2
        app_name = row[name_index]
        review_count = row[review_count_index]
        if app_name in max_review: # Step 2.b
            if review_count > max_review[app_name]:
                max_review[app_name] = review_count
        else: # Step 2.a
            max_review[app_name] = review_count
    data_cleaned = [] # Step 3
    already_added = [] # Step 3
    for row in in_list: # Step 4
        app_name = row[name_index]
        review_count = row[review_count_index]
        if (app_name not in already_added) and (review_count == max_review[app_name]): # Step 4.a
            data_cleaned.append(row) # Step 4.a.i
            already_added.append(app_name)
    return data_cleaned

# Alternate way to do the same thing (although probably much less efficient since we go through the whole dataset many more times)
def delete_duplicates(in_list, duplicate_names, name_index, review_count_index):
    rows_to_delete = []
    for name in duplicate_names:
        duplicate_entries = []
        for rownumber in range(len(in_list)):
            if in_list[rownumber][name_index] == name:
                duplicate_entries.append((rownumber, int(in_list[rownumber][review_count_index])))
            sorted_entries = sorted(duplicate_entries, key=lambda x: x[1], reverse=True)
        for row in range(len(sorted_entries)-1):
            rows_to_delete.append(sorted_entries[row+1][0])
    rows_to_delete = sorted(rows_to_delete,  reverse = True) # We do the reverse because we always want to delete largest index first. Otherwise indices change.
    cleaned_list = in_list.copy()
    for index in rows_to_delete:
        del cleaned_list[index]
    return cleaned_list

# Let's look at the run time for the two methods to generate the same data for android_data
from time import time
start = time()
android_data_cleaned = remove_duplicates(android_data, 0, 3)
end = time()
print('The final dataset has length ', len(android_data_cleaned))
print('Time to run "remove_duplicates" on android_data was', end-start, 's')

start = time()
android_data_cleaned = delete_duplicates(android_data, android_duplicates, 0, 3)
end = time()
print('The final dataset has length ', len(android_data_cleaned))
print('Time to run "delete_duplicates" on android_data was', end-start, 's')

# Let's repeat for ios_data to see if the results are consistent
from time import time
start = time()
ios_data_cleaned = remove_duplicates(ios_data, 1, 5)
end = time()
print('The final dataset has length ', len(ios_data_cleaned))
print('Time to run "remove_duplicates" on ios_data was', end-start, 's')

start = time()
ios_data_cleaned = delete_duplicates(ios_data, ios_duplicates, 1, 5)
end = time()
print('The final dataset has length ', len(ios_data_cleaned))
print('Time to run "delete_duplicates" on ios_data was', end-start, 's')

The final dataset has length  9660
Time to run "remove_duplicates" on android_data was 0.5106644630432129 s
The final dataset has length  9660
Time to run "delete_duplicates" on android_data was 3.6092934608459473 s
The final dataset has length  7196
Time to run "remove_duplicates" on ios_data was 0.2592136859893799 s
The final dataset has length  7196
Time to run "delete_duplicates" on ios_data was 0.004986763000488281 s


Since we are interested in free apps for an English-speaking audience, we should remove information on apps that do not have English names. We can use the character encoding to help us here. If the name contains a character with encoding >127, it is not one of the 26 characters used in the English language (both lower and upper case).

In [8]:
# Let's write a helper function to see if more than 3 character map to a number > 127
# This likely means the name is probably not English but we still have wiggle room for characters like emojis and (TM)

def is_it_English(inputstring):
    badchar = 0
    for char in inputstring:
        if ord(char)>127:
            badchar += 1
    if badchar > 3:
        return False
    return True

# Testing the function
print(is_it_English('Instagram'))
print(is_it_English('Áà±Â•áËâ∫PPS -„ÄäÊ¨¢‰πêÈ¢Ç2„ÄãÁîµËßÜÂâßÁÉ≠Êí≠'))
print(is_it_English('Docs To Go‚Ñ¢ Free Office Suite'))
print(is_it_English('Instachat üòú'))

True
False
True
True


Let's use this new function to loop through the two lists to remove unwanted app information

In [9]:
def remove_non_English(in_list, name_index):
    final_list = []
    for row in in_list:
        name = row[name_index]
        if is_it_English(name):
            final_list.append(row)
    return final_list

android_data_English_only = remove_non_English(android_data_cleaned, 0)
ios_data_English_only = remove_non_English(ios_data_cleaned, 1)
print('Final Android data with English apps only has ', len(android_data_English_only)-1, 'data points.')
print('Final IOS data with English apps only has ', len(ios_data_English_only)-1, 'data points.')

Final Android data with English apps only has  9614 data points.
Final IOS data with English apps only has  6181 data points.


Since our company is interested in free apps, let's further filter by results to only include free apps. 

In [10]:
def remove_paid_apps(in_list, price_index):
    final_list = []
    final_list.append(in_list[0])
    for row in in_list[1:]:
        price = row[price_index]
        if price == '0' or price =='0.0':
            final_list.append(row)
    return final_list

android_data_free_English_only = remove_paid_apps(android_data_English_only, 7)
ios_data_free_English_only = remove_paid_apps(ios_data_English_only, 4)
print('Final Android data with free English apps only has ', len(android_data_free_English_only)-1, 'data points.')
print('Final IOS data with free English apps only has ', len(ios_data_free_English_only)-1, 'data points.')

Final Android data with free English apps only has  8864 data points.
Final IOS data with free English apps only has  3220 data points.


# Most Common Apps by Genre

## Part One

As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we then develop it further.
3. If the app is profitable after six months, we also build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both the App Store and Google Play, we need to find app profiles that are successful on both markets. For instance, a profile that might work well for both markets might be a productivity app that makes use of gamification.

Let's begin the analysis by getting a sense of the most common genres for each market. For this, we'll build a frequency table for the `prime_genre` column of the App Store data set, and the `Genres` and `Category` columns of the Google Play data set.

## Part Two

We'll build two functions we can use to analyze the frequency tables:

- One function to generate frequency tables that show percentages
- Another function that we can use to display the percentages in a descending order


In [11]:
# Function to create a frequency table for the values in a column of interest.
# The frequencies are given in percentage.

def freq_table(in_list, col_index):
    total = len(in_list) - 1
    freqs = {}
    for row in in_list[1:]:
        colval = row[col_index]
        if colval in freqs:
            freqs[colval] += 1
        else:
            freqs[colval] = 1
    for key in freqs:
        freqs[key] = freqs[key]*100/total
    return freqs

# Function to display the values with the frequencies sorted. 

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [12]:
# Frequency table for 'prime_genre' column of IOS data
display_table(ios_data_free_English_only, 11)

Games : 58.13664596273292
Entertainment : 7.888198757763975
Photo & Video : 4.968944099378882
Education : 3.6645962732919255
Social Networking : 3.2919254658385095
Shopping : 2.608695652173913
Utilities : 2.515527950310559
Sports : 2.142857142857143
Music : 2.049689440993789
Health & Fitness : 2.018633540372671
Productivity : 1.7391304347826086
Lifestyle : 1.5838509316770186
News : 1.3354037267080745
Travel : 1.2422360248447204
Finance : 1.1180124223602483
Weather : 0.8695652173913043
Food & Drink : 0.8074534161490683
Reference : 0.5590062111801242
Business : 0.5279503105590062
Book : 0.43478260869565216
Navigation : 0.18633540372670807
Medical : 0.18633540372670807
Catalogs : 0.12422360248447205


For free English IOS apps, it looks like Games represent the vast majority with 58%. The closest second is only 8% with Entertainment. It certainly looks like, at least for free English IOS apps, the main focus is for entertainment purposes, whether specifically the clear winner of games or even sports, music, food and drink etc. Other more utilitarian categories (Education, Health & Fitness, Navigation etc) are clearly the minority. Let's see if the same can be said of Android free English apps.

In [13]:
# Frequency table for 'Genres' column of Android data
display_table(android_data_free_English_only, 9)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.700361010830325
Medical : 3.5311371841155235
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.237815884476534
Action : 3.1024368231046933
Health & Fitness : 3.079873646209386
Photography : 2.9444945848375452
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.041967509025271
Dating : 1.861462093862816
Arcade : 1.8501805054151625
Video Players & Editors : 1.7712093862815885
Casual : 1.759927797833935
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418774
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075813

In [14]:
# Frequency table for 'Category' column of Android data
display_table(android_data_free_English_only, 1)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.700361010830325
MEDICAL : 3.5311371841155235
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.237815884476534
HEALTH_AND_FITNESS : 3.079873646209386
PHOTOGRAPHY : 2.9444945848375452
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768953
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418774
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075813
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 0

For Android free English apps, it certainly looks like games is not as important as for IOS free English apps. Even if one were to consider family and games to be interchangeable categories, these still represent only 28% of the apps compared to 58% for IOS. One interesting thing is that the 'Genres' column has a lot more subcategories than the 'Category' column. This may be useful when determining what type of apps to aim for for development. We are not quite ready for a recommendation on what type of app to develop quite yet because we do not know the number of users for each yet.

# Most Popular IOS Apps by Genre on the App Store

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the `Installs` column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the `rating_count_tot` app.

Let's start with calculating the average number of user ratings per app genre on the App Store. To do that, we'll need to do the following:

- Isolate the apps of each genre
- Add up the user ratings for the apps of that genre
- Divide the sum by the number of apps belonging to that genre (not by the total number of apps)


In [23]:
# First lets find the number of apps for each genre
genres_frequency = freq_table(ios_data_free_English_only, 11)
genres = genres_frequency.keys()
user_count_per_genre = {}
for genre in genres:
    ratings_total = 0
    genre_len = 0
    for row in ios_data_free_English_only[1:]:
        this_app_genre = row[11]
        if this_app_genre == genre:
            ratings_total += float(row[5])
            genre_len += 1
    user_count_per_genre[genre] = ratings_total/genre_len
# Now sort the data
sorted_user_count_per_genre = sorted(user_count_per_genre.items(), key = lambda kv: kv[1], reverse = True)
for item in sorted_user_count_per_genre:
    print('The '+item[0]+' app genre has an average of '+str(item[1])+' ratings.')
            

The Navigation app genre has an average of 86090.33333333333 ratings.
The Reference app genre has an average of 74942.11111111111 ratings.
The Social Networking app genre has an average of 71548.34905660378 ratings.
The Music app genre has an average of 57326.530303030304 ratings.
The Weather app genre has an average of 52279.892857142855 ratings.
The Book app genre has an average of 39758.5 ratings.
The Food & Drink app genre has an average of 33333.92307692308 ratings.
The Finance app genre has an average of 31467.944444444445 ratings.
The Photo & Video app genre has an average of 28441.54375 ratings.
The Travel app genre has an average of 28243.8 ratings.
The Shopping app genre has an average of 26919.690476190477 ratings.
The Health & Fitness app genre has an average of 23298.015384615384 ratings.
The Sports app genre has an average of 23008.898550724636 ratings.
The Games app genre has an average of 22812.92467948718 ratings.
The News app genre has an average of 21248.023255813954

Based on this data, looks like Navigation apps tend to have the most user ratings while Medical ones have the least. This data seems to suggest that, for the most engagement, we should develop Navigation apps, Reference apps and Social Networking apps. Is this true? Let's look at the apps within these genres and get a little more insight. 

In [24]:
for row in ios_data_free_English_only[1:]:
    if row[11] == 'Navigation':
        print(row[1]+':'+row[5])

Waze - GPS Navigation, Maps & Real-time Traffic:345046
Google Maps - Navigation & Transit:154911
Geocaching¬Æ:12811
CoPilot GPS ‚Äì Car Navigation & Offline Maps:3582
ImmobilienScout24: Real Estate Search in Germany:187
Railway Route Search:5


In [26]:
for row in ios_data_free_English_only[1:]:
    if row[11] == 'Reference':
        print(row[1]+':'+row[5])

Bible:985920
Dictionary.com Dictionary & Thesaurus:200047
Dictionary.com Dictionary & Thesaurus for iPad:54175
Google Translate:26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran:18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition:17588
Merriam-Webster Dictionary:16849
Night Sky:12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE):8535
LUCKY BLOCK MOD ‚Ñ¢ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools:4693
GUNS MODS for Minecraft PC Edition - Mods Tools:1497
Guides for Pok√©mon GO - Pokemon GO News and Cheats:826
WWDC:762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free:718
VPN Express:14
Real Bike Traffic Rider Virtual Reality Glasses:8
Êïô„Åà„Å¶!goo:0
Jishokun-Japanese English Dictionary & Translator:0


In [27]:
for row in ios_data_free_English_only[1:]:
    if row[11] == 'Social Networking':
        print(row[1]+':'+row[5])

Facebook:2974676
Pinterest:1061624
Skype for iPhone:373519
Messenger:351466
Tumblr:334293
WhatsApp Messenger:287589
Kik:260965
ooVoo ‚Äì Free Video Call, Text and Voice:177501
TextNow - Unlimited Text + Calls:164963
Viber Messenger ‚Äì Text & Call:164249
Followers - Social Analytics For Instagram:112778
MeetMe - Chat and Meet New People:97072
We Heart It - Fashion, wallpapers, quotes, tattoos:90414
InsTrack for Instagram - Analytics Plus More:85535
Tango - Free Video Call, Voice and Chat:75412
LinkedIn:71856
Match‚Ñ¢ - #1 Dating App.:60659
Skype for iPad:60163
POF - Best Dating App for Conversations:52642
Timehop:49510
Find My Family, Friends & iPhone - Life360 Locator:43877
Whisper - Share, Express, Meet:39819
Hangouts:36404
LINE PLAY - Your Avatar World:34677
WeChat:34584
Badoo - Meet New People, Chat, Socialize.:34428
Followers + for Instagram - Follower Analytics:28633
GroupMe:28260
Marco Polo Video Walkie Talkie:27662
Miitomo:23965
SimSimi:23530
Grindr - Gay and same sex guys chat

Looks like the genres with these highest average number of ratings are mostly driven by a few apps - Waze and google maps for Navigation, Bible in reference and Facebook and Pinterest for Social Networking. Also, we do not want to go into an over-represented category. So maybe we should be looking at the health and fitness or games category. 

In [29]:
for row in ios_data_free_English_only[1:]:
    if row[11] == 'Games':
        print(row[1]+':'+row[5])

Clash of Clans:2130805
Temple Run:1724546
Candy Crush Saga:961794
Angry Birds:824451
Subway Surfers:706110
Solitaire:679055
CSR Racing:677247
Crossy Road - Endless Arcade Hopper:669079
Injustice: Gods Among Us:612532
Hay Day:567344
PAC-MAN:508808
DragonVale:503230
Head Soccer:481564
Despicable Me: Minion Rush:464312
The Sims‚Ñ¢ FreePlay:446880
Sonic Dash:418033
8 Ball Pool‚Ñ¢:416736
Tiny Tower - Free City Building:414803
Jetpack Joyride:405647
Bike Race - Top Motorcycle Racing Games:405007
Kim Kardashian: Hollywood:397730
Trivia Crack:393469
WordBrain:391401
Sniper 3D Assassin: Shoot to Kill Gun Game:386521
Flow Free:373857
Geometry Dash Lite:370370
‚ñªSudoku:359832
Fruit Ninja¬Æ:327025
Pixel Gun 3D:301182
Temple Run 2:295211
My Horse:293857
Word Cookies!:287095
Dragon City Mobile:277268
The Simpsons‚Ñ¢: Tapped Out:274501
Plants vs. Zombies‚Ñ¢ 2:267394
Clash Royale:266921
Pok√©mon GO:257627
CSR Racing 2:257100
Star Wars‚Ñ¢: Commander:253448
Boom Beach:241929
MARVEL Contest of Champions

In [30]:
for row in ios_data_free_English_only[1:]:
    if row[11] == 'Health & Fitness':
        print(row[1]+':'+row[5])

Calorie Counter & Diet Tracker by MyFitnessPal:507706
Lose It! ‚Äì Weight Loss Program and Calorie Counter:373835
Weight Watchers:136833
Sleep Cycle alarm clock:104539
Fitbit:90496
Period Tracker Lite:53620
Nike+ Training Club - Workouts & Fitness Plans:33969
Plant Nanny - Water Reminder with Cute Plants:27421
Sworkit - Custom Workouts for Exercise & Fitness:16819
Clue Period Tracker: Period & Ovulation Tracker:13436
Headspace:12819
Fooducate - Lose Weight, Eat Healthy,Get Motivated:11875
Runtastic Running, Jogging and Walking Tracker:10298
WebMD for iPad:9142
8fit - Workouts, meal plans and personal trainer:8730
Garmin Connect‚Ñ¢ Mobile:8341
Record by Under Armour, connects with UA HealthBox:7754
Fitstar Personal Trainer:7496
My Cycles Period and Ovulation Tracker:7469
Seven - 7 Minute Workout Training Challenge:6808
RUNNING for weight loss: workout & meal plans:6407
Lifesum ‚Äì Inspiring healthy lifestyle app:5795
Waterlogged - Daily Hydration Tracker:5000
J&J Official 7 Minute Worko

These certainly seem to be a lot more uniform in their rating counts instead of being mainly driven by one or two apps. How about for android apps though? What are the more common categories there? 

## Most Popular Android Apps by Genre on Google Play

For the Google Play market, we actually have data about the number of installs (`Installs` column), so we should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough ‚Äî we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.):

In [31]:
display_table(android_data_free_English_only, 5)

1,000,000+ : 15.72653429602888
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.1985559566787
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.772111913357401
5,000+ : 4.512635379061372
10+ : 3.542418772563177
500+ : 3.2490974729241877
50,000,000+ : 2.3014440433212995
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.7897111913357401
1+ : 0.5076714801444043
500,000,000+ : 0.27075812274368233
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


This is problematic because we don't know whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000. However, since we only want to get an idea which app genres attract the most users, and we don't need perfect precision. We can use the category labels as the actual number of installs - we consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on.

To perform computations, however, we'll need to convert each install number to `float` ‚Äî this means that we need to clean the strings by removing the commas and + signs. We'll do this directly in the loop below, where we also compute the average number of installs for each genre (category).

In [33]:
android_categories = freq_table(android_data_free_English_only, 1)
genres = android_categories.keys()
user_count_per_genre = {}
for genre in genres:
    install_total = 0
    genre_len = 0
    for row in android_data_free_English_only[1:]:
        this_app_genre = row[1]
        if this_app_genre == genre:
            num_installs = row[5]
            num_installs = num_installs.replace(',','')
            num_installs = num_installs.replace('+','')
            install_total += float(num_installs)
            genre_len += 1
    user_count_per_genre[genre] = install_total/genre_len
# Now sort the data
sorted_user_count_per_genre = sorted(user_count_per_genre.items(), key = lambda kv: kv[1], reverse = True)
for item in sorted_user_count_per_genre:
    print('The '+item[0]+' app genre has an average of '+str(item[1])+' ratings.')

The COMMUNICATION app genre has an average of 38456119.167247385 ratings.
The VIDEO_PLAYERS app genre has an average of 24727872.452830188 ratings.
The SOCIAL app genre has an average of 23253652.127118643 ratings.
The PHOTOGRAPHY app genre has an average of 17840110.40229885 ratings.
The PRODUCTIVITY app genre has an average of 16787331.344927534 ratings.
The GAME app genre has an average of 15588015.603248259 ratings.
The TRAVEL_AND_LOCAL app genre has an average of 13984077.710144928 ratings.
The ENTERTAINMENT app genre has an average of 11640705.88235294 ratings.
The TOOLS app genre has an average of 10801391.298666667 ratings.
The NEWS_AND_MAGAZINES app genre has an average of 9549178.467741935 ratings.
The BOOKS_AND_REFERENCE app genre has an average of 8767811.894736841 ratings.
The SHOPPING app genre has an average of 7036877.311557789 ratings.
The PERSONALIZATION app genre has an average of 5201482.6122448975 ratings.
The WEATHER app genre has an average of 5074486.197183099 r

How do our proposed categories of `GAME` and `HEALTH_AND_FITNESS` do here? Let's specifically look at those apps with the three highest categories for number of installs.

In [38]:
for row in android_data_free_English_only[1:]:
    if row[1] == 'GAME':
        if row[5] == '1,000,000,000+' or row[5]=='500,000,000+' or row[5]=='100,000,000+':
            print(row[0],':',row[5])

Sonic Dash : 100,000,000+
PAC-MAN : 100,000,000+
Roll the Ball¬Æ - slide puzzle : 100,000,000+
Piano Tiles 2‚Ñ¢ : 100,000,000+
Pok√©mon GO : 100,000,000+
Extreme Car Driving Simulator : 100,000,000+
Trivia Crack : 100,000,000+
Angry Birds 2 : 100,000,000+
Candy Crush Saga : 500,000,000+
8 Ball Pool : 100,000,000+
Subway Surfers : 1,000,000,000+
Candy Crush Soda Saga : 100,000,000+
Clash Royale : 100,000,000+
Clash of Clans : 100,000,000+
Plants vs. Zombies FREE : 100,000,000+
Pou : 500,000,000+
Flow Free : 100,000,000+
My Talking Angela : 100,000,000+
slither.io : 100,000,000+
Cooking Fever : 100,000,000+
Yes day : 100,000,000+
Score! Hero : 100,000,000+
Dream League Soccer 2018 : 100,000,000+
My Talking Tom : 500,000,000+
Sniper 3D Gun Shooter: Free Shooting Games - FPS : 100,000,000+
Zombie Tsunami : 100,000,000+
Helix Jump : 100,000,000+
Crossy Road : 100,000,000+
Temple Run 2 : 500,000,000+
Talking Tom Gold Run : 100,000,000+
Agar.io : 100,000,000+
Bus Rush: Subway Edition : 100,00

In [39]:
for row in android_data_free_English_only[1:]:
    if row[1] == 'HEALTH_AND_FITNESS':
        if row[5] == '1,000,000,000+' or row[5]=='500,000,000+' or row[5]=='100,000,000+':
            print(row[0],':',row[5])

Period Tracker - Period Calendar Ovulation Tracker : 100,000,000+
Samsung Health : 500,000,000+


There does not seem to be too much in terms of `HEALTH & FITNESS` but `GAME` still seems like it's a good category to look into. At least with the imprecise numbers we see, the distribution seems to be a lot more uniform. Looks like it is a valid category to pursue. 

## Conclusions

For this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.

Our conclusion is to go into the 'Game' category since this seems to be popular on both platforms and turning it into an app could be profitable for both markets. Given there is much more granular data in the android data, we could even distill further into what subcategory of games to look into but that will be for future work. 