# Coming up with a lucrative Mobile App idea - Ryan Lacarne
based on 2018 Apple Store and Google Play app data.

In this data science project, I will be trying to deduce which mobile apps are profitable based on App Store and Google Play Store data. Using this information, I will come up with an app idea of my own.

## Setting up the project

First up, I will open the two datasets, ensuring they are in the right directory and can be called.

In [3]:
from csv import reader
opened_file = open('googleplaystore.csv',encoding="utf8")
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]      # i set the header and body so they can be called at anytime.
android = android[1:]

opened_file = open('AppleStore.csv',encoding="utf8")
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

Next, I create the function `explore_data`.

In [4]:
def explore_data(dataset, start, end, rows_and_columns=False): 
    dataset_slice = dataset[start:end] # the `dataset_slice` variable allows me to explore specific slices.
    for row in dataset_slice:
        print(row)
        print('\n')

    if rows_and_columns:    # this if statement will print the number of rows and columns if set to True.
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Next, I print the first few columns of the Android data set to check that my `explore_data` function is working.

In [5]:
print (android_header)
print('\n')
explore_data(android, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


This tells me we currently have 10841 apps in the Android data set.

Next, I print the first few columns of the Apple App store data set to doublecheck that my `explore_data` function is working.

In [6]:
print (ios_header)
print ('\n')
explore_data (ios,0,3,True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


This tells me we currently have 7197 apps in the Apple App store data set.

From reading the comments on the page where I downloaded the [dataset](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015), I found out that there is an error and a missing value for "Content Rating" at `(android[10472])`, which is the app 'Life Made WI-Fi Touchscreen Photo Frame'. I don't want that, as it will surely give me errors down the line; so I will delete this line altogether.

In [7]:
print(android[10472])
del android[10472]
print (len(android))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
10840


Therefore, the new number of apps in our Android dataset we are working with is 10840.

## Data Cleaning pt.1: Duplicates

Now that that's out of the way, I want to start the data cleaning process.
First, I'll deal with duplicates; because, at this point, I realized that there were many duplicate apps in both datasets. For example, in our Android dataset we had 4 different apps called "Instagram". 

In [8]:
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


Therefore, I created a method to sort through all of the apps in the dataset, and put all of the ones that have a duplicate name together, and all of the apps with unique names together. 

In [9]:
duplicate_apps_android = []
unique_apps_android = []

for app in android:
    name = app[0]
    if name in unique_apps_android:
        duplicate_apps_android.append(name)
    else:
        unique_apps_android.append(name)
print ('Number of duplicate apps:',len(duplicate_apps_android))
print ('Number of unique apps:', len(unique_apps_android))

Number of duplicate apps: 1181
Number of unique apps: 9659


I found that there were 9659 apps with unique names, and 1181 with duplicate names. I want to get rid of some of these duplicates, but I don't want to delete duplicates randomly. What I will do instead, is keep the app with the most number of reviews, as that one is most likely to be genuine/relevant to our search.

To do this, I will create a dictionary where each key is a unique app name, and the corresponding value is the highest number of reviews of that app; after that, I will use this dictionary to create a new data set, which will only have one entry per app with duplicate names, and we will choose the app with the highest number of reviews to be added.


In [10]:
reviews_max_android= {}
for app in android:
    name = app[0]
    n_reviews_android = float(app[3])
    
    if name in reviews_max_android and reviews_max_android[name] < n_reviews_android:
        reviews_max_android[name]= n_reviews_android
    elif name not in reviews_max_android:
        reviews_max_android[name]= n_reviews_android

Now, let's use this dictionary on the Android dataset to remove the duplicates. As explained, we'll keep the entries with the highest number of reviews.

In [11]:
android_clean=[]
already_added=[]

for app in android:
    name = app[0]
    n_reviews_android = float(app[3])
    if (reviews_max_android[name] == n_reviews_android) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)


Next, we will check that this worked, and that our android_clean is truly void of duplicates. To do this, I will search for it's length, which I will expect it to be the same as `unique_apps` (9659), and also check for the "Instagram" name to see if there is only 1 Instagram now in our dataset.

In [12]:
android_clean_len = len(android_clean)
print ('The number of apps in our cleaned Android dataset is:',android_clean_len,'.')
print('\n')
for app in android_clean:
    name = app[0]
    if name == 'Instagram':
        print(app)

The number of apps in our cleaned Android dataset is: 9659 .


['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


Alright, so that worked! 9659 values, and Instagram only appears once, and it's the Instagram that shows up is the one with the most reviews. Awesome. Now, let's do this again for the App Store. 

In [13]:
duplicate_apps_ios = []
unique_apps_ios = []

for app in ios:
    name = app[0]
    if name in unique_apps_ios:
        duplicate_apps_ios.append(name)
    else:
        unique_apps_ios.append(name)
print ('Number of duplicate apps:',len(duplicate_apps_ios))
print ('Number of unique apps:', len(unique_apps_ios))

Number of duplicate apps: 0
Number of unique apps: 7197


Running the first part of our code, where we check for duplicate names, shows us that we have no duplicates in the App store, and therefore we do not need to go any further and we may move on.

## Data Cleaning pt.2: Non-English Apps

Next, as a part of our data cleaning process, I'll be getting rid of apps that have non-english names.

I know that the first 127 characters in the ASCII keyboard are used in the english keyboard. Therefore, I create a function that enables me to strings that contain characters outside of that. That will enable me to isolate non-english apps.

In [14]:
def is_english(string):
    for character in (string):
        if ord(character) > 127:
            return False
    return True

print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
False
False


This is a good start, and it works; however if there's an emoji or one character that isn't Ascii, I would be ommitting those, and those apps aren't necessarily non-english. I don't want that, so Icreate a function that enables me to keep the apps that have up to 3 non ASCII characters, and we'll get rid of any apps that have more than 3. Sometimes, we just have to make decisions like this that will net us the most accurate results possible.

In [15]:
def is_english (string):
    non_ascii = 0
    
    for character in string:
        if ord(character) > 127:
            non_ascii+=1
    if non_ascii > 3:
        return False
    else:
        return True

print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
True
True


I'm happy with my new and improved `is_english` function. I'll run that through both of my datasets.

In [16]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if is_english(name):
        android_english.append(app)
        
for app in ios:
    name = app[1]
    if is_english(name):
        ios_english.append(app)
        
explore_data (android_english, 0 , 0, True)
print ('\n')
explore_data (ios_english, 0 , 0 , True)


Number of rows: 9614
Number of columns: 13


Number of rows: 6183
Number of columns: 16


Now we can see that for Android, we are left with 9614 apps (from 9656), and for the Apple App store, we are left with 6183 apps (from 6197). 

## Data Cleaning pt.3: Isolating the Free Apps

Next, I know that I want to create a free app, and therefore, I only want free apps to be a part of my analysis, and so I write a function that will create a new datasets with only the free apps in our existing, already "cleaned twice" dataset. I must always remember on this step to use the most current version of my dataset, and not the initial dataset to not lose the work I have done.

In [17]:
android_en_free = []
ios_en_free = []

for app in android_english:
    price = str(app[6])
    if price == 'Free':
        android_en_free.append(app)

for app in ios_english:
    price = float(app[4])
    if price == 0.0:
        ios_en_free.append(app)
        

android_final = android_en_free
ios_final = ios_en_free

print ('The final number of Android apps is:', len(android_final))
print ('\n')
print ('The final number of Apple Store apps is:',len(ios_final))



The final number of Android apps is: 8863


The final number of Apple Store apps is: 3222


That was fairly easy. We now know that we are working with 8863 apps for Android, and 3222 for iPhone. We are satisfied with the level of cleaning that we have performed on our apps and are ready to move on to some calculations for analysis.

## Data Analysis pt. 1: Genres and Categories

This is where the fun begins. (Yes, I just quoted Anakin Skywalker, sue me.)

Our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps. Because our end goal is to add the app on both the App Store and Google Play, we need to find app profiles that are successful on both markets.

In order to start narrowing down, I begin my analysis by getting a sense of the most common genres for each market. For this, I build a frequency table for the prime_genre column of the App Store data set, and the Genres and Category columns of the Google Play data set.

In [18]:
def freq_table (dataset,index): ## I create a frequency table function with two inputs datasets and index
    table = {}   ## I create an empty table dictionary
    total = 0    ## I create an empty total, we initialize it baby
    
    for row in dataset: ## for each element in thedataset, which is a list of lists
        total += 1      ## I add 1 to the total, which will essentially give us the total number of rows/datapoints
        value = row[index] ## value will end up being the actual value in the dataset we're looking for
        if value in table:
            table[value] +=1
        else:
            table [value] = 1
    
    table_percentages = {}
    for key in table: 
        percentage = (table[key]/total)*100
        table_percentages[key] = percentage
        
    return table_percentages

Next up, I create a function that will enable me to display this frequency table for datasets. 

In [19]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Now, I will run my two new functions on my datasets, as well as a quick few lines of my code that will tell me how many genres/categories are in each data set.

In [20]:
freq_apple_genre = freq_table(ios_final, 11)
freq_apple_genre_len = len(freq_apple_genre)
print ('This dataset contains',freq_apple_genre_len,'genres/categories.')
display_table(ios_final, 11) 

This dataset contains 23 genres/categories.
Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


My intial thoughts when looking at the Apple App Store data:
- Games are overly saturated holding a 58 percent share.
- Entertainment Apps are not even a close second, holding 8 percent share. 
- Everything that comes after that is basically a gimme, holding between 5 and 0 percent shares. The genres are just so varied that it's hard at first glance to get a sense of some overarching trend of what sells (besides Games).

In [21]:
freq_android_category = freq_table(android_final, 1)
freq_android_category_len = len(freq_android_category)
print ('This dataset contains',freq_android_category_len,' genres/categories.')
display_table(android_final, 1) #category column for android

This dataset contains 33  genres/categories.
FAMILY : 18.898792733837304
GAME : 9.725826469592688
TOOLS : 8.462146000225657
BUSINESS : 4.592124562789123
LIFESTYLE : 3.9038700214374367
PRODUCTIVITY : 3.8925871601038025
FINANCE : 3.7007785174320205
MEDICAL : 3.5315355974275078
SPORTS : 3.396141261423897
PERSONALIZATION : 3.317161232088458
COMMUNICATION : 3.2381812027530184
HEALTH_AND_FITNESS : 3.0802211440821394
PHOTOGRAPHY : 2.944826808078529
NEWS_AND_MAGAZINES : 2.798149610741284
SOCIAL : 2.6627552747376737
TRAVEL_AND_LOCAL : 2.335552296062281
SHOPPING : 2.245289405393208
BOOKS_AND_REFERENCE : 2.1437436533904997
DATING : 1.8616721200496444
VIDEO_PLAYERS : 1.7939749520478394
MAPS_AND_NAVIGATION : 1.399074805370642
FOOD_AND_DRINK : 1.241114746699763
EDUCATION : 1.1621347173643235
ENTERTAINMENT : 0.9590432133589079
LIBRARIES_AND_DEMO : 0.9364774906916393
AUTO_AND_VEHICLES : 0.9251946293580051
HOUSE_AND_HOME : 0.8236488773552973
WEATHER : 0.8010831546880289
EVENTS : 0.7108202640189552
PARE

In [22]:
freq_android_genre = freq_table(android_final, -4)
freq_android_genre_len = len(freq_android_genre)
print ('This dataset contains',freq_android_genre_len,' genres/categories.')
display_table(android_final, -4) #genre column for android

This dataset contains 114  genres/categories.
Tools : 8.450863138892023
Entertainment : 6.070179397495204
Education : 5.348076272142616
Business : 4.592124562789123
Productivity : 3.8925871601038025
Lifestyle : 3.8925871601038025
Finance : 3.7007785174320205
Medical : 3.5315355974275078
Sports : 3.463838429425702
Personalization : 3.317161232088458
Communication : 3.2381812027530184
Action : 3.102786866749408
Health & Fitness : 3.0802211440821394
Photography : 2.944826808078529
News & Magazines : 2.798149610741284
Social : 2.6627552747376737
Travel & Local : 2.324269434728647
Shopping : 2.245289405393208
Books & Reference : 2.1437436533904997
Simulation : 2.042197901387792
Dating : 1.8616721200496444
Arcade : 1.8503892587160102
Video Players & Editors : 1.771409229380571
Casual : 1.7601263680469368
Maps & Navigation : 1.399074805370642
Food & Drink : 1.241114746699763
Puzzle : 1.128286133363421
Racing : 0.9928917973598104
Role Playing : 0.9364774906916393
Libraries & Demo : 0.936477490

My intial thoughts when looking at the Android Store data:
- The genre column is extremely granular, with 114 different genres. Therefore I will focus on the category column. 
- Family apps dominate the dataset, with 19 percent, double the share of "Games". This shows us something quite different from the Apple App Store, and something could be said about the types of people that own iPhones vs the types of people who own Androids, from this simple metric; but that doesn't really help us right now, as we are looking to create an app that would find success in both stores.

Up to this point, we found that the App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and for-fun apps, yet dominated by the family genre. 

From this, I'm starting to formulate a good idea of the app market; the "Games" apps market is incredibly saturated, and just because it is dominant, does definitely not mean that it is a promising market to try to break into, as there is a large amount of supply. Instead, I think I will be focusing on some of the less represented genres, as I believe there would be higher demand there due to the lower supply. 

But first, let's do some further analysis on our datasets.

## Data Analysis pt. 2: Number of Users

In [37]:
genres_ios = freq_table(ios_final, 11)

for genre in genres_ios:
    total = 0
    len_genre = 0
    for app in ios_final:
        genre_app = app[11]
        if genre_app == genre:            
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)



Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


From this, I can see that Navigation apps have the highest number of user reviews, at 86090 reviews, followed by Reference Apps, at 74942 reviews. I want to know more about these genres, and see how they are distributed in terms of which apps they are comprised of, to see how competitive they are, or if certain apps have a large monopoly.

In [65]:
print ('Navigation apps in Apple Store:')
print ('\n')
for app in ios_final:
    if app[11] == 'Navigation':
        print (app[1], ':', app[5])

Navigation apps in Apple Store:


Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


As I could have easily expected, the "Navigation" genre is largely dominated by Waze and Google Maps, and if I'm honest I do not wish to try and compete with those two, as I lack the knowledge, capital, or infrastructure to do so. Maybe one day. But not today.

In [64]:
print ('Reference apps in Apple Store:')
print ('\n')
for app in ios_final:
    if app[-5] == 'Reference':
        print(app[1], ':', app[5])

Reference apps in Apple Store:


Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


In the "Reference" genre, we have a more granular spread, and this genre shows some promise. I see that by far, the top App is the "Bible" app, and not far from it, in fifth position, is "Muslim Pro". What is missing is the third monotheistic religion, Judaism. Perhaps there is an opportunity here to create a "Torah" app. But honestly, I'm absolutely not qualified to undertake such a project, so I will continue exploring other genres.

I'm very interested in the productivity genre, as I actually have an app idea for a productivity app and it has a promising number of average user ratings at 21078. Therefore I will break it down further and see what apps comprise it.

In [63]:
print ('Productivity apps in Apple Store:')
print ('\n')
for app in ios_final:
    if app[-5] == 'Productivity':
        print(app[1], ':', app[5])

Productivity apps in Apple Store:


Evernote - stay organized : 161065
Gmail - email by Google: secure, fast & organized : 135962
iTranslate - Language Translator & Dictionary : 123215
Yahoo Mail - Keeps You Organized! : 113709
Google Docs : 64259
Google Drive - free online storage : 59255
Dropbox : 49578
Microsoft Word : 47999
Microsoft OneNote : 39638
Microsoft Outlook - email and calendar : 32807
Hotspot Shield Free VPN Proxy & Wi-Fi Privacy : 32499
Documents 6 - File manager, PDF reader and browser : 29110
Google Sheets : 24602
Microsoft Excel : 24430
Inbox by Gmail : 21561
T-Mobile : 19977
Paper by FiftyThree - Sketch, Diagram, Take Notes : 18219
MyScript Calculator - Handwriting calculator : 16555
VPN Proxy Master - Unlimited WiFi security VPN : 13674
Microsoft OneDrive – File & photo cloud storage : 12797
Ever - Capture Your Memories : 12755
Speak & Translate － Voice and Text Translator : 12062
Tayasui Sketches : 11505
Drawing Desk - Draw, Paint, Doodle & Sketch board : 11040
Mi

The productivity genre is extremely granular, in that it has a large number of apps in it , but none of the top apps in there are actually direct competitors to my idea. They are all apps that relate to being more productive on your phone and having access to diffrent services such as Google Docs, Gmail, etc, but aren't actually apps that can help boost your productivity. Now that I think aboutit, there really aren't that many apps that can help with that, except, seemingly, with Evernote. It seems to be the only competitor in that space, and therefore I'm sure there would be high demand for such an app.

Next up, I take a look at the most popular apps by Genre in the Android Play store, to see if I am on the right track in terms of my idea of building a productivity app.
Here, we already have installs as a column, as we are given the installs column in our dataset right off the bat.

In [59]:
display_table(android_final, 5) # the Installs columns

1,000,000+ : 15.728308699086089
100,000+ : 11.55365000564143
10,000,000+ : 10.549475346947986
10,000+ : 10.199706645605326
1,000+ : 8.394448832223853
100+ : 6.916393997517771
5,000,000+ : 6.826131106848697
500,000+ : 5.562450637481666
50,000+ : 4.772650344127271
5,000+ : 4.513144533453684
10+ : 3.542818458761142
500+ : 3.2494640640866526
50,000,000+ : 2.3017037120613786
100,000,000+ : 2.1324607920568655
50+ : 1.9180864267178157
5+ : 0.7898002933543946
1+ : 0.5077287600135394
500,000,000+ : 0.270788672007221
1,000,000,000+ : 0.2256572266726842
0+ : 0.045131445334536835


However, upon printing this column with our display_table function, we get a distribution for the number of installs, and not what genres they belong to. So, we need to create a new function to show us this. 
Additionally, if we want to perform computations on this data, we need to convert each install number to a float, and since these install numbers have commas and plus signs in them, we need to remove them. We'll do this using the built-in `replace` function.

In [60]:
categories_android = freq_table(android_final, 1)

for category in categories_android:
    total = 0
    len_category = 0
    for app in android_final:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)



ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3697848.1731343283
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

Once we've done that, we again have a clear picture of the number of installs per genre.

As I know that I am interested in building a productivity app, I am going to dive in to those and see what I can understand from this dataset.

In [68]:
print ('Productivity apps in Android Store:')
print ('\n')
for app in android_final:
    if app[1] == 'PRODUCTIVITY':
        print(app[0], ':', app[5])

Productivity apps in Android Store:


Microsoft Word : 500,000,000+
All-In-One Toolbox: Cleaner, Booster, App Manager : 10,000,000+
AVG Cleaner – Speed, Battery & Memory Booster : 10,000,000+
QR Scanner & Barcode Scanner 2018 : 10,000,000+
Chrome Beta : 10,000,000+
Microsoft Outlook : 100,000,000+
Google PDF Viewer : 10,000,000+
My Claro Peru : 5,000,000+
Power Booster - Junk Cleaner & CPU Cooler & Boost : 1,000,000+
Google Assistant : 10,000,000+
Microsoft OneDrive : 100,000,000+
Calculator - unit converter : 50,000,000+
Microsoft OneNote : 100,000,000+
Metro name iD : 10,000,000+
Google Keep : 100,000,000+
Archos File Manager : 5,000,000+
ES File Explorer File Manager : 100,000,000+
ASUS SuperNote : 10,000,000+
HTC File Manager : 10,000,000+
MyMTN : 1,000,000+
Dropbox : 500,000,000+
ASUS Quick Memo : 10,000,000+
HTC Calendar : 10,000,000+
Google Docs : 100,000,000+
ASUS Calling Screen : 10,000,000+
lifebox : 5,000,000+
Yandex.Disk : 5,000,000+
Content Transfer : 5,000,000+
HTC Mail :

At first glance, just like in the App Store Productivity dataset, most of the apps that have large numbers of installs aren't actually apps that help you with productivity, but are actually tools that can be used to access different programs such as Dropbox, Google Docs, etc. Unfortunately, our results are still extremely granular, and quite hard to go through as it is so large. To see if there is another large competitor, I'm going to isolate only the apps with a substantial number of installs each (apps with 100,000,000 installs or more).

In [72]:
for app in android_final:
    if app[1] == 'PRODUCTIVITY' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])



Microsoft Word : 500,000,000+
Microsoft Outlook : 100,000,000+
Microsoft OneDrive : 100,000,000+
Microsoft OneNote : 100,000,000+
Google Keep : 100,000,000+
ES File Explorer File Manager : 100,000,000+
Dropbox : 500,000,000+
Google Docs : 100,000,000+
Microsoft PowerPoint : 100,000,000+
Samsung Notes : 100,000,000+
SwiftKey Keyboard : 100,000,000+
Google Drive : 1,000,000,000+
Adobe Acrobat Reader : 100,000,000+
Google Sheets : 100,000,000+
Microsoft Excel : 100,000,000+
WPS Office - Word, Docs, PDF, Note, Slide & Sheet : 100,000,000+
Google Slides : 100,000,000+
ColorNote Notepad Notes : 100,000,000+
Evernote – Organizer, Planner for Notes & Memos : 100,000,000+
Google Calendar : 500,000,000+
Cloud Print : 500,000,000+
CamScanner - Phone PDF Creator : 100,000,000+


Unsurprisingly, Evernote appears again as the only genuine app that promotes productivity, the others all being tools. I think so far, this convinces me that I want to build a productivity app that rivals Evernote.

## Conclusion

In this project, I cleaned and analyzed data from the App Store and the Andoird Play store for mobile apps with the goal of recommending an app profile that can be profitable for both markets.

I concluded that even though productivity seems to be a large genre in both markets, the main productivity apps are actually tools and services such as Dropbox and Gmail and not apps that can actually help you be more productive. I find that there would be huge demand for an app of this sort, and realized that there is definitely space in both markets as the only large app currently with such an offering is Evernote. 

Therefore, my recommendation at the end of this project would be for someone to get started on working on an app that can help people be more productive using notes, reminders, planners, and the like. 

Will that someone be me? I guess we'll see.