# Exploring Profitable Apps for Apple and Android

This is a guided project in the DataQuest Data Analyst in Python track.

Our company builds apps for both Android and Apple that are free to use, and our revenue is generated through ads. This means that we want to identify app profiles that are popular in both markets so that we can optimize the traffic to our app in order to maximize our ad revenue. We will analyze data from both the Google Play Store and the Apple App Store to gain a better understanding into what types of apps garner more attraction in both markets. 

## Opening and Exploring Data

Define a function named `explore_data` that will display the rows from `start` to `finish` and will tell us how many rows and columns are in the `dataset` if the `rows_and_columns` parameter is set to `True`.

In [28]:
def explore_data(dataset, start, end, rows_and_columns = False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
    
    if rows_and_columns:
        print('Number of rows: ' + str(len(dataset)))
        print('Number of columns: ' + str(len(dataset[0])))

Import the two data sets into variables named apple and android.

In [29]:
from csv import reader
apple = list(reader(open('AppleStore.csv', encoding = 'utf8')))
apple_header = apple[0]
apple = apple[1:]
android = list(reader(open('googleplaystore.csv', encoding = 'utf8')))
android_header = android[0]
android = android[1:]

In [30]:
print(apple_header)
print('\n')
explore_data(apple, 0, 3, rows_and_columns = True)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


Number of rows: 7197
Number of columns: 17


Looking at the header of our Apple Store dataset, the columns that may be useful in our analysis are `'track_name'`, `'currency'`, `'price'`, `'rating_count_tot'`, `'rating_count_ver'`, and `'prime_genre'`. 

More details on the columns in this dataset can be found in its [documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home). 


In [31]:
print(android_header)
print('\n')
explore_data(android, 0, 3, rows_and_columns = True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


It appears that the columns that will be useful in the Google Play Store dataset are `'App'`, `'Category'`, `'Installs'`, `'Type'`, `'Price'`, and `'Genres'`.

Further detail on the columns can be found in the dataset's [documentation](https://www.kaggle.com/lava18/google-play-store-apps/home).

## Deleting Wrong Data

We need to prepare our data so that we have no incorrect data and the data fits the purpose of our analysis. 

There is a [discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) in the Google Play Store dataset's documentation regarding row 10472. Let's look at that row. 

In [32]:
print(android[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


It appears the `'Category'` column is missing in this entry. I will choose to simply remove it from our dataset.

In [33]:
del(android[10472])

## Removing Duplicates

According to another [discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/67894), it appears as though our Google Play dataset also has multiple entries for some apps as well. Instagram is one of those apps:

In [34]:
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


Let's find how many duplicates there are and look at some examples.

In [35]:
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print("Number of duplicate apps: " + str(len(duplicate_apps)))
print('\n')
print('Examples of duplicate apps: ', duplicate_apps[:10])

Number of duplicate apps: 1181


Examples of duplicate apps:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


We could remove the duplicates randomly, but by looking at the Instagram examples above, we can see they have different numbers of reviews. I am going to keep the entry with the most reviews for each app with duplicates. We should expect to have 9,659 apps after removing the duplicates.

In [36]:
print('Expected length: ' + str(len(android) - len(duplicate_apps)))

Expected length: 9659


Now we need to go through and remove the duplicates so that the remaining entry is the one with the most reviews. We first create a dictionary name `reviews_max` and iterate through `android`. If the current entry's app is already in the dictionary and it's number of reviews is larger than the existing amount, we replace the existing amount with the current entry's. If the current entry does not exist in the dictionary yet, we add it along with its number of reviews. 

In [37]:
reviews_max = {}
for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
print(len(reviews_max))

9659


Now we create two lists named `android_clean` and `already_added`. We iterate through the `android` dataset and if the current entry's name is the same as the max number we have listed for that app and if that app is not already in our list, we add that entry to our clean dataset.

In [38]:
android_clean = []
already_added = []
for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)

print(len(android_clean))

9659


## Removing Non-English Apps

We are only interested in apps directed towards an English-speaking audience, but some of the apps in our data are not in English. Using the `ord()` function on a character, we get a number that corresponds to that character. According to ASCII, the numbers corresponding to the characters commonly used in English text range from 0 to 127.

Since symbols like the trademark symbol and emojis fall outside of the ASCII range, we will only remove an app from our data if its name has three or more non-ASCII characters.

In [39]:
def checkEnglish(app):
    total = 0
    for c in app:
        if ord(c) > 127:
            total += 1
    if total > 3:
        return False
    else:
        return True

print(checkEnglish('Instagram'))
print(checkEnglish('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(checkEnglish('Docs To Go™ Free Office Suite'))
print(checkEnglish('Instachat 😜'))         

True
False
True
True


Using this filter, apps with three or more emojis and/or special symbols will be removed. Some non-English apps may also make it past the filter. This filter is not perfect, but will be effective.

In [40]:
android_english = []
apple_english = []

for app in android_clean:
    name = app[0]
    if checkEnglish(name) == True:
        android_english.append(app)

for app in apple:
    name = app[2]
    if checkEnglish(name) == True:
        apple_english.append(app)
        
explore_data(android_english, 0, 3, True)
print('\n')
explore_data(apple_english, 0 , 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188

Our Android dataset now has 9614 entries and our Apple dataset has 6183.

## Removing Paid Apps

We are only interested in free apps, so we will remove all of the remaining paid apps.

In [41]:
android_free = []
apple_free = []

for app in android_english:
    price = app[7]
    if price == '0':
        android_free.append(app)
        
for app in apple_english:
    price = float(app[5])
    if price == 0:
        apple_free.append(app)
        
print('Android: ' + str(len(android_free)))
print('Apple: ' + str(len(apple_free)))
    

Android: 8864
Apple: 3222


We have 8864 Android apps and 3222 Apple apps remaining.

## Most Common Apps in each Genre

Our aim is to determine the kinds of apps that are most likely to attract more users since we generate revenue through ads in free apps. This means our revenue is heavily influenced by the number of people using our apps. 

The validation strategy for an app idea is comprised of three steps:
1. Build a minimal Android version and add it to the Google Play Store.
2. If there is a good response from users, continue development.
3. If the app is profitable after six months, port it to iOS and add it to the App Store. 

Because we need to find apps that will perform well on both the Google Play Store and the Apple App Store, we will look for app profiles that work well in both markets. 

We will begin by generating frequency tables to find the most common genres in each market. Upon inspection, it appears the Category and Genres columns in the Android dataset and the prime_genre column in the Apple dataset will be of interest.

We'll build two functions we can use to analyze frequency tables:
- One function to generate frequency tables that show percentages.
- Another we can use to display the percentages in a descending order. 

In [42]:
def freq_table(dataset, index):
    freq_dict = {} #initialize frequency table
    
    #build a count of the target data
    for app in dataset:
        target = app[index]
        if target in freq_dict:
            freq_dict[target] += 1
        else:
            freq_dict[target] = 1
    
    #convert to a percentage
    for key in freq_dict:
        freq_dict[key] = 100 * (freq_dict[key] / len(dataset))
        
    return freq_dict
        
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        
        
display_table(apple_free, 12)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


We can see here that for the prime_genre column of the free English Apple App Store apps, Games is by far the most common, taking up over half (58.16%) of the market. Next is Entertainment at almost 8% of the market. Photo & Video is third most common at almost 5%, followed by Education at 3.66% and Social Networking at 3.28%. We can see that the market is dominated by genres geared towards fun (Games, Entertainment, Photo & Video, and Social Networking) with only one parctical-purpose genre in our five most common genres (Education). 

Although fun apps may be the most numerous in the Apple App Store, that does not mean that those apps have large amounts of users. The demand may not be particularly high for the individual apps.

Next, we'll take a look at the Google Play Store's free English apps. 

In [43]:
display_table(android_free, 1)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

In the Google Play Store, we can immediately see the market is not nearly as dominated by fun apps as the App Store. Looking at the Family category, we can tell that it is mostly games and entertainment apps. Nonetheless, we can see that practical apps (Tools, Business, Productivity, etc) are better represented in the Google Play Store than the App Store.

The Genres column gives even more evidence of that fact:

In [44]:
display_table(android_free, 9)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

Here we can see that the data is far more granular, with the most common genre (Tools) topping out at 8.44% of the total market. We also see that practical genres continue to be well represented in the Google Play Store. However, there is not a clear distinction between the Category and Genres columns and we are more interested in the big picture, so we will focus on the Category column moving forward.

Now that we have an idea about which genres are most common in each market, we will want to determine what types of apps tend to have larger user bases. 

## Most Popular Apps by Genre on the App Store
We now want to determine which app profiles have large average user bases. We can determine this by finding the average installs for the apps. The Google Play Store dataset has an `Installs` column, but the App Store does not. As a workaround, we'll use the total number of user ratings as a proxy, which we can find in the `rating_count_tot` column. 

In [45]:
genres_apple = freq_table(apple_free, 12)
genres_apple_list = []

for genre in genres_apple:
    total = 0
    len_genre = 0
    
    for app in apple_free:
        genre_app = app[12]
        if genre_app == genre:
            number_of_ratings = float(app[6])
            total += number_of_ratings
            len_genre += 1
            
    average_ratings = total / len_genre
    #I want it sorted, so I'll append it to a list, sort, then print.
    genres_apple_list.append((average_ratings, genre))

genres_apple_list = sorted(genres_apple_list, reverse = True)

for genre in genres_apple_list:
    print(genre[1], ':', genre[0])

Navigation : 86090.33333333333
Reference : 74942.11111111111
Social Networking : 71548.34905660378
Music : 57326.530303030304
Weather : 52279.892857142855
Book : 39758.5
Food & Drink : 33333.92307692308
Finance : 31467.944444444445
Photo & Video : 28441.54375
Travel : 28243.8
Shopping : 26919.690476190477
Health & Fitness : 23298.015384615384
Sports : 23008.898550724636
Games : 22788.6696905016
News : 21248.023255813954
Productivity : 21028.410714285714
Utilities : 18684.456790123455
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Business : 7491.117647058823
Education : 7003.983050847458
Catalogs : 4004.0
Medical : 612.0


We can see that Navigation has the most average number of reviews at over 86,000. This is likely skewed due to the popularity of Waze and Google Maps.

I will write a function to explore our Apple dataset and print a list of apps in a given genre sorted by `rating_tot_count` in descending order and use that to investigate some of the more popular genres.

In [46]:
def explore_apple_genre(dataset, target_genre):
    genre_list = [] #initialize an empty list
    #loop through the dataset and append any apps belonging to our target genre to the list
    for app in dataset:
        genre = app[12]
        if genre == target_genre:
            genre_list.append((int(app[6]), app[2]))
    genre_list = sorted(genre_list, reverse = True) #sort our list in descending order
    for app in genre_list:
        print(app[1], ':', app[0])
        
explore_apple_genre(apple_free, 'Navigation')

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


Waze and Google Maps have nearly half a million reviews between them while the small number of other apps do not have nearly as sizeable userbases. 

The genre with the next highest average amount of reviews is Reference:

In [47]:
explore_apple_genre(apple_free, 'Reference')

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


We see once again that the market is dominated by a few very popular apps (Bible and Dictionary.com Dictionary & Thesaurus). However, we could explore creating an app for popular pulic domain books (particularly religious ones) since we can see Bible is very popular and Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran has a decently sized userbase. 

The next genre in order of average number of reviews is Social Networking:

In [48]:
explore_apple_genre(apple_free, 'Social Networking')

Facebook : 2974676
Pinterest : 1061624
Skype for iPhone : 373519
Messenger : 351466
Tumblr : 334293
WhatsApp Messenger : 287589
Kik : 260965
ooVoo – Free Video Call, Text and Voice : 177501
TextNow - Unlimited Text + Calls : 164963
Viber Messenger – Text & Call : 164249
Followers - Social Analytics For Instagram : 112778
MeetMe - Chat and Meet New People : 97072
We Heart It - Fashion, wallpapers, quotes, tattoos : 90414
InsTrack for Instagram - Analytics Plus More : 85535
Tango - Free Video Call, Voice and Chat : 75412
LinkedIn : 71856
Match™ - #1 Dating App. : 60659
Skype for iPad : 60163
POF - Best Dating App for Conversations : 52642
Timehop : 49510
Find My Family, Friends & iPhone - Life360 Locator : 43877
Whisper - Share, Express, Meet : 39819
Hangouts : 36404
LINE PLAY - Your Avatar World : 34677
WeChat : 34584
Badoo - Meet New People, Chat, Socialize. : 34428
Followers + for Instagram - Follower Analytics : 28633
GroupMe : 28260
Marco Polo Video Walkie Talkie : 27662
Miitomo : 2

This genre has many more apps than the previous two we've looked at. This market is dominated by the likes of Facebook, Tumblr, and Whatsapp. By investigating this genre, we can see the dating apps, apps for meeting new people, and chat apps are potential app profiles to explore.

Other relatively popular genres include Music, Weather, Book, Finance, and Food & Drink. 
- Music - Outside of the dominant apps, it appears rare for an app to reach 10,000 or more reviews. 
- Weather - Likely not heavily used by its users; they probably just check occasionally for the forecast. This reduces the potential for ad revenue.
- Book - Dominated by a few very popular apps and all other apps struggle to gain a satisfactory amount of reviews.
- Finance - May require domain knowledge and regulatory considerations that we are beyond the scope of our organization.
- Food & Drink - Tend to be tied to large popular brands like Starbucks, McDonalds, etc. and used for ordering food/drink.

My suggestion would be to focus on a Reference app for a popular public domain (potentially religious) text since the App Store is heavily saturated with fun apps. A practical app may stick out of the crowd more easily. 

## Most Popular Apps by Genre in the Google Play Store
Now we will look at the Google Play Store. In our `android_free` dataset, there is an `Installs` column we can use to get a more direct analysis of the userbase size. However, the install numbers themselves are not precise - they are placed in buckets (100+, 1,000+, 5,000+, etc.):

In [49]:
display_table(android_free, 5)

1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


Due to this, we don't know if an app with 100,000+ installs has 110,000 installs, 200,000 installs, or 350,000 installs. Since we're working with averages and just want a picture of the general userbase size, we don't need perfect precision. 
We are going to consider that an app with 100,000+ installs has 100,000 installs, and so on. Since the data in the Installs column are all strings, we will need to convert them to float in order to perform calculations. We will first need to remove the commas and the plus characters from the strings. 

In [50]:
category_android = freq_table(android_free, 1)
category_android_list = []

for category in category_android:
    total = 0
    len_category = 0
    
    for app in android_free:
        category_app = app[1]
        if category_app == category:
            number_of_installs = app[5]
            number_of_installs = float(number_of_installs.replace(',', '').replace('+', ''))
            total += number_of_installs
            len_category += 1
    
    average_installs = total / len_category
    category_android_list.append((average_installs, category))
    
category_android_list = sorted(category_android_list, reverse = True)
for category in category_android_list:
    print(category[1], ':', category[0])

COMMUNICATION : 38456119.167247385
VIDEO_PLAYERS : 24727872.452830188
SOCIAL : 23253652.127118643
PHOTOGRAPHY : 17840110.40229885
PRODUCTIVITY : 16787331.344927534
GAME : 15588015.603248259
TRAVEL_AND_LOCAL : 13984077.710144928
ENTERTAINMENT : 11640705.88235294
TOOLS : 10801391.298666667
NEWS_AND_MAGAZINES : 9549178.467741935
BOOKS_AND_REFERENCE : 8767811.894736841
SHOPPING : 7036877.311557789
PERSONALIZATION : 5201482.6122448975
WEATHER : 5074486.197183099
HEALTH_AND_FITNESS : 4188821.9853479853
MAPS_AND_NAVIGATION : 4056941.7741935486
FAMILY : 3695641.8198090694
SPORTS : 3638640.1428571427
ART_AND_DESIGN : 1986335.0877192982
FOOD_AND_DRINK : 1924897.7363636363
EDUCATION : 1833495.145631068
BUSINESS : 1712290.1474201474
LIFESTYLE : 1437816.2687861272
FINANCE : 1387692.475609756
HOUSE_AND_HOME : 1331540.5616438356
DATING : 854028.8303030303
COMICS : 817657.2727272727
AUTO_AND_VEHICLES : 647317.8170731707
LIBRARIES_AND_DEMO : 638503.734939759
PARENTING : 542603.6206896552
BEAUTY : 51315

We'll explore some of the more popular categories that may be of interest to us. We will try to focus on individual apps with 10,000,000 or more downloads.

Similar to above, I will write a function to explore each category in the Android dataset.

In [51]:
def explore_android_category(dataset, target_category):
    category_list = [] #initialize an empty list
    #loop through the dataset and append any apps belonging to our target category to the list
    for app in dataset:
        category = app[1]
        if category == target_category:
            number_of_installs = app[5]
            number_of_installs = int(number_of_installs.replace(',', '').replace('+', ''))
            category_list.append((number_of_installs, app[0]))
    category_list = sorted(category_list, reverse = True) #sort our list in descending order
    for app in category_list:
        print(app[1], ':', app[0])

explore_android_category(android_free, 'COMMUNICATION')

WhatsApp Messenger : 1000000000
Skype - free IM & video calls : 1000000000
Messenger – Text and Video Chat for Free : 1000000000
Hangouts : 1000000000
Google Chrome: Fast & Secure : 1000000000
Gmail : 1000000000
imo free video calls and chat : 500000000
Viber Messenger : 500000000
UC Browser - Fast Download Private & Secure : 500000000
LINE: Free Calls & Messages : 500000000
Google Duo - High Quality Video Calls : 500000000
imo beta free calls and text : 100000000
Yahoo Mail – Stay Organized : 100000000
Who : 100000000
WeChat : 100000000
UC Browser Mini -Tiny Fast Private & Secure : 100000000
Truecaller: Caller ID, SMS spam blocking & Dialer : 100000000
Telegram : 100000000
Opera Mini - fast web browser : 100000000
Opera Browser: Fast and Secure : 100000000
Messenger Lite: Free Calls & Messages : 100000000
Kik : 100000000
KakaoTalk: Free Calls & Text : 100000000
GO SMS Pro - Messenger, Free Themes, Emoji : 100000000
Firefox Browser fast & private : 100000000
BBM - Free Calls & Messages

In the communication category, we can see there are a few apps that have over 500 million installs (WhatsApp Messenger, Hangouts, etc). If we look past these at the other successful apps in this market, we can see a large variety of browsers (particularly with ad blocking functionality), chat/call/video call apps, free wifi finders/connectors, VPN services, and email apps. However, chat/call/video call apps, email apps, and browsers would have to compete directly apps like WhatsApp, Gmail, and Google Chrome, so we may want to avoid those markets if possible. 

The next most popular category is Video Players:

In [52]:
explore_android_category(android_free, 'VIDEO_PLAYERS')

YouTube : 1000000000
Google Play Movies & TV : 1000000000
MX Player : 500000000
VivaVideo - Video Editor & Photo Movie : 100000000
VideoShow-Video Editor, Video Maker, Beauty Camera : 100000000
VLC for Android : 100000000
Motorola Gallery : 100000000
Motorola FM Radio : 100000000
Dubsmash : 100000000
Vote for : 50000000
Vigo Video : 50000000
VMate : 50000000
Samsung Video Library : 50000000
Ringdroid : 50000000
MiniMovie - Free Video and Slideshow Editor : 50000000
LIKE – Magic Video Maker & Community : 50000000
KineMaster – Pro Video Editor : 50000000
HD Video Downloader : 2018 Best video mate : 50000000
DU Recorder – Screen Recorder, Video Editor, Live : 50000000
video player for android : 10000000
iMediaShare – Photos & Music : 10000000
YouTube Studio : 10000000
Video Player All Format : 10000000
Video Downloader - for Instagram Repost App : 10000000
Video Downloader : 10000000
Ustream : 10000000
Quik – Free Video Editor for photos, clips, music : 10000000
PowerDirector Video Editor

Once again, we see a few highly dominant apps in the Video Player category with YouTube and Google Play Movies & TV garnering over 1 billion installs each. 

Investigating this list for viable app profiles, we can see that there is a good representation of video players, video editors, and video downloaders. Video players and editors may require more complicated domain knowledge in topics like video codecs, so it may be a good idea to avoid those markets. 

As we saw in the Apple dataset, Social apps are dominated by industry giants like Facebook and the Game apps category is going to be heavily saturated with little chance to stand out. A main concern we also have with some of the heavily popular categories is that there are outliers that heavily skew the average install numbers upward. If we remove some of the dominant apps in these categories, it may turn out they are not as popular as they initially appear. 

I'll next take a look at the Books and Reference category since we had found a potential app profile of interest in the Apple dataset:

In [53]:
explore_android_category(android_free, 'BOOKS_AND_REFERENCE')

Google Play Books : 1000000000
Wattpad 📖 Free Books : 100000000
Bible : 100000000
Audiobooks from Audible : 100000000
Amazon Kindle : 100000000
Wikipedia : 10000000
Spanish English Translator : 10000000
Quran for Android : 10000000
Oxford Dictionary of English : Free : 10000000
NOOK: Read eBooks & Magazines : 10000000
Moon+ Reader : 10000000
JW Library : 10000000
HTC Help : 10000000
FBReader: Favorite Book Reader : 10000000
English Hindi Dictionary : 10000000
English Dictionary - Offline : 10000000
Dictionary.com: Find Definitions for English Words : 10000000
Dictionary - Merriam-Webster : 10000000
Dictionary : 10000000
Cool Reader : 10000000
Aldiko Book Reader : 10000000
Al-Quran (Free) : 10000000
Al'Quran Bahasa Indonesia : 10000000
Al Quran Indonesia : 10000000
Read books online : 5000000
English to Hindi Dictionary : 5000000
Ebook Reader : 5000000
Dictionary - WordWeb : 5000000
Bible KJV : 5000000
Ancestry : 5000000
AlReader -any text book reader : 5000000
Al Quran : EAlim - Transl

In this category, we see a large amount of dictionaries, book readers, and translators. One thing we can note once again is that there are popular apps for the Bible and the Quran. This leads me to believe that a reader app for a popular (potentially religious) public domain work could be a good app profile to explore that may work in both the Android and Apple market spaces. We may want to add functionality like word search, definitions, notes, bookmarks, discussion forums etc. to help stand out in the crowd. 

Another category of interest in the Apple dataset was Social apps due to the popularity of apps geared towards dating and meeting new people.

In [54]:
explore_android_category(android_free, 'SOCIAL')

Instagram : 1000000000
Google+ : 1000000000
Facebook : 1000000000
Snapchat : 500000000
Facebook Lite : 500000000
VK : 100000000
Tumblr : 100000000
Tik Tok - including musical.ly : 100000000
Tango - Live Video Broadcast : 100000000
Pinterest : 100000000
LinkedIn : 100000000
Badoo - Free Chat & Dating App : 100000000
BIGO LIVE - Live Stream : 100000000
ooVoo Video Calls, Messaging & Stories : 50000000
Zello PTT Walkie Talkie : 50000000
SKOUT - Meet, Chat, Go Live : 50000000
POF Free Dating App : 50000000
MeetMe: Chat & Meet New People : 50000000
textPlus: Free Text & Calls : 10000000
magicApp Calling & Messaging : 10000000
YouNow: Live Stream Video Chat : 10000000
We Heart It : 10000000
Waplog - Free Chat, Dating App, Meet Singles : 10000000
TextNow - free text + calls : 10000000
Text free - Free Text + Call : 10000000
Text Me: Text Free, Call Free, Second Phone Number : 10000000
Tapatalk - 100,000+ Forums : 10000000
Tagged - Meet, Chat & Dating : 10000000
SayHi Chat, Meet New People : 1

Once again, we can see a large amount of dating and "meet new people" apps. We can see that there are a number of these apps marketed towards the LGBT community as well, possibly since it may be difficult for members of these communities to find each other by chance in public. One way we may be able to stand out in this market is to create apps that appeal to certain niches (LGBT, religion-specific, etc.). Another way to stick out would be to add some functionalities that don't exist in the typical app today.

## Conclusions

In this project, we analyzed data about Google Play Store and App Store apps with a goal of finding a potentially profitable app profile in both markets. 

Although there are a number of app profiles that looked promising in the individual markets, one app profile that stood out as having potential in both markets is a book app for a popular, possibly religious public domain work with additional useful functionalities added. 

Another potential app profile is dating/"meet new people" apps. These are popular in both markets, but we may need to add some additional functionalities not existing in the typical app or market them towards specific niches in order to stand out among the crowd. 