# Predicting Profitable Apps Based on User Reviews

In this project our objective will be to determine what makes an app attractive to potential users. We will be working 
as data analysts for an app developing company that specializes in free apps sold on the App Store and Google Play marketplaces.

Given that our company only develops free apps, the overwhelming majority of revenue will be coming from in app advertisements. This implies that the most relevant metric when analyzing the apps' data will be the number of users for each app. By analyzing the data with specific interest in the aforementioned metric of number of users and the average ratings of apps by genre, we will offer useful insights to our company.

# Opening the Datasets
The iOS App Store and Android Google Play Store, have about 2 million and 2.1 million apps available for download, respectively. Analyzing the entirety of the two datasets would be expensive and time consuming and is beyond the scope of this project. However, we are still able to analyze two samples of the two data sets.
* This is a sample of the original Google Play Store data set. It consists of data for about 10,000 apps and is available for download [here](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv)
* This is a sample of the original iOS App Store data set. It consists of data for about 7,000 apps and is available for download [here](https://dq-content.s3.amazonaws.com/350/AppleStore.csv)

We will begin by opening each of the samples in Python

In [1]:
#The App Store Marketplace
from csv import reader
f_object = open('/Users/Tornyeli/Datasets/AppleStore.csv')
read_file = reader(f_object)
apple = list(read_file)
apple_header = apple[0]
apple = apple[1:]

#The Google Play Marketplace
from csv import reader
f_object = open('/Users/Tornyeli/Datasets/googleplaystore.csv')
read_file = reader(f_object)
android = list(read_file)
android_header = android[0]
android = android[1:]

In the interest of efficiency we'll create a function that helps us to explore the datasets, present the data in a way that is more comprehensible, and count the rows and columns of any dataset. 

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    """A function used to present segments of, or all of the data from a dataset"""
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
print(android_header)
print('\n')
explore_data(android, 1, 4, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10841
Number of columns: 13


So our function has helped us determine an exact amount of apps for this data set, 10,841. Relevant columns include: 'App', 'Category', 'Reviews', 'Installs', 'Type', 'Price', and 'Genres'.

Below we will explore the iOS dataset.

In [4]:
print(apple_header)
print('\n')
explore_data(apple, 1, 4, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7197
Number of columns: 16


There are 7197 iOS apps in this data set. Relevant columns include: 'track_name', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', and 'prime_genre'. Some of the column names are not very descriptive, but futher explanation is provided in the data set [documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home)

# Deleting Duplicate/Incorrect Data
In the [discussion session](https://www.kaggle.com/lava18/google-play-store-apps/discussion) of the Google Play Store dataset there is a specific [discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) covering an error in row 10472. To get a better picture of the error we'll print the header and a correct row and compare them with the error row.

In [5]:
print(android[10472]) # error row

print(android[0],'\n') # correct row

print(len(android_header)) #header

for row in android[1:]:
    if len(row) != len(android_header):
        print(row)
        print("\n")
        print("Index postion is:", android.index(row))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] 

13
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Index postion is: 10472


The row 10472 corresponds to the app Life Made WI-Fi Touchscreen Photo Frame, and has a rating of 19. This must be an error because the maximum rating for a Google Play app is 5. The discussion session attributes the error to a missing value in the 'Category' column, since we do not have the correct value to replace the missing one, we'll just delete the row.

In [6]:
print(len(android))
del android[10472]
print(len(android))

10841
10840


In both datasets, there are apps that listed multiple times, take, for example, the Instagram app below.

In [7]:
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [8]:
duplicate_apps_ios = []
unique_apps_ios = []
for app in apple:
    name = app[1]
    if name in unique_apps_ios:
        duplicate_apps_ios.append(name)
    else:
        unique_apps_ios.append(name)

print('Number of duplicate apps:', len(duplicate_apps_ios))
print('\n')
print('Examples of duplicate apps:', duplicate_apps_ios[:15])        

Number of duplicate apps: 2


Examples of duplicate apps: ['Mannequin Challenge', 'VR Roller Coaster']


In [9]:
duplicate_apps_droid = []
unique_apps_droid = []
for app in android:
    name = app[0]
    if name in unique_apps_droid:
        duplicate_apps_droid.append(name)
    else:
        unique_apps_droid.append(name)
        
print('Number of duplicate apps:', len(duplicate_apps_droid))
print('\n')
print('Examples of duplicate apps:', duplicate_apps_droid[:15])    

Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


Counting the same app multiple times will be redundant and will likely skew our data. This leads us to our next question: On what basis will we remove duplicate entries? If we take a look at the multiple entries for the Instagram app displayed previously, we can see that the one column with values that differ between each of the entries is the fourth column which stores the quantity of reviews. The differences in this value are due to the data being recorded at different times throughout the day. So our basis for removing rows will be the number of reviews, with preference for keeping the entries with more reviews (data will be more reliable).

To do this we will create a dictionary that has key-value pairs of app name and review count respectively. Once we've finished creating the dictionary we'll use it to make a new dataset where each app only appears once with the highest number of reviews taken.

In [10]:
reviews_max = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

In a previous code cell, we discovered that there are 1,181 instances of duplicate apps across the two datasets, so the length of our dictionary should reflect that and have 1,181 less items (key-value pairs) than the length of our dataset.

In [11]:
print('Expected length:', len(android) - 1181)
print('Actual length:', len(reviews_max))

Expected length: 9659
Actual length: 9659


We'll use the reviews_max dictionary to sort our duplicate free data into a new list titled android_clean. The purpose of the if block is to add each app to the android_clean list if the number of reviews matches up with the quanitity listed in the reviews_max dictionary and the name is not in the already_added list, if either of these conditions are not met then the app will be placed in the already_added list.

In [12]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

We'll do a quick check using the function we defined earlier, if everything is working properly we should get 9,659 for the number of rows.

In [13]:
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


# Removing Non-English Apps
Some of the apps in the google play store dataset are made for non-English speakers. Rather then attempt to translate each of the non-English apps we'll just delete them. To do this we'll delete all apps that have characters that are not found in the English alphabet. We can onsider numbers (0-9), punctuation marks, and common symbols (+, =, -, etc.) to be part of the English language in these cases. Each of these alphanumeric characters is  encoded using the ASCII standard, so each character has its own unique ASCII number that represents it. The English and alphanumeric characters only go up to 127 in the ASCII system so we can use this to our advantage by creating a function that checks whether a string has any non-English/alphanumeric ASCII characters.

The built in function ord() will check the ASCII number of a character for us so that we don't have to check the ASCII table.

In [14]:
def language_checker(string):
    
    for character in string:
        if ord(character) > 127:
            return False
        
    return True

print(language_checker('Instagram'))
print(language_checker('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(language_checker('Docs To Go™ Free Office Suite'))
print(language_checker('Instachat 😜'))

True
False
False
False


The function's performance is still not satisfactory because we are getting false positives for the third and fourth test cases. Emojis and symbols like ™ are included in the alphanumeric/English alphabet range so they trigger a false result. To improve the function we can adjust it such that the number of characters necessary to trigger a false result is increased. Although it isn't a perfect solution, it will greatly decrease the number of false positives we attain.

In [15]:
def eng_checker(string):
    
    non_eng_chars = 0
    
    for character in string:
        if ord(character) > 127:
            non_eng_chars += 1
    
    if non_eng_chars > 3:
        return False
    
    else:
        return True

print(eng_checker('Docs To Go™ Free Office Suite'))
print(eng_checker('Instachat 😜'))

True
True


Now that we've improved our function's performance we can use it to create two lists of the english apps for their respective marketplaces.

In [16]:
eng_apps_droid = []
for app in android_clean:
    name = app[0]
    eng_checker(name)
    if eng_checker(name):
        eng_apps_droid.append(app)     

eng_apps_ios = []        
for row in apple:
    name_ios = row[1]
    eng_checker(name_ios)
    if eng_checker(name_ios):
        eng_apps_ios.append(row)
        
explore_data(eng_apps_droid, 0, 3, True)
print('\n')
explore_data(eng_apps_ios, 0, 3, True)        

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

# Removing Paid Apps

As we mentioned in our introduction, our company only develops free apps with in-app advertisements being our main revenue source, so our final step in the data cleaning process is to isolate the free apps.

In [17]:
free_droid_apps = []
free_ios_apps = []

for app in eng_apps_droid:
    price = app[7]
    name = app[0]
    if price == '0':
        free_droid_apps.append(app)

for app in eng_apps_ios:
    price = app[4]
    name = app[1]
    if price == '0.0':
        free_ios_apps.append(app)           

In [18]:
print(len(free_droid_apps))
print(len(free_ios_apps))

8864
3222


# Analyzing Apps by Genre

Our plan to maximize our user base for our apps is listed as follows:
1. Develop a basic Android app which we will upload to the Google Play Store.
2. Depending on how successful the app is according to our metrics (user count, reviews, etc.) we'll build upon the existing app.
3. Develop an iOS version of our now finished app which we will upload to the Apple Store.

Ideal apps will feature more general content instead of content specific to any one app marketplace. For starters we can take a look at which app genres are most common by creating a frequency table.

In [23]:
def freq_table(dataset, index):
    """Creates a frequency table for a column of a given data set"""
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages

def display_table(dataset, index):
    """Displays the percentages of a frequency table in descending order"""
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Now that we know the quantity and percentage of apps in each category, we can begin our analysis. The first area we can take a look at is the prime genre column of the app store dataset.

In [24]:
display_table(free_ios_apps, -5) #Prime Genre

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


From the display table we can see that the overwhelming majority of apps in the app store are fall into the category of leisure (the games, entertainment, photo & video, social networking categories constitute nearly 75% of the free English apps available). However, this does not necessarily imply that non-leisure apps have smaller user bases. Leisure, especially game apps could be less difficult/expensive to make which would explain the large discrepancy. Now for the free English Google Play Store apps we'll take a look at the Category and Genre columns.

In [35]:
display_table(free_droid_apps, 1) # Category

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

The Google Play store seems to have a more even distribution of apps amongst the different categories with the largest difference in categories being roughly 18% compared to the App Store's nearly 58% difference in largest and smallest categories. Upon further examination, we can see that the discrepancy is a bit larger than we assumed initially, due to the Family category containing mostly children's games. Despite this revelation, our initial impression is still valid given that the percentage of non-leisure apps is still far greater than the percentage of non-leisure apps in the app store

In [34]:
display_table(free_droid_apps, -4) # Genre

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

From the display table we can see that the main difference between the Genre and Category columns is that the Genre column is more specific. As we mentioned previously we are taking a more macroscopic approach in our analysis, so we will omit this column in our analysis.

Now we'll find out which apps have the highest user count to complete the picture.

# Predicting Genre Profitability Based on Avg User Count/Avg Rating 
## Sorting iOS Genres By Avg User Count/Avg Rating 

Since our goal is to discover which genres are most popular it follows that the metric we will use will be the average number of installs for apps within each genre. We'll begin by determining the average rating count for each genre for the iOS dataset.

In [26]:
ios_genres = freq_table(free_ios_apps, -5)


for genre in ios_genres:
    total = 0
    len_genre = 0
    for app in free_ios_apps:
        genre_app = app[-5]
        if genre_app == genre:
            rating_count = float(app[5])
            total += rating_count
            len_genre += 1
    avg_rating_count = total / len_genre
    print(genre, ':', avg_rating_count)   

Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


Here we can see the leading categories (in descending order) are:
   1. Navigation
   2. Reference
   3. Social Networking
   4. Music
   5. Weather

Of course we must take into account the possibility that a few apps are the source of most of the reviews in a given genre. This is the case for the Navigation and Social Networking sections where Waze/Google Maps and Pinterest/Facebook have total rating counts several times greater than the average rating counts of their respective genres. 

In [30]:
for app in free_ios_apps:
    if app[11] == 'Navigation' and (float(app[5]) > 86090): # isolating apps with rating counts > average_rating_count
        print(app[1], ':', app[5]) # print name and number of ratings

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911


In [33]:
for app in free_ios_apps:
    if app[11] == 'Music':
        print(app[1], ':', app[5]) # print name and number of ratings

Pandora - Music & Radio : 1126879
Spotify Music : 878563
Shazam - Discover music, artists, videos & lyrics : 402925
iHeartRadio – Free Music & Radio Stations : 293228
SoundCloud - Music & Audio : 135744
Magic Piano by Smule : 131695
Smule Sing! : 119316
TuneIn Radio - MLB NBA Audiobooks Podcasts Music : 110420
Amazon Music : 106235
SoundHound Song Search & Music Player : 82602
Sonos Controller : 48905
Bandsintown Concerts : 30845
Karaoke - Sing Karaoke, Unlimited Songs! : 28606
My Mixtapez Music : 26286
Sing Karaoke Songs Unlimited with StarMaker : 26227
Ringtones for iPhone & Ringtone Maker : 25403
Musi - Unlimited Music For YouTube : 25193
AutoRap by Smule : 18202
Spinrilla - Mixtapes For Free : 15053
Napster - Top Music & Radio : 14268
edjing Mix:DJ turntable to remix and scratch music : 13580
Free Music - MP3 Streamer & Playlist Manager Pro : 13443
Free Piano app by Yokee : 13016
Google Play Music : 10118
Certified Mixtapes - Hip Hop Albums & Mixtapes : 9975
TIDAL : 7398
YouTube Mu

In [31]:
for app in free_ios_apps:
    if (app[11] == 'Social Networking') and (float(app[5]) > 74942):
        print(app[1], ':', app[5]) # print name and number of ratings

Facebook : 2974676
Pinterest : 1061624
Skype for iPhone : 373519
Messenger : 351466
Tumblr : 334293
WhatsApp Messenger : 287589
Kik : 260965
ooVoo – Free Video Call, Text and Voice : 177501
TextNow - Unlimited Text + Calls : 164963
Viber Messenger – Text & Call : 164249
Followers - Social Analytics For Instagram : 112778
MeetMe - Chat and Meet New People : 97072
We Heart It - Fashion, wallpapers, quotes, tattoos : 90414
InsTrack for Instagram - Analytics Plus More : 85535
Tango - Free Video Call, Voice and Chat : 75412


We can see from the list of leading genres by rating count, that the construction of the most profitable apps will require outside expertise, or an existing infrastructure. For instance, in the case of Instagram we'll need developers who can create a useful algorithm for sorting user data, and for navigation apps we'll need access to an existing satellite network. However one area we can capitalize on is the social media analytics sphere. Collecting data pertaining to an individual's social media profile and notifying them with any changes to said profile is the basis behind apps like "InsTrack for Instagram - Analytics Plus More" or Followers - Social Analytics For Instagram, and both have total rating counts that excede the average. These apps will be perfect for our company because they are already based on data analysis so our existing skillset will likely be sufficient to develop and utilize one of these apps.

## Sorting Google Play Store Category By Avg User Count/Avg Rating
Now we'll repeat the process for the App Store dataset. Again, we'll begin by calculating the average number of installs for each category.

In [36]:
droid_categories = freq_table(free_droid_apps, 1)
for category in droid_categories:
    total = 0
    len_category = 0
    for app in free_droid_apps:
        category_app = app[1]
        if category_app == category:
            no_installs = app[5]
            no_installs = no_installs.replace(',', '')
            no_installs = no_installs.replace('+', '')
            total += float(no_installs)
            len_category += 1
    
    avg_no_installs = total/len_category
    print(category, ':', avg_no_installs)

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

The leading categories for the Google Play Store dataset are as follows:
1. Communication
2. Video Players
3. Social
4. Photography
5. Productivity

Again, we can see that certain apps are skewing the averages for many of the leading categories. 

In [46]:
for app in free_droid_apps:
    installs = app[5].replace(',',"")
    installs_final = installs.replace('+',"")
    if app[1] == 'COMMUNICATION' and (float(installs_final) > 38456119):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Contacts : 50,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
free video calls and chat : 50,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
Dolphin Browser - Fast, Private & Adblock🐬 : 50,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Mail.Ru - Email App : 50,000,000+
Hangouts : 1,000,000,000+
Azar : 50,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure 

Apps like WhatsApp, Gmail, and Skype each have over a billion installs, so if we were to omit them in our analysis the average number of installs would be several times lower. Now let's take a look a the video player and social sections to see if the pattern holds.

In [56]:
for app in free_droid_apps:
    installs = app[5].replace(',',"")
    installs_final = installs.replace('+',"")
    if app[1] == 'VIDEO_PLAYERS' and (float(installs_final) > 24727872): 
        print(app[0], ':', app[5])

YouTube : 1,000,000,000+
Motorola Gallery : 100,000,000+
VLC for Android : 100,000,000+
Vote for : 50,000,000+
Vigo Video : 50,000,000+
Google Play Movies & TV : 1,000,000,000+
MiniMovie - Free Video and Slideshow Editor : 50,000,000+
Samsung Video Library : 50,000,000+
LIKE – Magic Video Maker & Community : 50,000,000+
MX Player : 500,000,000+
Dubsmash : 100,000,000+
DU Recorder – Screen Recorder, Video Editor, Live : 50,000,000+
KineMaster – Pro Video Editor : 50,000,000+
VMate : 50,000,000+
HD Video Downloader : 2018 Best video mate : 50,000,000+
VivaVideo - Video Editor & Photo Movie : 100,000,000+
VideoShow-Video Editor, Video Maker, Beauty Camera : 100,000,000+
Ringdroid : 50,000,000+
Motorola FM Radio : 100,000,000+


In [52]:
for app in free_droid_apps:
    installs = app[5].replace(',',"")
    installs_final = installs.replace('+',"")
    if app[1] == 'SOCIAL' and (float(installs_final) > 23253652):
        print(app[0], ':', app[5])

Facebook : 1,000,000,000+
Facebook Lite : 500,000,000+
Tumblr : 100,000,000+
Pinterest : 100,000,000+
Google+ : 1,000,000,000+
Badoo - Free Chat & Dating App : 100,000,000+
Tango - Live Video Broadcast : 100,000,000+
ooVoo Video Calls, Messaging & Stories : 50,000,000+
Instagram : 1,000,000,000+
Snapchat : 500,000,000+
MeetMe: Chat & Meet New People : 50,000,000+
LinkedIn : 100,000,000+
Zello PTT Walkie Talkie : 50,000,000+
POF Free Dating App : 50,000,000+
SKOUT - Meet, Chat, Go Live : 50,000,000+
Tik Tok - including musical.ly : 100,000,000+
BIGO LIVE - Live Stream : 100,000,000+
VK : 100,000,000+


We can see the trend does hold with billion+ install apps like YouTube and Google Play Movies skewing the averages for the Video Players category, while Facebook, Instagram, and Google+ do the same for the Social category. There are no social media analytics apps that have a greater number of installs than the average for the Social category. However, this can be explained by the different method of organizing apps implemented by the Google Play Store. Nonetheless the prevalence of social media apps in the leading categories of each dataset still supports our theory that social media analytics apps will attract a large number of users. Even if 1% of people who use Instagram use analytics apps on the google play store, that would still mean over 10,000,000 installs for an Instagram analytics apps.

# Conclusion

Based on our analysis, we've come to the conclusion that a social media analytics app will likely be one of the most profitable for our company to develop. We are already using data analysis tools so it should not be too difficult to implement these same tools towards social media analytics. We would repeat much of the process we completed here, identify relevant metrics (changes in follower count, alerting users of blocked/inactive accounts they follow, and if possible when and who their profile has been viewed by). 