# Profitable App Profile for the App Store and Google Play Markets

The aim for this project is to find app profiles profitable for both the App Store and Google Play Markets. 

For our company, we build free apps with our main source of revenue coming from in-app ads, meaning that our revenue will come from an app influenced by as many users as possible. Our goal for this project is to analyze data to help developers which apps are likely to attract the most users.


# Opening and Exploring Data


Google Play data set: https://www.kaggle.com/lava18/google-play-store-apps 

App Store data set: https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps

First we will read in the data.

In [1]:
from csv import reader

#Opening Apple Store app file
ios_file = open('AppleStore.csv')
ios_readfile = reader(ios_file)
ios_appdata = list(ios_readfile)

#Opening Google Play Store app file
android_file = open('googleplaystore.csv')
android_readfile = reader(android_file)
android_appdata = list(android_readfile)


To make things easier, we will construct a function to easily explore a data set for a set amount of rows.

In [2]:

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
# Using explore function to print the first couple of rows of Apple Store data
explore_data(ios_appdata, 0, 3, rows_and_columns=True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7198
Number of columns: 16


In [4]:
# Using explore function to print the first couple of rows of Google Play Store Data
explore_data(android_appdata, 0, 3, rows_and_columns=True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


# Data Cleaning
Removing non-English apps, and removing apps that are not free.

This discussion, https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015 shows an error in the category column and has no genre for index 10473 in the Play Store data.

In [5]:
#Checking if it's incorrect. It is
explore_data(android_appdata, 10472, 10475)

['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']




In [6]:
#Removing this row
del android_appdata[10473]

# Removing Duplicate Entries

From the 2 cells code below, one can see there exists duplicate apps in our data set. In this case there are 1181 duplicates for our Google Play Store data. A few duplicate rows are printed to confirm.

In [7]:
duplicate_apps = []
unique_apps = []

for app in android_appdata[1:]:
    name = app[0]
    
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:15])


Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


In [8]:
for app in android_appdata[1:]:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


### Looking at the output above, the difference lies in the 'ratings', column 4. Instead of removing duplicate entries randomly, the criteron I will use is to keep the one entry per app with the highest number of reviews and therefore the data should be more recent.

In [9]:
reviews_max = {}

for app in android_appdata[1:]:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

print('Length after removing duplicates\n')
print('Expected: ', (len(android_appdata) - 1) - len(duplicate_apps))
print('Actual: ', len(reviews_max))

Length after removing duplicates

Expected:  9659
Actual:  9659


### We will now clean the data by looping through our dataset and checking if the ratings is equal to the max in our dictionary above. The expected length for our clean data is 9659

In [10]:
android_clean = []
already_added = []

for app in android_appdata[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)

print('Length of clean data: ',len(android_clean))

Length of clean data:  9659


### Checking for duplicates in our iOS data. None were found.

In [11]:
duplicate_ios = []
unique_ios = []

for app in ios_appdata[1:]:
    id = app[0]
    if id in unique_ios:
        duplicate_ios.append(id)
    else:
        unique_ios.append(id)

print('Duplicates found: ', len(duplicate_ios))

Duplicates found:  0


# Removing non-English Apps

To filter out non-English apps, we will look for only characters with ASCII values 0 - 127. To not remove names will alternate characters over 127 but still considered English, such as 'Docs To Go™ Free Office Suite' and 'Instachat 😜', we will allow up to three characters to have a value over 127.

In [12]:
#Function to determine if a String is english
def is_english(a_string):
    n_non_eng = 0
    
    for char in a_string:
        if ord(char) > 127:
            n_non_eng += 1
    
    if n_non_eng > 3:
        return False
    else:
        return True

In [13]:
english_android_appdata = []
english_ios_appdata = []

n_non_eng_android = 0
n_non_eng_ios = 0

for app in android_clean:
    name = app[0]
    if is_english(name) == True:
        english_android_appdata.append(app)
    else:
        n_non_eng_android += 1

for app in ios_appdata[1:]:
    name = app[1]
    if is_english(name) == True:
        english_ios_appdata.append(app)
    else:
        n_non_eng_ios += 1
        

print('Android Non-English apps found: ', n_non_eng_android)
print('iOS Non-English apps found: ', n_non_eng_ios)
print('\n')
print('Android apps in new list: ', len(english_android_appdata))
print('iOS apps in new list: ', len(english_ios_appdata))

Android Non-English apps found:  45
iOS Non-English apps found:  1014


Android apps in new list:  9614
iOS apps in new list:  6183


# Isolating Free Apps

For this project we are interested in only free apps, meaning our main source of revanue is from in-app ads. So far our data set contains both free and non-free apps. We will now isolate the free apps.

In [14]:
android_final = []
ios_final = []

for app in english_android_appdata:
    price = app[7]
    if price == '0':
        android_final.append(app)
        
for app in english_ios_appdata:
    price = app[4]
    if price == '0.0':
        ios_final.append(app)

print(len(android_final))
print(len(ios_final))

8864
3222


### After removing inaccurate data, duplicate entries, non-English apps, and non-free apps, we are left with 8864 Android apps and 3222 iOS apps.

# Analysis

As mention in the introduction, our aim is to determine apps that attract the most users, since our revanue is based off in-app ads.

Because our end goal is to add an app on both Google Play and the App Store, we need to find app profiles successful on both markets.

We will being the analysis determining the most common genres for each market.

## Most Common Apps by Genre

For this we'll need to build frequency tables for a few columns in the data sets.

We will build to functions:
1. One function to generate frequency tables that show percentages
2. Another that displays the percentages in descending order

In [15]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for app in dataset:
        total += 1
        val = app[index]
        if val in table:
            table[val] +=1
        else:
            table[val] = 1
    
    table_percentage = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentage[key] = percentage
    
    return table_percentage    

In [16]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Examining the frequecy table for the prime_genre column of the App Store data

In [17]:
display_table(ios_final, 11)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


From the table above, one can see that the most common genres of English free apps for the App Store is Games (58.16%) and Entertainment (7.88%). The general impression for the free English apps is they are used primariliy for fun (games, photo and video, social networking, sports, music etc.) rather than for practical purposes (education, shopping utilities, productivity, lifestyle).

However, fun apps being the most numerous does not gaurantee that they have the greatest number of users.

Now we will examine the Category and Genre columns respectively, for the Google Play Store

In [18]:
display_table(android_final, 1)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

The most common categories for the Play store look significantly different. The most common categories include Family (18.90%), Game (9.72%), Tools (8.46%), Business (4.59%) and Lifestyle (3.90%). At first glance, the general impression for the free English apps on the Play store is that they are primarily designed for practical purposes (Family, Tools, Business, Productivity, Finance, Medical, etc.). However, if we investigate further, the Family category contains mostly games for kids.

Similar to the previous case above, the most common category of apps does not necessarily reflect the apps with the most users.

Despite the Family category having more games, the Play store seems to contain more apps for practical purposes and thus a more balanced landscape. This can also be seen from the Genre column below.

In [19]:
display_table(android_final, 9)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

At first glance, it is not clear what the main purpose for the Genre column is. It does differ from the Category column in that it is more granular (contains more categories). With our purposes being a general analysis, we will stick to the Category column moving forward.

Up to this point we have found that the Apple App Store is dominated by apps that are designed for fun, whereas the Play Store contains a more balanced landscape of both fun and practical apps.

Next we will look to see what kind of apps have the most users.

## Most Common Apps by Genre for the App Store

One way to find what genres are the most popular is to calculate the average number of installs for each app genre. For the Goolge Play data we have the column 'Installs', but this is missing for the App Store. As a workaround, we'll use the total number of user ratings found in the 'rating_count_tot' column. First we will look at App Store data.

In [25]:
ios_genre_table = freq_table(ios_final, -5)

In [27]:
for genre in ios_genre_table:
    total = 0
    len_genre = 0
    for app in ios_final:
        genre_app = app[11]
        if genre_app == genre:
            ratings = float(app[5])
            total += ratings
            len_genre += 1
    avg_ratings = total / len_genre
    print(genre,':',avg_ratings)

News : 21248.023255813954
Photo & Video : 28441.54375
Entertainment : 14029.830708661417
Finance : 31467.944444444445
Book : 39758.5
Health & Fitness : 23298.015384615384
Sports : 23008.898550724636
Reference : 74942.11111111111
Weather : 52279.892857142855
Games : 22788.6696905016
Catalogs : 4004.0
Shopping : 26919.690476190477
Medical : 612.0
Travel : 28243.8
Business : 7491.117647058823
Education : 7003.983050847458
Music : 57326.530303030304
Navigation : 86090.33333333333
Productivity : 21028.410714285714
Utilities : 18684.456790123455
Lifestyle : 16485.764705882353
Social Networking : 71548.34905660378
Food & Drink : 33333.92307692308


On average, Navigation has the highest number of user reviews, followed by Reference, Social Networking, Music and Weather. The average number of ratings for each genre could be skewed by huge giants such as Google Maps, Facebook, etc. For a more accurate picture, we could remove these extremely popular apps. This level of detail will be left for later.

Now to analyze the Google Store data.



## Most Common Apps by Genre for Google Play

The install numbers don't seem precise enough. For example we don't know whether 100,000+ is 200,000, 350,000, or 400,000. However for our purposes we don't need precsion with respect to the number of users, we only want to find out which app genres attract the most users.

In [29]:
display_table(android_final, 5) # the Installs columns

1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


To perform calculations on these number we first need to remove the comma and plus characters. Then we will convert the strings to floats.

In [31]:
android_table = freq_table(android_final, 1)

for category in android_table:
    total = 0
    len_category = 0
    for app in android_final:
        category_app = app[1]
        if category == category_app:
            n_installs = app[5]
            n_installs = n_installs.replace('+', '')
            n_installs = n_installs.replace(',', '')
            n_installs = float(n_installs)
            total += n_installs
            len_category += 1
    
    avg_installs = total / len_category
    print(category,':', avg_installs)
    
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

TRAVEL_AND_LOCAL : 13984077.710144928
DATING : 854028.8303030303
BUSINESS : 1712290.1474201474
MAPS_AND_NAVIGATION : 4056941.7741935486
PARENTING : 542603.6206896552
SPORTS : 3638640.1428571427
GAME : 15588015.603248259
AUTO_AND_VEHICLES : 647317.8170731707
VIDEO_PLAYERS : 24727872.452830188
HOUSE_AND_HOME : 1331540.5616438356
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
EDUCATION : 1833495.145631068
COMMUNICATION : 38456119.167247385
ART_AND_DESIGN : 1986335.0877192982
BEAUTY : 513151.88679245283
PHOTOGRAPHY : 17840110.40229885
FOOD_AND_DRINK : 1924897.7363636363
BOOKS_AND_REFERENCE : 8767811.894736841
HEALTH_AND_FITNESS : 4188821.9853479853
SHOPPING : 7036877.311557789
ENTERTAINMENT : 11640705.88235294
SOCIAL : 23253652.127118643
PRODUCTIVITY : 16787331.344927534
MEDICAL : 120550.61980830671
COMICS : 817657.2727272727
FAMILY : 3695641.8198090694
TOOLS : 10801391.298666667
WEATHER : 5074486.197183099
PERSONALIZATION : 5201482.6122448975
LIFESTYLE : 1437816.2687861272
NEWS_A

On average, communications apps have the most installs (38,456,119). This number is heavily skewed by popular apps with installs 1,000,000,000+. Some with 500,000,000+ and 100,000,000+  (What's App, Skype, Messenger, etc.)

In [32]:
for app in android_final:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+' or app[5] == '500,000,000+' or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

If we removed all the communication apps with over 100,000,000+, the average would probably be significantly reduced.

Other runner up categories such as video playing or social networking, will have apps that heavily skew results as well (Facebook, Instagram, YouTube). The main concern is that these app genres may seem more popular than they actually are.

The books and references is fairly popular and would interesting to explore with more depth. We found this app to have some potential in the App Store and our aim is to find a profitable genre for both the App Store and Google Play Store.