# Ad Analysis for Google Play and App Store

In this project we will be acting as a data analyst for a company that builds freee to use apps for both android and apple markets. As a company that makes primarily free apps, all revenue comes from in-app ads.

in order to help our 'team' make data driven decisions we will be analyzing apps to understand what kind of apps and advertisement attract more users.

# Opening Exploring Our Dataset

There are 2 million and 2.1 million apps in the Google Play Store and Apple Store respectively. For this project Ill be using this sample data set of a couple thousand apps.

* [A data set](https://www.kaggle.com/lava18/google-play-store-apps) containing data about roughly 10000 apps from the Google Play Store. This data was collected in August 2018. You can download the data set directly from [here.](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv)
* [A data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps?select=AppleStore.csv) containing data about roughly 7000 apps from the Apple Store. This data was collected in July 2017. You can download the data set directly from [here.](https://dq-content.s3.amazonaws.com/350/AppleStore.csv)


Seeing as these files are CSV files, first we will open and read in the files using reader from the csv library.


In [1]:
from csv import reader

# importing app store data as list of lists
app_file = open('C:\datasets\AppleStore.csv', encoding="utf8")
read_app = reader(app_file)
apps_data = list(read_app)
apps_header = apps_data[0]
apps_data = apps_data[1:]

# importing google play store data as list of lists
goog_file = open('C:\datasets\googleplaystore.csv', encoding="utf8")
read_goog = reader(goog_file)
goog_data = list(read_goog)
goog_header = goog_data[0]
goog_data = goog_data[1:]


After importing our data well explore it using the function below to print out a slice of data from the entire set using indices we choose. The function will also be able to count the number of columns and rows. 

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
#print data formating as well as sample data slice 
print(goog_header)
print('\n')
explore_data(goog_data,0,3,True)


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


With this we see that our data set for google play store apps has 13 columns and 10841 entries

In [3]:
print(apps_header)
print('\n')
explore_data(apps_data,0,3,True)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


Number of rows: 7197
Number of columns: 17


With this we see that our data set for the apple store has 17 columns and 7197 entries.



# Deleting Incorrect Data

Using the [discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion) for the google playstore data set we can see that the data at index 10472 was incorrect from [this](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) post.

First lets print the list at that index while comparing to a different row and see whats wrong with the data.



In [4]:
#data we know is correct
print(goog_header)
print('\n')
#data to observe
print(goog_data[10472])


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


It appears that this row is missing a categories section so we will have to delete this row from our data set.

In [5]:
print(len(goog_data))
del(goog_data[10472])
print(len(goog_data))

10841
10840


# Duplicate Entries

After playing around with the data set enough we will find that there are several apps with duplicate entries in the playstore dataset.

In [6]:
for app in goog_data:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


Instagram alone has 3 extra entries. Lets see how many other apps have duplicate entries. 

In [7]:
duplicate_apps = []
unique_apps = []

for app in goog_data:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
    
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:15])

Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


In order to clean the data most of these duplicates should be deleted but we cant just delete entries randomly. Looking back to the Instagram example we can see that the entries only differ when it comes to review count meaning the data was collected at different times.

To remove most of the duplicates we will find the entries with the most reviews and then ignore the other duplicates as the one with the most reviews can be seen as the most up to date entry.

What to do:
* first we will create a dictionary that stores app names as keys and review count as the coresponding value. 
* then we will iterate through the app list and then if the review count of the current app exceeds the value in the dictionary, then we replace it in that dictionary.

In [8]:
reviews_dict = {}
for app in goog_data:
    name = app[0]
    reviews = float(app[3])
    if name in reviews_dict and reviews_dict[name] > reviews:
        reviews_dict[name] = reviews
    elif name not in reviews_dict:
        reviews_dict[name] = reviews


After completing the operation above, the new dict should have non duplicate entries. To test this we will compare its length with the size of our data minus the number of duplicate entries(1181).

In [9]:
print("Expected length: {}".format(10840 - 1181))
print("Actual size: {}".format(len(reviews_dict)))

Expected length: 9659
Actual size: 9659


Now we will convert this to an actual clean data set with no dupicates. To do this we will:
* create a new list to hold the clean data
* loop through each value in the original data set and compare its review count to the max review count in the reviews dict.
* if the current iterables review count matches that of the dictionary we made then we will append that app to the new list.
* if not then we will simply skip that app.


In [10]:
goog_clean = []
already_added = []
for app in goog_data:
    name = app[0]
    reviews = float(app[3])
    if (name not in already_added) and (reviews == reviews_dict[name]):
        goog_clean.append(app)
        already_added.append(name)
print(len(goog_clean))

9659


Now that the google dataset is cleaned lets see if the apple store dataset contains any duplicates. 

In [11]:
duplicate_apps = []
unique_apps = []

for app in apps_data:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
    
print('Number of duplicate apps:', len(duplicate_apps))

Number of duplicate apps: 0


The apple dataset contains no duplicates.

# Removing Non-English Entries

 Some of the apps shown below are not apps directed towards english speaking audiences. Lets look at a few examples.

In [12]:
print(apps_data[814][2])
print(apps_data[6382][2])
print('\n')
print(goog_clean[7985][0])


搜狐新闻—新闻热点资讯掌上阅读软件
口袋全明星


Météo Algérie DZ


We are not interested in these non english apps so we need to remove them. To do this we can check to see if the title contains english letters and if not we remove them. To check we can use their ASCII values using the ord() function. If the ord value is over 127 then we know it is not an english character.

Since some titles include non english characters like emojis and the trademark sign we need to account for this when deleting entries.

In [13]:
def is_english(word):
    ascii_over=0
    for character in word:
        if ord(character)>127:
            ascii_over+=1
    if ascii_over >3:
        return False
    else:
        return True

apps_english = []
goog_english = []

for app in apps_data:
    name = app[2]
    if is_english(name):
        apps_english.append(app)

for app in goog_clean:
    name = app[0]
    if is_english(name):
        goog_english.append(app)
explore_data(apps_english,0,3,True)
print('\n')
explore_data(goog_english,0,3,True)

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


Number of rows: 6183
Number of columns: 17


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free',

Now we are left with 6183 Apple Store apps and 9614 google play store apps.

# Retrieving Free Only Apps

Since the company I am working for only makes free apps it would be best for us to only look at apps that are free for our analysis. Below we will isolate the free apps from both datasets.

In [14]:
goog_final = []
apps_final = []

for apps in apps_english:
    price = apps[5]
    if price == '0':
        apps_final.append(apps)

for apps in goog_english:
    price = apps[7]
    if price == '0':
        goog_final.append(apps)
        
print(len(apps_final))
print(len(goog_final))


3222
8862


We are left with 3222 Apple Store apps and 8862 Google play store apps. Now we can begin with out anlysis.

# Analysis of data

Now that we have cleaned data we can begin the anlysis of the datasets. Since our company specializes in both Apple Store and Google Play Store apps then we should find columns that are in both data sets that we can observe.

## Genres

Lets start out by making a frequency table function that we can use for different columns in both datasets. Lets start off with on to find the most common genres.

In [15]:
def freq_table(dataset, index):
    table_dict = {}
    total = 0
    for row in dataset:
        total+=1
        value = row[index]
        if value in table_dict:
            table_dict[value]+=1
        else:
            table_dict[value] = 1
    table_percent={}
    for key in table_dict:
        table_percent[key] = (table_dict[key]/total)*100
    return table_percent

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_value_tuple = (table[key], key)
        table_display.append(key_value_tuple)
        
    table_sorted = sorted(table_display, reverse=True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])


First well call the function to make a frequency table for the prime_genre column in Apple Store dataset.

In [16]:
display_table(apps_final, -5)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


We can see that 'for fun' apps like games, entertainment, photo & video, and social networking apps take up more than 70% of the app store. while more practical apps make up less of the percentage.

Based off the app store data set we would be inclined to make a 'for fun' app however just because they take up a majority of the store does not mean they attract the most viewers.

Lets observe the genres section for the Google Play Store apps.

In [17]:
display_table(goog_final, 1)

FAMILY : 18.788083953960733
GAME : 9.636650868878357
TOOLS : 8.440532611148726
BUSINESS : 4.581358609794629
LIFESTYLE : 3.9043105393816293
PRODUCTIVITY : 3.8930264048747465
FINANCE : 3.7011961182577298
MEDICAL : 3.5319341006544795
SPORTS : 3.419092755585647
PERSONALIZATION : 3.3175355450236967
COMMUNICATION : 3.2498307379823967
HEALTH_AND_FITNESS : 3.0692845858722633
PHOTOGRAPHY : 2.945159106296547
NEWS_AND_MAGAZINES : 2.798465357707064
SOCIAL : 2.663055743624464
TRAVEL_AND_LOCAL : 2.335815842924848
SHOPPING : 2.2455427668697814
BOOKS_AND_REFERENCE : 2.143985556307831
DATING : 1.8618821936357481
VIDEO_PLAYERS : 1.782893252087565
MAPS_AND_NAVIGATION : 1.399232678853532
EDUCATION : 1.2525389302640486
FOOD_AND_DRINK : 1.2412547957571656
ENTERTAINMENT : 1.0381403746332656
LIBRARIES_AND_DEMO : 0.9365831640713158
AUTO_AND_VEHICLES : 0.9252990295644324
HOUSE_AND_HOME : 0.8350259535093659
WEATHER : 0.8011735499887158
EVENTS : 0.7109004739336493
ART_AND_DESIGN : 0.6770480704129994
PARENTING : 0

Based off the play store frequency table for genres we can see that there is a much less dominating landscape for the apps as opposed to the apple store. There are much less apps designe to be 'for fun' with a large number being designed for practical purposes (Tools, buisiness, lifestyle, productivity, finance, etc.). After extra investigation we learn that the family genre mostly means games for kids.

I can also see that there is a genres column besides this category section show above.

In [18]:
display_table(goog_final, -4)

Tools : 8.429248476641842
Entertainment : 6.070864364703228
Education : 5.348679756262695
Business : 4.581358609794629
Productivity : 3.8930264048747465
Lifestyle : 3.8930264048747465
Finance : 3.7011961182577298
Medical : 3.5319341006544795
Sports : 3.4642292936131795
Personalization : 3.3175355450236967
Communication : 3.2498307379823967
Action : 3.1031369893929135
Health & Fitness : 3.0692845858722633
Photography : 2.945159106296547
News & Magazines : 2.798465357707064
Social : 2.663055743624464
Travel & Local : 2.324531708417964
Shopping : 2.2455427668697814
Books & Reference : 2.143985556307831
Simulation : 2.0424283457458814
Dating : 1.8618821936357481
Arcade : 1.8618821936357481
Video Players & Editors : 1.782893252087565
Casual : 1.7490408485669149
Maps & Navigation : 1.399232678853532
Food & Drink : 1.2412547957571656
Puzzle : 1.128413450688332
Racing : 0.9930038366057323
Role Playing : 0.9365831640713158
Libraries & Demo : 0.9365831640713158
Auto & Vehicles : 0.92529902956443

The difference doesnt seem too clear but it seems to be a smore specified version of the categories column. For now lets focus on the bigger picture so well be using the categories column instead of this genres column.

So for weve observed that 'for fun' apps dominate the app store while the google play store have a much more balanced selection of practical and 'for fun'. Now lets try and find which types of apps have the most users.

## Most Popular Apps By Genre On The App Store

Finding the most popular apps for the play store should be easy but the app store data has no such column. Ill have to use the total rating count column as a proxy instead and calculate the average for each genre.


In [19]:
genre_apps = freq_table(apps_final, -5)

for genre in genre_apps:
    total = 0
    len_genre = 0
    for apps in apps_final:
        genre_app = apps[-5]
        if genre_app == genre:
            ratings = float(apps[6])
            total+=ratings
            len_genre+=1
    average_ratings = total/len_genre
    print(genre, ':', average_ratings)


Productivity : 21028.410714285714
Weather : 52279.892857142855
Shopping : 26919.690476190477
Reference : 74942.11111111111
Finance : 31467.944444444445
Music : 57326.530303030304
Utilities : 18684.456790123455
Travel : 28243.8
Social Networking : 71548.34905660378
Sports : 23008.898550724636
Health & Fitness : 23298.015384615384
Games : 22788.6696905016
Food & Drink : 33333.92307692308
News : 21248.023255813954
Book : 39758.5
Photo & Video : 28441.54375
Entertainment : 14029.830708661417
Business : 7491.117647058823
Lifestyle : 16485.764705882353
Education : 7003.983050847458
Navigation : 86090.33333333333
Medical : 612.0
Catalogs : 4004.0


While looking at the data above one might think that Navigation would be best option due to the large number of ratings.

In [20]:
for apps in apps_final:
    if apps[-5] == 'Navigation':
        print(apps[2],':', apps[6])

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Geocaching® : 12811
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5
CoPilot GPS – Car Navigation & Offline Maps : 3582
Google Maps - Navigation & Transit : 154911


we see here that Waze and google maps consist of most of the review count for navigation apps.

The same pattern appears for Social networking and Music where few apps dominate the store and generate a lot of reviews for their corresponding genre.



In [21]:
for apps in apps_final:
    if apps[-5] == 'Reference':
        print(apps[2],':', apps[6])
print('\n\n')
for apps in apps_final:
    if apps[-5] == 'Book':
        print(apps[2],':', apps[6])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
Merriam-Webster Dictionary : 16849
Google Translate : 26786
Night Sky : 12122
WWDC : 762
Jishokun-Japanese English Dictionary & Translator : 0
教えて!goo : 0
VPN Express : 14
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Real Bike Traffic Rider Virtual Reality Glasses : 8



Kindle – Read eBooks, Magazines & Textbooks : 252076
OverDrive – Library eBooks and Audiobooks : 65450
Audible – audio 

While a Bible and dictionary app take up a large portion of reference genre reviews there are still several other apps with large average review counts. Book apps also seem to be somewhat popular. They also seem to be a good target for our in app ad strategy as users of these types of apps tend to stay on the app for a longer amount of time.

One idea for an app could be a book reading app where we can take a popular book and add features to make the app attractive for several users. Some features could be translation to other languages as well as an embeded dictionary to avoid users switching to other apps to maximize use time.

While for fun apps might seem enticing, our data proves that the market is very saturated for the App Store, which means we might do better with a more practical style app.

Other genres that seem popular include weather, book, food and drink, or finance. The book genre seem to overlap a bit with the app idea we described above, but the other genres don't seem too interesting to us:

* Weather apps — people generally don't spend too much time in-app, and the chances of making profit from in-app adds are low. Also, getting reliable live weather data may require us to connect our apps to non-free APIs.

* Food and drink — examples here include Starbucks, Dunkin' Donuts, McDonald's, etc. So making a popular food and drink app requires actual cooking and a delivery service, which is outside the scope of our company.

* Finance apps — these apps involve banking, paying bills, money transfer, etc. Building a finance app requires domain knowledge, and we don't want to hire a finance expert just to build an app.


## Most Popular Apps By Genre On The Google Play Store



In [22]:
display_table(goog_final, 5)

1,000,000+ : 15.752651771609116
100,000+ : 11.554953735048521
10,000,000+ : 10.505529225908374
10,000+ : 10.223425863236288
1,000+ : 8.406680207628076
100+ : 6.917174452719477
5,000,000+ : 6.8156172421575265
500,000+ : 5.574362446400361
50,000+ : 4.773188896411646
5,000+ : 4.513653802753328
10+ : 3.5432182351613632
500+ : 3.2498307379823967
50,000,000+ : 2.279395170390431
100,000,000+ : 2.1214172872940646
50+ : 1.9183028661701647
5+ : 0.7898894154818324
1+ : 0.5077860528097494
500,000,000+ : 0.2708192281651997
1,000,000,000+ : 0.22568269013766643
0+ : 0.045136538027533285
0 : 0.011284134506883321


Here we run into a small problem with the installs column on the play store data. The installs number is not exact and instead shows ranges for downloads. For now we will take the numbers at face value and store 1,000,000+ as 1,000,000 and 5,000+ as 5,000, etc. for now.

In [23]:
categories_goog = freq_table(goog_final, 1)

for category in categories_goog:
    total = 0
    len_category= 0
    for apps in goog_final:
        category_app = apps[1]
        if category_app == category:
            installs = apps[5]
            installs = installs.replace(',','')
            installs = installs.replace('+','')
            total+=float(installs)
            len_category+=1
    average_ratings = total/len_category
    print(category, ':', average_ratings)

ART_AND_DESIGN : 1905351.6666666667
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1704192.3399014778
COMICS : 817657.2727272727
COMMUNICATION : 38326063.197916664
DATING : 854028.8303030303
EDUCATION : 3057207.207207207
ENTERTAINMENT : 19428913.04347826
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4167457.3602941176
HOUSE_AND_HOME : 1313681.9054054054
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 13006872.892271662
FAMILY : 4371709.123123123
MEDICAL : 107167.23322683707
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17805627.643678162
SPORTS : 4274688.722772277
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10695245.286096256
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16772838.591304347
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24790074.17721519
NEWS_AND_MA

Once again there seem to be a few categories in which popular apps skew the install average in some way. Some categories include commmunication apps that have apps such as whatsapp, facebook messenger, Google Chrome, and Gmail each with 1 billion installs.

Lets filter this list so that extremely popular apps will be ignored as to not skew our average.


In [24]:
categories_goog = freq_table(goog_final, 1)

for category in categories_goog:
    total = 0
    len_category= 0
    for apps in goog_final:
        category_app = apps[1]
        if category_app == category:
            installs = apps[5]
            installs = installs.replace(',','')
            installs = installs.replace('+','')
            if float(installs) < 500000000.00:
                total+=float(installs)
                len_category+=1
    average_ratings = total/len_category
    print(category, ':', average_ratings)

ART_AND_DESIGN : 1905351.6666666667
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 3523197.1428571427
BUSINESS : 1704192.3399014778
COMICS : 817657.2727272727
COMMUNICATION : 9162116.249097472
DATING : 854028.8303030303
EDUCATION : 3057207.207207207
ENTERTAINMENT : 8653406.593406593
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 2337816.9815498153
HOUSE_AND_HOME : 1313681.9054054054
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 11276842.07746479
FAMILY : 3477073.219013237
MEDICAL : 107167.23322683707
SOCIAL : 6440960.614718615
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 14027956.98076923
SPORTS : 4274688.722772277
TRAVEL_AND_LOCAL : 4364410.175609756
TOOLS : 6064748.6172506735
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 8195968.570588236
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 9140849.806451613
NEWS_AND_MAGAZIN

Some categories that could be of interest to us include books and reference and comics. All of these apps create an environment where users are on the app for extended periods of time which would maximize our ad profit. 

For now lets look into the books and reference category as it matches our Apple store analysis from before and we can gain insight to see if its a valuable market for our company to dive into.

In [28]:
# print books and reference apps that have less than 500,000,000 downlaods.
for app in goog_final:
    if app[1] == 'BOOKS_AND_REFERENCE' or app[1] == 'COMICS' and not(app[5] == '1,000,000,000+'
                                            or app[5] == '500,000,000+'):
        print(app[0], ':', app[5])

Wattpad 📖 Free Books : 100,000,000+
E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Amazon Kindle : 100,000,000+
Cool Reader : 10,000,000+
Dictionary - Merriam-Webster : 10,000,000+
NOOK: Read eBooks & Magazines : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Oxford Dictionary of English : Free : 10,000,000+
Offline: English to Tagalog Dictionary : 500,000+
Spanish English Translator : 10,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
NOOK App for NOOK Devices : 500,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,00

This niche seems to be dominated by software for processing and reading ebooks, as well as various collections of libraries and dictionaries, so it's probably not a good idea to build similar apps since there'll be some significant competition.

We also notice there are quite a few apps built around the book Quran, which suggests that building an app around a popular book can be profitable. It seems that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets.

However, it looks like the market is already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc

## Conclusions

In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.

We concluded that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets. The markets are already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.
