# Profitable App Profile for the App Store and Google Play Markets

The goal of this project is to use readily available data in order to understand which apps are likely to succeed and profit within the App Store and Google Play market. The information acquired will allow developers to produce their apps to match the data-driven expected needs of the consumer.
The data in question will be pertaining to free applications and directed towards the english speaking audience. The result of this analysis should provide increased revenue through in-app advertisement.

## Data and Exploration

As of September 2018, both markets have provided samples of their data which can be found here 
* https://dq-content.s3.amazonaws.com/350/googleplaystore.csv 
* https://dq-content.s3.amazonaws.com/350/AppleStore.csv

To begin the analysis, both files will be opened and inspected.

In [3]:
from csv import reader
opened_google = open('googleplaystore.csv')
read_google = reader(opened_google)
google_data = list(read_google)
google_header = google_data [0]
google_dataset = google_data [1:]

opened_apple = open('AppleStore.csv')
read_apple = reader(opened_apple)
apple_data = list(read_apple)
apple_header = apple_data[0]
apple_dataset = apple_data[1:]

After opening the files, The first 4 entries as well as the header of the dataset can be seen and taken into account.

In [4]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
explore_data(google_data,0,5,True)
explore_data(apple_data,0,5,True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10842
Number of columns: 13
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_cou

Taking a look at the data within both files, we can see that there are roughly 18000 apps between the two datasets. The reader can also take a look at some of the information pertaining to popular web apps such as "Facebook" and "Instagram". It can be noted that categories such as "Genres" , "Reviews", and "Category" can all be considered for analyses. Before we make any kind of analysis, we must clean and remove any errors from our datasets as this may cause issues later on.

## Cleaning the Data

Taking a look through the forums of our datasets, it can be seen that the google data has an error entry. This entry is missing results in the Category and total number of reviews columns.

In [5]:
print(google_data[10473])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [6]:
google_data[10473][8] = 'Lifestyle'

In [7]:
google_data[10473].insert(3,'43')

In [8]:
print(google_data[10473])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '43', '3.0M', '1,000+', 'Free', '0', 'Everyone', 'Lifestyle', 'February 11, 2018', '1.0.19', '4.0 and up']


In [9]:
for app in google_dataset:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


Another problem addressed in the google dataset is that there are duplicates in the dataset. As a result, we will be removing these duplicates but not at random. Taking a look at the 4th column of our dataset, the 4th column represents the number of reviews and we will be only keeping the entry with the highest number of reviews

In [10]:
duplicated_apps = []
unique_apps = []

for app in google_dataset:
    name = app[0]
    if name in unique_apps:
        duplicated_apps.append(name)
    else:
        unique_apps.append(name)

print('Number of duplicated apps:', len(duplicated_apps))
print('\n')
print('Examples of duplicates:', duplicated_apps[:10])

Number of duplicated apps: 1181


Examples of duplicates: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


In [11]:
reviews_max = {}

for app in google_dataset:
    name = app[0]
    num_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < num_reviews:
        reviews_max[name] = num_reviews
    elif name not in reviews_max:
        reviews_max[name] = num_reviews
print(len(reviews_max))

9660


The code below will remove any applications that occur more than once and leave behind the version of the app with the highest number of reviews in our cleaned dataset.

In [12]:
clean_google_dataset = []
already_added = []

for app in google_dataset:
    name = app[0]
    num_reviews = float(app[3])
    if num_reviews == reviews_max[name] and name not in already_added:
        clean_google_dataset.append(app)
        already_added.append(name)

In [13]:
explore_data(clean_google_dataset, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9660
Number of columns: 13


There are no duplicates within the App store so we do not have any need to apply the same code to the App Store, instead we must remove all the foreign language apps as we are interested in an english speaking audience.


In [14]:
def in_english(string):
    for char in string:
        if ord(char) > 127:
            return False
    return True

In [15]:
print(in_english('Instagram'))
print(in_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))


True
False


In [16]:
print(in_english('Docs To Go™ Free Office Suite'))
print(in_english('Instachat 😜'))

False
False


It can be seen that the in_english function is also removing apps with specialty characters in them. To deal with this problem we will be limiting the total number of special characters to 3.

In [17]:
def in_english(string):
    non_ascii = 0
    for char in string:
        if ord(char) > 127:
            non_ascii += 1
    if non_ascii > 3:
        return False
    else:
        return True

In [18]:
print(in_english('Docs To Go™ Free Office Suite'))
print(in_english('Instachat 😜'))

True
True


In [19]:
english_google_dataset = []
english_apple_dataset = []
for app in clean_google_dataset:
    name = app[0]
    if (in_english(name) == True):
        english_google_dataset.append(app)
for app in apple_dataset:
    name = app[1]
    if (in_english(name) == True):
        english_apple_dataset.append(app)

explore_data(english_google_dataset, 0, 3, True)
print('\n')
explore_data(english_apple_dataset, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9615
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

Now that the datasets have been cleaned, we can see that there are roughly 15000 remaining data entries. From these remaining apps, we must filter the free apps from the priced apps as that is our target of interest in our analysis.

In [20]:
free_google_appdata = []
free_apple_appdata = []
for app in english_google_dataset:
    cost = app[6]
    if cost == 'Free':
        free_google_appdata.append(app)
for app in english_apple_dataset :
    cost = app[4]
    if cost == '0.0':
        free_apple_appdata.append(app)

explore_data(free_google_appdata, 0, 3, True)
print('\n')
explore_data(free_apple_appdata, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 8864
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

## Analysis

From this point, we will be building a frequency table in order to find the percentage of any column we are interested in. 

Ex: If we are looking at the Category column, we can find out exactly what percentage of our apps are in the "Games" category. If "Games" have a result of 50, then half of our apps in our dataset are made around the "Games" category.

To minimize risk and the costs of the process we will follow this procedure:

1. Build a base version of the app, and add it to Google Play.
2. If the app is reviewed frequently, we take note of feedback and further improve the app.
3. If the app is profitable after six months, we also build an iOS version of the app and add it to the App Store.

We'll begin the analysis by gaining a sense of what the Google Play store is made up of with the use of our frequency tables.

In [21]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [22]:
display_table(free_apple_appdata, -5)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


This display table shows the proportion of "Prime Genre", which represents the applications primary genre. There is a clear majority in the "Games" genre. In addition, "Entertainment" makes up the second highest genre. The two of these factors in tandem result in the market being densely populated with entertainment. This makes a lot of sense, as the longer an individual is spending on their device the more oppertunity for ad revenue.

It is worth noting, other productivity based applications make up roughly 10% of the market. These results however, do not represent the amount of users per category nor the demand for this proportion of apps.

In [23]:
display_table(free_google_appdata, 1)

FAMILY : 18.896660649819495
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

Upon initial inspection, you may note that family is the leading "Category" on the google play store. However, if you take a look at the family section, you quickly realize that the category is mostly games aimed at smaller children. Like the App store market, the google play store is also dominated by games. In this case however, the ratio of (tools, business, lifestyle, productivity, etc.) related apps has gone up considerably in comparison.

In [24]:
display_table(free_google_appdata, -4)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Lifestyle : 3.9034296028880866
Productivity : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.925090252707

Although the difference between genres and category aren't particularly clear, it is clear that genres offer a much broader categorization further aiding us in separating each application.

## Most Popular Apps by Genre on the Apple Store

In [25]:
genres_apple = freq_table(free_apple_appdata, -5)

for genres in genres_apple:
    total = 0
    len_genre = 0
    for app in free_apple_appdata:
        genre_app = app[-5]
        if genre_app == genres:
            num_ratings = float(app[5])
            total += num_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genres, ':', avg_n_ratings)

Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


In [26]:
for app in free_apple_appdata:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5]) # print name and number of ratings

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


On average it can be seen that social media, navigation, reference, and music are the most popular genres within the app store but taking into account that each industry is spearheaded by their respective giants.
In the above code, it can be seen that GPS and google map services skew the average positively leading individuals to believe these app profiles are more popular than they actually are.
In the case of social media, Facebook and Instagram influence the average of these ratings heavily. For music, Spotify and Shazam do the same and for reference apps the bible and the dictionary (essentials to some) also heavily skew the average.

In [27]:
for app in free_apple_appdata:
    if app[-5] == 'Reference':
        print(app[1], ':', app[5])  

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


Amongst other categories there is quite a bit of game related content, games seem to be an integral part of the App Store. The above code shows that within reference, there is a following for Minecraft related applications. This can be explained by the Minecraft (an online multiplayer building game) community being boundless and having a niche following may lead to great success in application design.

Regardless, the data suggests that the reference category shows quite a bit of potential as it is the second highest average rated category. The reference category in tandem with the book, music, and entertainment category having popularity may lead to the possibility of success in an adopting a novel into an application with supporting music and an embedded dictionary.

Other genres that seem popular include weather, food and drink, and finance. These apps arent as much of an interest to us as:

* Weather apps — The amount of time individuals spend on these apps is relatively brisk and the resulting ad revenue may be very low, in addition to the costly APIs required to manage a weather app.

* Food and drink — Although this category shows potential with the development of food delivery applications and cooking instruction, this requires a fair bit of expertise in another field.

* Finance apps — Like food and drink, this category requires domain knowledge which may cause difficulty in the production process of our application.

Nonetheless, we will further our understanding and fine tune our understanding with our analysis of the Google play market.

## Most Popular Apps by Genre on the Google Play Store

The method we'll be using to determine which app is most popular on the Google play store is by counting the average total number of installations for each category within the dataset. There is an issue with using total installs as the number value in the installs column of the dataset has its values as 1000+, 10000+, or + some number. In order to do actual analysis on this data, we must convert all the values to a float by removing the + and changing the value from a str to a float.

In [28]:
category_google = freq_table(free_google_appdata, 1)

for category in category_google:
    total = 0
    len_category = 0
    for app in free_google_appdata:
        category_app = app[1]
        if category_app == category:            
            num_installs = app[5]
            num_installs = num_installs.replace(',', '')
            num_installs = num_installs.replace('+', '')
            total += float(num_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3697848.1731343283
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

It can be seen that on average, communications apps have the most installs at: 38,456,119. This number is heavily skewed up by a few apps that have over one billion installs. (WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts)

We mentioned previously for our App Store data that we would like to remove some of the data points that skew our analysis heavily. So we will begin by removing all communication apps that have over 100 million installs.

In [29]:
under_100_m = []

for app in free_google_appdata:
    num_installs = app[5]
    num_installs = num_installs.replace(',', '')
    num_installs = num_installs.replace('+', '')
    if (app[1] == 'COMMUNICATION') and (float(num_installs) < 100000000):
        under_100_m.append(float(num_installs))
        
sum(under_100_m) / len(under_100_m)

3603485.3884615386

In [30]:
for app in free_google_appdata:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E

It seems there are still some popular applications that skew the average so lets take a look at those.

In [31]:
for app in free_google_appdata:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                            or app[5] == '500,000,000+'
                                            or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


It seems that the niche isn't overly populated by popular apps, showing that the market still has potential.

In [32]:
for app in free_google_appdata:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
H

This niche seems to be made up of software for ebooks, as well as various collections of libraries and dictionaries, so although it may not be a good idea to copy one or the other, it may be a profitable model to encorporate elements of both.

We also notice there is a large number of apps that are made around religious texts, an app that translates or explores details in these texts may be largely profitable albeit slightly immoral to profit off of religion. 

The market is already saturated with resources such as libraries and dictionaries, so in order to be successful we must improve user experience and practicality. 

## Conclusions


Wrapping up our analysis of the App Store and Google Play Markets. 
We concluded that taking a popular book and turning it into an app could be profitable for both the App Store and Google Play markets. Given that the market is saturated with ebooks, we must improve our application in order to be successful. Some possible improvements include an embedded dictionary for complex words, integrated reader soundtrack, audio-visual version of the book, and a connection to social media networks for readers to discuss the book.