# Profitable App Profiles for the App Store and Google Play Markets

The aim of this project is to find mobile app profiles that are profitable for the App Store and Google Play markets. We're working as data analysts for a company that builds Android and iOS mobile apps. Our job is to help our developers make data-driven decisions in regards to the kind of apps they build. 

At our company, we only build apps that are free to download and install. Because of this, our main source of revenue comes from in-app ads. This means that revenue is heavily influenced by the number of users for our apps. Therefore our goal with this project is to analyze data and help our developers understand what kind of apps are likely to attract more users.

## Opening and Exploring the Data

As of September 2018, there were approximately 2 million apps available on the App Store and Google Play store.

Collecting data on 4 million apps requires significant resources so we will be analyzing a sample of data instead. To avoid spending resources collecting data ourselves we found two data sets suitable for our purpose:
* A dataset [here](https://www.kaggle.com/lava18/google-play-store-apps) on Google Play Apps
* A dataset [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) on App Store Apps

We obtained both of these datasets from Kaggle, a repuatable data science online community. 

Let's start by opening the two datasets and then continue with exploring the data.

In [1]:
from csv import reader

#Google Play Dataset
opened_file = open('googleplaystore.csv', encoding="utf8")
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

#App Store Dataset
opened_file = open('AppleStore.csv', encoding="utf8")
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

To make it easier to explore the two sets we will be using a fucntion named explore_data(). We can use this function to repeatedly explore rows in an easier to read format. It also has an option to show the number of columns and rows in a dataset. 

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
#Exploring Google Play

print(android_header)
print('\n')
explore_data(android, 0, 3, rows_and_columns=True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


We see that the google play dataset has 13 columns and 10,841 rows. At a quick peak, we can assume the 'App', 'Category', 'Reviews', 'Installs', 'Type', 'Price', and "Genres' columns will be useful in our analysis.

Now let's take a look at the App Store dataset.

In [4]:
# Exploring App Store

print(ios_header)
print('\n')
explore_data(ios, 0,3, rows_and_columns=True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


We see that the app store dataset has 16 columns and 7197 rows. It looks like the 'track_name', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', and 'prime_genre' columns will be useful to us. Not all of the column names are self-evident in this dataset but we can quickly see column details in the [documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home)

## Data Cleansing

Data rarely comes to us ready to analyze. First, we have to "clean" it. This generally includes detecting inaccurate data and correcting or removing it as well as detecting duplicates. This is the next step of our process.

One of the great things about Kaggle is they have a discussion sections for datasets on their site. It is always a good idea to go through the discussion and see if anyone has found any errors. By going through these we found that this [post](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) outlines an error for row 10472. Let's confirm this.

### Deleting Wrong Data

In [5]:
# Print Column Names
print(android_header)

#Print Known Correct Row
print(android[1])

#Print Errored Row
print(android[10472])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Row 10472 corresponds to the app *Life Made WI-Fi Touchscreen Photo Frame* and has a rating of 19. We know that Google Play app ratings have a maximum of 5 so this is clearly an error. This error will cause or calculations to be incorrect so we will be deleting this row.

In [6]:
#Don't run this more than once
del android[10472]

#Check
print(android[10472])

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


### Removing Duplicates

#### Part One

In our exploration of the Google Play dataset we found that some apps have duplicate entries. For example, Instagram has four entries.

In [7]:
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


Let's look for more dupilcate apps.

In [8]:
duplicate_apps = []
unique_apps = []

for app in android:
    name=app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print('Duplicates: ', len(duplicate_apps), '\n')
print('Duplicate examples: ', duplicate_apps[:15])

Duplicates:  1181 

Duplicate examples:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


Above we:
* Created two empty lists
    * One to store the name of duplicate apps
    * One to store the name of unique apps
* Looped through the android data set

We don't want to count apps more than once when we analyze that data so we need to remove the duplicates. Above we looked at the 'Instragram' rows and it appears the only real difference is in the fourth position of each row. This position corresponds to the number of reviews. The difference in the numbers suggests the data was collected at different times. We can use this to build a criterion for keeping rows. The row with the highest number of reviews is the most recent data collected thus we will keep that row and remove the others.

To do that we will:
* Create a dictionary where each key is a unique name and the value is the highest number of reviews of that app
* Use the dictionary to create a new dataset with no duplicates

#### Part Two

Let's build the dictionary.

In [9]:
reviews_max = {}
for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

Earlier we found that there are 1,181 duplicates in the dataset. After removing these, the length of our dictionary should be the length of our dictionary minus 1,181. Let's check this.

In [10]:
print('Expected length: ', len(android)-1181)
print('Actual length: ', len(reviews_max))

Expected length:  9659
Actual length:  9659


Now we will use the reviews_max dictionary to remove the duplicates. In the code cell below:
* We start by do creating two empty lists - android_clean and already_added
* Loop through the android dataset and for every iteration:
    * Isolate the name of the app and the number of reviews
    * Add the current row to the andorid_clean list and the app name to the already added list if:
        * The number of reviews of the current app matches the number of reviews in the reviews_max dictionary
        * The name of the app is not already in the already_addedd list
            * This is to guard against rows with duplicate entries that have the same number of reviews 

In [11]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

Now lets check the android_clean dataset to make sure it went as expected.

In [13]:
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


Great. The number of rows matched what we determined it should be. 

## Removing Non-English Apps

### Part One

Our company only makes apps directed towards English speakers. So for our purposes we don't want to include non-English apps in our analysis. We will need to remove these.

We will remove apps that have names with symbols not commonly used in English text. All letters that are specific to English text are encoded using the ASCII standard. Each ASCII character has a corresponding number between 0 and 127 associated with it. This is what will allow us to determine whether the app in English or not. 

We have built the function below that checks an app name and tells us whether it contains non-ASCII characters. It utilizes the built in ord() function which returns the corresponding encoding number of each character.

In [17]:
def is_english(string):
    
    for c in string:
        if ord(c) > 127:
            return False
        
    return True

In [18]:
#Testing
print(is_english('Instagram'))
print(is_english('爱奇艺PPS'))

True
False


The fuction seems to work fine, but some English app names use emojis and other symbols that are outside of ASCII range. We don't want to remove these as we would lose valuable data. Encoding for emojis fall outside of ASCII range so if we deleted apps based just on that we would lose valuable data. Below is an example.

In [20]:
print(is_english('Docs To Go™ Free Office Suite'))
print(ord('™'))

False
8482


### Part Two

To minimize data loss we will only remove an app if its name has more than three non-ASCII characters.

In [21]:
def is_english(string):
    non_ascii = 0
    
    for c in string:
        if ord(c) > 127:
            non_ascii += 1
        
    if non_ascii > 3:
        return False
    else:
        return True
        

In [22]:
print(is_english('Instachat 😜'))
print(is_english('Instagram'))
print(is_english('爱奇艺PPS'))
print(is_english('爱奇艺艺PPS'))

True
True
True
False


It is not perfect, as few non-English apps might get past our filter, but it will work good enough for our analysis.
Below we use the is_english() function to filter out the non_English apps for both datasets.

In [26]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if is_english(name):
        android_english.append(app)
        
for app in ios:
    name = app[1]
    if is_english(name):
        ios_english.append(app)

In [27]:
explore_data(android_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


In [29]:
explore_data(ios_english, 0, 3, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 6183
Number of columns: 16


We are left with 9,614 Android apps and 6,183 iOS Apps.

## Removing Paid Apps

Our company only builds apps that free to download and install and our main source of revenue is from in-app ads. Our datasets contain both free and paid apps. For our purposes we will remove the paid apps for our analysis.

In [30]:
android_final = []
ios_final = []

for app in android_english:
    price = app[7]
    if price == '0':
        android_final.append(app)


for app in ios_english:
    price = app[4]
    if price == '0.0':
        ios_final.append(app)
print(len(android_final))
print(len(ios_final))

8864
3222


We are left with 8,864 Android apps and 3,222 iOS apps for our analysis.

## Most Common Apps by Genre
### Part One

As we mentioned earlier, our goal is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app and deploy it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app ir profitable after 6 onths, we build an iOS version and deploy it to the App Store.

Because our end goal is to add an app to the App Store and Google Play we need to find app profies that are successful on both markets.

Let's begin the analysis by looking at the most common genres for each market. To do this, we will build a frewuency table fr the prime_genre column of the App Store dataset and the Genres and Category columns of the Google Play dataset.

### Part Two

We will build two functions to analyze the frequency tables:
* One to generate frequency tables that show percentages
* One to display the percentages in descending order

In [31]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage
        
    return table_percentages

In [32]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

### Part Three

Let's start with the frequency table for the prime_genre column of the App Store dataset.

In [33]:
display_table(ios_final, -5)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


We can see that for free English apps, games make up more than half (58.16%). Entertainment apps are almost 8%, followed by photo and video apps which are close to 5%. Only 3.66% are education apps followed by social networking apps at 3.29%. After these genres the percentages are 2.6% and below. 

Our general impression is that the App Store (at least the English free apps) is dominated by apps for fun (games, entertainment, photo and video, social netowrking music, etc.,) Apps with practical purposes (ediucation, shopping, poroductivity, lifestyle, etc.,) are less popular. However, the fact that fun apps are more numerous doesn't also imply they have the greatest number of users.

Let's continue by examining the Genres and Category columns of the Google Play dataset.

In [36]:
#Category
display_table(android_final, 1)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

We are seeing a big difference in Google Play. There are not that many "fun" apps and it seems that a good number of the apps are practical (family, toos, business, lifestyle, etc.,) However, if we investigate this further, we can see that the family category (at nearly 19%) is mostly games for kids. 

![title](googleplay.png)

However, even if we assumed all family apps are games and added the percentages for both, (Games at 9.7% and family at 19% for a total of 28.7%) there are more practical apps on the Google Play store compared to the App Store. This is confirmed by the frequency table we see for the Genres column.

In [35]:
#Genres
display_table(android_final, -4)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

The difference between the Genres and the Category columns is not completely clear, but we notice that the Genres column is much more categorized than the Category column. For our analysis, we are interested in the big picture so moving forward we will be working only with the Category column.

To recap, we found that the App Store is dominated by "fun" apps while Google Play shows a more balanced arena for both paractical and fun apps. Now we will invesitigate what kinds of apps have the most users.

## Most Popular Apps by Genre on the App Store

One way to find out what genres have the most users to to calculate the average number of installs for each genre. The Google Play dataset has this information already in the Installs column. The App Store daaset is missing this information. As a workaround, we will use the total number of user ratings, found in the ratings_count_tot column.

Below, we calculate the average number of user rating per genre on the App Store.

In [41]:
genres_ios =  freq_table(ios_final, -5)

for genre in genres_ios:
    total = 0
    len_genre = 0
    for app in ios_final:
        genre_app = app[-5]
        if genre_app == genre:
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
            
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)

Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


On average, navigation apps have the highest number of user reviews. But if we dig deeper into this we see that it is highly influenced by Waze and Google which have close to half a million user reviews together.

In [42]:
for app in ios_final:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5])

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


This same pattern applies to social networking apps, where the average number is highly influenced by a few giants like Facebook, Pinterest, Skype etc., It also applies to music apps with playes like Pandora and Spotify heavily influence the average number.

Our aim is to find popular genres but navigation, social networking, or music apps might seem more popular than they really are. The average number of ratings appears to be skewed by a few apps that have hundreds of thousands of user ratings while other apps are barely bast the 10,000 mark. We could get a better picture by removing these extremely popular apps for each genre and then rework the averages but we will leave this level of detail for later.

Other genres that seem popular include weather, book, reference, food and drink, or finance. We are intersted in looking at the book and reference genres further. The other genres don't quite interest us for these reasons:

* Weather apps - People don't generally spend a lot of time in-app and because our revenue comes from in-app ads the chances of making profit are low.  
* Food and drink apps - Examples from this genre are Starbucks, McDonalds. etc., so making a popular food and drink app seems to require actual cooking and delivery service which is outside the scope of our company.
* Finance apps - These apps are mostly money transfer, mobile banking, paying bills, etc. Building a finance app would require us to hire a finance expert.

Let's dive deeper into the books and reference genres.

In [57]:
for app in ios_final:
    if app[-5] == 'Reference':
        print(app[1], ':', app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


The reference genre does seem to be skewed by the Bible and Dictionary.com apps. Let's look at the books genre.

In [56]:
for app in ios_final:
    if app[-5] == 'Book':
        print(app[1], ':', app[5])

Kindle – Read eBooks, Magazines & Textbooks : 252076
Audible – audio books, original series & podcasts : 105274
Color Therapy Adult Coloring Book for Adults : 84062
OverDrive – Library eBooks and Audiobooks : 65450
HOOKED - Chat Stories : 47829
BookShout: Read eBooks & Track Your Reading Goals : 879
Dr. Seuss Treasury — 50 best kids books : 451
Green Riding Hood : 392
Weirdwood Manor : 197
MangaZERO - comic reader : 9
ikouhoushi : 0
MangaTiara - love comic reader : 0
謎解き : 0
謎解き2016 : 0


The book genre seems to be skewed by the Kindle and Audible apps. However, we see potential in the reference/book genre. It is not over saturated like the Game genre, and though it has big players, there is room in the market for other apps. One thing we could do is take another popular book and turn it into an app with different features other than the raw version of the book. This could include daily quotes from the book, an audio version, quizzes on the book etc., On top of that, we could combine a dictionary within the app so users don't need to exit the app and use an external dictionary. 

This idea seems to fit well with the fact that the App Store is dominated by "fun" apps and it will be hard to compete with those. This suggests a more practical app might have more of a chance to stand out. 

Now let's analyze the Google Play market.

## Most Popular Apps by Genre on Google Play

For the Andorid dataset, we have data about the actual number of installs, so we should be able to get a clearer picture about genre popularity. However, the install numbers are not precise. The Installs columns is broken into chunks like 10,000+ and 100,000+.

In [44]:
display_table(android_final, 5)

1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


However, we are just trying to get an idea of what genres are popular so we don't need precise numbers. For our analysis, we are going consider an app with 10,000+ to have 10,000 installs and so on. To perform computations we will need to convert each install number into a float. We will do this directly in the loop below where we will also compute the average number of installs for each genre (category).

In [45]:
categories_android = freq_table(android_final, 1)

for category in categories_android:
    total = 0
    len_category = 0
    for app in android_final:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace('+', '')
            n_installs = n_installs.replace(',', '')
            n_installs = float(n_installs)
            total += n_installs
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

On average, communication apps have the most installs (38,456,119). Like we saw in the App Store this number is heavily skewed by a few apps that have over one billion installs (WhatsApp, Facebook Messenger, Skype, etc.,) There are a few other apps causing skew with over 100 and 500 million installs.

In [49]:
for app in android_final:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

If we removed all the communication apps that have over 100 million installs, the average would be reduced roughly ten times.

In [50]:
under_100_m = []

for app in android_final:
    n_installs = app[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
        under_100_m.append(float(n_installs))
        
sum(under_100_m) / len(under_100_m)

3603485.3884615386

We see the same pattern for the video players, social apps, photography apps, and productivity apps categories. The concern is these app genres might seems more popular than they really are. These niches seem to be dominated by a few giants who are hard to compete against. The games genre is popular in the Google Play market as well however we previously found this part of the market saturated so we would like to come up with a different app reccomendation if possible.

The books and reference genre looks popular as well with an average number of installs of 8,767,811. Our aim to to reccommend an app genre that shows profilt potential on both the App Store and Google Play. Since we found this genre has potential on the App Store we would like to explore this genre in more depth for the Google Play market. 

We will now look at some of the apps in the Books and Reference genre and thier number of installs.

In [73]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E

We can see that the book and reference genre includes a variety of apps: 
* Ebook readers
* Libraries 
* Dictionaries
* Tutorials

We also that as in the case of all the other genres we've looked at - there's still a small number of extremely popular apps that skew the average.

In [74]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                            or app[5] == '500,000,000+'
                                            or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


However, there are only a few very popular apps, so we will see potential is this market. We will try to get some app ideas based on the kinds of apps that fall in the middle (Between 1,000,000 and 1,000,000,000 downloads).

In [75]:

for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
H

This area seems to be dominated by software for processing and reading ebooks as well as various collections of libraries and dictionaries. Therefore it is not a good idea to build similar apps as there will be some steep competitions.

We also notice there are quite a few apps built arounf the book Quaran which suggests that building an app around a popular book can be profitable. It seems that taking a popular book and turning it into an app could be profitable in both the Google Play and App Store marekts.

However, the market is already saturated with libraries so to make our app stand out we will need to add some special features. Such as daily quotes from the book and activites, a discussion forum, tools like dictionaries and highlighters etc., 

## Conclusions

In this project we analyzed data about the App Store and Google Play apps with the goal of recommending an app profile that can be profitable in both markets. 

We concluded that taking a popular book and turning it into an app could be profitable for both markets. However, the market is already full of libraries, so we would need to add some special features. 