# Profitable App Profiles for the App Store and Google Play Markets 
We are a company that builds Android and iOS mobile apps. Our apps are free to download, and our main revenue stream comes from in-app adds. The more users that use our apps, the more revenue per app we make. 

## Objective
The goal for this project: How can we better educate our developers to understand what **TYPE** of apps are more likely to attract **MORE USERS** in the Google Play and App Store? Below we have two free, relevant data sets to help with our analysis:
* A [data set](https://www.kaggle.com/lava18/google-play-store-apps) containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018. You can download the data set directly from this [link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).
* A [data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017. You can download the data set directly from this [link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv).

Let's start by opening the data sets and exploring.

### Opening and Exploring the Data

In [1]:
from csv import reader

#the Google Play store data set

opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
google = list(read_file)
google_header = google[0]
google = google[1:]

#the App Store data#

opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
apple = list(read_file)
apple_header = apple[0]
apple = apple[1:]

Next we will define and use the given **explore_data** function to allow us to explore and read both data sets more easily. Also, will add a third argument that will allow us to see the number of rows and columns in the data sets.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
print(google_header)
print('\n')
explore_data(google, 0, 3, True)

As we can see here, our Google Play Store contains 10,841 apps with 13 descriptive categories. Looking at the categories for our analysis, some useful ones seem to be **'App', 'Category', 'Reviews'  'Installs', 'Type' 'Price', and 'Genres'**. For more context on the Play Store data, click [here](https://www.kaggle.com/lava18/google-play-store-apps)

In [3]:
print(apple_header)
print('\n')
explore_data(apple, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


As we can see from this data, the App Store data has 7,197 apps with 16 descriptive column categories. For our analysis, we will focus on: **' track_name', 'price', 'rating_count_tot', 'rating_count_ver', 'prime_genre'**. For further context on the descriptive columns, see [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

### Data Cleaning 
#### Deleting Incorrect Data

Next, we will begin the data cleaning process. First, let's start with the Google Play Data.

Based on the Google Play data set [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion), we can see one of the [discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) describes an error for us to delete out of our set for row 10,472

In [4]:
print(google_header) #header row (for categories)
print('\n')
print(google[10472]) #named error row for evaluation
print('\n')
print(google[0])     #another row for comparison

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Looking at row 10,472 with the app, *Life Made WI-FI Touchscreen Photo Frame'*, we can see that the 'Category' column value is indeed missing and thus, throwing off our other value indexes. We will delete the row.

In [5]:
print(len(google))
del google[10472]
print(len(google))

10841
10840


#### Removing Duplicate Entries

#### Part One

As we explore the data further as well as the [discussion section](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion), we realize that there are also duplicates in the data that we need to remove to only include one unique instance. See below as we confirm a specific duplicate for *Instagram*

In [6]:
for app in google:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


While this is just a snippet of the duplicates, let's see how many duplicates are in the data set.

In [7]:
duplicate_apps = []
unique_apps = []

for app in google:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print('Number of duplicate apps:', len(duplicate_apps))
print('Number of unique apps:', len(unique_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:15])

Number of duplicate apps: 1181
Number of unique apps: 9659


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


In total, there are 1,181 cases where an app occurs more than once. We don't want to count these apps more than once when we analyze the data, so let's remove duplicates.

Looking at the rows we printed from our *Instagram* example, main difference occurs in the 4th column, where the number of reviews is listed. The difference in numbers corresponds when the data was collected at different times. We will use this criteria to remove the duplicates.The higher the number of reviews, the more recent it is and thus, we will keep the row with the **HIGHEST** number of reviews. 

To remove the duplicates, we will:
* Create a dictionary, where each dictionary key is a unique app name and the corresponding value is the highest number of reviews of that app.
* Use the info stored in the dictionary and create a new data set, which will have only one entry per app (and for each app, we'll only select the entry with the highest number of reviews).

#### Part Two
We will begin by building our dictionary:

In [8]:
reviews_max = {}

for row in google:
    name = row[0]
    n_reviews = float(row[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

In Part One, we found by looping through our data set that we found 1,181 instances where there were duplicates and 9,659 unique, non-duplicated apps - our dictionary length should match this value.

In [9]:
print('Expected length:', len(google) - 1181, 'apps')
print('Actual length:', len(reviews_max), 'apps')

Expected length: 9659 apps
Actual length: 9659 apps


Now that we know they match, let's work on using this dictionary to remove the duplicate rows by only keeping entries with the highest number of reviews:

* We start by building two empty lists below: **google_clean** (to eventually be a list of list and store our cleaned data) and **already_added** to help keep track of apps we already added.
* We will loop through the Google data set and for every iteration:
    * We isolate the name of the app and number of reviews 
    * We add the current row (row) to our **google_clean** list and the app name to our **already_added** list if:
        * The # of reviews of the current app matches the number of reviews of that app described in our recently made **reviews_max** dictionary and 
        * The name of the app is not already in the **already_added** list. We **need** to add this condition to account for cases where the highest number of reviews of the duplicate app is the name for multiple instances. For example, the Box app has three entries where the highest number of reviews are the same. Without this condition, we would be left with 3 duplicates.  



In [10]:
google_clean = [] # to store our new cleaned data set
already_added = [] # which will just store app names

for row in google:
    name = row[0]
    n_reviews = float(row[3])
    
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        google_clean.append(row)
        already_added.append(name) 

Now, let's explore our newly made google_clean data set. The number of rows should, again, equal 9,569. 

In [11]:
explore_data(google_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite ‚Äì FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


As expected, we have 9,659 rows.

Taking our same script for when we checked if there were duplicates in the Google Play Store data, we will quickly check if there are any in the Apple App Store data:

In [12]:
duplicate_apps = []
unique_apps = []

for app in apple:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print('Number of duplicate apps:', len(duplicate_apps))
print('Number of unique apps:', len(unique_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:15])

Number of duplicate apps: 0
Number of unique apps: 7197


Examples of duplicate apps: []


There are none. Onward with our data cleaning

### Removing Non-English Apps
#### Part One
As stated before, we are a company that deals in English only apps. Thus, we'd like to analyze  data that corresponds to an English-speaking audience. Though looking at both data sets, they both have apps with names that suggest they are not directed towards an English-speaking audience.

In [13]:
print(apple[813][1])
print(apple[6731][1])
print('\n')
print(google_clean[4412][0])
print(google_clean[7940][0])

Áà±Â•áËâ∫PPS -„ÄäÊ¨¢‰πêÈ¢Ç2„ÄãÁîµËßÜÂâßÁÉ≠Êí≠
„ÄêËÑ±Âá∫„Ç≤„Éº„É†„ÄëÁµ∂ÂØæ„Å´ÊúÄÂæå„Åæ„Åß„Éó„É¨„Ç§„Åó„Å™„ÅÑ„Åß „ÄúË¨éËß£„ÅçÔºÜ„Éñ„É≠„ÉÉ„ÇØ„Éë„Ç∫„É´„Äú


‰∏≠ÂõΩË™û AQ„É™„Çπ„Éã„É≥„Ç∞
ŸÑÿπÿ®ÿ© ÿ™ŸÇÿØÿ± ÿ™ÿ±ÿ®ÿ≠ DZ


We need to begin removing these types of apps. First, we need to determine what our criteria will be, in script, to decipher between English and non-English apps. One way is remove the apps with a name containing a symbol not commonly used in English text - which usally includes letters from English alphabet, numbers from 0 to 9, punctuation marks (., !, ?m ;) and other symbols (+, *, /). 

All the characters specific to English texts are encoded to the [ASCII](https://en.wikipedia.org/wiki/ASCII) (American Standard Code for Information Interchange) system and correspond to numbers in a range between 0 to 127. Based on this range, we can build a function that detects whether a character belongs to the set of common English characters or not. 
* If the numnber is **equal to or less** than 127, then the character belongs to the set of common English characters.
* If the app contains a character **GREATER** than 127, then it probabaly means that the app has a non-English name.

We will need to write a function that checks the individual characters of each app name and see which ASCII number it corresponds to before we can begin removing them from our data set. 

First, let's use the ord() build a function to find the corresponding encoding number of each character and determine if a specific string is likely English or non-English

In [14]:
def is_english(app_name):
    
    for character in app_name:
        if ord(character) > 127:
            return False 
        
    return True 

print(is_english('Instagram'))
print(is_english('Áà±Â•áËâ∫PPS -„ÄäÊ¨¢‰πêÈ¢Ç2„ÄãÁîµËßÜÂâßÁÉ≠Êí≠'))
print(is_english('Docs To Go‚Ñ¢ Free Office Suite'))
print(is_english('Instachat üòú'))

True
False
False
False


The function seems to work but doesnt seem to be able to recognize between the 3rd and 4th app name example due to emojis and characters that fall outside the ASCII range and have numbers outside 127.

In [15]:
print(ord('üòú'))
print(ord('‚Ñ¢'))

128540
8482


#### Part Two
To use this function, we would lose on a lot of data because many English apps will be incorrectly labeled non-English. To minimize this impact, we will only remove an app if it's name has more than three characters with corresponding numbers outside the ASCII range. 
* This means all English apps with up to three emoji or other special characters will still be labeled as English. Not perfect but more effective.

In [16]:
def is_english(app_name):
    non_ascii = 0 
    
    for character in app_name:
        if ord(character) > 127:
            non_ascii += 1
        
    if non_ascii > 3:
        return False
    else:
        return True 

print(is_english('Docs To Go‚Ñ¢ Free Office Suite'))
print(is_english('Instachat üòú'))
print(is_english('Áà±Â•áËâ∫PPS -„ÄäÊ¨¢‰πêÈ¢Ç2„ÄãÁîµËßÜÂâßÁÉ≠Êí≠'))


True
True
False


While this function isn't perfect and some non_English apps may get through, for our purposes, it suffices. For business purpoes, let's continue on. 

Next, we will use this function to loop through each data set to filter out non-English apps. If the app is identified as English, append the whole row to a separate list.

In [17]:
google_english = []
apple_english = []

for app in google_clean:
    name = app[0]
    if is_english(name):
        google_english.append(app)
        
for app in apple:
    name = app[1]
    if is_english(name):
        apple_english.append(app)
        
explore_data(google_english, 0, 3, True)
print('\n')
explore_data(apple_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite ‚Äì FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+'

We can see we are now left with 9,614 Google Play Store Android apps (down from 9,695) and 6,183 Apple App Store iOS apps (down from 7,197), assumed to be English.

### Isolating the Free Apps

So far in this data cleaning process, we:

* Removed inaccurate data
* Removed duplicate app entries
* Removed non-English apps

As mentioned, we only build apps free to use, and our main source of revenue comes from in-app ads. Our data sets contain both free and non-free apps; we need to isolate only *free* apps for our analysis.

Isolating the free apps will be our last step in data cleaning process. I will begin to loop through each data set to isolate the free apps in separate lists. 


In [18]:
google_final = []
apple_final = []

for row in google_english:
    free = row[7]
    if free == '0':
        google_final.append(row)
        
for row in apple_english:
    free = row[4]
    if free == '0.0':
        apple_final.append(row)

print("Free Google Apps:",len(google_final))
print("Free Apple Apps:", len(apple_final))
    

Free Google Apps: 8864
Free Apple Apps: 3222


We are now left with 8,864 free, English Android apps and 3,222 free, English iOS apps.

### Most Common Apps By Genre
#### Part One 

As mentioned in the intro, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is made up of three steps:

 1. Build a minimal Android version of the app, and add it to Google Play
 2. If the app is well received from users, we develop it further. 
 3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because the end goal is to add the app to both Google Play and App Store, we need to find app profiles that are successful in both markets. *Perhaps a profile that works well for both markets is a productivity app that makes use of gamification*.
    
We will start the analysis by getting a sense of what are the most common genres for each market. For this, we'll need to build frequency tables for a few columns in our data sets. 
   * For Google Play, we will use the **'Category'** and **'Genres'** columns
   * For the App Store, we will use the **'Prime Genre'** column

#### Part Two
We will build two functions to analyze the frequency tables:
* One function to generate frequency tables that show percentages
* Another function we can use to display the percentages in a descending order

In [19]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
            
    table_percentages = {}
    
    for key in table:
        percentage = (table[key] / total) * 100 
        table_percentages[key] = percentage
    
    return table_percentages

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

#### Part Three
With our function written, we will start by examining our frequency table based on our selected column categories. Will start with the **'prime_genre'** category in the App Store

In [20]:
print(display_table(apple_final, -5))

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665
None


We can see that among the free English apps, more than half (58.16%) are games. Entertainment apps are close to 8% and photo and video apps are about 5%. Only 3.6% are for education and social networking rounds out at 3.2% of the apps in our data set. 

The general impression is that apps centered around fun and leisure (games, entertainment, photo & video) dominate the App Store market while apps built around productivity and practicality are more rare or niche (education, shopping, utilities, productivity, lifestyle).

Based on this data alone, one would think the market for apps in the App Store built around entertainment is highly saturated. This doesnt imply they have the most users - the demand might not be the same as the supply here. 

Next, we will look at the **'Genres'** and **'Category'** columns in the Google Play data set.

In [21]:
print(display_table(google_final, 1)) # category 

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

The landscape in the Google Play market seems vastly different. It seems there aren't many apps built around fun but more so around practical purposes (family, tools, education, business, productivity, etc.). Though upon investigation, one will see that most of the apps in the 'Family' category (at about 19%) are games for kids. 

There appears to be better representation of practical apps in Google Play. This seems to be the case below in genre where apps built around practicality seem to corner the market in supply. 

In [22]:
print(display_table(google_final, -4)) # Genres column

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

Though the difference between 'Category' and 'Genres' isn't entirely clear, the "Genres' column is a bit more detailed, as evident by the larger number of rows. As we want a more equal comparison between the App Store and Play Store, we will take the bigger picture given from Category to work with going forward.

Up to this point, we see that the App Store is dominated in **SUPPLY** by apps built around fun and entertainment while the Play Store is more balanced in **SUPPLY** between apps built around practicality but also fun and games. 

While all useful to know, it doesn't give us a sense of what app categories/genres have the most users. Let's explore that

### Most Popular Apps by Genre on the App Store

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the **'Installs'** column, but the App Store data is missing this column. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the **'rating_count_tot'** app.

Let's calculate the average number of user ratings per app genre in the App Store.
* Isolate the apps of each genre
* Sum up the user ratings for the apps of that genre
* Divide the sum by the number of apps belonging to that genre

In [23]:
genres_apple = freq_table(apple_final, -5)

for genre in genres_apple:
    total = 0 #store the sum of user ratings (the number of ratings) specific to each genre.
    len_genre = 0 #store the number of apps specific to each genre
    
    for app in apple_final:
        genre_app = app[-5]
        
        if genre_app == genre:
            num_ratings = float(app[5])
            total += num_ratings
            len_genre += 1
            
    avg_num_ratings = total / len_genre
    print(genre, ':', avg_num_ratings)


Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


Looking at the above table, it seems like the Navigation category dominates the App Store market with highest number of user ratings, followed by Reference nad Social Networking. Let's see why.

In [24]:
for app in apple_final:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5])

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching¬Æ : 12811
CoPilot GPS ‚Äì Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


As we can see, the figure that makes Navigation such a popular app is heavily influenced by reviews for the Waze and Google Maps apps, which account for almost half a million reviews together. 

In [25]:
for app in apple_final:
    if app[-5] == 'Social Networking':
        print(app[1], ':', app[5])

Facebook : 2974676
Pinterest : 1061624
Skype for iPhone : 373519
Messenger : 351466
Tumblr : 334293
WhatsApp Messenger : 287589
Kik : 260965
ooVoo ‚Äì Free Video Call, Text and Voice : 177501
TextNow - Unlimited Text + Calls : 164963
Viber Messenger ‚Äì Text & Call : 164249
Followers - Social Analytics For Instagram : 112778
MeetMe - Chat and Meet New People : 97072
We Heart It - Fashion, wallpapers, quotes, tattoos : 90414
InsTrack for Instagram - Analytics Plus More : 85535
Tango - Free Video Call, Voice and Chat : 75412
LinkedIn : 71856
Match‚Ñ¢ - #1 Dating App. : 60659
Skype for iPad : 60163
POF - Best Dating App for Conversations : 52642
Timehop : 49510
Find My Family, Friends & iPhone - Life360 Locator : 43877
Whisper - Share, Express, Meet : 39819
Hangouts : 36404
LINE PLAY - Your Avatar World : 34677
WeChat : 34584
Badoo - Meet New People, Chat, Socialize. : 34428
Followers + for Instagram - Follower Analytics : 28633
GroupMe : 28260
Marco Polo Video Walkie Talkie : 27662
Miito

Same can be said for social networking apps, where a large amount of reviews is coming into the Facebook, Pinterest, Skype apps. Same for music apps with Spotify and Pandora. Seems like bigger players corner the market. 

While our goal is to find popular genres to allow us to find a target category to enter, navigation, social networks, and music may seem more popular than they really are. The average number of ratings seem to be skewed by a few big players in certain categories 

We could get a more even picture of by removing these extremely popular apps for each genre and recalculate the averages, but that can be done later.

Let's look at one more genre: Reference 

In [26]:
for app in apple_final:
    if app[-5] == 'Reference':
        print(app[1], ':', app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ‚Ñ¢ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pok√©mon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
Êïô„Åà„Å¶!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


While the Reference category has 74,942 user ratings on average, the  Bible and Dictionary.com skew the average up significantly.

However, with just two of those specific apps, this category shows potential. One suggestion is taking another book and turn it into an app with extra features besides that raw book. It could include daily quotes, an audio verison of the book, quizzes about the book, etc. 

Also, public and mental health are areas that have seen a lot of growth in popularity recently. Perhaps there is a reference book that could be made that pairs with the growing number of meditation apps that seem to be springing. We could even embed a dictionary feature within the app to keep users within our environment for longer. 

This could play well with the fact that the App Store is dominated by for-fun apps. This suggests the market might be over saturated with those kind of apps, which means a practical app might have more chance to break through and stand out amount the huge number in the App Store.

Other genres that seem popular include weather, finance, food and drink. The book genre seem to overlap a bit with the app idea described above, but other genres don't seem too interesting to us:
* Weather apps - people don't really spend much time in these apps, and thus the chance of making money from in-app adds seem low
* Food and drink - most of this category is dominated and filled with specific brands that own the lifecycle of food creation to delivery or service, which we don't do and outside of our scope.
* Finance - these apps are similar in the sense they are owned by banking institutions or places that specialize in financial sector and offer applicable services. We would need to go outside of our scope and bring someone in just to enter the space

Now let's look at the Google Play market.

### Most Popular Apps by Genre on Google Play
For the Google Play market, we actually have data about the number of installs, so we should be able to get a clearer picture about genre popularity. 

However, the install numbers don't seem precise enough - we see values are open-ended.

In [27]:
display_table(google_final, 5) # the Installs columns

1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


However, we don't need very precise data - we only want to find which app genres attract the most users, and we don't need perfect precision with respect to the number of users.

To simplify the data, we'll consider an app with 100,000+ installs has 100,000 installs and an app with 1,000,000+ installs has that many, and so on. 

Though we need to convert each install number from a string to a float. Thus, we need to remove the commas and plus characters, or we will get an error.

We will do this in the loop below.

In [28]:
genres_google = freq_table(google_final, 1)

for category in genres_google:
    
    total = 0 #store the sum of installs specific to each genre
    len_category = 0 #store the number of apps specific to each genre
    
    for app in google_final:
        category_app = app[1]
        
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace('+', '')
            n_installs = n_installs.replace(',', '')
            total += float(n_installs)
            len_category += 1
            
    avg_num_installs = total / len_category
    print(category, ':', avg_num_installs)

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

On average, communication apps have the most installs at 38,456,119. Though, this is heavily skewed up by a few apps that have oer one billion installs:
* Whatsapp
* Facebook Messenger
* Skype
* Google Chrome
* Gmail
* Hangouts

In [29]:
for app in google_final:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger ‚Äì Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Me

If we removed these all the communication apps with over 100 million installs, the average would be reduced roughly ten times

In [30]:
under_100_m = []

for app in google_final:
    n_installs = app[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    
    if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
        under_100_m.append(float(n_installs))
        
under_100_m_total = sum(under_100_m) / len(under_100_m)

print("Average number of installs of apps under 100 million installs:")
print(under_100_m_total)

Average number of installs of apps under 100 million installs:
3603485.3884615386


We see the same pattern for apps in the video players category, which is the runner up at 24,727,872 installs. The category though is dominated by apps like:
* Youtube
* Google Play Movies & TV
* MX Player

The pattern is repeated for social apps, as we saw in the App Store data, photography apps (Google Photos and other popular editors) or productivity apps like MS Word, DropBox, Google Cal, Evernote, etc.

Again, the main concern comes from these large players being cateogory leaders and artificially raising the number of installs. 

The game genre seems pretty popular, but as we know, is a pretty saturated market already. 

The books and refernce genre looks pretty popular as well with an average number of installs of 8,767,811. Let's explore this category in more depth since we found similar trends in the App Store data and we go back to our aim of building an app that can be profitable in both markets.

In [31]:
for app in google_final:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra ‚Äì free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+

This genre includes a large variety of apps and there still seems to be a small number of extremely popular apps that skew our average. 

In [32]:
for app in google_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                            or app[5] == '500,000,000+'
                                            or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad üìñ Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


We see that there are a few apps that raise our average a bit but not as many as other categories. There is potential here for development. Let's look at apps that fall within the middle range of installs (between 1,000,000 and 100,000,000).

In [33]:
for app in google_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra ‚Äì free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+

This niche seems to be dominated by software for processing and reading ebooks, as well as collections of libraries and dictionaries, so it's probably not a good idea to build similar apps due to competition.

There are also quite a few apps built around the Quran, probably linked to the large Muslim population worldwide. Perhaps there is a market to build an app around a popular book to be profitable. Perhaps taking a popular book (and a more recent or timely one) and turning it into an app could be profitable for both the Google Play and App Store markets.

However, it looks like the market is already full of libraries, so we need to add some special features besides a raw version of the book. This could include daily quotes, audio version, quizzes on the book, a forum where people can discuss. We should also look to track and source books in the public domain, as those would be easiest to work with since there are no rights or licensing issues to contend with. Stoicism is gaining popularity lately so a good idea could be an app based on a popular book on stoicism with a daily quote feature.

To play into the popularity of games/entertainment, what if we made a *Book Club* app that brought a community around a certain book(s) that included knowledge games around said book. Could even be used as a study tool in academia. 

# Conclusions

In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of reccomending an app profile/category/genre that could be profitable in both markets.

We concluded that the reference category that spans both markets is ripe for development. More granularly, taking a popular book (perhaps a more recent one) and turning it into an app for both markets could be profitable. 

The markets are already full of libraries and encyclopedias, so we need extra features. These could include features that mimic the success found in games and entertainment. Combined with a sense of community, such as a forum or other social aspects, we could have the best of both worlds. 