# An Analysis of a Profitable App Profile for the App Store and Google Play Markets

#### Introduction
Hello, I'm Russel, and this is my first data analysis project! I've been using this website called Data Quest to help me learn the syntax, concepts, and skills needed to perform data analysis with Python. This is my first guided project after finishing their Fundamentals in Python course. 

Data Quest guided projects are great because they're able to teach you the process of analysis that must be adopted to achieve project goals. For instance, in cleaning data, some matters may evade my judgement (we'll discuss this more later). But because I had the guide from Data Quest, I was able to avoid those pitfalls. Now, moving forward in my data science career, I have added all these considerations to my mental toolbox for my future data analysis projects.

#### Project Brief
The guided project starts with the premise that we work for a company that develops apps for the English market on both Android and iOS. These apps are then made available on Google Play and App Store, free to download and install. The primary source of revenue for the company is driven by in-app ads and in-app purchases. This means that revenue is heavily driven by the size of the user base of the app - the more users who see and engage with ads and the more users who purchase in-app add-ons, the better.

The goal: Analyze data from Google Play and Apple Store to help developers understand user preferences and recommend an app profile that is most likely to attract users.

#### Data Source
The first thing we need to do is gather the data that we need for the analysis. As of the third quarter of 2019, Android users have an app base of 2.47 million apps in Google Play. Meanwhile, Apple has 1.8 million apps available in App Store (Statista, 2019).

Instead of gathering data for over 4.27 million apps - which takes up a lot of time, effort, and money - we'll analyze a representative sample of this data. A quick search shows that there currently exists a relevant sample that can cater to our project goal at no cost. On Kaggle, we were able to gather:

  * A data set containing 10,000 [Android apps from Google Play](https://www.kaggle.com/lava18/google-play-store-apps/home), collected in August 2018. 
  * A data set containing 7,000 [iOS apps from App Store](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home), collected in July 2017.

It is important to note the limitations that may exist with our current data, which can lead to risks in drawing conclusions from our analysis. The data sets were gathered from 2017 and 2018, which may affect the timeliness of our recommendation - we will not be able to factor in any market taste changes from the past year. And while the size of the data sets are a good sample for analysis, it is not sufficient to accurately represent the totality of the market. The data sets that we currently have only represent 0.4% of their respective total markets.

Below, we download both data sets. Then we separate the header from the actual data, this will ease our analysis as we go further.

In [1]:
open_apple = open('C:/Users/Artemis Plus/Documents/datasets/AppleStore.csv', encoding = 'utf8')
from csv import reader
read_apple = reader(open_apple)
apple_data = list(read_apple)
apple_header = apple_data[0]
apple = apple_data[1:]

open_google = open('C:/Users/Artemis Plus/Documents/datasets/googleplaystore.csv', encoding = 'utf8')
read_google = reader(open_google)
google_data = list(read_google)
google_header = google_data[0]
google = google_data[1:]

In [2]:
def explore_data(dataset, start, end, row_col = False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n') #adds new line
    if row_col:
        print('Number of rows: ', len(dataset))
        print('Number of columns: ', len(dataset[0]))
        print('\n')

The `explore_data` function above is used to display a selected chunk of our data in a readable format. The function takes in 4 parameters - `dataset`, `start`, `end`, and `row_col`. `dataset` takes in the data set that we want to read, `start` and `end` takes in the indeces of the data we want to display, and `row_col` prints out the number of rows and columns of the entire data set. `dataset` here should not have a header row.

In [3]:
explore_data(apple, 0, 4, row_col = True)
explore_data(google, 0, 4, row_col = True)

print(apple_header)
print(google_header)

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']


Number of rows:  7197
Number of columns:  17


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design

To better understand the output from the `explore_data` function, the tables below contain a brief description of the content of the columns of our data sets. To achieve the goals of our project, we will use the `prime_genre` column from the Apple Store Data and the "Category" and "Genres" column from the Google Play Data in our analysis

##### Apple Store Data

|Name|Description|
|:---:|:---:|
|"id" | App ID|
|"track_name"| App Name|
|"size_bytes"|Size (in Bytes)|
|"currency"| Currency Type|
|"price"|Price amount|
|"rating_count_tot"| User Rating counts (for all version)|
|"rating_count_ver"| User Rating counts (for current version)|
|"user_rating" |Average User Rating value (for all version)|
|"user_rating_ver"| Average User Rating value (for current version)|
|"ver" | Latest version code|
|"cont_rating"| Content Rating|
|"prime_genre"|Primary Genre|
|"sup_devices.num"| Number of supporting devices|
|"ipadSc_urls.num"| Number of screenshots showed for display|
|"lang.num"| Number of supported languages|
|"vpp_lic"| Vpp Device Based Licensing Enabled|

##### Google Play Data

|Name|Description|
|:---:|:---:|
|App|Application name|
|Category|Category the app belongs to|
|Rating|Overall user rating of the app (as when scraped)|
|Reviews|Number of user reviews for the app (as when scraped)|
|Size|Size of the app (as when scraped)|
|Installs|Number of user downloads/installs for the app (as when scraped)|
|Type|Paid or Free|
|Price|Price of the app (as when scraped)|
|Content Rating|Age group the app is targeted at - Children / Mature 21+ / Adult|
|Genres|An app can belong to multiple genres (apart from its main category).|
|Last Updated|Date when the app was last updated on Play Store (as when scraped)|
|Current Ver|Current version of the app available on Play Store (as when scraped)|
|Android Ver|Min required Android version (as when scraped)|

#### Data Cleaning
Among data analysts, it's been said that 90% of your time will go towards cleaning your data into a usable form. I say 'among data analysts' with familiarity, because my data analytics elective professor in college said that. 

Kidding aside, the goal of data cleaning is to detect inaccurate and duplicate data, so we can remove it from our analysis. In identifying possible inaccuracies and errors, the first task is to read discussions on the source site of the data set that you currently have. This was what I was talking about earlier when I said some issues would evade my judgement. 

The Google Play data set discussion shows an error exists in row 10472. We can observe this by printing out the data around that index. We see that this row, with the app "Life Made WI-Fi..." has a missing category column and no data for the genre column. Since the row has an error, we can remove it by using the `del` statement. We need to make sure we don't run this line more than once, otherwise we'll delete more than one row.

From here, we can read the discussions about the Apple Store Data to see if there are any inaccuracies. From what I could read though, the data set does not have shifted columns or missing content.

In [4]:
explore_data(google, 10470, 10475, row_col = True)

['Jazz Wi-Fi', 'COMMUNICATION', '3.4', '49', '4.0M', '10,000+', 'Free', '0', 'Everyone', 'Communication', 'February 10, 2017', '0.1', '2.3 and up']


['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


['Sat-Fi Voice', 'COMMUNICATION', '3.4', '37', '14M', '1,000+', 'Free', '0', 'Everyone', 'Communication', 'November 21, 2014', '2.2.1.5', '2.2 and up']


Number of rows:  10841
Number of columns:  13




In [5]:
del google[10472]

In [6]:
explore_data(google, 10470, 10475, row_col = True)

['Jazz Wi-Fi', 'COMMUNICATION', '3.4', '49', '4.0M', '10,000+', 'Free', '0', 'Everyone', 'Communication', 'February 10, 2017', '0.1', '2.3 and up']


['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


['Sat-Fi Voice', 'COMMUNICATION', '3.4', '37', '14M', '1,000+', 'Free', '0', 'Everyone', 'Communication', 'November 21, 2014', '2.2.1.5', '2.2 and up']


['Wi-Fi Visualizer', 'TOOLS', '3.9', '132', '2.6M', '50,000+', 'Free', '0', 'Everyone', 'Tools', 'May 17, 2017', '0.0.9', '2.3 and up']


Number of rows:  10840
Number of columns:  13




##### Duplicate Apps

The next step in cleaning would be to eliminate any and all duplicate rows of data. For proof, we printed out the rows with app name "Instagram" for the Google Play data set. It currently has 4 instances.

We checked Apple Store Data for any duplicate instances as well, but based on a trial with 3 different apps, the data set is pretty clean.

In [7]:
for app in google:
    name = app[0]
    if name == 'Instagram':
        print(app)

print('\n')

for app in apple:
    name = app[2]
    if name == 'Facebook' or name == 'Instagram' or name == 'Twitter':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['17', '284882215', 'Facebook', '389879808', 'USD', '0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']
['251', '333903271', 'Twitter', '210569216', 'USD', '0', '354058', '452', '3.5', '4', '6.79.1', '17+', 'News', '37', '2', '33', '1']
['591', '389801252'

To be able to see how many duplicate apps currently exist in both data sets, we can create two lists for google and apple that will store the names of its unique and duplicate apps. Below, we loop through both data sets and check for any duplicates. 

From here, we're able to see that there are currently 1181 and 2 app duplicates for the Google Play and Apple Store data set, respectively.

In [8]:
google_duplicate_apps = []
google_unique_apps = []

for app in google:
    name = app[0]
    if name in google_unique_apps:
        google_duplicate_apps.append(name)
    else:
        google_unique_apps.append(name)

print('Number of duplicate apps: ', len(google_duplicate_apps))
print('\n')
print('Examples of duplicate apps: ', google_duplicate_apps[:15])

print('\n')

apple_duplicate_apps = []
apple_unique_apps = []

for app in apple:
    name = app[2]
    if name in apple_unique_apps:
        apple_duplicate_apps.append(name)
    else:
        apple_unique_apps.append(name)
        
print('Number of duplicate apps: ', len(apple_duplicate_apps))
print('\n')
print('Examples of duplicate apps: ', apple_duplicate_apps[:15])

Number of duplicate apps:  1181


Examples of duplicate apps:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


Number of duplicate apps:  2


Examples of duplicate apps:  ['VR Roller Coaster', 'Mannequin Challenge']


Given that we have duplicate rows, we need to establish criteria to decide on what row to keep. From the duplicate rows of Instagram we displayed a while ago for Google Play, we can see that the primary difference lies in the number of reviews. The different numbers show that the data was collected at different times. We can conclude that the row with the most number of reviews is the most recent data entry recorded, so we keep that row.

Below, we instantiate a dictionary for both data sets that can only keep one unique key (app name) and store one value (number of reviews) based on the criteria we set. This fixes any of our problems with any duplicate data.

In [9]:
#Google
google_dic = {}

for app in google:
    name = app[0]
    if name in google_dic:
        reviews = float(app[3])
        if reviews > google_dic[name]:
            google_dic[name] = float(app[3])
    else:
        google_dic[name] = float(app[3])
        
print(len(google_dic))

#Apple
apple_dic = {}

for app in apple:
    name = app[2]
    if name in apple_dic:
        reviews = float(app[6])
        if reviews > apple_dic[name]:
            apple_dic[name] = float(app[6])
    else:
        apple_dic[name] = float(app[6])
        
print(len(apple_dic))

9659
7195


From here, we use the dictionaries made above to eliminate duplicate entries. We create two new lists for both data sets to store the rows already added and the cleaned data set. Then we loop through the data sets, cross-check with the dictionary regarding what number of reviews were stored, and we use that row for our final lists. Here, we needed an additional condition for the `already_added` list just in case there are duplicate entries with the same number of reviews. 

To check if everything went according to plan, the cleaned android data set should have 9,659 rows and the cleaned apple data set should have 7,195 rows.

In [10]:
#Google
android_clean = []
already_added = []

for app in google:
    name = app[0]
    n_reviews = float(app[3])
    if (n_reviews == google_dic[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

explore_data(android_clean, 0, 3, True)         

#Apple
apple_clean = []
app_already_added = []

for app in apple:
    name = app[2]
    n_reviews = float(app[6])
    if (n_reviews == apple_dic[name]) and (name not in app_already_added):
        apple_clean.append(app)
        app_already_added.append(name)

explore_data(apple_clean, 0, 3, True)  

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows:  9659
Number of columns:  13


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '1

##### English Apps

Recall that the company only targets the English market in developing apps. If we explore the data, we'll be able to find apps not directed to an English-speaking audience. Since these apps are not needed for our analysis, we need to remove them.

One technique we can use is to remove apps with a name containing any symbol not used in English text. English text is usually the alphabet, numbers, punctuation marks, and other common symbols ('+', '-', '/'). Each character in a string has a corresponding number associated with it. We can get the number of the character using the `ord()` built-in function.

The numbers of all the characters we commonly use in the English language are all in the range 0 to 127. Based on this number range, we can build a function to identify if a string contains characters not in the English language.

In [11]:
def english_only(a_string):
    count = 0
    for character in a_string:
        if ord(character) > 127:
            count += 1
    if count > 3:
        return False
    else:
        return True

print(english_only('Instagram'))
print(english_only('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english_only('Docs To Go™ Free Office Suite'))
print(english_only('Instachat 😜'))

True
False
True
True


From above, we can see that we can iterate through strings just like how we iterate through lists. The function above returns a boolean value depending on the count of characters in the string beyond the 0 to 127 range. If the string has less than 3 counts, the function returns `True`, meaning the string is targeted for the English market.

If the string has more than 3 non-English characters, the function returns `False`. We use this stipulation to catch characters beyond the acceptable range, but are still part of the English language. For example, emojis are beyond this range, but "Instachat" is still an app made for the English market. 

Below, we define new lists called `apple_english` and `google_english` to store only the apps targeted to the English market. We loop through the cleaned data sets (without duplicate data) and use the function `english_only` to append the entire row for app names that return a `True` value. 

In [12]:
apple_english = []
google_english = []

for app in android_clean:
    english_checker = english_only(app[0])
    if english_checker == True:
        google_english.append(app)
        
for app in apple_clean:
    english_checker = english_only(app[2])
    if english_checker == True:
        apple_english.append(app)

explore_data(apple_english, 0, 3, True)
print('\n')
explore_data(google_english, 0, 3, True)

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


Number of rows:  6181
Number of columns:  17




['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0

##### Free Apps

The last step in data cleaning would be to satisfy the last condition for the apps that the company develops. Recall we only develop free apps - so we need to define two new lists called `final_apple` and `final_google` that will only contain the free apps from `apple_english` and `google_english`.

Below, we isolate the free apps for both data sets and append them to our final lists. We are then left with 3,220 iOS apps and 8,864 Android apps.

In [13]:
final_apple = []
final_google = []

for app in apple_english:
    value = float(app[5])
    if value == 0: #converted to float
        final_apple.append(app)    

for app in google_english:
    value = app[7]
    if value == '0': #retained string property for simpler use
        final_google.append(app)
        
explore_data(final_apple, 0, 3, True)
explore_data(final_google, 0, 3, True)

['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']


Number of rows:  3220
Number of columns:  17


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '21564

##### Data Analysis

We then created two functions `display_table` and `freq_table`. The first function is to arrange and display the values in a dataset (in this case, a frequency table) in decreasing order, this was given by DataQuest as a template already. The second function generates a frequency table given any dataset and index.

In [14]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
    
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

def freq_table(dataset, index):
    freq_table = {}
    count = 0
    for app in dataset:
        category = app[index]
        count += 1
        if category in freq_table:
            freq_table[category] += 1
        else:
            freq_table[category] = 1
            
    freq_table_percentages = {}
    for value in freq_table:
        percentage = freq_table[value] / count
        freq_table_percentages[value] = percentage * 100
    
    return freq_table_percentages

print("Apple Primary Genre: Frequency Table")
display_table(final_apple, -5)
print("\n")
print("Google Category: Frequency Table")
display_table(final_google, 1)
print("\n")
print("Google Genres: Frequency Table")
display_table(final_google, -4)

Apple Primary Genre: Frequency Table
Games : 58.13664596273293
Entertainment : 7.888198757763975
Photo & Video : 4.968944099378882
Education : 3.6645962732919255
Social Networking : 3.291925465838509
Shopping : 2.608695652173913
Utilities : 2.515527950310559
Sports : 2.142857142857143
Music : 2.049689440993789
Health & Fitness : 2.018633540372671
Productivity : 1.7391304347826086
Lifestyle : 1.5838509316770186
News : 1.3354037267080745
Travel : 1.2422360248447204
Finance : 1.1180124223602486
Weather : 0.8695652173913043
Food & Drink : 0.8074534161490683
Reference : 0.5590062111801243
Business : 0.5279503105590062
Book : 0.43478260869565216
Navigation : 0.18633540372670807
Medical : 0.18633540372670807
Catalogs : 0.12422360248447205


Google Category: Frequency Table
FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.5311371841

We can see that the Top 5 Primary Genres of Apple are: Games, Entertainment, Photo & Video, Education, and Social Networking. The Top 5 Categories of Google are: Family, Game, Tools, Business, and Lifestyle. The Top 5 Genres of Google are: Tools, Entertainment, Education, Business, and Productivity.

To further see what's driving the numbers for these categories, we explore the apps behind the Top 5 categories. Below, we printed out the apps that belong to the Top 5 genres and have 100,000 reviews for Apple Store. For Google Play, we printed out the apps that belong to the Top 5 categories and genres and have 1,000,000 reviews. The reason behind the increase is that the user base of Google is larger than that of Apple.

In [15]:
print('Apple Store Apps per Genre')
apple_top5 = ['Games', 'Entertainment', 'Photo & Video', 'Education', 'Social Networking']
for genre in apple_top5:
    counter = 0
    for app in final_apple:
        genre_app = app[-5]
        if genre_app == genre:
            name_app = app[2]
            num_reviews = float(app[6])
            if num_reviews > 100000 and counter < 10:
                print(genre, ":", name_app)
                counter += 1
print('\n')
print('Google Play Apps per Category')
google_cat_top5 = ['FAMILY', 'GAME', 'TOOLS', 'BUSINESS', 'LIFESTYLE']
for cat in google_cat_top5:
    counter = 0
    for app in final_google:
        cat_app = app[1]
        if cat_app == cat:
            name_app = app[0]
            num_reviews = float(app[3])
            if num_reviews > 1000000 and counter < 10:
                print(cat, ":", name_app)
                counter += 1
print('\n')
print('Google Play Apps per Genre')
google_gen_top5 = ['Tools', 'Entertainment', 'Education', 'Business', 'Productivity']
for gen in google_gen_top5:
    counter = 0
    for app in final_google:
        gen_app = app[-4]
        if gen_app == gen:
            name_app = app[0]
            num_reviews = float(app[3])
            if num_reviews > 1000000 and counter < 10:
                print(gen, ":", name_app)
                counter += 1

Apple Store Apps per Genre
Games : Blackjack by MobilityWare
Games : PAC-MAN
Games : Beer Pong Game
Games : Sky Burger - Build & Match Food Free
Games : Angry Birds
Games : Glow Hockey 2 FREE
Games : Solitaire
Games : ▻Sudoku
Games : Spider Solitaire Free by MobilityWare
Games : Fruit Ninja®
Entertainment : Fandango Movies - Times + Tickets
Entertainment : Mad Libs
Entertainment : TRUTH or DARE!!! - FREE
Entertainment : IMDb Movies & TV - Trailers and Showtimes
Entertainment : Netflix
Entertainment : Twitch
Entertainment : Action Movie FX
Entertainment : Colorfy: Coloring Book for Adults
Photo & Video : Instagram
Photo & Video : Snapchat
Photo & Video : Pic Collage - Picture Editor & Photo Collage Maker
Photo & Video : YouTube - Watch Videos, Music, and Live Streams
Photo & Video : musical.ly - your video social network
Photo & Video : Funimate video editor: add cool effects to videos
Education : Guess My Age  Math Magic
Education : Duolingo - Learn Spanish, French and more
Social Net

We will consider the categories that print out 10 apps with more than 100,000 as a saturated market. This is applicable for both Google and Apple. Given this condition, we conclude that the genres still ripe for a new app would be Entertainment, Photo & Video, and Education. For Google, we conclude that the genres and categories that can accommodate a new app would be Business, Lifestyle, and Education.

From here, we have reason to believe that the best point of entry would be an app that would be categorized under "Education". Both platforms have ample space to welcome a new entry to this category.

In [16]:
apple_genres_freq = freq_table(final_apple, -5)
sort_apple = []
for genre in apple_genres_freq:
    total = 0
    len_genre = 0
    for app in final_apple:
        genre_app = app[-5]
        if genre_app == genre:
            add_ratings = float(app[6])
            total += add_ratings
            len_genre += 1
    ave_number = total/len_genre
    print(genre, ":", ave_number)

Productivity : 21028.410714285714
Weather : 52279.892857142855
Shopping : 26919.690476190477
Reference : 74942.11111111111
Finance : 31467.944444444445
Music : 57326.530303030304
Utilities : 18684.456790123455
Travel : 28243.8
Social Networking : 71548.34905660378
Sports : 23008.898550724636
Health & Fitness : 23298.015384615384
Games : 22812.92467948718
Food & Drink : 33333.92307692308
News : 21248.023255813954
Book : 39758.5
Photo & Video : 28441.54375
Entertainment : 14029.830708661417
Business : 7491.117647058823
Lifestyle : 16485.764705882353
Education : 7003.983050847458
Navigation : 86090.33333333333
Medical : 612.0
Catalogs : 4004.0


Here, we are able to observe that "Navigation", "Social Networking", "Reference", and "Weather" have around 50,000 to 80,000+ average number of users. This number is heavily skewed by apps that have more than a hundred thousand users. For example, Navigation numbers are increase by users from Waze and Google Maps and Social Networking numbers are driven by social media giants like Facebook, Instagram, Twitter, etc.

As we identified earlier though, we can enter the 3 categories not yet saturated - Photo & Video, Entertainment, and Education. Photo & Video has the highest potential user base, but from cross-analysis between Apple and Google, we see that this category may not thrive in the Google Play market. Our best bet would be to enter the Education genre and leverage it's potential to be considered as an app under the Reference genre as well.

Below, we further explore this idea by analyzing the number of installs that Google Play has for each category.

In [17]:
google_category = freq_table(final_google, 1)
for category in google_category:
    total = 0
    len_category = 0
    for app in final_google:
        category_app = app[1]
        if category == category_app:
            n_installs = app[5]
            n_installs = n_installs.replace('+', '')
            n_installs = n_installs.replace(',', '')
            n_installs = float(n_installs)
            total += n_installs
            len_category += 1
    average_installs = total / len_category
    if 10000000 > average_installs > 1000000:
        print(category, ": ", average_installs)

ART_AND_DESIGN :  1986335.0877192982
BOOKS_AND_REFERENCE :  8767811.894736841
BUSINESS :  1712290.1474201474
EDUCATION :  1833495.145631068
FINANCE :  1387692.475609756
FOOD_AND_DRINK :  1924897.7363636363
HEALTH_AND_FITNESS :  4188821.9853479853
HOUSE_AND_HOME :  1331540.5616438356
LIFESTYLE :  1437816.2687861272
FAMILY :  3695641.8198090694
SHOPPING :  7036877.311557789
SPORTS :  3638640.1428571427
PERSONALIZATION :  5201482.6122448975
WEATHER :  5074486.197183099
NEWS_AND_MAGAZINES :  9549178.467741935
MAPS_AND_NAVIGATION :  4056941.7741935486


We remove the categories that are below 1,000,000 and above 10,000,000 installs. We will consider those below 1M to be a small potential market and those above 10M a saturated market. Given that we have Education as the primary genre from the Apple Store, we must find categories in Google Play that could be considered under that genre.

From our results, we can see that we can use Books and Reference, Business, Education, and News and Magazines as the possible categories that overlap in the two markets. Given this, we can explore the current apps that belong to this category in the Google Play data set.

In [18]:
print('\n')
print('Google Play Apps per Category')
google_cat_top5 = ['BOOKS_AND_REFERENCE', 'BUSINESS', 'EDUCATION', 'NEWS_AND_MAGAZINES']
for cat in google_cat_top5:
    counter = 0
    for app in final_google:
        cat_app = app[1]
        if cat_app == cat:
            name_app = app[0]
            num_reviews = float(app[3])
            if num_reviews > 250000 and counter < 10:
                print(cat, ":", name_app)
                counter += 1



Google Play Apps per Category
BOOKS_AND_REFERENCE : Wikipedia
BOOKS_AND_REFERENCE : Google Play Books
BOOKS_AND_REFERENCE : Bible
BOOKS_AND_REFERENCE : Amazon Kindle
BOOKS_AND_REFERENCE : Wattpad 📖 Free Books
BOOKS_AND_REFERENCE : Al Quran Indonesia
BOOKS_AND_REFERENCE : Al'Quran Bahasa Indonesia
BOOKS_AND_REFERENCE : Quran for Android
BOOKS_AND_REFERENCE : Audiobooks from Audible
BOOKS_AND_REFERENCE : Dictionary.com: Find Definitions for English Words
BUSINESS : Indeed Job Search
BUSINESS : Uber Driver
BUSINESS : OfficeSuite : Free Office + PDF Editor
BUSINESS : MyASUS - Service Center
BUSINESS : Tiny Scanner - PDF Scanner App
BUSINESS : Vault-Hide SMS,Pics & Videos,App Lock,Cloud backup
BUSINESS : Facebook Pages Manager
BUSINESS : File Commander - File Manager/Explorer
EDUCATION : Math Tricks
EDUCATION : English with Lingualeo
EDUCATION : Learn English with Wlingua
EDUCATION : Learn languages, grammar & vocabulary with Memrise
EDUCATION : SoloLearn: Learn to Code for Free
NEWS_AND_

From here, we can see that the Education category has the largest potential for a new entry, because the market is the least saturated. Given the category's performance in Apple - with a low number of users at only 7,000+ - we have to be able to categorize the app as a Reference as well, which has 10x the user base of Education.

My recommendation given this analysis is we produce an app that educates its users on a specific field of study, so it can be categorized as a reference as well. The field of study could be psychology, computer science, US/UI design, or data science - whatever is the recent skill that most users are seeking after.