#        App Profile Recommendation

## Introduction of project
---
In this project, we are working as data analysts for a company that builds Andriod and iOS mobile apps. The company only build apps that are free to download and install, thus the main source of revenue is the in-app ads. This also means that the revenue for any given app is mostly influenced by the number of users who use the apps.

## Goal of the project
---
The goal of this project is to analyze data to help the developers understand what type of apps are likely to attract more users, and hence they can focus on developing these certain type of apps to maximise revenue for the company.

Unfortunately, as of September 2018, there were approximately 2 million apps on Google Play and App Store respectively and collecting data for over 4 million apps require significant amount of time and money, so we will try to analyze a sample of the data instead. Luckily, there are two existing data sets online that are suitable for our goals.

## Opening and exploring the data
---
We will first open the csv files we obtained from Kaggle. The respective datasets can be obtained [here](https://www.kaggle.com/lava18/google-play-store-apps) and [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps). We will then store both files as a nested lists for easier data extraction later on. In order to make the data analysis easier, we have chosen to store the data without its headers aka column names under the variable named `ios_data` and `android_data`.

In [33]:
open_file1 = open('AppleStore.csv', encoding="utf8")
open_file2 = open('googleplaystore.csv', encoding="utf8")
from csv import reader
read_file1 = reader(open_file1)
read_file2 = reader(open_file2)
ios = list(read_file1) 
ios_data = ios[1:] # excluding header
android = list(read_file2)
android_data = android[1:] # excluding header

To explore the datasets, we will first write a function named `explore_data`so that we can repeatedly explore rows in a more readable manner. The function also allow us to find out the number of rows and columns in each dataset.

In [4]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new empty line after each row
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))


            


Below, we have printed the column names of the AppStore dataset as well as the first 3 rows in the data set. We also managed to find out that there are 7197 apps in this AppStore dataset. Certain data such as prime genre, price, rating count and user rating are particularly important for our analysis later on. For a better description of the column names, you can access the documentation [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps). 

In [5]:
header_ios = ios[0]
print('Column names (ios) = '+ str(header_ios)) 
print('\n')
explore_data(ios_data, 0, 3, True)

Column names (ios) = ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


Similarly, there are 10841 apps in the Google Play data set. Certain data such as rating, category, installs and genres seemed really useful for our analysis later on.

In [6]:
header_android = android[0]
print('Column names (android) = '+ str(header_android))
print('\n')
explore_data(android_data, 0, 3, True)

Column names (android) = ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


## Data Cleaning
---
Since we are building apps that are free to download and are directed towards an English-speaking audience, we will have to clean our data by removing apps that are non-English or paid. We will also have to remove duplicate data and correct the wrong data based on feedbacks from the general community.

### Deleting wrong data

There is a discussion thread on [Kaggle](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) that involve a wrong data in row 10472 of the Google Play data set. A user reported that the entry for "Category" column is missing and a column shift has occured for the app "Life Made WI-Fi Touchscreen Photo Frame". To validify the wrong data, we have printed the row 10472.

In [7]:
print(android_data [10472])
print('Number of columns with data entered in row 10472 = ' + str(len(android_data[10472])))
print('Number of columns in the whole Google Play dataset = ' + str(len(android_data[0])))
del android_data[10472] # we are deleting that row
print('Number of remaining rows = ' + str(len(android_data)))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
Number of columns with data entered in row 10472 = 12
Number of columns in the whole Google Play dataset = 13
Number of remaining rows = 10840


As seen above, there is only data in 12 columns when in fact there should be 13. The missing data seems to be from the "Cateogry" column. Since the missing data will compromise our analysis, we will be removing the row.

### Duplicated data

We have also noticed that the Google Play data set has duplicate entries. As seen below, by defining a function `duplicate_entries`, we can find out that 1181 apps are duplicated in the Google Play dataset.



In [8]:
app_names = []
duplicated_apps = []
def duplicate_entries(apps_data, index):
    for rows in apps_data:
        app_name = rows[index]
        if app_name in app_names:
            duplicated_apps.append(app_name)
        else:
            app_names.append(app_name)

duplicate_entries(android_data, 0) # 0 represent the index of first column
print('Number of Google Play apps that are duplicated = ' + str(len(duplicated_apps)))
print('\n')
print('Examples of duplicated app : ' + str(duplicated_apps[:11]))

Number of Google Play apps that are duplicated = 1181


Examples of duplicated app : ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic']


To ensure that the function is working properly, we will be printing a few duplicate rows to confirm. For example, one of the duplicated apps is "Slack". Below, we have printed some rows that have the app name "Slack".

In [9]:
def print_duplicated_app(apps_data, app_name, index):
    for rows in apps_data:
        if rows[index] == app_name:
            print(rows)
            print('\n')

print_duplicated_app(android_data, 'Slack', 0)

['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']


['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']


['Slack', 'BUSINESS', '4.4', '51510', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']




All three rows are quite identical, except for a slight variation in the 4th column, the "Reviews" column. The higher the number of reviews, the more recent the data is. Hence, it will serve as a criterion for us to remove the duplicate rows and only conserve the row that has the highest reviews. To do so, we need to first create a `dictionary` to store the app name and its corresponding maximum number of reviews.

In [10]:
reviews_max = {} # creating a dictionary
for rows in android_data:
    name = rows[0]
    n_reviews = float(rows[3])
    if name not in reviews_max:
        reviews_max[name] = n_reviews
    elif n_reviews > reviews_max[name]:
        reviews_max[name] = n_reviews # this ensure that the dictionary is updated to only store the maximum number of reviews per app name
print('Number of entries after clearing duplicates and keeping highest reviews = ' + str(len(reviews_max)))
print('Expected number of entries = ' + str(len(app_names))) # from the codes above, we know that the variable 'app_name' consist of a list of the names of apps without duplicates

Number of entries after clearing duplicates and keeping highest reviews = 9659
Expected number of entries = 9659


The codes we have written above let us keep track of the maximum number of reviews for each app as well as the number of entries we should have after clearing the duplicate apps. However, the removal of the duplicate apps will only take place in the code below. We have created two empty list: `android_clean` for acting as a nested list to store the 9659 rows of data while `already_added` list is to prevent us from storing duplicated apps (that have the same highest number of reviews) in the `android_clean` list. After cleaning of data, there is indeed 9659 rows of non-duplicated apps in the `android_clean` dataset.

In [11]:
android_clean = []
already_added = []
for rows in android_data:
    name = rows[0]
    n_reviews = float(rows[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(rows)
        already_added.append(name)
print('Number of entries in the cleaned Google Play dataset = ' + str(len(android_clean)))

Number of entries in the cleaned Google Play dataset = 9659


In comparison, there are only 2 duplicated apps in the App Store, which is the "Mannequin Challenge" app and the "VR Roller Coaster" app.

In [12]:
app_names = []
duplicated_apps = []
duplicate_entries(ios_data, 1)
print('Number of App Store apps that are duplicated = ' + str(len(duplicated_apps)))
print('\n')
print(duplicated_apps)

Number of App Store apps that are duplicated = 2


['Mannequin Challenge', 'VR Roller Coaster']


In [13]:
print_duplicated_app(ios_data, 'Mannequin Challenge', 1)
print_duplicated_app(ios_data, 'VR Roller Coaster', 1)


['1173990889', 'Mannequin Challenge', '109705216', 'USD', '0.0', '668', '87', '3.0', '3.0', '1.4', '9+', 'Games', '37', '4', '1', '1']


['1178454060', 'Mannequin Challenge', '59572224', 'USD', '0.0', '105', '58', '4.0', '4.5', '1.0.1', '4+', 'Games', '38', '5', '1', '1']


['952877179', 'VR Roller Coaster', '169523200', 'USD', '0.0', '107', '102', '3.5', '3.5', '2.0.0', '4+', 'Games', '37', '5', '1', '1']


['1089824278', 'VR Roller Coaster', '240964608', 'USD', '0.0', '67', '44', '3.5', '4.0', '0.81', '4+', 'Games', '38', '0', '1', '1']




Since there are only 2 duplicated apps, we can manually delete it instead of going through the tedious process. The total rating count of the first 'Mannequin Challenge' app is 668 as compared to 105 of the second app. Hence, we will delete the second app. Similarly, by comparing the rating count for the 'VR Roller Coaster' app, we will be deleting the second app.Since the id_number for each row is unique, we can use the `list.index` function to identify the index of the duplicated apps and hence delete them manually. After deleting the duplicated apps, the number of rows left in the `ios_clean` dataset is 7195.

In [14]:
id_number_list = []
for rows in ios_data:
    id_number = rows[0]
    id_number_list.append(id_number)
    
#print(id_number_list.index('1178454060')) # for identifying the index of the duplicated 'Mannequin Challenge' app
del ios_data[4463]
#print(id_number_list.index('1089824278')) # for identifying the index of the duplicated 'VR Roller Coaster' app
del ios_data[4830]

app_names = []
duplicated_apps = []
duplicate_entries(ios_data, 1)
print('Number of iOS apps that are duplicated = ' + str(len(duplicated_apps)))
print('\n')
ios_clean = ios_data # renaming it so that it is less confusing later on
print('Number of entries in the cleaned App Store dataset = ' + str(len(ios_clean)))


Number of iOS apps that are duplicated = 0


Number of entries in the cleaned App Store dataset = 7195


### Removing non-English apps

One way to go about this is to remove apps with name that contain symbol not commonly used in English text. English text usually includes letters from the English alphabet, numbers composed of digits from 0 to 9, punctuation marks and other symbols (+, *, /). 

Each character we use in a string has a corresponding number associated with it for example the character 'a' is 97 while the character 'A' is 65. The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the ASCII (American Standard Code for Information Interchange) system. 

With this in mind, we can try building a function that detect whether an app is English or not using the built in `ord` function that will return the corresponding number of every character in the app name.

In [15]:
def language_check(a_string):
    for character in a_string:
        if ord(character) > 127:
            return False
    return True

# We will test out whether the above function works
print(language_check('Instagram'))
print(language_check('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(language_check('Docs To Go™ Free Office Suite'))
print(language_check('Instachat 😜'))
        


True
False
False
False


As seen above, the function cannot correctly identify certain English app names like "Docs To Go™ Free Office Suite" and "Instachat 😜". This is because emojis and characters like '™' fall outside of the ASCII range and have corresponding number over 127. To minimize data loss, we will only remove an app if its name has more than 3 characters with corresponding numbers falling outside the ASCII range. Although this filter is not perfect, but it should still be fairly effective. We have modified the code and it seems to be filtering out exactly the way we want it to.

In [16]:
def language_check1 (a_string):
    num_of_non_english_char = 0
    for character in a_string:
        if ord(character) > 127:
            num_of_non_english_char += 1
        if num_of_non_english_char > 3:
            return False
    return True

# To test out the code,

print(language_check1('Docs To Go™ Free Office Suite'))
print(language_check1('Instachat 😜'))
print(language_check1('爱奇艺PPS -《欢乐颂2》电视剧热播'))

True
True
False


By modifying the function to have 3 parameters (`apps_data`, `empty_list` and `index`), we can combine the verification of app names and the extraction of the English apps into one function named `language_check1`. 

In [17]:
english_android_data = []
english_ios_data = []
def language_check1 (apps_data, empty_list, index):
    for rows in apps_data:
        name = rows[index]
        num_of_non_english_char = 0
        for character in name:
            if ord(character) > 127:
                num_of_non_english_char += 1
        if num_of_non_english_char <= 3:
            empty_list.append(rows)
        
language_check1(android_clean, english_android_data, 0)
language_check1(ios_clean, english_ios_data, 1) # the "1" represent the column in which the app name reside in 

print('Number of rows remaining for Google Play dataset = ' + str(len(english_android_data)))
print('Number of rows remaining for App Store dataset = ' + str(len(english_ios_data)))

Number of rows remaining for Google Play dataset = 9614
Number of rows remaining for App Store dataset = 6181


After removing the non-English apps from each dataset, we can see that there are only 9614 English apps left in the Google Play while there are 6183 English apps left in the App Store dataset.

### Isolating the free apps

This is the last step of our data cleaning process, which is to only extract apps that are free. We are left with 8864 Google Play apps and 3220 App Store apps for analysis.

In [18]:
final_android_dataset = []
for rows in english_android_data:
    price = rows[7]
    if price == '0':
        final_android_dataset.append(rows)
print('Number of Google Play apps remaining = ' + str(len(final_android_dataset)))

final_ios_dataset = []
for rows in english_ios_data:
    price = rows[4]
    if price == '0.0':
        final_ios_dataset.append(rows)
print('Number of App Store apps remaining = ' + str(len(final_ios_dataset)))

Number of Google Play apps remaining = 8864
Number of App Store apps remaining = 3220


## Analysis of apps
---
In order to minimize the risks and overhead, our validation strategy of an app idea comprised of three steps:

1) Build a minimal Android version of the app and add it to Google Play.

2) If the app has good responses from users, we will develop it further and improve upon it.

3) If the app is profitable after 6 months, we will build an iOS version of the app add it to the App Store.

Therefore, we need to find an app profile that are successful on both markets. We will first begin the analysis by finding out what are the most common genres for each market.

### Most common genres

To find out the most common genres of the apps in both stores, there are certain data that are of particularly importance to us. For example, we should pay attention to the `prime_genre` column of the App Store data,as well as the `Category` and `Genre` column of the Google Play data. By building a frequency table function, we can then analyse and determine the percentage of apps that belong to a certain genre. The `freq_table` function can also be used to analyze any column, by taking in 3 parameters (apps data, index of the column, name of the dict storing the frequency table)


In [19]:
def freq_table(apps_data, index, empty_dict):
    for rows in apps_data:
        genres = rows[index]
        if genres not in empty_dict:
            empty_dict[genres] = 1
        else:
            empty_dict[genres] += 1
# To display the frequency table in percentages:
    for key in empty_dict:
        empty_dict[key] /= len(apps_data)
        empty_dict[key] *= 100
        
    return empty_dict

The frequency table stored in a dictionary does not have any order and hence, will be very difficult to analyze. We will need to build a second function to display the entries in the frequency table in a descending order. 

The `display_table` function must first convert the dictionary to a list of tuples and store them under `table_list` in order for the in built `sorted()` function to work.

In [20]:
# converting dictionary to tuples for the built in "sorted()" function to work

def display_table(dict_type):
    table_list = []
    for key in dict_type:
        dict_as_tuple = (dict_type[key], key) # the percentage must be the first value of every tuple for the "sorted()" function to work
        table_list.append(dict_as_tuple)
        
# to sort the percentages in descending order, we need to set the "reverse" parameter to True        
    
    sorted_table = sorted(table_list, reverse = True)
    for entry in sorted_table:
        print(entry[1], ':', entry[0]) # this step is to ensure that the key (genre type) is infront while the percentage is behind
        

### Analyzing the frequency table

We have first generated the frequency table for the `prime_genre` column of the App Store data. 

In [21]:
genre_ios = {}
freq_table(final_ios_dataset, 11, genre_ios)
display_table(genre_ios)

Games : 58.13664596273293
Entertainment : 7.888198757763975
Photo & Video : 4.968944099378882
Education : 3.6645962732919255
Social Networking : 3.291925465838509
Shopping : 2.608695652173913
Utilities : 2.515527950310559
Sports : 2.142857142857143
Music : 2.049689440993789
Health & Fitness : 2.018633540372671
Productivity : 1.7391304347826086
Lifestyle : 1.5838509316770186
News : 1.3354037267080745
Travel : 1.2422360248447204
Finance : 1.1180124223602486
Weather : 0.8695652173913043
Food & Drink : 0.8074534161490683
Reference : 0.5590062111801243
Business : 0.5279503105590062
Book : 0.43478260869565216
Navigation : 0.18633540372670807
Medical : 0.18633540372670807
Catalogs : 0.12422360248447205


The most common genre of App Store apps is **Games**, followed by **Entertainment**. In fact, more than half of the free English apps in the App Store are under the **Games** genre. It can also be seen that most of the apps (around 71%) are designed more for entertainment i.e. genres such as games, photo & video, social networking and music. Few apps are designed for practical purposes i.e. genres such as education, shopping, utilities and productivity. However, a large number of apps of a particular genre goes not equate to having a large number of users, hence we cannot recommend an app profile based on this frequency table alone.

Next, we shall analyse the the "Genres" and the "Category" column of the Google Play dataset.

In [22]:
# Genres column
genre_android = {}
freq_table(final_android_dataset, 9, genre_android)
display_table(genre_android)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

In [23]:
# Category column
category_android = {}
freq_table(final_android_dataset, 1, category_android)
display_table(category_android)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

Comparing both the **Genres** and **Category** column of the Android dataset, we noticed that the Genres column is too detailed and has too many sub-categories. Since we are analysing from a broad perspective, perhaps the Category column is more suitable for our project.

From the Category column, we can see that the most common genres is "Family", followed by "Games" and finally "Tools". The general idea is that the Google Play has a wider variety of apps and and has an equal share of practical and entertainment apps. However, further research into the "Family" apps will show that it mainly comprise of games for children. Thus, nearly 29% of apps in the Google Play can be considered **Games** app. Nevertheless, this is still much lower than the App Store and can demonstrate that the Google Play has a much balanced variety of apps and is not dominated by a single genre of apps. Next, we will be investigating which genre of apps have the most users.

### Most popular apps by genre on App Store

To find out what genres are the most popular (have the most users), we can find the average number of installs for each app genre. For Google Play dataset, this information can be found under the **Installs** column. However, for the App Store dataset, this information is not available. Hence, we can use the total number of user ratings per app as a gauge instead. This information can be found under the **rating_count_tot** column.


In [24]:
genre_ios = {}
freq_table(final_ios_dataset, 11, genre_ios)
for genre in genre_ios:
    total = 0
    len_genre = 0
    for rows in final_ios_dataset:
        genre_app = rows[11]
        if genre_app == genre:
            num_user_rating = float(rows[5])
            total += num_user_rating
            len_genre += 1
    avg_rating = total/len_genre
    print(genre + ':' + str(avg_rating))

Social Networking:71548.34905660378
Photo & Video:28441.54375
Games:22812.92467948718
Music:57326.530303030304
Reference:74942.11111111111
Health & Fitness:23298.015384615384
Weather:52279.892857142855
Utilities:18684.456790123455
Travel:28243.8
Shopping:26919.690476190477
News:21248.023255813954
Navigation:86090.33333333333
Lifestyle:16485.764705882353
Entertainment:14029.830708661417
Food & Drink:33333.92307692308
Sports:23008.898550724636
Book:39758.5
Finance:31467.944444444445
Education:7003.983050847458
Productivity:21028.410714285714
Business:7491.117647058823
Catalogs:4004.0
Medical:612.0


As seen above, the genre **Navigation** seems to have the highest average user ratings, followed by **Reference**, **Social Networking**, **Music**, **Weather**, **Book**, **Food & Drink** and **Finance**.

We have printed out some genre of apps for further analysis.

In [25]:
def print_genre(apps_data, genre):
    for rows in apps_data:
        name = rows[1]
        num_of_rating = rows[5]
        if rows[11] == genre:
            print(name + ':' + num_of_rating)

In [26]:
print_genre(final_ios_dataset,'Navigation')
print('\n')
print_genre(final_ios_dataset, 'Reference')
print('\n')
print_genre(final_ios_dataset, 'Social Networking')
print('\n')
print_genre(final_ios_dataset, 'Music')
print('\n')
print_genre(final_ios_dataset, 'Book')
print('\n')
print_genre(final_ios_dataset, 'Food & Drink')

Waze - GPS Navigation, Maps & Real-time Traffic:345046
Google Maps - Navigation & Transit:154911
Geocaching®:12811
CoPilot GPS – Car Navigation & Offline Maps:3582
ImmobilienScout24: Real Estate Search in Germany:187
Railway Route Search:5


Bible:985920
Dictionary.com Dictionary & Thesaurus:200047
Dictionary.com Dictionary & Thesaurus for iPad:54175
Google Translate:26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran:18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition:17588
Merriam-Webster Dictionary:16849
Night Sky:12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE):8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools:4693
GUNS MODS for Minecraft PC Edition - Mods Tools:1497
Guides for Pokémon GO - Pokemon GO News and Cheats:826
WWDC:762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free:718
VPN Express:14
Real Bike Traffic Rider Virt

### Justifying App Store app profile recommendation

1) The average user rating in genre such as Reference, Social Networking, Book and Music are heavily skewed due to the presence of a few well known "giant" apps such as 'Bible', 'Dictionary.com', 'Facebook', 'Pinterest', 'Kinder', 'Pandora' and 'Spotify'. Other apps in these genres often have far lesser (<10000) ratings. 

For Social Networking and Music genre, although it has quite a few number of apps above 10 000 ratings, the market is too saturated and there are a lot of apps to compete with.

2) Genres such as Navigation and Weather will require  data from external parties and it will be rather costly to gather these data. Futhermore, apps like Google Map and Waze are so developed such that it will be extremely difficult to compete with them and most users will stick to using these apps.

3) Finance apps often require domain knowledge as well as expensive equipments such as servers, hence it may be costly to invent a free Finance app that profit mainly from ads.

4) Food & Drinks apps are dominated by chain restaurants and delivery services and our company has neither of that. However, if we are to focus on one of the apps,'Allrecipes Dinner Spinner', we can see that it has rather high user ratings of 109349. Furthermore, this is a niche market as there are not a lot of apps that provide recipes.

Hence, I would suggest that the app profile be focused on Food & Drinks genre, not on the delivery or reservation domain, but rather on the preparation of food and drinks. Ads can be played in order to unlock a recipe or the next step in the recipe, bringing in revenue for the company.

## Most popular apps by genre on Google Play

To find out the most popular apps genre on Google Play, we can just utilise the data under the **Installs** column of the dataset. However, when we printed the installs column, this is what we see.

In [27]:
# First 5 rows of Installs column
for rows in final_android_dataset[:5]:
    print(rows[5])

10,000+
5,000,000+
50,000,000+
100,000+
50,000+


As seen above, the **Installs** column store the data with other characters like ',' and '+'. Therefore, we will need to remove these characters before performing computations later on. Furthermore, the data does not seem very precise as 10,000+ could mean 20,000 or 30,000. Since we only want to find out which genre of apps has the most user, we do not need very precise data about it. Therefore, we will treat 10,000+ as 10,000 and 5,000,000+ as 5,000,000. 

Next, we will be modifying the function we used for the App Store dataset to remove the characters and allow us to compute the average installs per app in each category.

In [28]:
category_android = {}
android_category = []
freq_table(final_android_dataset, 1, category_android)
for category in category_android:
    total = 0
    len_category = 0
    for rows in final_android_dataset:
        category_app = rows[1]
        installs = rows[5]
        if category_app == category:
            installs = installs.replace('+', '')
            installs = installs.replace(',', '')
            installs = float(installs)
            total += installs
            len_category += 1
    avg_installs = total/len_category
    # the step below is to arrange the average installs per category in descending order for easier analysis
    android_tuple = (avg_installs, category)
    android_category.append(android_tuple)
sorted_android = sorted(android_category, reverse = True)
for category in sorted_android:
    print(category[1] + ':' + str(category[0]))
    

COMMUNICATION:38456119.167247385
VIDEO_PLAYERS:24727872.452830188
SOCIAL:23253652.127118643
PHOTOGRAPHY:17840110.40229885
PRODUCTIVITY:16787331.344927534
GAME:15588015.603248259
TRAVEL_AND_LOCAL:13984077.710144928
ENTERTAINMENT:11640705.88235294
TOOLS:10801391.298666667
NEWS_AND_MAGAZINES:9549178.467741935
BOOKS_AND_REFERENCE:8767811.894736841
SHOPPING:7036877.311557789
PERSONALIZATION:5201482.6122448975
WEATHER:5074486.197183099
HEALTH_AND_FITNESS:4188821.9853479853
MAPS_AND_NAVIGATION:4056941.7741935486
FAMILY:3695641.8198090694
SPORTS:3638640.1428571427
ART_AND_DESIGN:1986335.0877192982
FOOD_AND_DRINK:1924897.7363636363
EDUCATION:1833495.145631068
BUSINESS:1712290.1474201474
LIFESTYLE:1437816.2687861272
FINANCE:1387692.475609756
HOUSE_AND_HOME:1331540.5616438356
DATING:854028.8303030303
COMICS:817657.2727272727
AUTO_AND_VEHICLES:647317.8170731707
LIBRARIES_AND_DEMO:638503.734939759
PARENTING:542603.6206896552
BEAUTY:513151.88679245283
EVENTS:253542.22222222222
MEDICAL:120550.6198083

Once again, categories such as "Communication", "Video_Players", "Social", "Photography" and "Productivity" may not be as popular as it seems. They are all dominated by giant companies which are hard for us to compete with. To give a sense of these giant companies, I have printed the top 5 popular apps (by installs) for each of this category.

In [29]:
def print_category (category, index):
    category_list = []
    for rows in final_android_dataset:
        name = rows[0]
        installs = rows[5]
        if rows [1] == category:
            installs = installs.replace('+', '')
            installs = installs.replace(',', '')
            installs = float(installs)
            category_tuple = (installs, name)
            category_list.append(category_tuple)
    sorted_list = sorted(category_list, reverse = True)
    for app in sorted_list[:index]:
        print(app[1] + ":" + str(app[0]))
print_category("COMMUNICATION", 5)
print('\n')
print_category("VIDEO_PLAYERS", 5)
print('\n')
print_category("SOCIAL", 5)
print('\n')
print_category("PHOTOGRAPHY", 5)
print('\n')
print_category("PRODUCTIVITY", 5)

WhatsApp Messenger:1000000000.0
Skype - free IM & video calls:1000000000.0
Messenger – Text and Video Chat for Free:1000000000.0
Hangouts:1000000000.0
Google Chrome: Fast & Secure:1000000000.0


YouTube:1000000000.0
Google Play Movies & TV:1000000000.0
MX Player:500000000.0
VivaVideo - Video Editor & Photo Movie:100000000.0
VideoShow-Video Editor, Video Maker, Beauty Camera:100000000.0


Instagram:1000000000.0
Google+:1000000000.0
Facebook:1000000000.0
Snapchat:500000000.0
Facebook Lite:500000000.0


Google Photos:1000000000.0
Z Camera - Photo Editor, Beauty Selfie, Collage:100000000.0
YouCam Perfect - Selfie Photo Editor:100000000.0
YouCam Makeup - Magic Selfie Makeovers:100000000.0
Sweet Selfie - selfie camera, beauty cam, photo edit:100000000.0


Google Drive:1000000000.0
Microsoft Word:500000000.0
Google Calendar:500000000.0
Dropbox:500000000.0
Cloud Print:500000000.0


The "Travel" category comprises of mainly maps and booking sites, which once again require co-operation with external parties in building the app. It is also hard to compete with established apps such as "TripAdvisor ..." and "Google Earth".

In [30]:
print_category("TRAVEL_AND_LOCAL", 25)

Maps - Navigate & Explore:1000000000.0
Google Street View:1000000000.0
TripAdvisor Hotels Flights Restaurants Attractions:100000000.0
Google Earth:100000000.0
Booking.com Travel Deals:100000000.0
trivago: Hotels & Travel:50000000.0
VZ Navigator:50000000.0
MAPS.ME – Offline Map and Travel Navigation:50000000.0
2GIS: directory & navigator:50000000.0
easyJet: Travel App:10000000.0
Yelp: Food, Shopping, Services Nearby:10000000.0
Yatra - Flights, Hotels, Bus, Trains & Cabs:10000000.0
XE Currency:10000000.0
Where is my Train : Indian Railway & PNR Status:10000000.0
Skyscanner:10000000.0
PagesJaunes - local search:10000000.0
NTES:10000000.0
MakeMyTrip-Flight Hotel Bus Cab IRCTC Rail Booking:10000000.0
Live Camera Viewer ★ World Webcam & IP Cam Streams:10000000.0
KakaoMap - Map / Navigation:10000000.0
KAYAK Flights, Hotels & Cars:10000000.0
Hotels.com: Book Hotel Rooms & Find Vacation Deals:10000000.0
Goibibo - Flight Hotel Bus Car IRCTC Booking App:10000000.0
GasBuddy: Find Cheap Gas:1000000

During our first analysis of the percentage of apps by genre, we can see that the **Game** apps only account for about 10% of the apps in Google Play as compared to 58% in App Store. Hence, the game market is not as saturated in the Google Play store. Furthermore, I have printed the top 100 apps under the **Game** category, and it can be seen that a wide variety of game apps have large number of installs. This may be due to Google Play having access to a wider market and I believe that making a free game app in the Google Play may be the best method to gain large amount of installs and hence, large amount of revenue from the ads in the games.

In [31]:
print_category("GAME",100)

Subway Surfers:1000000000.0
Temple Run 2:500000000.0
Pou:500000000.0
My Talking Tom:500000000.0
Candy Crush Saga:500000000.0
slither.io:100000000.0
Zombie Tsunami:100000000.0
Yes day:100000000.0
Vector:100000000.0
Trivia Crack:100000000.0
Traffic Racer:100000000.0
Temple Run:100000000.0
Talking Tom Gold Run:100000000.0
Super Mario Run:100000000.0
Sonic Dash:100000000.0
Sniper 3D Gun Shooter: Free Shooting Games - FPS:100000000.0
Smash Hit:100000000.0
Skater Boy:100000000.0
Shadow Fight 2:100000000.0
Score! Hero:100000000.0
Roll the Ball® - slide puzzle:100000000.0
Pokémon GO:100000000.0
Plants vs. Zombies FREE:100000000.0
Piano Tiles 2™:100000000.0
PAC-MAN:100000000.0
My Talking Angela:100000000.0
Modern Combat 5: eSports FPS:100000000.0
Mobile Legends: Bang Bang:100000000.0
Lep's World 2 🍀🍀:100000000.0
Jetpack Joyride:100000000.0
Hungry Shark Evolution:100000000.0
Hill Climb Racing 2:100000000.0
Hill Climb Racing:100000000.0
Helix Jump:100000000.0
Glow Hockey:100000000.0
Geometry Dash

## Conclusion
---

In conclusion, we have analyzed both the data set from the App Store as well as Google Play. However, the app profile we recommended is not the same. Thus, we would like to propose an app that has game features in it (such as solving a puzzle) so that it would appeal to the Google Play market. At the same time, if the reception is good, we can push it out to App Store but this time our target groups will be those **Food & Drink** app users. We may create apps that allow users to incorporate the game elements into unlocking recipes, thereby appealing to both app profiles that we have recommended.