# Profitable App Profiles for the App Store and Google Play Markets

In this project I am working for a company that builds Android and iOS mobile apps. The apps we build are free to download and install, generating all revenue through in-app ads. This means that the amount of revenue generated by an app in influenced directly by the customer engagement, with more engagement leading to higher revenues.

Our goal for this project is to:

- Analyse data to help our developers understand what type of apps are likely to attract more users and therefore leading to higher revenues.

## Data Retrieval 

As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play. This is too big to analyse, fortunately some smaller sample data sets have already been collected- containing a sample of 10,000 android apps from [here](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv), and 7,000 iOS apps from [here](https://dq-content.s3.amazonaws.com/350/AppleStore.csv).

To use our files with the app data I had to:

- Import the csv module so that I can use it's reader() function
- Use the open() function to open the file
- Use the reader() function to generate a reader object, and assign this to a variable. The reader function is designed to take each line of the file and make all the rows a list, with each element of this list a value from one column
- Generate list of lists and assign this to a variable. Each element in this list of lists is a row.
- Isolate the header row to one variable, and the body of data to another variable using list slicing to make analysis on the body easier, mitigating errors from including the header in analysis.


In [1]:
# Import csv module
import csv

# Apps Store data import
opened_file_1 = open('AppleStore.csv')
ios_file_reader = csv.reader(opened_file_1)
ios_data = list(ios_file_reader)
ios_data_header = ios_data[0]
ios_data_body = ios_data[1:]

# Google Play data import
opened_file_2 = open('googleplaystore.csv')
android_file_reader = csv.reader(opened_file_2)
android_data = list(android_file_reader)
android_data_header = android_data[0]
android_data_body = android_data[1:]



To help with understanding the dataset characteristics and differences, it can be useful to extract a visalisation of a portion of the data set. In this the number of rows and columns will be useful to know for indexing purposes. This may be something I may want to do numerous times as a sanity check and inpection throughout the analysis so a new function was defined, saving re-writing code. The code is as follows:

- Defining the function and setting the input paramters as the dataset wanting to be examined, the start and end integer values determining the slice constraints and a bollean condition that will execute the second part of our function.
- The chosen slice using the input arguments are assigned to a variable where a loop is run over this slice represented by the variable where the rows of the slice are printed.
- A conditional if statement is included, which will be executed upon response to the Boolean parameter of the defined function, where two strings stating the number of rows and columns will be printed.

This function is run through the main body of both data sets, with their headers being printed seperately for clarity to what the columns represent.

In [2]:
# defining a new function: explore_data()

def explore_data(data_set, start, end, rows_and_columns=False):
    data_slice = data_set[start:end]
    for row in data_slice:
        print(row)
        print('\n')    # creates an empty line, separating each row making it more readable
        
    if rows_and_columns == True:
        print('Number of rows: ', len(data_set))
        print('Number of columns: ', len(data_set[0]))
        

print('iOS Data Preview')
print('\n')
print(ios_data_header)
print('\n')
explore_data(ios_data_body, 0, 3, True)
print('\n')
print('Android Data Preview')
print('\n')
print(android_data_header)
print('\n')
explore_data(android_data_body, 0, 3, True)
        

iOS Data Preview


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows:  7197
Number of columns:  16


Android Data Preview


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+

## Data Processing

Now that the data has been imported, preparing the data and ensuring the data is accurate to analyse is the next step. This preparation ensures that there are no repeating entries, or entries that don't align with the company objectives, these objectives are:

- Only free apps are to be analysed because our company only is only in the market of free apps
- Apps targeted to an english-speaking audience

This means that all entries that are not free or are targeted to non-english speaking audiences should not be included in our analysis. 

### Preliminary Review

As a preliminary indicator, the columns that seem to be of interest the most to meeting our goals for the iOS dataset are:

- track_name
- price
- prime genre
- user_rating
- rating_count_tot

Applying the same to android we find the columns to be:

- category
- rating
- reviews
- installs
- price
- genre

### Incorrect Data

The Google Play data set has a dedicated [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion), and we can see that one of the [discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) describes an error for a certain row. 

The row mentioned, 10472, is displayed alongside the header to do a visual inspection to ascertain what the problem is. This done by running the print function for the android header row and alsothe 10472 row.

In [3]:
print(android_data_header)
print('\n')
print(android_data_body[10472])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


The row showed to have an entry error in the category column, displaying '1.9' when it should be clearly a description, and the rating of 19 when it goes up to 5 usually. It appears that some of these values have shifted to the left, occupying the wrong column.

This row is deleted using the del() function and a count of the number of rows before and after is conducted to confirm this.

In [4]:
print('Number of apps before delete: '+ str(len(android_data_body)))
del(android_data_body[10472])
print('Number of apps after delete: '+ str(len(android_data_body)))

Number of apps before delete: 10841
Number of apps after delete: 10840


### Duplicate Apps

A check for any duplicate app names is done by:

- initializing empty lists for unique and duplicate app names for both data sets
- assigning the row with the column names to a variable
- running a loop over the column containing the app names and for each iteration storing every app name in the unique list and any repeating names already in the unique list into the duplicate list
- printing the lengths of the unique and duplicate lists to obtain the number of dulicate apps and the summation of the two as a sanity check to make sure it matches the total number of apps calculated earlier

In [5]:
unique_name_ios = []
unique_name_android = []
duplicate_name_ios = []
duplicate_name_android = []

for row in ios_data_body:
    app_name_ios = row[1]
    if app_name_ios in unique_name_ios:
         duplicate_name_ios.append(app_name_ios)
    else:
         unique_name_ios.append(app_name_ios)

for row in android_data_body:
    app_name_android = row[0]
    if app_name_android in unique_name_android:
         duplicate_name_android.append(app_name_android)
    else:
         unique_name_android.append(app_name_android)
    
    
print('iOS check:')            
print('iOS has ' + str(len(unique_name_ios)) + ' unique apps')
print('iOS has ' + str(len(duplicate_name_ios)) + ' duplicate apps')
print('iOS total apps is ' + str(len(unique_name_ios)+len(duplicate_name_ios)))
print('\n')
print('Android check:')
print('Android has ' + str(len(unique_name_android)) + ' unique apps')
print('Android has ' + str(len(duplicate_name_android)) + ' duplicate apps')
print('Android total apps is ' + str(len(unique_name_android)+len(duplicate_name_android)))

iOS check:
iOS has 7195 unique apps
iOS has 2 duplicate apps
iOS total apps is 7197


Android check:
Android has 9659 unique apps
Android has 1181 duplicate apps
Android total apps is 10840


The iOS duplicate apps are investigated and their columns are printed to do a visual inspection to see if these are duplicates and coincidentally different apps with the same name.

In [6]:
print(duplicate_name_ios)

['Mannequin Challenge', 'VR Roller Coaster']


In [7]:
print(ios_data_header)
print('\n')
for row in ios_data_body:
    app_name_ios = row[1]
    if app_name_ios == 'VR Roller Coaster':
        print(row)


for row in ios_data_body:
    app_name_ios = row[1]
    if app_name_ios == 'Mannequin Challenge':
        print(row)


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['952877179', 'VR Roller Coaster', '169523200', 'USD', '0.0', '107', '102', '3.5', '3.5', '2.0.0', '4+', 'Games', '37', '5', '1', '1']
['1089824278', 'VR Roller Coaster', '240964608', 'USD', '0.0', '67', '44', '3.5', '4.0', '0.81', '4+', 'Games', '38', '0', '1', '1']
['1173990889', 'Mannequin Challenge', '109705216', 'USD', '0.0', '668', '87', '3.0', '3.0', '1.4', '9+', 'Games', '37', '4', '1', '1']
['1178454060', 'Mannequin Challenge', '59572224', 'USD', '0.0', '105', '58', '4.0', '4.5', '1.0.1', '4+', 'Games', '38', '5', '1', '1']


Upon taking a closer look at the two duplicate apps for the Apple store: Mannequin Challenge and VR Roller Coaster, it can be seen that there are several notoable differences. It has also been further analysed on discussion boards where analysis has deemed them to be different apps, this can be seen [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion/90409).

For these reasons, it is concluded these are different apps that happen to share the same name and that these apps will not be discarded from the analysis. This shows that duplicate apps are an indication of repeating entries but not a definitive explanantion as seen with the Apple store apps.

Similarly to the ios duplicate names, it needs to be determined if the duplicate names for android are indicative of the same app being recorded more than once, or if there are just coincidentally different apps with the same name.

There are over 1000 duplicate android apps but it is unclear how many duplicates each app has, so a frequency will be generated using dictionaries where the app name is the key and its pair will be the frequency of occurance. The key-value pair will be indicative of unique app names in the duplicate app name list. This may help to begin to identify criterion to narrow our duplicate list, this will be done by:

- Initialising an empty dictionary
- running a loop over the name column of the duplicate data set
- having conditional if statements that count the frequency of occurance of unique apps of the list
- the length of this dictionary is printed, and this will be the number of unique apps that have duplicate values

In [8]:
app_frequency = {}

for name in duplicate_name_android:
    if name in app_frequency:
        app_frequency[name] += 1
    else:
        app_frequency[name] = 1
        
print('The number of unique apps that have duplicate values is ' + str(len(app_frequency)))


The number of unique apps that have duplicate values is 798


Unlike the iOS apps, where resources could be spent to determine the individual characteristics and figure out if they are duplicates or just coincidentally names the same, it is not feasible for the android apps due to how many there are, hence the assumption that a duplicate app name is indicative of a duplicate entry will be made. This means that a method that will determine which of the duplicates to keep will need to be created.

One useful criterion is number of reviews of the app, this symbolises a larger sample of data for that app and in turn will be beneficial to the validity of our analysis. The duplicate entry with the highest reviews will be kept and the rest deleted. To do this a dictionary that keeps track of the highest reviews of each unique app will be generated by:

- Initialising a dictionary whose values are the maximum reviews of each unique app and the key is the app name
- A loop will be run over our android body data set, isolating the app name column and review coulmn
- A conditional if statement is passed through each iteration where if the row in question is in the dictionary already and the corresponding value in the dictionary is lower, then the new value is equal to the number of reviews of the app name key in question
- A second alternative condition is also passed through each iteration to be activated if the previous condition is not satisfied i.e. its the loops first encounter with that app, then the number of reviews is taken as the value for that key.
- Only the highest value of each unique app name will be kept in the list by running a loop with logical conditions
- The list length should be equal to the total number of apps minus the number duplicate entrys, this is done by doing a check outside the loop and printing the results

In [9]:
reviews_max = {}

for row in android_data_body:
    app_name_android = row[0]
    n_reviews = float(row[3])
    if app_name_android in reviews_max and reviews_max[app_name_android] < n_reviews:
        reviews_max[app_name_android] = n_reviews
    
    elif app_name_android not in reviews_max:
        reviews_max[app_name_android] = n_reviews

print('Check:')
print('After taking only highest reviews, number of unique android apps is: '+str(len(reviews_max)))
print('Difference between total apps and duplicate apps: ' + str(len(unique_name_android)))

Check:
After taking only highest reviews, number of unique android apps is: 9659
Difference between total apps and duplicate apps: 9659


With the information achieved removing the duplicates, a new list will need to be generated that is representative of the clean android body of data, the code below does this by:

- Initialising two empty lists: clean_android_body that will house the clean list of apps and thie information, and clean_android_names which will host the names of each clean app name
- For each iteration in our original android body data set
     - The app name and number of reviews will be isolated
     - A conditional if statement is run through each iteration, determining if the row from the android body should be appended to the clean android body list by:
         - If he number of reviews for the app being iterated is equal to the number of reviews from the same app name in our reviews_max dicttionary. This ensures that we are selecting the right apps.
         - The name of the app in the iteration is not in the clean_android_names list. This ensures that if there are any duplicate apps that have the same number of reviews and got through the previous check, are not counted.
         
A check of the length of both lists is done to ensure they are the same length, confirming that each app name is only considered once.

In [10]:
clean_android_body = []
clean_android_names = []

for row in android_data_body:
    app_name_android = row[0]
    n_reviews = float(row[3])
    
    if (n_reviews == reviews_max[app_name_android]) and app_name_android not in clean_android_names:
        clean_android_body.append(row)
        clean_android_names.append(app_name_android)

print('Check to make sure both lists are the same:')
print('The number of apps in our clean android body list is ' + str(len(clean_android_body)))
print('The number of apps in our clean android names list is ' +str(len(clean_android_names)))

Check to make sure both lists are the same:
The number of apps in our clean android body list is 9659
The number of apps in our clean android names list is 9659


### Isolation of Apps for English-Speaking Audience

The company we work for is only interested in data related to apps targeted at an english speaking audience, so some of the apps that do not meet this criteria need to be removed. 

Every character used in a string has a number associated with it, and according to the ASCII, which has a table containing letters, numbers, control characters, and other symbols of those associated with the english language. A function can be made to do this check on the app name and ensure there aren't any non-english characters and identify them.

But, there are some app names that include special characters not included in the ASCII range such as emojis, therefore a method to make sure useful data is not discarded needs to be applied. A counter that notes the frequency of non ASCII characters is included in the function and if the frequency of these non_english characters occurs more than three times then the string is classed as non english. There is the chance that genuine english-speaking audience targeted apps contain more than three non ASCII characters, but this margin of error is tolerable for this stage of our analysis. The process of making the function is as follows:

- The function is defined, with an input argument of the string being analysed
- A variable for non ASCII chracters is initalised and set to 0
- a loop is run through each character in a string where in each iteration a conditional if statement is run:
    - Using the ord() function to retirve the ASCII value of each character, if it is higher thn our desired range a unit will be added to our non ASCII word counter
    - If tehe counter breaches our limit of 3 then the function returns False as an indicator, otherwise it returns True if the string had less non ASCII characters than 3
- Empty lists are set up for both data sets
- A loop is run through the clean data body sets and the app name column is isolated
- For each iteration of the app name column the english check function created earlier is run and if the output is True (indicating and english app name) then the row is appended to our empty list of english speaking audience apps
- The explore function created in the beginnning is run through the two new lists of non-duplicate english speaking list of apps

In [11]:
def english_check(string):
    non_ascii = 0
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
            
        if non_ascii > 3:
            return False
    return True

android_english_body = []
ios_english_body =[]

for row in clean_android_body:
    app_name_android = row[0]
    if english_check(app_name_android) == True:
        android_english_body.append(row)
    
    
for row in ios_data_body:
    app_name_ios = row[1]
    if english_check(app_name_ios) == True:
        ios_english_body.append(row)
        
print('Android')
explore_data(android_english_body, 0, 3, True)
print('\n')
print('iOS')
explore_data(ios_english_body, 0, 3, True)


Android
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows:  9614
Number of columns:  13


iOS
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.

The data set so far has been cleaned and the apps targeted at an english speaking audience have been isolated. The company is only interested in the free apps market, as its mian source of revenue is in-app ads. The final step in setting up the data is isolating the free apps. This will be done by:

- Initialising two lists for each data set, representing free, duplicate-free, english-speaking audience list of apps
- A loop is run through our previous duplicate-free, english-speaking audience list of apps, isolatig the price column
- if the price is equal to 0 then it is appended to our final lists
- The number of apps in this final processed list of data is printed

In [12]:
final_android_body = []
final_ios_body = []


for row in android_english_body:
    price = row[7]
    if price == '0':
        final_android_body.append(row)
        
for row in ios_english_body:
    price = row[4]
    if price == '0.0':
        final_ios_body.append(row)
        
print('English, unique, clean android app number: ' + str(len(final_android_body)))
print('English, unique, clean ios app number: ' + str(len(final_ios_body)))


English, unique, clean android app number: 8864
English, unique, clean ios app number: 3222


## Analysis

Now that the data is prepared, the analysis part of our data can begin. The aim of the analysis is to identify key metrics that have been related to high number of users as our profitability is determined through in-app ads and the more people on it the more revenue. The company stratergy is comprised by the following steps:

1. Build a minimal Android version of the app, and add it to Google Play
2. If the app has a good response from users, we develop it further
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store

Because the aim is to have apps on both stores it's important to find the right balance to satify the criterion for both. For the analysis of Android apps I am going to focus on 'installs', 'genre' and 'category' to give an indication of which are the most common type of apps available on the store and which are the most popular. For iOS these are going to be 'prime genre' to show how common types are on the store and I'll use 'rating_count_tot' as a proxy to installs as was with Android.


### Most Common Apps by Genre

To asses the most common apps by genre, a frequency analysis will be conducted that will inidicate how many apps a particular genrea appears in each store. This will be done through the use of dictionaries, and as this looks like it will be used again, a function will be made by:

- Define the function name and its input parameters
- Initialise an empty dictionary variable to house our frequency of occurance as the value
- create an initial value of 0 for our total variable, as we will use this to work out the percentage
- A loop will be run through our final body data sets, where the row of which we are counting frequency will be isolated
- For each iteration a value of 1 is added to our total variable, counting the total number of types of row in question
- A conditional if statement is run where if the row in question, appears in the empty dictionary as a key then 1 is added to the previous value
- An alternative condition is added where if the row in question does not appear in the dictionary, it will as the key to the key value pair, and its value will be equal to one to count that occurance
- Outside the loop the percentage equivalents to the values in the key value pairs will be created, starting with inittialising an empty dictionary
- A loop is run through each key in the previous dictionary, where the value of the new dictionary is equal to previously calculated value divided by the total and multiplied by 100 for the percentage calculation
- This final table of percentages is returned

To make things more readable, the list shall be arranged in descending order, this is similarly done by creating a function as it will be done numerous times by:

- defining the display function and its inputs
- initialising a variable equal to the output of the frequency function created
- intialising an empty list in which we'll append out previous dictionary values in order since dictionaries don't take into account order
- Run a loop over each key in our variable equal to the frequency dictionary where a list of lists is made from the key-value pair of the dictionary and assigned to a variable
- This variable is then appended to the empty list we made
- Now with the frequency data is a list of lists that understands order, the sorted() function is used, taking in the just calculated list
- A loop is run to define how it should be presented as an extra visualisation tool

In [13]:
def freq_table(data_set, index):
    table = {}
    total = 0
    for row in data_set:
        total += 1
        key_column = row[index]
        if key_column in table:
            table[key_column] += 1
        else:
            table[key_column] = 1
    
    table_percentages = {}
    for key in table:       
        percentages = (table[key]/total) * 100
        table_percentages[key] = percentages
    
    return table_percentages

def display_table(data_set, index):
    table = freq_table(data_set, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
    
    table_sorted = sorted(table_display, reverse=True)
    for entry in table_sorted:
        print(entry[1],  ':', entry[0])


#### Android
The functions just created are run through the Android 'genres' column.

In [14]:
print("Android 'genres' column")
display_table(final_android_body, -4)

Android 'genres' column
Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Ve

The data presented is very granular showing a large number of types of genres. The top listed apps are mainly dominated by productivity and development style apps, with genres such as 'education', 'business', 'productivity' etc.
this indicates in the Google store, the genres that appear the most are of these type. 

The display function was run through the Android 'category' column: 

In [15]:
print("Android 'category' column")
display_table(final_android_body, 1)

Android 'category' column
FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.65433212

These results were far less granular than the 'genre' column, they also showed a similar display of results dominating the the top of the list. This list showed games and family to be the highest occuring though. Upon a closer inspection of the family section, it shows it encompasses a few subcategories such as games, education and entertainment from an inspection of the [Google Play Store](https://play.google.com/store/apps/category/FAMILY?hl=en_GB). This fits in with the narrative of Google store being dominated by practical type aps. The games section indicates that there is still a large portion of the free Google store that incoorporates games, even with the large practical style dominance though.

#### iOS
Looking at the Apple store 'prime_genre' column to see what type of apps are common by runnnig the display function shows:

In [16]:
print("ios 'prime-genre' column" )
display_table(final_ios_body, -5)

ios 'prime-genre' column
Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


The common iOS genres show that fun apps to be more dominant in presence with genres such as 'games', 'entertainment' and 'photo&video' at the top. Although amongst them education is the common genre that fits with the theme of the Google play store.

The general differences between the two stores show that the Google store is dominated by more productivity style apps, while the Apple store is dominated more by apps for entertainment and fun. Although there are elements of crossover such as education and games that are persistantly at the top across both stores.

In terms of what this means, the large presence of productivity type apps in the Google store and games apps in the Google and Apple store show: low barriers to entry in those markets, a competitive market and also indicates (but doesn't guarantee) popularity with customers, with there being so many developers in the market for it. This combination translates to a lower percieved risk and indicates that these type of apps to be the most promising.

## Most Popular Apps by Genre

To get an understanding of the most popular genre, a frequency analysis can be generated similar to before with the most common types of app in the store. In this analysis the installs column can be used for the google play store, but as this is not available in the apple store, the total rating count column will be used as a proxy.

#### iOS

Below, we calculate the average number of user ratings per app genre on the Apple Store by:

- Use the frequency function generated on the prime genre column and assign the dictionary to a variable
- Run a loop over each element of the genre freqiency variable to help define the number of times we want to run the loop equal to how many genres there are
- Initialise a total variable equal to 0 and same for a length of genre variable. This will be used to get the ratings per genre calculation in the end
- Run a nested loop over every app in the final ios data set. This nested loop is inside the outer loop that is happening for each genre in the frequency table
- Initialise a variable for the genre column
- Run a conditional statement inside the nested loop where if the genre name in the outer loop (frequency table) is equal to the genre name in the inner loop (searching the final ios data set), with the ratings column defined, it will add the ratings of that row to the total variable and add 1 to the genre variable
- Outside the nested loop once a genre has been exhausted, the average ratings is calculated and printed as its own set.
- The loop repeats for every element of the initial frequency dictionary (unique genre), where in the end, all average ratings are presented.

In [17]:
genre_ios = freq_table(final_ios_body, -5)

for genre in genre_ios:
    total = 0
    len_genre = 0
    for app in final_ios_body:
        genre_app = app[-5]
        if genre_app == genre:            
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre

Because they are printed individually, they are not sorted. Copying the output to an excel sheet and using a text to columns function to separate the genre average rating per genre into two collumns, and then using a custom sort function can quickly give us a look at the most popular genres in order. This can then be re-imported and made into a table:

|Genre	| Average Number of Ratings| 
---|---
|Navigation |	86090.33333|
|Reference 	|74942.11111|
|Social Networking| 	71548.34906|
|Music 	|57326.5303|
|Weather |	52279.89286|
|Book 	|39758.5|
|Food & Drink| 	33333.92308|
|Finance 	|31467.94444|
|Photo & Video |	28441.54375|
|Travel 	|28243.8|
|Shopping 	|26919.69048|
|Health & Fitness |	23298.01538|
|Sports |	23008.89855|
|Games 	|22788.66969|
|News 	|21248.02326|
|Productivity |	21028.41071|
|Utilities |	18684.45679|
|Lifestyle 	|16485.76471|
|Entertainment 	|14029.83071|
|Business |	7491.117647|
|Education |	7003.983051|
|Catalogs 	|4004|
|Medical 	|612|

On average, navigation apps have the highest number of user reviews, but this figure is heavily influenced by Waze and Google Maps, which have close to half a million user reviews together. This shows that although the navigation category is the most downloaded and initially seems attractive to get high volume of traffic, it also has a concentrated market that will have high barriers of entry due to the qualities such as brand names, with google and wave being legacy names, making it difficult for our company to take their customers. So although this is a popular genre, I would advise against it as not including those legacy apps, the average rating is around 4000, which is fairly low.

In [18]:
for app in final_ios_body:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5])

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


Reference apps have 74,942 user ratings on average, but it's actually the Bible and Dictionary.com which disproportionately contribute to the average rating. Although there are more apps in this genre, showing a more fragmented market than 'Navigation', it is similar to navigation for the some of the same reasons regarding legacy apps, barriers to entry such brand name, and the generally low user rating (if you don't include the legacy apps), I would advise against this type of app. 

In [19]:
for app in final_ios_body:
    if app[-5] == 'Reference':
        print(app[1], ':', app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


The 'Social Networking' genre displays some similar attributes to the previous two genres, in that there are some legacy apps that are very dominant in the market- with that being said the market has a large competitive landscape evident in the large number of players. Although this may be indicative of the market being saturated, it presents an opportunity to assess the current company capabilities and how they are differentiating themselves from each other and learn from their mitakes. 

There may be an opportunity to engage in a hybrid app, bringing in elemtnes from numerous popular genres such as music and social networking. Potentially a social networking platform for non-professionals in the music field to share, like and engage with others in their discipline. The app may offer features such as teahcing and sharing techniques- encoorporating elements of education that can be popular for all demographics and families. The site can also be for enthusiasts who enjoy discovering new talent and fits in to the popular entertainment narrative of the iOS market.

The site may also encorporate reference information about latest changes in the industry, signing, new artists and serve as a type of reference for those in the industry. This interaction with different elements of popular genres has the potential to unlock numerous syneergies.

In [20]:
for app in final_ios_body:
    if app[-5] == 'Social Networking':
        print(app[1], ':', app[5])

Facebook : 2974676
Pinterest : 1061624
Skype for iPhone : 373519
Messenger : 351466
Tumblr : 334293
WhatsApp Messenger : 287589
Kik : 260965
ooVoo – Free Video Call, Text and Voice : 177501
TextNow - Unlimited Text + Calls : 164963
Viber Messenger – Text & Call : 164249
Followers - Social Analytics For Instagram : 112778
MeetMe - Chat and Meet New People : 97072
We Heart It - Fashion, wallpapers, quotes, tattoos : 90414
InsTrack for Instagram - Analytics Plus More : 85535
Tango - Free Video Call, Voice and Chat : 75412
LinkedIn : 71856
Match™ - #1 Dating App. : 60659
Skype for iPad : 60163
POF - Best Dating App for Conversations : 52642
Timehop : 49510
Find My Family, Friends & iPhone - Life360 Locator : 43877
Whisper - Share, Express, Meet : 39819
Hangouts : 36404
LINE PLAY - Your Avatar World : 34677
WeChat : 34584
Badoo - Meet New People, Chat, Socialize. : 34428
Followers + for Instagram - Follower Analytics : 28633
GroupMe : 28260
Marco Polo Video Walkie Talkie : 27662
Miitomo : 2

Exploring some of the other popular genres working down the list to the 'Book' and Food & Drink' genres, we see that they all show far more concentrated markets than the 'Social Networking' genre, with not as many synergy opportunities. 

In [21]:
for app in final_ios_body:
    if app[-5] == 'Book':
        print(app[1], ':', app[5])

Kindle – Read eBooks, Magazines & Textbooks : 252076
Audible – audio books, original series & podcasts : 105274
Color Therapy Adult Coloring Book for Adults : 84062
OverDrive – Library eBooks and Audiobooks : 65450
HOOKED - Chat Stories : 47829
BookShout: Read eBooks & Track Your Reading Goals : 879
Dr. Seuss Treasury — 50 best kids books : 451
Green Riding Hood : 392
Weirdwood Manor : 197
MangaZERO - comic reader : 9
ikouhoushi : 0
MangaTiara - love comic reader : 0
謎解き : 0
謎解き2016 : 0


In [22]:
for app in final_ios_body:
    if app[-5] == 'Food & Drink':
        print(app[1], ':', app[5])

Starbucks : 303856
Domino's Pizza USA : 258624
OpenTable - Restaurant Reservations : 113936
Allrecipes Dinner Spinner : 109349
DoorDash - Food Delivery : 25947
UberEATS: Uber for Food Delivery : 17865
Postmates - Food Delivery, Faster : 9519
Dunkin' Donuts - Get Offers, Coupons & Rewards : 9068
Chick-fil-A : 5665
McDonald's : 4050
Deliveroo: Restaurant Delivery - Order Food Nearby : 1702
SONIC Drive-In : 1645
Nowait Guest : 1625
7-Eleven, Inc. : 1356
Outback : 805
Bon Appetit : 750
Starbucks Keyboard : 457
Whataburger : 197
Delish Eatmoji Keyboard : 154
Lieferheld - Delicious food delivery service : 29
Lieferando.de : 29
McDo France : 22
Chefkoch - Rezepte, Kochen, Backen & Kochbuch : 20
Youmiam : 9
Marmiton Twist : 2
Open Food Facts : 1


#### Android

Looking at the most popular apps for android will need a similar approach to calculate the average number of installs per genre, unlike the iOS dataset, we have an install column in the dataset so it will be a more explicit result. There were two columns used in the common apps analysis for the android data set, but the 'genres' column was far to fragmented- making it hard to ascertain the messeage, therefore the 'category' column will be used as the representative to the android genres. To do this analysis I:

- Initialised a variable with the output of the frequency table function applied. The inputs to the function were the android data set with the index number of the 'category' column.
- A loop is run over the genre keys of the frequency function dictionary output
- Initialise a variable for the total number of installs and also the number of apps in a genre to be used in the average calculation
- Run a loop over every app in the final android data set
- Initialise the column for the category data
- Inside the nested loop a conditional if statement which will execute its clause if the category name of the app in the final android data set is equal to that from our frequency dictionay key
- The clause isolates the installs column and replaces the + and , with an empty string as the installs were a range e.g. 100,000+ installs. This is taken to be 100,000 installs for that app for example, to make analysis simpler.
- These install string values are then converted to floats and added to the total variable initisalised earlier
- Simultaneously, a value of 1 is added to the length of the category, counting how many apps in that category
- The average calculation is done and the nested loop is finished, the nex iteration of the outer loop (next category) is run and the whole process starts again.
- The output is not ordered, so its copied into an excel sheet, where the text to columns function and custome sort functions are used that allow us to order them from largets to smallest for an easier read of the output.

In [23]:
category_android = freq_table(final_android_body, 1)

for category in category_android:
    total = 0
    len_category = 0
    for app in final_android_body:
        category_app = app[1]
        if category_app == category:
            installs_range = app[5]
            installs_approx = installs_range.replace('+','')
            installs_approx = installs_approx.replace(',','')
            installs_value = float(installs_approx)
            total += installs_value
            len_category += 1
    avg_installs = total/ len_category

Category	|Average installs
---|---
COMMUNICATION 	|38456119.17
VIDEO_PLAYERS |	24727872.45
SOCIAL 	|23253652.13
PHOTOGRAPHY |	17840110.4
PRODUCTIVITY |	16787331.34
GAME 	|15588015.6
TRAVEL_AND_LOCAL |	13984077.71
ENTERTAINMENT 	|11640705.88
TOOLS 	|10801391.3
NEWS_AND_MAGAZINES |	9549178.468
BOOKS_AND_REFERENCE 	|8767811.895
SHOPPING 	|7036877.312
PERSONALIZATION |	5201482.612
WEATHER 	|5074486.197
HEALTH_AND_FITNESS| 	4188821.985
MAPS_AND_NAVIGATION |	4056941.774
FAMILY 	|3695641.82
SPORTS 	|3638640.143
ART_AND_DESIGN| 	1986335.088
FOOD_AND_DRINK |	1924897.736
EDUCATION |	1833495.146
BUSINESS 	|1712290.147
LIFESTYLE 	|1437816.269
FINANCE 	|1387692.476
HOUSE_AND_HOME| 	1331540.562
DATING 	|854028.8303
COMICS 	|817657.2727
AUTO_AND_VEHICLES| 	647317.8171
LIBRARIES_AND_DEMO |	638503.7349
PARENTING |	542603.6207
BEAUTY 	|513151.8868
EVENTS 	|253542.2222
MEDICAL |	120550.6198

First analysing the 'COMMUNCATION' category it can be seen that the large average install figure of over 38 million is heavily influenced by some of the large legacy apps with over a billion installs, taking on these large legacy apps is not within the scope of the company strategy so a more refined analysis of the data set not including them is needed.

In [24]:
for app in final_android_body:
    if app[1] == 'COMMUNICATION':
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
Messenger for SMS : 10,000,000+
My Tele2 : 5,000,000+
imo beta free calls and text : 100,000,000+
Contacts : 50,000,000+
Call Free – Free Call : 5,000,000+
Web Browser & Explorer : 5,000,000+
Browser 4G : 10,000,000+
MegaFon Dashboard : 10,000,000+
ZenUI Dialer & Contacts : 10,000,000+
Cricket Visual Voicemail : 10,000,000+
TracFone My Account : 1,000,000+
Xperia Link™ : 10,000,000+
TouchPal Keyboard - Fun Emoji & Android Keyboard : 10,000,000+
Skype Lite - Free Video Call & Chat : 5,000,000+
My magenta : 1,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Seznam.cz : 1,000,000+
Antillean Gold Telegram (original version) : 100,000+
AT&T Visual Voicemail : 10,000,000+
GMX Mail : 10,000,000+
Omlet Chat : 10,000,000+
My Vodacom SA : 5,000,000+
Microsoft Edge : 5,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Calls & Text by Mo+ : 5,000,000+
free 

Refining the 'COMMUNICATION' data to not include skewness by large legacy apps was done by:

- A loop is run through every app in the android data set
- The category column and the install column are isolated in the loop
- removing the + and , from the install string value and changing the class to an integer
- applying a conditional statement that if the category is equal to 'COMMUNICATION' and the install value is less than a modest 50,000 to display the app name and its install column

In [25]:
for app in final_android_body:
    category = app[1]
    installs = app[5]
    installs = installs.replace('+','')
    installs = installs.replace(',','')
    installs = int(installs)
    if app[1] == 'COMMUNICATION' and installs < 50000:
        print(app[0], ':', app[5])

Mail1Click - Secure Mail : 10,000+
K-9 Material (unofficial) : 5,000+
m:go BiH : 10,000+
/u/app : 10,000+
[verify-U] VideoIdent : 10,000+
Ad Blocker Turbo - Adblocker Browser : 10,000+
AG Contacts, Lite edition : 5,000+
Oklahoma Ag Co-op Council : 10+
Bee'ah Employee App : 100+
tournaments and more.aj.2 : 100+
Aj.Petra : 100+
AK Phone : 5,000+
Access Point Names : 10,000+
ClanHQ : 10,000+
AU Call Blocker - Block Unwanted Calls Texts 2018 : 1,000+
AV Phone : 1,000+
Katalogen.ax : 100+
BA SALES : 1+
BD Dialer : 10,000+
BD Live Call : 5,000+
Best Browser BD social networking : 10+
Traffic signs BD : 500+
BF Browser by Betfilter - Stop Gambling Today! : 10,000+
BH Mail : 1,000+
BJ - Confidential : 10+
BK Chat : 1,000+
Of the wall Arapaho bk : 5+
AC-BL : 50+
DMR BrandMeister Tool : 10,000+
BN MALLORCA Radio : 1,000+
BQ Partners : 1,000+
BS-Mobile : 50+
ATC Unico BS : 500+
BT One Voice mobile access : 5,000+
BT One Phone Mobile App : 10,000+
BV : 100+
Feel Performer : 10,000+
Cb browser : 50

The results showed a still very fragmented market, this is very attractive as it shows there is a large market with lots of opportunities. 

## Conclusion

The type of app that will meet the company objective of generating the most user engagement, and also coincided with the strategy to have it on both app stores is that off social networking and communication type apps- with the added opportunity for harnessing synergies and incorporating features from other popular genres such as Music and Reference.

A social networking app focused around music is recommended as a potentially popular app to develop is recommened. This is evident from the consistency of social networking in both app stores and also music, in addition it draws on traits common in both app stores, with the practical type apps of android that can be evident in feautures of this app that could be to do with news, reference, information and eductaion with the entertainment side of things drawm from the iOS store.

The market is very fragmented so this presents opportunity to assess our competitors capabilities and learn how they are differentiating themselves from each other, this will provide for more robust information in a further analysis of the app development side, assessing what features are impactful to users.

The overlapping of categories presents the opportunity for harnessing synergies related to each individual category and creat a 'one-stop shop' type strategy when designing the app.

### Next Step

- For a more robust analysis,the data cleaning methods could be refined as there were some non-english audience speaking apps that appeared after the cleaning process. Although the impact on the analysis is assumed to be negligable, a deeper cleaning will enusre a more robust assessmnt
- Market research of the markets we want to include in our app to determine what are the most effective features, involving further data analysis.