# Profitable App Profiles for the App Store and Google Play Markets

For this project, we are data analysts for a company that builds mobile applications for both iOS and Android, with these apps being available on Google Play and the App Store, respectively.

All of the apps that our company builds are free to download and install. The main source of revenue from these apps are in-app ads.

This means that the revenue for any one of the apps is directly correlated to the number of users for that app - the more users who see and engage with the ads, the more revenue each app brings in.

**The goal for this project is to analyze data to help our development team understand what type of apps are likely to attract more users.**

***
The first step in this project will be to collect and analyze data about all mobile apps available on Google Play and the App Store.

Since there are over 4 million mobile apps, we will focus on a smaller sample size of data for the purposes of this project.

There are two datasets we will be using:
- [A dataset for approximately 10k Android apps from Google Play, collected in August 2018](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv). Saved as `googleplay.csv` within the project folder.
- [A dataset for approximately 7k iOS apps from the App Store, collected in July 2017](https://dq-content.s3.amazonaws.com/350/AppleStore.csv). Saved as `appstore.csv` within the project folder.

***
## Opening and Exploring Data

We will extract the datasets from the CSV files into a list of lists.

In [638]:
from csv import reader

def csv_to_list(file):
    opened_file = open(file)
    read_file = reader(opened_file)
    return list(read_file)

app_store = csv_to_list('appstore.csv')
google_play = csv_to_list('googleplay.csv')

We will define a function to explore these two datasets.

In [639]:
def explore_dataset(dataset, start_index, end_index, print_rows_and_columns = False):
    dataset_sliced = dataset[start_index: end_index]
    for row in dataset_sliced:
        print(row) 
        print('\n') # Add an empty line after each row
        
    if print_rows_and_columns:
        print("Number of rows: ", len(dataset))
        print("Number of columns: ", len(dataset[0]))

In [640]:
explore_dataset(app_store, 0, 5, print_rows_and_columns = True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows:  7198
Number of columns:  16


In [641]:
explore_dataset(google_play, 0, 5, print_rows_and_columns = True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows:  10842
Number of columns:  13


***
## Data Cleaning
### Identify Key Data Points

Next, we will view only the column names of each dataset in order to identify which data points will be most helpful for our analysis.

The detailed descriptions of the columns can be found from the original page of the datasets, for [Google Play](https://www.kaggle.com/datasets/lava18/google-play-store-apps) and the [App Store](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps).

In [642]:
print(google_play[0])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


From the list of available metrics from Google Play, it appears the most important metrics to use for our analysis should be:
- App
- Category
- Rating
- Reviews
- Installs
- Type
- Price
- Content Rating
- Genres

In [643]:
print(app_store[0])

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


From the list of available metrics from the App Store, it appears the most important metrics to use for our analysis should be:
- track_name
- currency
- price
- rating_count_tot
- user_rating
- cont_rating
- prime_genre

***
### Deleting Wrong or Inaccurate Data
*First, there is a [discussion topic](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) about an error for the Google Play data. We will fix this by removing the entry in question from our local dataset*

In [644]:
explore_dataset(google_play, start_index = 10472, end_index = 10474)

['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']




We can see that the data at `google_play[10473]` is missing the `Category` column. Since we can't readily assume what category it may have originally been in, we should just remove it from the dataset.

In [645]:
del google_play[10473]

# Confirm that the data is no longer in the dataset
explore_dataset(google_play, start_index = 10472, end_index = 10474)

['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']




***
### Removing Duplicate Entries
Reading through more discussion topics about the Google Play data, it appears there are duplicate entries for many apps. Since this would impact the analysis, we will remove identify and remove any duplicate entries.

In [646]:
def identify_duplicate_entries(dataset, index, remove_header = False):
    unique_entries = []
    duplicate_entries = []
    
    if remove_header:
        dataset = dataset[1:]
    
    for row in dataset:
        value = row[index]
        if value in unique_entries:
            duplicate_entries.append(value)
        else:
            unique_entries.append(value)
            
    print("Number of duplicate entries: ", len(duplicate_entries))
    print("\n")
    print("Examples of duplicate entries: ", duplicate_entries[:10])
    
    return duplicate_entries

In [647]:
google_play_duplicates = identify_duplicate_entries(google_play, 0, remove_header = True)

Number of duplicate entries:  1181


Examples of duplicate entries:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


In [648]:
app_store_duplicates = identify_duplicate_entries(app_store, 1, remove_header = True)

Number of duplicate entries:  2


Examples of duplicate entries:  ['Mannequin Challenge', 'VR Roller Coaster']


***
While it would be easy to just remove any duplicate entries, it is important that we first identify why there are duplicate entries.

We can work on the first duplicate entry for each dataset to try to identify any factors that we can use to ensure we are working with the best data for a particular entry.

In [649]:
def compare_duplicates(original_dataset, duplicate_dataset, index):
    for data in original_dataset:
        value = data[index]
        if value == duplicate_dataset[0]:
            print(data)

In [650]:
compare_duplicates(google_play, google_play_duplicates, 0)

['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80804', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']


In [651]:
compare_duplicates(app_store, app_store_duplicates, 1)

['1173990889', 'Mannequin Challenge', '109705216', 'USD', '0.0', '668', '87', '3.0', '3.0', '1.4', '9+', 'Games', '37', '4', '1', '1']
['1178454060', 'Mannequin Challenge', '59572224', 'USD', '0.0', '105', '58', '4.0', '4.5', '1.0.1', '4+', 'Games', '38', '5', '1', '1']


For each of these two datasets, it appears the metric that differentiates each duplicate entry is the number of ratings, seen in index `3` of the Google Play data, and in index `5` of the App Store data.

With that knowledge, we know that we should only be keeping the entry with the *highest* number of ratings.

In order to do this, we will need to create a new dataset that keeps only one entry for each app, with that entry being the one with the highest number of ratings.

In [652]:
def create_dataset_unique_entries(dataset, index_name, index_unique, remove_header = True):
    if remove_header:
        dataset = dataset[1:]
        
    unique_dict = {}
    for row in dataset:
        name = row[index_name]
        value = float(row[index_unique])
        if name in unique_dict and unique_dict[name] < value:
            unique_dict[name] = value
        elif name not in unique_dict:
            unique_dict[name] = value
            
    return unique_dict

We can confirm that this process was done succesfully because we can assume that the resulting dataset should have the amount of entries equal to `Original Entries` - `Duplicate Entries`.

For the Google Play data, this should be:

In [653]:
print(len(google_play[1:]) - len(google_play_duplicates))

9659


In [654]:
google_play_unique_dict = create_dataset_unique_entries(google_play, 0, 3)
print(len(google_play_unique_dict))

9659


This checks out!

For the App Store data, this should be:

In [655]:
print(len(app_store[1:]) - len(app_store_duplicates))

7195


In [656]:
app_store_unique_dict = create_dataset_unique_entries(app_store, 1, 5)
print(len(app_store_unique_dict))

7195


This also checks out!

The next step here is to use the unique dictionaries that can tell us the highest ratings for each app to create a new *cleaned* dataset as a list of lists.

In [657]:
def create_clean_dataset(dataset, clean_dict, index_name, index_unique, remove_headers = True):
    cleaned_list = [] # Store new cleaned data
    already_added = [] # List of data that has already been added (non-unique)
    
    if remove_headers:
        cleaned_list.append(dataset[0])
        dataset = dataset[1:]
    
    for row in dataset:
        name = row[index_name]
        value = float(row[index_unique])
        if name not in already_added and value == clean_dict[name]:
            cleaned_list.append(row)
            already_added.append(name)
            
    return cleaned_list

In [658]:
google_play_cleaned = create_clean_dataset(google_play, google_play_unique_dict, 0, 3)
print(len(google_play_cleaned[1:]))
print(google_play_cleaned[:5])

9659
[['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'], ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']]


In [659]:
app_store_cleaned = create_clean_dataset(app_store, app_store_unique_dict, 1, 5)
print(len(app_store_cleaned[1:]))
print(app_store_cleaned[:5])

7195
[['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'], ['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'], ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'], ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1'], ['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']]


The lengths of both of the cleaned datasets match what we expected, so we can validate that the process was a success!

***
### Removing Other Unwanted Data
At our company, recall that we only develop mobile apps that are free to download and install. We also design our apps for an English-speaking audience.

So we will need to clean up the current datasets to remove all data points that are non-free and not primarily in English.

**Removing Non-free Apps**

For this step, we should just need to loop through each entry in our datasets and keep the entries that have a price of $0.

However, the price data points in each dataset exist as strings, or they contain the `$` character, so we will need to be sure to account for this.

We can use the `ord()` function, which returns the integer representing the Unicode character, and if we encounter anything other than a `.` or numeral, we will remove it. The ASCII integer for `.` is **46** and **48-57** for `0-9` according to this [ASCII table](https://www.asciitable.com/)

In [660]:
def string_to_float(string):
    cleaned_string = ""
    for character in string:
        char_unicode = ord(character)
        if char_unicode == 46 or (char_unicode >= 48 and char_unicode <= 57):
            cleaned_string += character
    return float(cleaned_string)

In [661]:
def create_dataset_free_only(dataset, index_name, index_price, remove_headers = True):
    dataset_free = []
    
    if remove_headers:
        dataset_free.append(dataset[0])
        dataset = dataset[1:]
        
    for row in dataset:
        name = row[index_name]
        price = string_to_float(row[index_price])
        if price == 0:
            dataset_free.append(row)
    
    return dataset_free

In [662]:
google_play_free = create_dataset_free_only(google_play_cleaned, 0, 7)

In [663]:
print(google_play_free[:10])

[['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'], ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'], ['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone'

In [664]:
app_store_free = create_dataset_free_only(app_store_cleaned, 1, 4)

In [665]:
print(app_store_free[:10])

[['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'], ['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'], ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'], ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1'], ['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1'], ['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1'], ['429047995', 'Pinterest', '74778624', 'USD', '0.0', '1061624',

Great! Now we have a cleaned dataset that has only free apps.

**Removing Non-English Apps**

The next step is to remove all apps that are not aimed at an English-speaking audience.

The process is similar to above. We will compare the unicode characters of the app name string and ignore any apps that have characters outside of the ASCII range of 0-127.

In [666]:
def create_dataset_english(dataset, index_name, remove_headers = True):
    dataset_english = []
    if remove_headers:
        dataset_english.append(dataset[0])
        dataset = dataset[1:]
    
    for row in dataset:
        isEnglish = True
        name = row[index_name]
        for character in name:
            char_unicode = ord(character)
            if char_unicode > 127:
                isEnglish = False
        if isEnglish:
            dataset_english.append(row)
    
    return dataset_english

Now, we will test this function to ensure it is properly filtering out non-English apps:

In [667]:
test_list = ['Instagram', '爱奇艺PPS -《欢乐颂2》电视剧热播', 'Docs To Go™ Free Office Suite', 'Instachat 😜']

test_list_english = create_dataset_english(test_list, 0, remove_headers = False)
print(test_list_english)

['Instagram', 'Docs To Go™ Free Office Suite', 'Instachat 😜']


In [668]:
google_play_english = create_dataset_english(google_play_free, 0)
print(google_play_english[:10])

[['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'], ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'], ['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up'], ['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIGN', '3.8', '178', '19M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'Ap

In [669]:
app_store_english = create_dataset_english(app_store_free, 1)
print(app_store_english[:10])

[['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'], ['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'], ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'], ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1'], ['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1'], ['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1'], ['429047995', 'Pinterest', '74778624', 'USD', '0.0', '1061624',

**I think that wraps up all of the data cleaning for this initial analysis!**

***
# Data Analysis

Now that we have cleaned our datasets, we should be clear to perform our analysis on the data.

To recap, the goal is to determine the kinds of apps that are likely to attract more users because the number of people using our apps affect our revenue.

Our validation strategy for an app idea has three steps:
- Build a minimal Android version of the app, and add it to Google Play.
- If the app has a good response from users, we develop it further.
- If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

We can assume our end goal is to add the app on both Google Play and the App Store, so we need to find app profiles that are successful in both markets. 

### Most Common Apps by Genre

We can begin the analysis by trying to determine the most common genres within each app marketplace.

To do this, we will generate a frequency table to count how many apps of a certain genre exist.

Recall the respective column headers:

In [670]:
print(google_play_english[0])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In [671]:
print(app_store_english[0])

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


So we will be using genre index for each marketplace: `9` for Google Play, and `11` for the App Store.

*Notice that for Google Play, there is also a higher level `Category` column. We will also generate a frequency table for this index and compare it later on.*

In [672]:
def create_freq_table(dataset, index_freq, remove_headers = True, generate_percentage = True):
    freq_table = {}
    if remove_headers:
        dataset = dataset[1:]
        
    number_of_entries = len(dataset)
        
    for row in dataset:
        value = row[index_freq]
        if value in freq_table:
            freq_table[value] += 1
        else:
            freq_table[value] = 1
            
    if generate_percentage:
        for key in freq_table:
            freq_table[key] = round((freq_table[key] / number_of_entries) * 100, 2)

    return freq_table

In [673]:
google_play_genre_freq = create_freq_table(google_play_english, 9)

In [674]:
print(google_play_genre_freq)

{'Art & Design': 0.62, 'Art & Design;Creativity': 0.07, 'Auto & Vehicles': 0.94, 'Beauty': 0.63, 'Books & Reference': 2.19, 'Business': 4.71, 'Comics': 0.56, 'Comics;Creativity': 0.01, 'Communication': 3.22, 'Dating': 1.83, 'Education': 5.39, 'Education;Creativity': 0.05, 'Education;Education': 0.34, 'Education;Pretend Play': 0.06, 'Education;Brain Games': 0.02, 'Entertainment': 6.09, 'Entertainment;Creativity': 0.02, 'Entertainment;Music & Video': 0.13, 'Events': 0.71, 'Finance': 3.73, 'Food & Drink': 1.2, 'Health & Fitness': 3.13, 'House & Home': 0.81, 'Libraries & Demo': 0.9, 'Lifestyle': 3.88, 'Lifestyle;Pretend Play': 0.01, 'Card': 0.4, 'Arcade': 1.83, 'Puzzle': 1.13, 'Racing': 1.02, 'Sports': 3.33, 'Casual': 1.77, 'Simulation': 2.08, 'Trivia': 0.42, 'Action': 3.12, 'Word': 0.25, 'Adventure': 0.65, 'Role Playing': 0.94, 'Strategy': 0.9, 'Music': 0.2, 'Action;Action & Adventure': 0.08, 'Casual;Brain Games': 0.14, 'Educational;Creativity': 0.04, 'Puzzle;Brain Games': 0.18, 'Educatio

In [675]:
google_play_category_freq = create_freq_table(google_play_english, 1)

In [676]:
print(google_play_category_freq)

{'ART_AND_DESIGN': 0.67, 'AUTO_AND_VEHICLES': 0.94, 'BEAUTY': 0.63, 'BOOKS_AND_REFERENCE': 2.19, 'BUSINESS': 4.71, 'COMICS': 0.57, 'COMMUNICATION': 3.22, 'DATING': 1.83, 'EDUCATION': 1.17, 'ENTERTAINMENT': 0.94, 'EVENTS': 0.71, 'FINANCE': 3.73, 'FOOD_AND_DRINK': 1.2, 'HEALTH_AND_FITNESS': 3.13, 'HOUSE_AND_HOME': 0.81, 'LIBRARIES_AND_DEMO': 0.9, 'LIFESTYLE': 3.89, 'GAME': 9.61, 'FAMILY': 18.8, 'MEDICAL': 3.64, 'SOCIAL': 2.66, 'SHOPPING': 2.25, 'PHOTOGRAPHY': 3.01, 'SPORTS': 3.26, 'TRAVEL_AND_LOCAL': 2.31, 'TOOLS': 8.58, 'PERSONALIZATION': 3.31, 'PRODUCTIVITY': 3.97, 'PARENTING': 0.65, 'WEATHER': 0.8, 'VIDEO_PLAYERS': 1.76, 'NEWS_AND_MAGAZINES': 2.79, 'MAPS_AND_NAVIGATION': 1.36}


In [677]:
app_store_genre_freq = create_freq_table(app_store_english, 11)

In [678]:
print(app_store_genre_freq)

{'Social Networking': 3.12, 'Photo & Video': 5.14, 'Games': 59.14, 'Music': 2.16, 'Reference': 0.51, 'Health & Fitness': 1.99, 'Weather': 0.89, 'Travel': 1.13, 'Shopping': 2.5, 'News': 1.34, 'Navigation': 0.14, 'Lifestyle': 1.47, 'Entertainment': 7.53, 'Food & Drink': 0.89, 'Sports': 2.05, 'Finance': 1.1, 'Education': 3.84, 'Productivity': 1.71, 'Utilities': 2.26, 'Book': 0.27, 'Business': 0.51, 'Catalogs': 0.1, 'Medical': 0.21}


We now have a set of frequency tables which enable us to view the percentage of apps that belong to each genre/category.

However, it's a bit hard to easily make out which apps hold a large portion of the marketplace, so we can define another function to sort our frequency tables.

In [727]:
def sort_table(table, index_end = None):
    if index_end == None:
        index_end = len(table) - 1
        
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted[:index_end]:
        print(entry[1], ':', entry[0])

In [728]:
sort_table(google_play_genre_freq)

Tools : 8.56
Entertainment : 6.09
Education : 5.39
Business : 4.71
Productivity : 3.97
Lifestyle : 3.88
Finance : 3.73
Medical : 3.64
Sports : 3.33
Personalization : 3.31
Communication : 3.22
Health & Fitness : 3.13
Action : 3.12
Photography : 3.01
News & Magazines : 2.79
Social : 2.66
Travel & Local : 2.31
Shopping : 2.25
Books & Reference : 2.19
Simulation : 2.08
Dating : 1.83
Arcade : 1.83
Casual : 1.77
Video Players & Editors : 1.74
Maps & Navigation : 1.36
Food & Drink : 1.2
Puzzle : 1.13
Racing : 1.02
Role Playing : 0.94
Auto & Vehicles : 0.94
Strategy : 0.9
Libraries & Demo : 0.9
House & Home : 0.81
Weather : 0.8
Events : 0.71
Adventure : 0.65
Beauty : 0.63
Art & Design : 0.62
Comics : 0.56
Parenting : 0.5
Trivia : 0.42
Educational;Education : 0.42
Card : 0.4
Educational : 0.38
Casino : 0.38
Board : 0.37
Education;Education : 0.34
Word : 0.25
Music : 0.2
Casual;Pretend Play : 0.19
Puzzle;Brain Games : 0.18
Racing;Action & Adventure : 0.14
Casual;Brain Games : 0.14
Entertainment;

In [729]:
sort_table(google_play_category_freq)

FAMILY : 18.8
GAME : 9.61
TOOLS : 8.58
BUSINESS : 4.71
PRODUCTIVITY : 3.97
LIFESTYLE : 3.89
FINANCE : 3.73
MEDICAL : 3.64
PERSONALIZATION : 3.31
SPORTS : 3.26
COMMUNICATION : 3.22
HEALTH_AND_FITNESS : 3.13
PHOTOGRAPHY : 3.01
NEWS_AND_MAGAZINES : 2.79
SOCIAL : 2.66
TRAVEL_AND_LOCAL : 2.31
SHOPPING : 2.25
BOOKS_AND_REFERENCE : 2.19
DATING : 1.83
VIDEO_PLAYERS : 1.76
MAPS_AND_NAVIGATION : 1.36
FOOD_AND_DRINK : 1.2
EDUCATION : 1.17
ENTERTAINMENT : 0.94
AUTO_AND_VEHICLES : 0.94
LIBRARIES_AND_DEMO : 0.9
HOUSE_AND_HOME : 0.81
WEATHER : 0.8
EVENTS : 0.71
ART_AND_DESIGN : 0.67
PARENTING : 0.65
BEAUTY : 0.63


In [730]:
sort_table(app_store_genre_freq)

Games : 59.14
Entertainment : 7.53
Photo & Video : 5.14
Education : 3.84
Social Networking : 3.12
Shopping : 2.5
Utilities : 2.26
Music : 2.16
Sports : 2.05
Health & Fitness : 1.99
Productivity : 1.71
Lifestyle : 1.47
News : 1.34
Travel : 1.13
Finance : 1.1
Weather : 0.89
Food & Drink : 0.89
Reference : 0.51
Business : 0.51
Book : 0.27
Medical : 0.21
Navigation : 0.14


**With a set of sorted tables, we can more easily analyze the data.**

When analyzing the sorted tables, we can try to answer some of the following questions to aid our analysis:
- What is the most common genre? What is the next most common?
- What obvious patterns are there?

Let's analyze the top 10 genre/categories for each marketplace.

In [731]:
sort_table(google_play_category_freq, index_end = 10)

FAMILY : 18.8
GAME : 9.61
TOOLS : 8.58
BUSINESS : 4.71
PRODUCTIVITY : 3.97
LIFESTYLE : 3.89
FINANCE : 3.73
MEDICAL : 3.64
PERSONALIZATION : 3.31
SPORTS : 3.26


In [732]:
sort_table(google_play_genre_freq, index_end = 10)

Tools : 8.56
Entertainment : 6.09
Education : 5.39
Business : 4.71
Productivity : 3.97
Lifestyle : 3.88
Finance : 3.73
Medical : 3.64
Sports : 3.33
Personalization : 3.31


In [733]:
sort_table(app_store_genre_freq, index_end = 10)

Games : 59.14
Entertainment : 7.53
Photo & Video : 5.14
Education : 3.84
Social Networking : 3.12
Shopping : 2.5
Utilities : 2.26
Music : 2.16
Sports : 2.05
Health & Fitness : 1.99


**Initial Observations:**
- The most common genres in both marketplaces appear to be Entertainment, Games, and Education apps.
- Tools and Utilities also appear to be among the most common apps
- Google Play has an even spread amongst all app genres, while Games dominate the App Store genres.

***
### Apps with the Most Users

Next, we'd like to determine the type of apps that have the  most users.

We can find this information using the `Installs` column for Google Play: index `5`, and (since there is not an equivalent metric for app installs for the App Store) the `rating_count_tot` column for the App Store: index `5`.

In [734]:
def print_installs_freq_table(original_table, original_dataset, index_genre, index_installs):    
    genre_installs_freq_table = {}
    for genre in original_table:
        genre_installs_freq_table[genre] = 0

    for genre in genre_installs_freq_table:
        for app in original_dataset[1:]:
            app_genre = app[index_genre]
            installs = string_to_float(app[index_installs])
            if app_genre == genre:
                genre_installs_freq_table[genre] += installs

    print(sort_table(genre_installs_freq_table))

In [735]:
print_installs_freq_table(app_store_genre_freq, app_store_english, 11, 5)

Games : 37278366.0
Social Networking : 7149625.0
Photo & Video : 4387465.0
Music : 3489949.0
Entertainment : 3301370.0
Shopping : 2108063.0
Sports : 1547500.0
Reference : 1343439.0
Weather : 1255165.0
Productivity : 1142111.0
Health & Fitness : 1126280.0
Travel : 1125814.0
News : 911905.0
Food & Drink : 866682.0
Finance : 833238.0
Utilities : 763732.0
Lifestyle : 742203.0
Education : 683588.0
Navigation : 500149.0
Book : 133368.0
Business : 102594.0
Catalogs : 15585.0
None


In [736]:
print_installs_freq_table(google_play_category_freq, google_play_english, 1, 5)

GAME : 12471347340.0
COMMUNICATION : 9784905491.0
TOOLS : 7991804304.0
FAMILY : 5744891309.0
PRODUCTIVITY : 5668814314.0
SOCIAL : 5474803752.0
PHOTOGRAPHY : 4579118815.0
VIDEO_PLAYERS : 3734721720.0
TRAVEL_AND_LOCAL : 2810583086.0
NEWS_AND_MAGAZINES : 2351483110.0
BOOKS_AND_REFERENCE : 1564873260.0
PERSONALIZATION : 1397507888.0
SHOPPING : 1381178585.0
HEALTH_AND_FITNESS : 1121337892.0
SPORTS : 999453417.0
ENTERTAINMENT : 975360000.0
BUSINESS : 634771490.0
MAPS_AND_NAVIGATION : 490705280.0
LIFESTYLE : 449722219.0
FINANCE : 423342632.0
WEATHER : 349687520.0
FOOD_AND_DRINK : 199468651.0
EDUCATION : 180800000.0
DATING : 117803757.0
ART_AND_DESIGN : 108221100.0
HOUSE_AND_HOME : 94602361.0
LIBRARIES_AND_DEMO : 51293710.0
AUTO_AND_VEHICLES : 50980061.0
COMICS : 42261150.0
MEDICAL : 36480344.0
PARENTING : 29961010.0
BEAUTY : 27197050.0
None


***We used the Google Play category frequency table, as it is less granular and more comparable to the App Store prime_genre metric.***

One thing to notice is that the number of installs for the Google Play store are massive compared to the App Store numbers. This can be explained that the Android mobile OS accounts for ~72% of the global market share, as compared to ~26% for iOS. [Source](https://gs.statcounter.com/os-market-share/mobile/worldwide)

***
### App Profile Recommendations

Looking at the types of mobile apps in both marketplaces that seem to garner the most installations or reviews, it's clear that Gaming apps dominate. Social Networking/Communication apps also understandably are amongst the top.

One such recommendation that can be made for our company is to develop an app that can reside within these two categories.

A mobile gaming app that is multiplayer in nature could be a lucrative app as it 