# App Store and Google Play Market Analysis

- This project is aimed at analyzing the profiles of applications available on the Apple iOS App Store and Google Play markets
- For fun, I am assuming I work for a company that builds free apps directed towards an English-speaking audience 
- As this company's data analyst, I am tasked with determining which application profiles will likely attract the most users


# Table of Contents

(Links not supported in Github)

1. [Dataset Introduction](#introduction)
2. [Dataset Exploration](#paragraph1)
3. [Data Cleaning](#paragraph2)
    1. [Missing Data Points](#subparagraph1)
    2. [Removing Duplicates](#subparagraph2)
    3. [Removing Non-English Apps](#subparagraph3)
    4. [Removing Paid Apps](#subparagraph4)
3. [Most Common Apps by Genre](#paragraph3)
4. [Most Popular Apps by Genre](#paragraph4)
5. [Findings and Recommendations](#paragraph5)

---

<a name="introduction"></a>

## 1. Dataset Introduction
The datasets used for this analysis are as follows:

---

**[Mobile App Statistics (Apple iOS app store)](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)**
- This data set contains more than 7000 Apple iOS mobile application details. 
- "Apple" data set

**Columns are as follows:**

In [1]:
from csv import reader

file1 = open('AppleStore.csv', encoding='utf8')
reader1 = reader(file1)
apple = list(reader1)

for item in apple[0][:-1]:
    print(item + ", ", end="")
print(apple[0][-1])

id, track_name, size_bytes, currency, price, rating_count_tot, rating_count_ver, user_rating, user_rating_ver, ver, cont_rating, prime_genre, sup_devices.num, ipadSc_urls.num, lang.num, vpp_lic


--- 

**[Google Play Store Apps](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)**
- Details on some  applications available on the Google Play store.
- Each app (row) has values for category, rating, size, and more.
- "Google" data set

**Columns are as follows:**

In [2]:
file2 = open('googleplaystore.csv', encoding='utf8')
reader2 = reader(file2)
google = list(reader2)

for item in google[0][:-1]:
    print(item + ", ", end="")
print(google[0][-1])

App, Category, Rating, Reviews, Size, Installs, Type, Price, Content Rating, Genres, Last Updated, Current Ver, Android Ver


---
<a name="paragraph1"></a>

## 2. Dataset Exploration

In [3]:
def print_data(dataset, start, end, rows_and_columns=True, header=True):        
    
    #Takes chunk of data to display
    ds_slice= dataset[start:end]
    for row in ds_slice:
        print(row)
    
    print()
    
    #Prints number of rows and columns in dataset by default, excluding the header row
    if rows_and_columns:
        if header:
            print('Number of rows:', len(dataset[1:]))
        else:
            print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
    print()

#Calls function using datasets defined earlier
print("Apple Dataset:\n")
print_data(apple, 1, 5)

print("Google Dataset:\n")
print_data(google, 1, 5)


Apple Dataset:

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']
['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']
['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']
['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']

Number of rows: 7197
Number of columns: 16

Google Dataset:

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']
['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']
['U Lau

---
<a name="paragraph2"></a>

## 3. Initial Data Cleaning

<a name="subparagraph1"></a>

### 3.1) Checking for and fixing any rows in either dataset with missing data points

In [4]:
#Detect if any rows are missing data points
def detect_invalid_rows(dataset):
    invalid_rows = []
    num_cols = len(dataset[0])
    for row in dataset:
        if len(row) != num_cols:
            invalid_rows.append(row)
    return invalid_rows

print("Rows with incomplete data in Apple dataset: ")
bad_apple_rows = detect_invalid_rows(apple)
print(bad_apple_rows, "\n")
print("Rows with incomplete data in Google dataset: ")
bad_google_rows = detect_invalid_rows(google)
print(bad_google_rows, "\n")

Rows with incomplete data in Apple dataset: 
[] 

Rows with incomplete data in Google dataset: 
[['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']] 



---
- No rows in the Apple dataset with incomplete data
- 1 row in the Google dataset with incomplete data, let's take a closer look at this one
---

In [5]:
#Checking the 'google' dataset header to determine which data point is missing
print(google[0])
print(bad_google_rows[0])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


---
- Missing data point appears to be the "Category". Checking the available categories:
---

In [6]:
def get_header_col_values(dataset, column):
    all_cols = []
    
    #searches through the value present at the [column] index in each row and adds unique values to a list 
    for row in dataset:
        if row[column] not in all_cols:
            all_cols.append(row[column])
    return all_cols

print(get_header_col_values(google, 1))

['Category', 'ART_AND_DESIGN', 'AUTO_AND_VEHICLES', 'BEAUTY', 'BOOKS_AND_REFERENCE', 'BUSINESS', 'COMICS', 'COMMUNICATION', 'DATING', 'EDUCATION', 'ENTERTAINMENT', 'EVENTS', 'FINANCE', 'FOOD_AND_DRINK', 'HEALTH_AND_FITNESS', 'HOUSE_AND_HOME', 'LIBRARIES_AND_DEMO', 'LIFESTYLE', 'GAME', 'FAMILY', 'MEDICAL', 'SOCIAL', 'SHOPPING', 'PHOTOGRAPHY', 'SPORTS', 'TRAVEL_AND_LOCAL', 'TOOLS', 'PERSONALIZATION', 'PRODUCTIVITY', 'PARENTING', 'WEATHER', 'VIDEO_PLAYERS', 'NEWS_AND_MAGAZINES', 'MAPS_AND_NAVIGATION', '1.9']


---
- App name is "Life Made WI-Fi Touchscreen Photo Frame"
- Likely can be categorized under "Lifestyle"
---

In [7]:
#Find the index of the row missing data in Google's dataset 
idx = google.index(bad_google_rows[0])

#Insert new category to this row, shifting all other elements to the right by 1
google[idx].insert(1, "LIFESTYLE")

print(google[idx])

['Life Made WI-Fi Touchscreen Photo Frame', 'LIFESTYLE', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


---
- Row now has a category and will not prevent accurate analysis
---

<a name="subparagraph2"></a>

### 3.2) Checking for and removing duplicate entries from each dataset

In [8]:
#Function will take in a dataset and the column which corresponds to the "app name"
#Will iterate through dataset and determine if the same app name appears more than once
#If so, it'll add it to its own dictionary
def detect_duplicates(dataset, app_name_col):
    
    #Create 2 dictionaries to store unique and duplicate values
    unique = []
    duplicates = []
    
    #exclude the hedaer row in datasets
    for row in dataset[1:]:
        #Grab the app name
        app = row[app_name_col]

        #if the app name is not in the unique list, then it is unique. Add it.
        if app in unique:
            duplicates.append(app)

        #if the app name is not unique, then it's a duplicate 
        else:
            unique.append(app)
            
    return (unique, duplicates)

#Run the function on both Apple and Google datasets
#These variables will have a tuple of lists with [0] being the list of unique apps and [1] the list of duplicate apps
apple_duplicates = detect_duplicates(apple, 1)
google_duplicates = detect_duplicates(google, 0)

print("Number of unique entries:")
print("Apple dataset:", len(apple_duplicates[0]), "out of", len(apple[1:]), "total rows")
print("Google dataset:", len(google_duplicates[0]), "out of", len(google[1:]), "total rows")

print()

print("Number of duplicate entries:")
print("Apple dataset:", len(apple_duplicates[1]))
print("Google dataset:", len(google_duplicates[1]))

Number of unique entries:
Apple dataset: 7195 out of 7197 total rows
Google dataset: 9660 out of 10841 total rows

Number of duplicate entries:
Apple dataset: 2
Google dataset: 1181


---
- There are some apps which appear more than once in each dataset. Let's check them out.
---

In [9]:
print(apple_duplicates[1])

['Mannequin Challenge', 'VR Roller Coaster']



- Two apps had multiple entries in the Apple Store data set. Let's see how (or if) those entries differ:


In [10]:
#Grab the dataset header
print(apple[0], "\n")

#Print all rows which contain app names with duplicate entries
for row in apple:
    if row[1] in apple_duplicates[1]:
        print(row)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 

['1173990889', 'Mannequin Challenge', '109705216', 'USD', '0.0', '668', '87', '3.0', '3.0', '1.4', '9+', 'Games', '37', '4', '1', '1']
['952877179', 'VR Roller Coaster', '169523200', 'USD', '0.0', '107', '102', '3.5', '3.5', '2.0.0', '4+', 'Games', '37', '5', '1', '1']
['1178454060', 'Mannequin Challenge', '59572224', 'USD', '0.0', '105', '58', '4.0', '4.5', '1.0.1', '4+', 'Games', '38', '5', '1', '1']
['1089824278', 'VR Roller Coaster', '240964608', 'USD', '0.0', '67', '44', '3.5', '4.0', '0.81', '4+', 'Games', '38', '0', '1', '1']


---
- In order to clean up these duplicates, we'll want a data point that is common to both datasets and easy to parse. We'll use total rating count in this case. The app with the most total ratings should be the latest version.

- We will keep the version of each app with the most reviews (index of 5 in "Apple" dataset, index of 3 in "Google" dataset)
---

In [11]:
#Function to take in a dataset and columns for "app name" and "total rating count"
#Function will return a list of apps which have the highest total rating count, removing the duplicate entries with lower rating counts

def remove_duplicates(dataset, app_name_col, total_rating_col):

    #Dictionary with format {App Name: total_rating_count} to find the apps with highest rating totals
    keepers = {}

    #Iterate through dataset, excluding header row
    for app in dataset[1:]:

        #grab the 2 data points: app name and total ratings count
        app_name = app[app_name_col]
        rating_ct = float(app[total_rating_col])

        #if we detect a duplicate app name
        #and if the rating count of this app is greater than the app in 'keepers' dictionary
        if app_name in keepers and keepers[app_name] < rating_ct:

            #then replace the entry in keepers dictionary
            keepers[app_name] = rating_ct

        #if app has not been detected yet, add it to the keepers dictionary
        elif app_name not in keepers:
            keepers[app_name] = rating_ct
            
    #We now have a dictionary with {App Name: Total rating count} which has only 'kept' the app with the highest total rating count
    #Let's use this dictionary to reconstruct the initial datast, excluding the duplicate entries
    dedup = []
    reviewed = []
    
    #Grab the header from dataset
    dedup.append(dataset[0])

    #Loop through dataset excluding header
    for app in dataset[1:]:
        
        #Reinitiate app name and ratings count variables
        app_name = app[app_name_col]
        ratings_ct = float(app[total_rating_col])

        #If the app is a 'keeper' and it has not already been reviewed during this for loop
        if (keepers[app_name] == ratings_ct) and (app_name not in reviewed):
            
            #then add this app to our new dataset (dedup) and to the 'reviewed' list so we remember which apps we've seen already
            dedup.append(app)
            reviewed.append(app_name)

    #return the new dataset with duplicates removed
    return dedup
    
apple_dedup = remove_duplicates(apple, 1, 5)
print_data(apple_dedup, 0 , 5)
print()
google_dedup = remove_duplicates(google, 0, 3)
print_data(google_dedup, 0 , 5)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']
['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']
['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']
['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']

Number of rows: 7195
Number of columns: 16


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Photo Editor & Ca

---
<a name="subparagraph3"></a>

### 3.3) Detecting and removing non-English apps

In [12]:
#If there are 3 or more characters in the app name which fall outside of standard english ASCII values, then the app is likely not meant for english audience
#The algorithm is not perfect, choosing '3' characters arbitrarily as some legitimate english apps will have emojis and "TM" characters which we would want to include
def is_eng_app(name):
    
    #Counter to detect if the app has more than (3) non-english characters 
    num_non_eng_chars = 0
    
    #iterate through each character in the app name string
    for letter in name:
        
        #if the ASCII value is greater than 127, it is not a commonly used english character
        #then add 1 to the counter
        if ord(letter) > 127:
            num_non_eng_chars += 1
    
    #if more than 3 non-english characters found, return false, app is likely not meant for english audiences
    if num_non_eng_chars > 3:
        return False
    else:
        return True

#Tests to show output
print(is_eng_app('instagram'))
print(is_eng_app('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_eng_app('Instachat 😜'))
print(is_eng_app('Docs To Go™ Free Office Suite'))

True
False
True
True


---
- Now that we have the algorithm, we'll create new datasets with non-english apps removed
---

In [13]:
def remove_non_eng(dataset, app_name_col):
    
    #Used to store all apps likely meant for english audience
    eng_dataset = []
    
    #Grab header for new dataset
    eng_dataset.append(dataset[0])
    
    #loop through all apps in dataset, excluding header, and add those which are english to the new dataset
    for app in dataset[1:]:
        app_name = app[app_name_col]
        if is_eng_app(app_name):
            eng_dataset.append(app)
            
    return eng_dataset

apple_dedup_eng = remove_non_eng(apple_dedup, 1)
google_dedup_eng = remove_non_eng(google_dedup, 0)

print_data(apple_dedup_eng, 0 , 3)
print_data(google_dedup_eng, 0 , 3)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']
['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']

Number of rows: 6181
Number of columns: 16

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']
['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,0

---
<a name="subparagraph4"></a>

### 3.4) Detecting and removing paid apps

In [14]:
def get_free_apps(dataset, price_col):
    
    free_apps = []
    free_apps.append(dataset[0])
    
    for app in dataset[1:]:
       
        price = app[price_col]
        
        if price == '0' or price == '0.0':
            free_apps.append(app)
         
    return free_apps

apple_final = get_free_apps(apple_dedup_eng, 4)
google_final = get_free_apps(google_dedup_eng, 7)

print_data(apple_final, 0, 3)
print_data(google_final, 0 , 3)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']
['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']

Number of rows: 3220
Number of columns: 16

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']
['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,0

---
##### We will now be using the cleaned datasets "apple_final" and "google_final" for analysis
- For fun, I am assuming I work for a company that builds free apps directed towards an English-speaking audience 
- As this company's data analyst, I am tasked with determining which application profiles will likely attract the most users as the company's revenue is dependent on number of users on the app
- We will begin this analysis in the next section
---

<a name="paragraph3"></a>

## 4. Most Common App Genres per Market
- We will first determine which genre of app is the most common

In [15]:
#This function will return the % frequency of values found in the dataset according to the 'column' provided
def get_frequencies(dataset, column):
    
    #total number of values on which to calculate percentage is # of rows - the header row
    total = len(dataset) - 1
    
    #empty dictionary to store the frequencies
    column_values = {}
    
    #loop through all entries in dataset
    for item in dataset[1:]:
        
        #create variable for the value we find in this row of the dataset
        value = item[column]
        
        #if we have already seen this value before, add 1 to the dictionary value
        if value in column_values:
            column_values[value] += 1
        
        #if we have not seen this value, add a new entry to the dictionary with value of 1
        else:
            column_values[value] = 1
    
    #for every key:value pair in the dictionary, calculate the percentage and store that as a new value in the dictionary
    for entry in column_values:
        value = column_values[entry]
        column_values[entry] = (value / total) * 100
        
    return column_values

#This function will simply print the dictionary created in previous function in an easy to read format
def print_frequencies(data, num_to_print, percentages=True):

    #will be list of tuples from 'data' dictionary
    list_data = []
    
    #popuate list of tuples
    for item in data:
        data_tuple = (data[item], item)
        list_data.append(data_tuple)
        
    #sort list of tuples based on value (not key) in 'data' dictionary
    sorted_data = sorted(list_data, reverse=True)
    
    if percentages:
        for item in sorted_data[:num_to_print]:
            percentage = '%f'
            print(item[1], ':', str(round(item[0], 2)) + '%')
    else:
        for item in sorted_data[:num_to_print]:
            print(item[1], ':', str(round(item[0], 2)))

In [16]:
print("Most common app 'prime_genre' in the Apple iOS App Store:")
print("-------------")
apple_freqs = get_frequencies(apple_final, 11)
print_frequencies(apple_freqs, 10)

Most common app 'prime_genre' in the Apple iOS App Store:
-------------
Games : 58.14%
Entertainment : 7.89%
Photo & Video : 4.97%
Education : 3.66%
Social Networking : 3.29%
Shopping : 2.61%
Utilities : 2.52%
Sports : 2.14%
Music : 2.05%
Health & Fitness : 2.02%


- Most common genre in iOS App Store appears to be a 'game' by far

In [17]:
print("Most common app 'Category' in the Google Play Store:")
print("-------------")
google_cat_freqs = get_frequencies(google_final, 1)
print_frequencies(google_cat_freqs, 10)

print()

print("Most common app 'Genre' in the Google Play Store:")
print("-------------")
google_genre_freqs = get_frequencies(google_final, 9)
print_frequencies(google_genre_freqs, 10)

Most common app 'Category' in the Google Play Store:
-------------
FAMILY : 18.91%
GAME : 9.72%
TOOLS : 8.46%
BUSINESS : 4.59%
LIFESTYLE : 3.91%
PRODUCTIVITY : 3.89%
FINANCE : 3.7%
MEDICAL : 3.53%
SPORTS : 3.4%
PERSONALIZATION : 3.32%

Most common app 'Genre' in the Google Play Store:
-------------
Tools : 8.45%
Entertainment : 6.07%
Education : 5.35%
Business : 4.59%
Productivity : 3.89%
Lifestyle : 3.89%
Finance : 3.7%
Medical : 3.53%
Sports : 3.46%
Personalization : 3.32%


- It is hard to tell the difference between these two columns in the 'Google' dataset as they both seem to describe the genre of app
- We can say for sure that the Google Play Store has a wider variety of apps which are free and meant for English audiences
- 'Games' only make up ~9-10% of this population whereas they make up 58% of the same population in the iOS App Store

---
- This data is showing us the most common apps which have been published to the respective app stores; however, we are interested in the apps which have the most users
- We'll use the functions above to determine the genres in each market which have the most user activity
---

<a name="paragraph4"></a>

## 4. Most Popular App Genres per Market
- Now we will determine which app genres have the most user activity

- For the 'Apple' dataset, "most user activity" will be measured by the total number of ratings given to each genre ('rating_count_tot')
- For the 'Google' dataset, "most user activity" will be measured by the total number of installs of apps in each genre ('Installs')

In [18]:
#First, we will get all unique genres in each dataset
#Remembering that the 'google' dataset had 2 columns that both appear to describe the app genre
apple_genres = list(get_frequencies(apple_final, 11).keys())
google_categories = list(get_frequencies(google_final, 1).keys())
google_genres = list(get_frequencies(google_final, 9).keys())

In [19]:
# this function will return a dictionary in the form {genre: average number of (activity_column) per app in this genre}
#
def get_user_activity(dataset, genres, genre_column, activity_column):
    
    #initialize this dictionary
    avg_genre_ratings = {}
    
    #loop through list of unique genres in dataset
    for genre in genres:
        
        #initialize variables for counting the total number of ratings found per genre
        sum_ratings = 0
        #and for the total number of apps associated with each genre
        sum_apps = 0
        
        #for every genre, loop through the original dataset
        for app in dataset[1:]:
            
            #grab the genre of given app
            app_genre = app[genre_column]
            
            #if it matches the genre of the outer loop
            if app_genre == genre:
                
                #convert value in activity column to a computational float
                activity = app[activity_column]
                activity = activity.replace(',', '')
                activity = activity.replace('+', '')
                
                #then add the # of its total ratings to our sum variable
                sum_ratings += float(activity)
                
                #add 1 to sum_apps to track how many apps per genre
                sum_apps += 1
                
        #calculate average number of ratings per genre and store in dictionary
        avg_num_ratings = sum_ratings / sum_apps
        avg_genre_ratings[genre] = avg_num_ratings
        
    return avg_genre_ratings

---
Let's run this for Apple

In [20]:
apple_activity = get_user_activity(apple_final, apple_genres, genre_column=11, activity_column=5)
print_frequencies(apple_activity, 5, percentages=False)

Navigation : 86090.33
Reference : 74942.11
Social Networking : 71548.35
Music : 57326.53
Weather : 52279.89


- The top 5 genres with most average reviews per app are above. This may be skewed by 1-3 apps in each genre which receive more reviews than most
- Such as 'Google Maps' and 'Waze' in Navigation and the 'Bible' and 'Dictionary.com' apps in Reference

However, we have seen that this iOS market is flooded with games. It would be wise for our company to invest time in developing an app which has an involved user base, such as those above.

---

Let's take a look at the Google Play Store before making our recommendation:

In [21]:
google_cat_activity = get_user_activity(google_final, google_categories, genre_column=1, activity_column=5)
print_frequencies(google_cat_activity, 5, percentages=False)

print()

google_genre_activity = get_user_activity(google_final, google_genres, genre_column=9, activity_column=5)
print_frequencies(google_genre_activity, 5, percentages=False)

COMMUNICATION : 38456119.17
VIDEO_PLAYERS : 24727872.45
SOCIAL : 23253652.13
PHOTOGRAPHY : 17840110.4
PRODUCTIVITY : 16787331.34

Communication : 38456119.17
Adventure;Action & Adventure : 35333333.33
Video Players & Editors : 24947335.8
Social : 23253652.13
Arcade : 22888365.49


Looking at the results from all 3 (Apple, Google - Category, and Google - Genre), we see just 1 overlap (social media). We can assume this genre is saturated by apps like Facebook and Instagram. Let's look at more data from all 3:

In [22]:
print("Apple iOS Store Top 10 Genres with Highest Average Number of Reviews per app")
print("--------")
print_frequencies(apple_activity, 10, percentages=False)
print()
print("Google Play Store Top 10 Categories with Highest Average Number of Reviews per app")
print("--------")
print_frequencies(google_cat_activity, 10, percentages=False)
print()
print("Google Play Store Top 10 Genres with Highest Average Number of Reviews per app")
print("--------")
print_frequencies(google_genre_activity, 10, percentages=False)

Apple iOS Store Top 10 Genres with Highest Average Number of Reviews per app
--------
Navigation : 86090.33
Reference : 74942.11
Social Networking : 71548.35
Music : 57326.53
Weather : 52279.89
Book : 39758.5
Food & Drink : 33333.92
Finance : 31467.94
Photo & Video : 28441.54
Travel : 28243.8

Google Play Store Top 10 Categories with Highest Average Number of Reviews per app
--------
COMMUNICATION : 38456119.17
VIDEO_PLAYERS : 24727872.45
SOCIAL : 23253652.13
PHOTOGRAPHY : 17840110.4
PRODUCTIVITY : 16787331.34
GAME : 15588015.6
TRAVEL_AND_LOCAL : 13984077.71
ENTERTAINMENT : 11640705.88
TOOLS : 10801391.3
NEWS_AND_MAGAZINES : 9549178.47

Google Play Store Top 10 Genres with Highest Average Number of Reviews per app
--------
Communication : 38456119.17
Adventure;Action & Adventure : 35333333.33
Video Players & Editors : 24947335.8
Social : 23253652.13
Arcade : 22888365.49
Casual : 19569221.6
Puzzle;Action & Adventure : 18366666.67
Photography : 17840110.4
Educational;Action & Adventure :

---

Reprinting the most common apps per genre for reference:

In [23]:
print("Most common app 'prime_genre' in the Apple iOS App Store:")
print("-------------")
apple_freqs = get_frequencies(apple_final, 11)
print_frequencies(apple_freqs, 10)

print()

print("Most common app 'Category' in the Google Play Store:")
print("-------------")
google_cat_freqs = get_frequencies(google_final, 1)
print_frequencies(google_cat_freqs, 10)

print()

print("Most common app 'Genre' in the Google Play Store:")
print("-------------")
google_genre_freqs = get_frequencies(google_final, 9)
print_frequencies(google_genre_freqs, 10)

Most common app 'prime_genre' in the Apple iOS App Store:
-------------
Games : 58.14%
Entertainment : 7.89%
Photo & Video : 4.97%
Education : 3.66%
Social Networking : 3.29%
Shopping : 2.61%
Utilities : 2.52%
Sports : 2.14%
Music : 2.05%
Health & Fitness : 2.02%

Most common app 'Category' in the Google Play Store:
-------------
FAMILY : 18.91%
GAME : 9.72%
TOOLS : 8.46%
BUSINESS : 4.59%
LIFESTYLE : 3.91%
PRODUCTIVITY : 3.89%
FINANCE : 3.7%
MEDICAL : 3.53%
SPORTS : 3.4%
PERSONALIZATION : 3.32%

Most common app 'Genre' in the Google Play Store:
-------------
Tools : 8.45%
Entertainment : 6.07%
Education : 5.35%
Business : 4.59%
Productivity : 3.89%
Lifestyle : 3.89%
Finance : 3.7%
Medical : 3.53%
Sports : 3.46%
Personalization : 3.32%


---

<a name="paragraph5"></a>

## Findings and Recommendation

Looking at the results above, we can see that some of the most common genres with high average number of reviews or installs per app are one of the following:

- Business
- Travel
- Shopping
- Food & Drink

**It would appear there is a demand for apps which aggregate data (such as listings of local stores, local restaurants, or local tourist destinations) and display this data in a central place.** Typically this data is freely available on the web, but scattered across multiple websites and business pages. 

This application would take some web-scraping but there is obviously a potential for user interaction given the high number of reviews. This means that if we were to publish an app in this space, we would likely get feedback early and often. This would allow us to respond to the community and take over market share in a space where users' are interested in a new application.