# Profitable App Profiles for Apple Store and Google Play Markets

This case study relates to data on free Android and iOS mobile apps. Free apps generate revenue using in-app ads. Higher the number of users using the app, the more users that see and engage with ads, and higher the revenue generated.

The objective is to help developers understand what types of apps are likely to attract more users.

## Explore the datasets

We have 2 sources of data - one from Google Play store, and the other from Apple Store. We can open these up and store them as lists.

In [1]:
from csv import reader
apple_ds = list(reader(open('AppleStore.csv')))
apple_ds_header = apple_ds[0]
apple_ds = apple_ds[1:]
google_ds = list(reader(open('googleplaystore.csv')))
google_ds_header = google_ds[0]
google_ds = google_ds[1:]

The following function will allow us to slice the data and explore the contents of each list.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
explore_data(google_ds, 0, 5, rows_and_columns = True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 10841
Number of columns: 13


In [4]:
explore_data(apple_ds, 0, 4, rows_and_columns = False)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']




In [5]:
apple_ds_header

['id',
 'track_name',
 'size_bytes',
 'currency',
 'price',
 'rating_count_tot',
 'rating_count_ver',
 'user_rating',
 'user_rating_ver',
 'ver',
 'cont_rating',
 'prime_genre',
 'sup_devices.num',
 'ipadSc_urls.num',
 'lang.num',
 'vpp_lic']

In [6]:
google_ds_header

['App',
 'Category',
 'Rating',
 'Reviews',
 'Size',
 'Installs',
 'Type',
 'Price',
 'Content Rating',
 'Genres',
 'Last Updated',
 'Current Ver',
 'Android Ver']

Based on the column headers, we can use the Category, Genres, Rating, Number of Reviews, and Installs to determine popularity based on App Type. Content Rating can give further insight on the population of users in case of targeted advertisiing.

# Data Cleansing


## Removing Duplicates
Before analyzing a dataset, we need to ensure the data is clean without any (or minimal) discrepancies, to ensure the analysis is accurate and to reduce the possibility of errors. This includes:
- removing duplicate records
- removing or correcting inaccurate data
- ensuring only relevant data is retained (eg. removing non-English apps)

Check the data to find rows that don't have same number of columns as described in header. Delete the row if there are any. In some cases, it might be useful to retain the datapoint if the missing values can be guessed based on sampling.

In [7]:
for row in google_ds:
    if len(row) != len(google_ds[0]):
        print(row)
        print(google_ds.index(row))
        
print(google_ds[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
10472
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [8]:
del google_ds[10472]

Check the data for duplicate records. Here's an example of an app repeated over 4 rows. Using the loop below, we use "Instagram" as an example to highlight the presence of multiple rows for the same app.

In [9]:
for app in google_ds:
    name = app[0]
    if name == 'Box':
        print(app)

['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [10]:
for app in apple_ds:
    name = app[1]
    if name == 'Instagram':
        print(app)

for app in apple_ds:
    name = app[1]
    if name == 'Mannequin Challenge':
        print(app)
        
for app in apple_ds:
    name = app[1]
    if name == 'VR Roller Coaster':
        print(app)

['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']
['1173990889', 'Mannequin Challenge', '109705216', 'USD', '0.0', '668', '87', '3.0', '3.0', '1.4', '9+', 'Games', '37', '4', '1', '1']
['1178454060', 'Mannequin Challenge', '59572224', 'USD', '0.0', '105', '58', '4.0', '4.5', '1.0.1', '4+', 'Games', '38', '5', '1', '1']
['952877179', 'VR Roller Coaster', '169523200', 'USD', '0.0', '107', '102', '3.5', '3.5', '2.0.0', '4+', 'Games', '37', '5', '1', '1']
['1089824278', 'VR Roller Coaster', '240964608', 'USD', '0.0', '67', '44', '3.5', '4.0', '0.81', '4+', 'Games', '38', '0', '1', '1']


We can expand on the above function, loop through the data, and find all instances of duplicated data.

In [11]:
duplicate_apps = []
unique_apps = []

for app in google_ds:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:10])

Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


In [12]:
# Check Apple data for duplicates

duplicate_apps_apple = []
unique_apps_apple = []

for app in apple_ds:
    id = app[0]
    if id in unique_apps_apple:
        duplicate_apps_apple.append(name)
    else:
        unique_apps_apple.append(name)
        
print('Number of duplicate apps:', len(duplicate_apps_apple))
print('\n')
print('Examples of duplicate apps:', duplicate_apps_apple[:10])

Number of duplicate apps: 0


Examples of duplicate apps: []


Given the ID field in the Apple Data is unique, we'll assume there are no duplicates in the data.
Note: further down the line, we may want to double check with the name field as well.

We can use several different criteria to remove duplicate data. We can use the following steps:
- For cases where there are multiple rows with no difference in values, we simply retain just one of the records
- We can use the updated date to obtain the latest values for an app
- We can use the number of reviews to retain the row with the highest number of refviews

In this example we shall only retain the entries that have the highest number of reviews. We first create a dictionary with the name of the app and the highest number of reviews of the corresponding app. Then we'll use this dictionary to create a new data set to retain just the entries in the source data with the highest number of reviews. 

In [13]:
reviews_max = {}
for app in google_ds:
    name = app[0]
    n_reviews = float(app[3])    
    if (name in reviews_max) and (reviews_max[name] < n_reviews):
        reviews_max[name] = n_reviews
    elif (name not in reviews_max):
        reviews_max[name] = n_reviews

The length of the dictionary can be used to check the number of unique apps versus the expected result.

In [14]:
print(len(google_ds)-1181)
print(len(reviews_max))

9659
9659


Now we loop through the Play Store data, and create a clean data set which has the apps only with the highest number of reviews, as per the dictionary. We also track the apps already added, such that if there are multiple entries with the same number of reviews, we only add it once.

In [15]:
already_added = []
google_clean = []

for app in google_ds:
    name = app[0]
    n_reviews = float(app[3])
    
    if (name not in already_added) and (n_reviews == reviews_max[name]):
        already_added.append(name)
        google_clean.append(app)

# explore the google_clean dataset to ensure 
print(len(google_clean))
explore_data(google_clean, 0, 4, True)

9659
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 9659
Number of columns: 13


Note: The Apple Store data does not have duplicated data. This can be confirmed by following the above series of steps.

## Removing non-English characters

The analysis involves apps directed to English-speaking audiences only. The ord function checks a character and returns a number corresponding to the ASCII code. Numbers in the range 0 to 127 correspond to all the characters used in English text. 

In [16]:
def eng_check(word):
    check = 0
    for a in word:
        if ord(a) > 127:
            check+=1
            
    if check > 3:  # adding a threshold for emojis and other characters
        return False
    else:
        return True
    
# test the function
# eng_check('爱奇艺PPS -《欢乐颂2》电视剧热播')
# eng_check('Instagram')
# eng_check('Instachat 😜')
eng_check('Docs To Go™ Free Office Suite')

True

We can use the above function to loop through the dataset, append only the apps with English names to a new list.

In [17]:
google_english = []
google_nonenglish = []
apple_english = []
apple_nonenglish = []

for app in google_clean:
    name = app[0]
    if eng_check(name) == True:
        google_english.append(app)
    else:
        google_nonenglish.append(app)

for app in apple_ds:
    name = app[1]
    if eng_check(name) == True:
        apple_english.append(app)
    else:
        apple_nonenglish.append(app)

In [18]:
# Check the lengths of each dataset - should add up to original 9659 apps
print(len(google_english))
print(len(google_nonenglish))
print(len(apple_english))
print(len(apple_nonenglish))

9614
45
6183
1014


In [19]:
explore_data(google_english, 0, 5, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows: 9614
Number of columns: 13


In [20]:
explore_data(apple_english, 0, 5, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 6183
Number of columns: 16


## Removing paid apps

Our analysis needs to include only free apps. We will have to loop through the data set once again and retain only free apps targeted to English speaking customers only.

In [21]:
google_eng_free = []
google_eng_paid = []
apple_eng_free = []
apple_eng_paid = []

for app in google_english:
    price = app[7]
    if price == '0':
        google_eng_free.append(app)
    else:
        google_eng_paid.append(app)

for app in apple_english:
    price = app[4]
    if price == '0.0':
        apple_eng_free.append(app)
    else:
        apple_eng_paid.append(app)

In [22]:
# Check the lengths of each dataset - should add up to original 9614 english apps
print(len(google_eng_free))
print(len(google_eng_paid))
print(len(apple_eng_free))
print(len(apple_eng_paid))

8864
750
3222
2961


In [23]:
explore_data(apple_eng_free, 0, 5, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 3222
Number of columns: 16


In [24]:
explore_data(google_eng_free, 0, 5, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows: 8864
Number of columns: 13


# Data Analysis

## Problem Statement

The objective of the analysis is to determine what kinds of apps are likely to attract more customers since ad revenue from free apps is highly dependent on the number of users using the app. Additionally, the validation strategy of an app idea is 

1. Release an app involves releasing a minimal version on Google Play store
2. If the app has a good response from users, develop the app further
3. If the app is profitable after 6 months, build iOS version and release it on App Store

Given this strategy, it becomes important to understand what apps have been popular across both markets. Given this strategy, it becomes important to understand what apps have been popular across both markets.

## Most common apps by Genre

We're going to create a couple function to help us build frequency tables using the Google and Apple datasets.

In [25]:
# This function creates a frequency table using a given dataset, and a column of interest
def freq_table(dataset, index):
    freq = {}
    total = 0
    for app in dataset:
        total += 1
        value = app[index]
        if value in freq:
            freq[value] += 1
        else:
            freq[value] = 1
    
    freq_pct = {}
    for app in freq:
        freq_pct[app] = round((freq[app]/total)*100, 2)
    
    return freq_pct
    
# This function 1) takes in a dataset, and an index for a column as an input
# 2) builds a frequency table using the freq_table function
# 3) displays the table in a descending order of frequenct
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Let's run the above function on the Apple dataset for free English apps.

In [26]:
display_table(apple_eng_free, 11) # Prime Genre (Apple)

Games : 58.16
Entertainment : 7.88
Photo & Video : 4.97
Education : 3.66
Social Networking : 3.29
Shopping : 2.61
Utilities : 2.51
Sports : 2.14
Music : 2.05
Health & Fitness : 2.02
Productivity : 1.74
Lifestyle : 1.58
News : 1.33
Travel : 1.24
Finance : 1.12
Weather : 0.87
Food & Drink : 0.81
Reference : 0.56
Business : 0.53
Book : 0.43
Navigation : 0.19
Medical : 0.19
Catalogs : 0.12


We can see that Games are the most common free English apps available from the Apple Store accounting for 58% of the population. The second-most common genre is Entertainment. The most common apps in this population are designed for entertainment purposes. Games, Entertainment, Photo & Video, Social Networking, Sports, and Music cover ~80% of the population. The apps designed for Practical purposes (for example, Lifestyle, Shopping, Weather) account for a much smaller portion of the available apps - often less 2-3%.

Note that these are just the apps that are available in the store, and does not account if these genres have a large number of users. This alone is not enough information to determine the app profile recommendation for the Ap Store market. We would have to combine this data with the number of users, and the average user rating as well for better results.

Let's look at the Category and Genre details for the Google dataset next.

In [27]:
display_table(google_eng_free, 1) # Category (Google)

FAMILY : 18.91
GAME : 9.72
TOOLS : 8.46
BUSINESS : 4.59
LIFESTYLE : 3.9
PRODUCTIVITY : 3.89
FINANCE : 3.7
MEDICAL : 3.53
SPORTS : 3.4
PERSONALIZATION : 3.32
COMMUNICATION : 3.24
HEALTH_AND_FITNESS : 3.08
PHOTOGRAPHY : 2.94
NEWS_AND_MAGAZINES : 2.8
SOCIAL : 2.66
TRAVEL_AND_LOCAL : 2.34
SHOPPING : 2.25
BOOKS_AND_REFERENCE : 2.14
DATING : 1.86
VIDEO_PLAYERS : 1.79
MAPS_AND_NAVIGATION : 1.4
FOOD_AND_DRINK : 1.24
EDUCATION : 1.16
ENTERTAINMENT : 0.96
LIBRARIES_AND_DEMO : 0.94
AUTO_AND_VEHICLES : 0.93
HOUSE_AND_HOME : 0.82
WEATHER : 0.8
EVENTS : 0.71
PARENTING : 0.65
ART_AND_DESIGN : 0.64
COMICS : 0.62
BEAUTY : 0.6


We can see that the largest percentage of the apps in the Google play store is ~19% for Family. As per the Google Play store, the category of Family also includes Games. [Note: this is no longer the case. The category "Family" no longer exists in the Google Play Store.]

Even so, the remaining categories have a decent representation compared to the Apple Store data. Apps designed for fun account for ~30% of the population. The landscape is far more balanced. 

In [28]:
display_table(google_eng_free, 9) # Genre (Google)

Tools : 8.45
Entertainment : 6.07
Education : 5.35
Business : 4.59
Productivity : 3.89
Lifestyle : 3.89
Finance : 3.7
Medical : 3.53
Sports : 3.46
Personalization : 3.32
Communication : 3.24
Action : 3.1
Health & Fitness : 3.08
Photography : 2.94
News & Magazines : 2.8
Social : 2.66
Travel & Local : 2.32
Shopping : 2.25
Books & Reference : 2.14
Simulation : 2.04
Dating : 1.86
Arcade : 1.85
Video Players & Editors : 1.77
Casual : 1.76
Maps & Navigation : 1.4
Food & Drink : 1.24
Puzzle : 1.13
Racing : 0.99
Role Playing : 0.94
Libraries & Demo : 0.94
Auto & Vehicles : 0.93
Strategy : 0.91
House & Home : 0.82
Weather : 0.8
Events : 0.71
Adventure : 0.68
Comics : 0.61
Beauty : 0.6
Art & Design : 0.6
Parenting : 0.5
Card : 0.45
Casino : 0.43
Trivia : 0.42
Educational;Education : 0.39
Board : 0.38
Educational : 0.37
Education;Education : 0.34
Word : 0.26
Casual;Pretend Play : 0.24
Music : 0.2
Racing;Action & Adventure : 0.17
Puzzle;Brain Games : 0.17
Entertainment;Music & Video : 0.17
Casual;

This is further confirmed byt the Genre data. The Genre information appears to be a more granular look at the apps compared to the Category data, with a portion of the apps having multiple genres (for example, "Lifestyle; Education" or "Health & Fitness ; Action & Adventure".)

Again - this is not enough information to decide a recommendation. This merely tells us what are the apps currently availble in the store, and not how many users we have for each genre.

## Category of Apps with Most Users

To determine which category of apps are the most popular among users, we'd want to find the average number of installs per category. For this we have the 'Installs' column in the Google dataset. We do not have a similar datapoint in the Apple dataset. We could use the 'rating_count_tot' datapoint as a workaround. This is the total number of user ratings, and we'll be making the assumption that the users that installed the app rated the app as well.

Let's start with the Apple Store data, and generate a unique list of genres.

In [29]:
apple_freq_table = freq_table(apple_eng_free, 11)

for genre in apple_freq_table:
    total = 0
    len_genre = 0
    for app in apple_eng_free:
        genre_app = app[11]
        if genre_app == genre:
            ratings_ct = float(app[5])
            total += ratings_ct
            len_genre += 1
    avg_num_ratings = round(total / len_genre, 2)
    print(genre, ': ', avg_num_ratings)    

Social Networking :  71548.35
Photo & Video :  28441.54
Games :  22788.67
Music :  57326.53
Reference :  74942.11
Health & Fitness :  23298.02
Weather :  52279.89
Utilities :  18684.46
Travel :  28243.8
Shopping :  26919.69
News :  21248.02
Navigation :  86090.33
Lifestyle :  16485.76
Entertainment :  14029.83
Food & Drink :  33333.92
Sports :  23008.9
Book :  39758.5
Finance :  31467.94
Education :  7003.98
Productivity :  21028.41
Business :  7491.12
Catalogs :  4004.0
Medical :  612.0


Navigation (86k) > Reference (75k) > Social Networking (71k) >  > Music (57k) > Weather (52k) > 
Book (39k) > Food & Drink (33k) > Finance (31k)

Free Navigation apps have the highest number of average users, followed by Reference and Social Networking apps. Navigation and Social Networking apps have key players already, and building a new app would need to have a major differentiator to compete. 

In [30]:
for app in apple_eng_free:
    genre = app[11]
    if genre == 'Navigation':
        print(app[1], ': ', app[5])

Waze - GPS Navigation, Maps & Real-time Traffic :  345046
Google Maps - Navigation & Transit :  154911
Geocaching® :  12811
CoPilot GPS – Car Navigation & Offline Maps :  3582
ImmobilienScout24: Real Estate Search in Germany :  187
Railway Route Search :  5


In [31]:
for app in apple_eng_free:
    genre = app[11]
    if genre == 'Social Networking':
        print(app[1], ': ', app[5])

Facebook :  2974676
Pinterest :  1061624
Skype for iPhone :  373519
Messenger :  351466
Tumblr :  334293
WhatsApp Messenger :  287589
Kik :  260965
ooVoo – Free Video Call, Text and Voice :  177501
TextNow - Unlimited Text + Calls :  164963
Viber Messenger – Text & Call :  164249
Followers - Social Analytics For Instagram :  112778
MeetMe - Chat and Meet New People :  97072
We Heart It - Fashion, wallpapers, quotes, tattoos :  90414
InsTrack for Instagram - Analytics Plus More :  85535
Tango - Free Video Call, Voice and Chat :  75412
LinkedIn :  71856
Match™ - #1 Dating App. :  60659
Skype for iPad :  60163
POF - Best Dating App for Conversations :  52642
Timehop :  49510
Find My Family, Friends & iPhone - Life360 Locator :  43877
Whisper - Share, Express, Meet :  39819
Hangouts :  36404
LINE PLAY - Your Avatar World :  34677
WeChat :  34584
Badoo - Meet New People, Chat, Socialize. :  34428
Followers + for Instagram - Follower Analytics :  28633
GroupMe :  28260
Marco Polo Video Walki

As we can see from above, 2 Navigation apps (Waze and Google Maps) currently dominate the user base, and it would be a hard market to enter. There are several key players already in the Social Networking space (Facebook and Pinterest being the key leaders).

In [32]:
for app in apple_eng_free:
    genre = app[11]
    if genre == 'Reference':
        print(app[1], ': ', app[5])

Bible :  985920
Dictionary.com Dictionary & Thesaurus :  200047
Dictionary.com Dictionary & Thesaurus for iPad :  54175
Google Translate :  26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran :  18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition :  17588
Merriam-Webster Dictionary :  16849
Night Sky :  12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) :  8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools :  4693
GUNS MODS for Minecraft PC Edition - Mods Tools :  1497
Guides for Pokémon GO - Pokemon GO News and Cheats :  826
WWDC :  762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free :  718
VPN Express :  14
Real Bike Traffic Rider Virtual Reality Glasses :  8
教えて!goo :  0
Jishokun-Japanese English Dictionary & Translator :  0


In [33]:
for app in apple_eng_free:
    genre = app[11]
    if genre == 'Finance':
        print(app[1], ': ', app[5])

Chase Mobile℠ :  233270
Mint: Personal Finance, Budget, Bills & Money :  232940
Bank of America - Mobile Banking :  119773
PayPal - Send and request money safely :  119487
Credit Karma: Free Credit Scores, Reports & Alerts :  101679
Capital One Mobile :  56110
Citi Mobile® :  48822
Wells Fargo Mobile :  43064
Chase Mobile :  34322
Square Cash - Send Money for Free :  23775
Capital One for iPad :  21858
Venmo :  21090
USAA Mobile :  19946
TaxCaster – Free tax refund calculator :  17516
Amex Mobile :  11421
TurboTax Tax Return App - File 2016 income taxes :  9635
Bank of America - Mobile Banking for iPad :  7569
Wells Fargo for iPad :  2207
Stash Invest: Investing & Financial Education :  1655
Digit: Save Money Without Thinking About It :  1506
IRS2Go :  1329
Capital One CreditWise - Credit score and report :  1019
U by BB&T :  790
Paribus - Rebates When Prices Drop :  768
KeyBank Mobile :  623
VyStar Mobile Banking for iPhone :  434
Sparkasse - Your mobile branch :  77
VyStar Mobile Ban

Looking into the Reference apps, we see that these are largely books or manuals converted to an app format. The Finance category is also one of the Top 10 categories. Specifically, a Reference app targeting Finance / Personal Budgeting could be an option to pursue.

We can next review the Google dataset. We see that we have the "installs" column in the dataset. However, this is a text field, and provides just ranges for the number of downloads. 

In [34]:
freq_table(google_eng_free, 5)

{'10,000+': 10.2,
 '5,000,000+': 6.83,
 '50,000,000+': 2.3,
 '100,000+': 11.55,
 '50,000+': 4.77,
 '1,000,000+': 15.73,
 '10,000,000+': 10.55,
 '5,000+': 4.51,
 '500,000+': 5.56,
 '1,000,000,000+': 0.23,
 '100,000,000+': 2.13,
 '1,000+': 8.39,
 '500,000,000+': 0.27,
 '500+': 3.25,
 '100+': 6.92,
 '50+': 1.92,
 '10+': 3.54,
 '1+': 0.51,
 '5+': 0.79,
 '0+': 0.05,
 '0': 0.01}

For example, an app could have "100,000+" downloads, with no specificity. This is fine since this still gives us an idea of the most popular categories based on the number. We first need to remove the "+" and "," to be able to estimate the average.

In [35]:
# Create a distinct list of categories in the Google dataset
google_freq_table = freq_table(google_eng_free, 1)

for category in google_freq_table:
    total = 0
    len_category = 0
    for app in google_eng_free:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace('+', '')
            n_installs = n_installs.replace(',', '')
            total += float(n_installs)
            len_category += 1
    avg_installs = round(total / len_category, 2)
    print(category, ': ', avg_installs)

ART_AND_DESIGN :  1986335.09
AUTO_AND_VEHICLES :  647317.82
BEAUTY :  513151.89
BOOKS_AND_REFERENCE :  8767811.89
BUSINESS :  1712290.15
COMICS :  817657.27
COMMUNICATION :  38456119.17
DATING :  854028.83
EDUCATION :  1833495.15
ENTERTAINMENT :  11640705.88
EVENTS :  253542.22
FINANCE :  1387692.48
FOOD_AND_DRINK :  1924897.74
HEALTH_AND_FITNESS :  4188821.99
HOUSE_AND_HOME :  1331540.56
LIBRARIES_AND_DEMO :  638503.73
LIFESTYLE :  1437816.27
GAME :  15588015.6
FAMILY :  3695641.82
MEDICAL :  120550.62
SOCIAL :  23253652.13
SHOPPING :  7036877.31
PHOTOGRAPHY :  17840110.4
SPORTS :  3638640.14
TRAVEL_AND_LOCAL :  13984077.71
TOOLS :  10801391.3
PERSONALIZATION :  5201482.61
PRODUCTIVITY :  16787331.34
PARENTING :  542603.62
WEATHER :  5074486.2
VIDEO_PLAYERS :  24727872.45
NEWS_AND_MAGAZINES :  9549178.47
MAPS_AND_NAVIGATION :  4056941.77


The Communication and Video Player categories have the highest number of installs, but similar to the apple store, are likely to be dominated by a handful of players. 

In [36]:
for app in google_eng_free:
    category = app[1]
    n_installs = app[5]
    n_installs = n_installs.replace('+', '')
    n_installs = float(n_installs.replace(',', ''))
    if category == 'COMMUNICATION' and (n_installs >= 500000000):
        print(app[0], ': ', app[5])

WhatsApp Messenger :  1,000,000,000+
Google Duo - High Quality Video Calls :  500,000,000+
Messenger – Text and Video Chat for Free :  1,000,000,000+
imo free video calls and chat :  500,000,000+
Skype - free IM & video calls :  1,000,000,000+
LINE: Free Calls & Messages :  500,000,000+
Google Chrome: Fast & Secure :  1,000,000,000+
UC Browser - Fast Download Private & Secure :  500,000,000+
Gmail :  1,000,000,000+
Hangouts :  1,000,000,000+
Viber Messenger :  500,000,000+


In [37]:
for app in google_eng_free:
    category = app[1]
    n_installs = app[5]
    n_installs = n_installs.replace('+', '')
    n_installs = float(n_installs.replace(',', ''))
    if category == 'VIDEO_PLAYERS' and (n_installs >= 500000000):
        print(app[0], ': ', app[5])

YouTube :  1,000,000,000+
Google Play Movies & TV :  1,000,000,000+
MX Player :  500,000,000+


We can see based on the above that 11, and 3 apps dominate the Communication, and Video Players categories, respectively, with each app accounting for greater than half a billion installs each. These would not be good recommendations, as it would be extremely difficult to enter these markets.

If we compare to the Apple Store data, Google Play appears to categorize the "Book" and "Reference" categories into one.

In [38]:
for app in google_eng_free:
    category = app[1]
    n_installs = app[5]
    n_installs = n_installs.replace('+', '')
    n_installs = float(n_installs.replace(',', ''))
    if category == 'BOOKS_AND_REFERENCE' and (n_installs >= 1000000):
        print(app[0], ': ', app[5])

Wikipedia :  10,000,000+
Cool Reader :  10,000,000+
Book store :  1,000,000+
FBReader: Favorite Book Reader :  10,000,000+
Free Books - Spirit Fanfiction and Stories :  1,000,000+
Google Play Books :  1,000,000,000+
AlReader -any text book reader :  5,000,000+
FamilySearch Tree :  1,000,000+
Cloud of Books :  1,000,000+
ReadEra – free ebook reader :  1,000,000+
Ebook Reader :  5,000,000+
Read books online :  5,000,000+
eBoox: book reader fb2 epub zip :  1,000,000+
All Maths Formulas :  1,000,000+
Ancestry :  5,000,000+
HTC Help :  10,000,000+
Moon+ Reader :  10,000,000+
English-Myanmar Dictionary :  1,000,000+
Golden Dictionary (EN-AR) :  1,000,000+
All Language Translator Free :  1,000,000+
Bible :  100,000,000+
Amazon Kindle :  100,000,000+
Aldiko Book Reader :  10,000,000+
Wattpad 📖 Free Books :  100,000,000+
Dictionary - WordWeb :  5,000,000+
50000 Free eBooks & Free AudioBooks :  5,000,000+
Al-Quran (Free) :  10,000,000+
Al Quran Indonesia :  10,000,000+
Al'Quran Bahasa Indonesia 

Again, similar to the Apple Store data, we see that the books and reference apps are essentially large books that are in the format of an app. These include religious books, and dictionaries.

# Conclusion

Our objective was to analyze the app data from both the Google Play Store and the Apple App Store to provide a recommendation for a free app that could be a source of significant ad revenue. Through the process, we understood  which category of apps are the most downloaded on average. 

On further investigation, we noted that the Books and Rerefence category offers up the most amount of options and flexibility. Essentially, a large book typically referred to at a regular cadence (eg. tax or accounting information) could be built into an app format with search options. Additional features could include a discussion forum, highlighting/storing quotes or excerpts per user etc.