# Profitable Apps in Google Play & App Store Markets

The goal of this project is to determine what makes an app profitable on both the Google Play and App Store markets. This would allow a company to make decisions on which apps to build based on data. 

For the sake of this project, the assumption is that all apps that will be made are free, for English-speaking users, and depend heavily on in-app ads. As a result, the number of users heavily influences the profitability of the app. 

## Opening the Data

In order to achieve this goal, I reviewed data from two sources:
- [Google Play](https://www.kaggle.com/datasets/lava18/google-play-store-apps): collected in August 2018 and containing approximately 10,000 apps
- [App Store](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps): collected in July 2017 and containing approximately 7,000 apps

In [1]:
from csv import reader

#Google Play
read_file = reader(open('googleplaystore.csv'))
android_data = list(read_file)
android_header = android_data[0]
android = android_data[1:]

#App Store
read_file = reader(open('AppleStore.csv'))
ios_data = list(read_file)
ios_header = ios_data[0]
ios = ios_data[1:]

## Exploring the Data

The function below will help in exploring the data by slicing the data to only the rows that we we interested in, while also printing them in a more readable way. Keep in mind that the `end` parameter is not inclusive.

If the row_and_columns argument is `True`, then it will also let us know how many rows (apps) and columns (data points for each app) are in the data.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    for row in dataset[start:end]:
        print(row)
        print('\n')
    
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Let's take a look at the Android Data column names, along with some random rows of data. We will also determine how many rows and columns there are in this set. 

In [3]:
print(android_header)
print('\n')
print(explore_data(android, 0, 5, True))

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000', 'Free', '0', 'Everyone

We can quickly see that the Android data had 10,841 apps with 13 columns. At a glance, I would think that `App`, `Category`, `Reviews`, `Installs`, `Type`, `Price`, and `Genres` will be useful.

Now, let's do the same for the iOS data:

In [4]:
print(ios_header)
print('\n')
print(explore_data(ios, 0, 5, True))

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12', 'Music', '37', '4', '1', '1']


Number of rows: 7197
Number of columns: 16
None


The iOS data consists of 7,197 apps and has 16 columns. At a glance, `track_name`, `currency`, `price`, `rating_count_tot`, `rating_count_ver` and `prime_genre` will be useful. 

_Note_: These column names are a bit harder to understand. You can find their descriptions in the data documentation link found in the 'Opening the Data' section.

## Data Cleaning

We are going to start data cleaning by checking the discussion threads to if there is any missing data that has been previously found. 

The Google Play discussion section includes [a thread](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion/66015) for an error on row 10472 (if using full data set with header, it will be row 10473).

In [5]:
for i in range(len(android[0])):
    print (f'{android_header[i]}: {android[10472][i]} ')

App: Life Made WI-Fi Touchscreen Photo Frame 
Category: 1.9 
Rating: 19 
Reviews: 3.0M 
Size: 1,000 
Installs: Free 
Type: 0 
Price: Everyone 
Content Rating:  
Genres: February 11, 2018 
Last Updated: 1.0.19 
Current Ver: 4.0 and up 
Android Ver:  


Looking at this, we can definitely see that something is off. Upon further exploration, it seems that the `Category` and `Genres` are missing, which is causing all other data to get moved. 

There are several ways of handling this, including deleting the row outright or finding the missing information, adding it into the set, and moving the other information to this correct place. For the sake of this project, I will delete it outright.

In [6]:
del android[10472]

The previous number of rows was 10,841. If the below prints as one less, the delete was successful.

In [7]:
print(len(android))

10840


### Removing Duplicate Entries

The function below can be called on either data set to determine if and how many potential duplicates exist in the data set. For Android data the 0th index of each row is the app name. In the iOS data, the 0th index is the app ID. Both of these work for testing if the app is unique. 

In [8]:
def duplicates(dataset):
    duplicate_apps = []
    unique_apps = []
    
    for app in dataset:
        if app[0] not in unique_apps:
            unique_apps.append(app[0])
        else:
            duplicate_apps.append(app[0])
    
    return duplicate_apps

android_duplicates = duplicates(android)
ios_duplicates = duplicates(ios)

#### Android Data

I will look at the Android data first.

In [9]:
print('Number of duplicate apps:', len(android_duplicates))

Number of duplicate apps: 1181


In [10]:
print('Examples of duplicate apps:', android_duplicates[:15])

Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


I will be picking a couple of random duplicate names and will print them out to see if I can find any differences with them. This will help me come up with criterion to use for determining which duplicate to delete.

To make this easier, I will create a function that I can call for each dataset and name I want to check.

In [11]:
def specific_duplicates(dataset, name):
    for app in dataset:
        if app[0] == name:
            print(app)
            
specific_duplicates(android, 'Box')

['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [12]:
specific_duplicates(android, 'Google Ads')

['Google Ads', 'BUSINESS', '4.3', '29313', '20M', '5,000,000', 'Free', '0', 'Everyone', 'Business', 'July 30, 2018', '1.12.0', '4.0.3 and up']
['Google Ads', 'BUSINESS', '4.3', '29313', '20M', '5,000,000', 'Free', '0', 'Everyone', 'Business', 'July 30, 2018', '1.12.0', '4.0.3 and up']
['Google Ads', 'BUSINESS', '4.3', '29331', '20M', '5,000,000', 'Free', '0', 'Everyone', 'Business', 'July 30, 2018', '1.12.0', '4.0.3 and up']


In [13]:
specific_duplicates(android, 'Slack')

['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51510', 'Varies with device', '5,000,000', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']


Several of the duplicates of the apps I checked seem to be the same. However, in 'Slack' I can see that the number of ratings is different. It is fair to assume that, since everything else is the same, the apps with the same name are the same apps with the data collected at different times. 

Thus, the app with more ratings was likely collected last, and will be the row that is kept, while the others are deleted.

In [14]:
expected_clean_length_android = len(android) - 1181
print('Expected length after removal of duplicates:', len(android) - 1181)

Expected length after removal of duplicates: 9659


We now know that the expected length of our dataset after we remove the duplicates is 9.659. We are going to use a dictionary to determine a list of all the apps that should be in our clean dataset. The criterion for adding to the dictionary is: (1) the app is not already in there; or (2) if the app is already in there with a lower rating it will be replaced with the current one. 

If the length of the dictionary is the same as the expected length after removal of duplicates, it was done correctly. 

In [15]:
ratings_max = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if name not in ratings_max or (name in ratings_max and ratings_max[name] < n_reviews):
        ratings_max[name] = n_reviews
        
print(len(ratings_max))

9659


Using the `ratings_max` dictionary, we will remove the duplicates from the android data. To do this, we will iterate through the android data set, only adding an app to the new clean dataset if it matches the name and ratings in `ratings_max` and has not already been added.

To test that this was successful, we will check the length of the clean data. If it equals the expected length (9,659) then we can be confident it was successful.

In [16]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == ratings_max[name] and name not in already_added:
        already_added.append(name)
        android_clean.append(app)

print('The length of the clean data is:', len(android_clean))
print('Cleaning android data was successful:', expected_clean_length_android == len(android_clean))

The length of the clean data is: 9659
Cleaning android data was successful: True


#### iOS Data

Now onto the iOS data. I am going to start in the same way: by examining the number of duplicates, printing some of the names out, and examining the specific apps for trends.

In [17]:
print('Number of duplicate iOS apps:', len(ios_duplicates))

Number of duplicate iOS apps: 0


According to this, there are no duplicates in the iOS data. This is great and allows us to continue to the next step. As mentioned earlier, we will be focusing only on free apps for English-speakers. Let's remove all non-English apps, to start.

### Removing Non-English Apps

In this section, we will be removing all non-English apps from the list. 

According to the American Standard Code for Information Interchange (ASCII), all characters have a cooresponding number to identify them. Characters that are generally used in English text range from 0 to 127. It is important to note that non-English characters that are used commonly with the English language, such as emojis and symbols, are above 127.

This information will help us create a function that detects whether or not a given character is part of the English language. 

Our function will take a string and will iterate through each character in the string, checking if it is between the given range. If the name of an app includes more than 3 characters above 127, it would be fair to assume that the app name is not English and, therefore, the app is not in English. 

We specify 3 characters to prevent the app removing app in English that include emojis or symbols (such as `™`) in the title. It is possible that an app name includes more than 3 of these characters and will excluded, though it is unlikely that it will happen often.

In [22]:
def english_name(name):
    count = 0
    for char in name:
        if ord(char) < 0 or ord(char) > 127:
            count += 1
            if count > 3:
                return False
    return True

Some test cases:

In [23]:
print('\'Instagram\' is in English:', english_name('Instagram'))
print('\'爱奇艺PPS -《欢乐颂2》电视剧热播\' is in English:', english_name('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print('\'Docs To Go™ Free Office Suite\' is in English:', english_name('Docs To Go™ Free Office Suite'))
print('\'Instachat 😜\' is in English:', english_name('Instachat 😜'))

'Instagram' is in English: True
'爱奇艺PPS -《欢乐颂2》电视剧热播' is in English: False
'Docs To Go™ Free Office Suite' is in English: True
'Instachat 😜' is in English: True


We will now create another function that will loop through the ios and android data, using the above function as a helper to determine if the app is perceived to be in English or not. We will run this to create two variables consisting of lists of apps: one for iOS and one for Android.

This function will take in a dataset and a column index. The column index refers to the name column in the given dataset.

In [26]:
def find_english_apps(dataset, column_index):
    english_apps = []
    for app in dataset:
        if english_name(app[column_index]):
          english_apps.append(app)
          
    return english_apps

android_english = find_english_apps(android_clean, 0)
ios_english = find_english_apps(ios, 1)

print(f'There are {len(android_english)} English apps for Android')
print(f'There are {len(ios_english)} English apps for iOS')

There are 9614 English apps for Android
There are 6183 English apps for iOS


### Removing Paid Apps

As mentioned previously, we are only going to be analyzing free apps. Since the data we have includes _all_ apps, we will take some time to remove any paid apps from our set.

To start, let's see where the information for the app price is.

In [27]:
print('Android Column Names\n')
print(android_header)
print('\n')
print('iOS Column Names')
print(ios_header)

Android Column Names

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


iOS Column Names
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


The Android data has two columns where we can tell if it is free or not: 
1. The `Type` column (index 6)
2. The `Price` column (index 7)

For the sake of this project, I will be using the `Price` column. My thought process here is that it is less likely there will be a mistake, such as a typo, in that column.

The iOS data has a `price` column (index 4), which can be used to determine if it is free. 

Below, we will create a function that can be reused later for each dataset. This function will take in a dataset and index. We will use type coercion to ensure that each price is a `float`, first making sure that we aren't accidentally attempting to coerce a `'$'`. It will then loop through the given dataset, appending an app to a new list every time the given index is `0.0`. It is also useful to note that this provides an additional reason for using the `Price` column of the Android data - it will allow use to create a reuseable function instead of creating one loop that looks for `0.0` and one that looks for `'free'`. 

In [38]:
def free_apps (dataset, idx):
    free = []
    for app in dataset:
        price = None
        if app[idx][0] == '$':
            price = app[idx][1:]
        else:
            price = app[idx]
            
        if float(price) == 0.0:
            free.append(app)
    return free

android_free_english = free_apps(android_english, 7)
ios_free_english = free_apps(ios_english, 4)

print(f'There are {len(android_free_english)} Android apps remaining.')
print(f'There are {len(ios_free_english)} iOS apps remaining.')

There are 8864 Android apps remaining.
There are 3222 iOS apps remaining.


## Most Common Apps By Genre

### Background

The revenue made by the company depends heavily on in-app ads and, therefore, on number of users who see those ads. As a result, we want to analyze this data to determine what kind of app is likely to attract more users.

Our end goal is to have the app on both the Google Play Store and App Store, so we need to find app profiles that have been successful in both stores. 

If we scroll up a few cells, we can see the column names for Android and iOS apps. It looks like `Category` (index 1) and `Genres` (index 9) in Android and `prime_genre` (index 11) in iOS would be good starting points.  

### Creating and Analyzing Sorted Frequency Tables 

In [33]:
def freq_table(dataset, index):
    freq = {}
    count = 0
    for app in dataset:
        if app[index] not in freq:
            freq[app[index]] = 0
        freq[app[index]] += 1
        count += 1
    
    freq_percentages = {}
    for genre in freq:
        percentage = (freq[genre] / count) * 100
        freq_percentages[genre] = percentage
        
    return freq_percentages



def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

#### Android Data

Let's analyze the `Categories` and `Genres` sections of the Android (Google Play Store) data. We picked both these sections because of how related they seem. 

In [34]:
print('Display table of Android App Categories:')
print('\n')
display_table(android_free_english, 1)

Display table of Android App Categories:


FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PAREN

At first glance, it looks like the majority of apps are focused on productivity, as opposed to gaming. However, if you look into the Google Play Store under family, you will find that the majority of the apps under Family category are actually mostly games for kids. This category alone accounts for about 19% of the apps, and if we combine it with the Games category, then you can see games account for almost 30% of all apps in the Google Play Store.

This does show us that there are more games that you may think at first glance, though we can still see that the majority of Google Play Store apps are productivity.

In [35]:
print('Display table of Android App Genres:')
print('\n')
display_table(android_free_english, 9)

Display table of Android App Genres:


Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.93637184115

The `Genre` section of the Google Play Store (above) reinforces a lot of what we discovered when analyzing the `Category` section. We can see that most of these apps are focused on productivity, as opposed to games.

Let's take a second to analyze the difference between the `Category` and the `Genre` sections. It is not completely clear, but what we do find is that there are many more genres than there are categories. Because we are looking at bigger picture, we will focuse on the `Category` section going forward.

#### iOS Data

Now we are going to look at the `prime_genre` section of the iOS data (Apple Store).

In [37]:
print('Display table of iOS App Genres:')
print('\n')
display_table(ios_free_english, 11)

Display table of iOS App Genres:


Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


From the output above, we can see that Games make up over half (58.16%) of the free English apps in the Apple Store. The next highest genre is Entertainment, which makes up for slightly under 8% of the apps. This is a pretty large jump. 

This data leads us to the conclusion that the Apple Store is focused on providing "fun" apps, such as games and entertainment driven apps. 

However, just because this is the way the apps are divided in the store, it does not necessarily mean that the users are split in the same way. For example, in the Apple Store it is possible, though unlikely, that the vast majority of users are using the 3.66% of Education apps, while only a small minority are downloading the Games. 

## Most Popular Apps By Genre

### Background

From the analyzation earlier, we can see that the Apple Store has more apps designed for the purpose of having fun whereas Google Play seems to have a balance between practical and fun. 

We are now going to analyze which apps are most *popular* within each genre in each store. Popularity is a difficult concept - does it mean how many plays per day? Rating? 

For this purpose, we will be using the `Installs` (index 5) section of the Android data and the `rating_count_tot` (index 5) section of the iOS data. These sections in their respective dataset are the best way for us to determine *how many* users each app has. 

### Creating and Analyzing Popular by Genre Tables

The functions below will be used for both the Android and iOS data. The first function, `popularity_table` will loop through the dataset, creating a hash table with the keys being distinct genres, and the values being a list of the different apps popularity (number of ratings or installs) within that genre.

The second function, `genre popularity`, will call the first function and create that table. It will then loop through that table, printing out each genre and the average of all the numbers in it's list. This will give us the average number of ratings or installs for each genre, which we are using to determine how popular the app genre is.

You may notice that the `users_proxy` uses a ternary for assignment. This is because if the string contains a comma, it cannot be coerced into a float. When this is the case, it will be passed as an argument into the `convert` function, which will take out the comma and coerce it to a float. This will be particularly useful for the Android `Install` data, which includes commas.

In [54]:
def popularity_table (dataset, genre_idx, popularity_idx):
    popular_table = {}
    for app in dataset:
        genre = app[genre_idx]
        users_proxy = float(app[popularity_idx]) if ',' not in app[popularity_idx] else convert(app[popularity_idx])
        if genre not in popular_table:
            popular_table[genre] = []
        
        popular_table[genre].append(users_proxy)
    
    return popular_table

def convert (string):
    num_string = '' # Can also simple return float(string.replace(',', '')) if we are certain that a comma is the only stray character
    for char in string: 
        if char in '0123456789.':
            num_string += char
    
    return float(num_string) 

def genre_popularity (dataset, genre_idx, popularity_idx):
    users_per_genre_table = popularity_table(dataset, genre_idx, popularity_idx)
    for genre in users_per_genre_table:
        average = sum(users_per_genre_table[genre]) / len(users_per_genre_table[genre])
        print(f'{genre}: {average}')


#### Android Data

For the Android (Google Play), we will be looking at the average number of installs of each genre to get an idea of its popularity. To start, let's see how this data is set up. We will be viewing it through the display table function set up previously, which will show us the percentage of apps with each amount of installs. 

In [46]:
display_table(android_free_english, 5) # Index 5 is the Installs section

1,000,000 : 15.726534296028879
100,000 : 11.552346570397113
10,000,000 : 10.548285198555957
10,000 : 10.198555956678701
1,000 : 8.393501805054152
100 : 6.915613718411552
5,000,000 : 6.825361010830325
500,000 : 5.561823104693141
50,000 : 4.7721119133574
5,000 : 4.512635379061372
10 : 3.5424187725631766
500 : 3.2490974729241873
50,000,000 : 2.3014440433213
100,000,000 : 2.1322202166064983
50 : 1.917870036101083
5 : 0.78971119133574
1 : 0.5076714801444043
500,000,000 : 0.2707581227436823
1,000,000,000 : 0.22563176895306858
0 : 0.056407942238267145


This data seems to be generalized, as it can be assumed that 15% of apps do not have exactly 1,000,000 installs. That being said, the data is good enough for our purposes, since we are looking for popularity and not *exactly* how many installs.

Let's put this dataset through our pre-made functions to take a look at the average installs for each genre.

In [57]:
genre_popularity(android_free_english, 1, 5) # We are using Index 1 for Category, as discussed in the earlier section

ART_AND_DESIGN: 1986335.0877192982
AUTO_AND_VEHICLES: 647317.8170731707
BEAUTY: 513151.88679245283
BOOKS_AND_REFERENCE: 8767811.894736841
BUSINESS: 1712290.1474201474
COMICS: 817657.2727272727
COMMUNICATION: 38456119.167247385
DATING: 854028.8303030303
EDUCATION: 1833495.145631068
ENTERTAINMENT: 11640705.88235294
EVENTS: 253542.22222222222
FINANCE: 1387692.475609756
FOOD_AND_DRINK: 1924897.7363636363
HEALTH_AND_FITNESS: 4188821.9853479853
HOUSE_AND_HOME: 1331540.5616438356
LIBRARIES_AND_DEMO: 638503.734939759
LIFESTYLE: 1437816.2687861272
GAME: 15588015.603248259
FAMILY: 3695641.8198090694
MEDICAL: 120550.61980830671
SOCIAL: 23253652.127118643
SHOPPING: 7036877.311557789
PHOTOGRAPHY: 17840110.40229885
SPORTS: 3638640.1428571427
TRAVEL_AND_LOCAL: 13984077.710144928
TOOLS: 10801391.298666667
PERSONALIZATION: 5201482.6122448975
PRODUCTIVITY: 16787331.344927534
PARENTING: 542603.6206896552
WEATHER: 5074486.197183099
VIDEO_PLAYERS: 24727872.452830188
NEWS_AND_MAGAZINES: 9549178.467741935
MA

This data shows us that communication apps have over 35 million installs. Let's take a closer look at the most popular apps in that genre to see if anything is skewing the data.

In [59]:
for app in android_free_english:
    high_rating = app[5] == '1,000,000,000' or app[5] == '500,000,000' or app[5] == '100,000,000'
    if app[1] == 'COMMUNICATION' and high_rating:
        print(f'{app[0]}: {app[5]}')

WhatsApp Messenger: 1,000,000,000
imo beta free calls and text: 100,000,000
Android Messages: 100,000,000
Google Duo - High Quality Video Calls: 500,000,000
Messenger – Text and Video Chat for Free: 1,000,000,000
imo free video calls and chat: 500,000,000
Skype - free IM & video calls: 1,000,000,000
Who: 100,000,000
GO SMS Pro - Messenger, Free Themes, Emoji: 100,000,000
LINE: Free Calls & Messages: 500,000,000
Google Chrome: Fast & Secure: 1,000,000,000
Firefox Browser fast & private: 100,000,000
UC Browser - Fast Download Private & Secure: 500,000,000
Gmail: 1,000,000,000
Hangouts: 1,000,000,000
Messenger Lite: Free Calls & Messages: 100,000,000
Kik: 100,000,000
KakaoTalk: Free Calls & Text: 100,000,000
Opera Mini - fast web browser: 100,000,000
Opera Browser: Fast and Secure: 100,000,000
Telegram: 100,000,000
Truecaller: Caller ID, SMS spam blocking & Dialer: 100,000,000
UC Browser Mini -Tiny Fast Private & Secure: 100,000,000
Viber Messenger: 500,000,000
WeChat: 100,000,000
Yahoo M

We can see that the data is heavily skewed due to a few apps that have a particularly high number of installs, such as *WhatsApp*, *Facebook Messenger*, *Skype*, *Google Chrome*, *Gmail*, etc.

Let's now take a look at the video players, which has an average of over 24 million installs. 

In [60]:
for app in android_free_english:
    high_rating = app[5] == '1,000,000,000' or app[5] == '500,000,000' or app[5] == '100,000,000'
    if app[1] == 'VIDEO_PLAYERS' and high_rating:
        print(f'{app[0]}: {app[5]}')

YouTube: 1,000,000,000
Motorola Gallery: 100,000,000
VLC for Android: 100,000,000
Google Play Movies & TV: 1,000,000,000
MX Player: 500,000,000
Dubsmash: 100,000,000
VivaVideo - Video Editor & Photo Movie: 100,000,000
VideoShow-Video Editor, Video Maker, Beauty Camera: 100,000,000
Motorola FM Radio: 100,000,000


We can see that particularly popular apps, such as *YouTube* and *Google Play Movies & TV*, skew the averages. Even without looking directly at the data, we can guess that other genres that have particularly high averages, such as 'social' are likely skewed by similar players. 

This makes it hard to know if the app genres are very popular, or if these players make it seem so. One may also consider that well-made apps in these categories are particularly popular so it could be a good genre to look at anyway, but it is also important to consider that it would be very hard to compete and be seen in a genre that is so dominated by these very popular and well-made apps. 

The games genre seems popular, though it is a pretty saturated market as seen before. The other genre that may be interesting to look into further is books and references which has an average of almost 9 million installs. 

In [65]:
for app in android_free_english:
    high_rating = app[5] == '1,000,000,000' or app[5] == '500,000,000' or app[5] == '100,000,000'
    if app[1] == 'BOOKS_AND_REFERENCE' and high_rating:
        print(f'{app[0]}: {app[5]}')

Google Play Books: 1,000,000,000
Bible: 100,000,000
Amazon Kindle: 100,000,000
Wattpad 📖 Free Books: 100,000,000
Audiobooks from Audible: 100,000,000


We can see that there are far fewer heavily popular apps in books and references, so the data is likely not as skewed. Let's take a look at the slightly less popular apps in books and references.

In [66]:
for app in android_free_english:
    high_rating = app[5] == '1,000,000' or app[5] == '5,000,000' or app[5] == '50,000,000' or app[5] == '10,000,000'
    if app[1] == 'BOOKS_AND_REFERENCE' and high_rating:
        print(f'{app[0]}: {app[5]}')

Wikipedia: 10,000,000
Cool Reader: 10,000,000
Book store: 1,000,000
FBReader: Favorite Book Reader: 10,000,000
Free Books - Spirit Fanfiction and Stories: 1,000,000
AlReader -any text book reader: 5,000,000
FamilySearch Tree: 1,000,000
Cloud of Books: 1,000,000
ReadEra – free ebook reader: 1,000,000
Ebook Reader: 5,000,000
Read books online: 5,000,000
eBoox: book reader fb2 epub zip: 1,000,000
All Maths Formulas: 1,000,000
Ancestry: 5,000,000
HTC Help: 10,000,000
Moon+ Reader: 10,000,000
English-Myanmar Dictionary: 1,000,000
Golden Dictionary (EN-AR): 1,000,000
All Language Translator Free: 1,000,000
Aldiko Book Reader: 10,000,000
Dictionary - WordWeb: 5,000,000
50000 Free eBooks & Free AudioBooks: 5,000,000
Al-Quran (Free): 10,000,000
Al Quran Indonesia: 10,000,000
Al'Quran Bahasa Indonesia: 10,000,000
Al Quran Al karim: 1,000,000
Al Quran : EAlim - Translations & MP3 Offline: 5,000,000
Koran Read &MP3 30 Juz Offline: 1,000,000
Hafizi Quran 15 lines per page: 1,000,000
Quran for Andro

We see mostly dictionaries, apps aimed at downloading books, and religious-affiliated books. We likely don't want to build a similar app (i.e. a dictionary) since there is a lot of competition in that field. However, we can see that people seem to enjoy apps that are focused on books or topics that interest them, and they are willing to read these books on their mobile devices. 

Let's keep this information on the backburner, and see what we can find in the iOS data.

#### iOS Data

Let's start by taking a look at the data in the `rating_count_tot` column to see how it looks.

In [61]:
display_table(ios_free_english, 5)

0 : 4.686530105524519
1 : 0.7138423339540658
7 : 0.4345127250155183
5 : 0.4345127250155183
2 : 0.4345127250155183
10 : 0.40347610180012417
6 : 0.37243947858473
14 : 0.37243947858473
9 : 0.31036623215394166
53 : 0.31036623215394166
29 : 0.31036623215394166
22 : 0.31036623215394166
17 : 0.31036623215394166
8 : 0.27932960893854747
41 : 0.27932960893854747
38 : 0.27932960893854747
105 : 0.27932960893854747
78 : 0.2482929857231533
21 : 0.2482929857231533
115 : 0.2482929857231533
58 : 0.21725636250775915
49 : 0.21725636250775915
43 : 0.21725636250775915
37 : 0.21725636250775915
30 : 0.21725636250775915
3 : 0.21725636250775915
19 : 0.21725636250775915
110 : 0.21725636250775915
109 : 0.21725636250775915
99 : 0.186219739292365
94 : 0.186219739292365
70 : 0.186219739292365
56 : 0.186219739292365
50 : 0.186219739292365
42 : 0.186219739292365
39 : 0.186219739292365
35 : 0.186219739292365
343 : 0.186219739292365
18 : 0.186219739292365
15 : 0.186219739292365
12 : 0.186219739292365
89 : 0.15518311607

The ratings counts (shown to the left of the colon) are whole numbers. We don't have to worry about the percentages, as they will all be small since most apps won't have the exact same number of ratings. This shows us that the counts are not rounded, like in the Android data, and that the strings can also be easily converted to floats, since they do not include any unneeded characters.

We can now put the dataset through the functions made earlier to get a look at the average number of ratings for each genre.

In [62]:
genre_popularity(ios_free_english, 11, 5)

Social Networking: 71548.34905660378
Photo & Video: 28441.54375
Games: 22788.6696905016
Music: 57326.530303030304
Reference: 74942.11111111111
Health & Fitness: 23298.015384615384
Weather: 52279.892857142855
Utilities: 18684.456790123455
Travel: 28243.8
Shopping: 26919.690476190477
News: 21248.023255813954
Navigation: 86090.33333333333
Lifestyle: 16485.764705882353
Entertainment: 14029.830708661417
Food & Drink: 33333.92307692308
Sports: 23008.898550724636
Book: 39758.5
Finance: 31467.944444444445
Education: 7003.983050847458
Productivity: 21028.410714285714
Business: 7491.117647058823
Catalogs: 4004.0
Medical: 612.0


After a quick glance, we can see that `Navigation` apps have the highest number of reviews, followed by `Reference` and `Social Networking`. 

In the next few code blocks, we are going to take a closer look at which apps are in those sections.

In [43]:
for app in ios_free_english:
    if app[11] == 'Navigation':
        print(f'{app[1]}: {app[5]}') # We will be printing the name of the app and the number of ratings

Waze - GPS Navigation, Maps & Real-time Traffic: 345046
Google Maps - Navigation & Transit: 154911
Geocaching®: 12811
CoPilot GPS – Car Navigation & Offline Maps: 3582
ImmobilienScout24: Real Estate Search in Germany: 187
Railway Route Search: 5


In [44]:
for app in ios_free_english:
    if app[11] == 'Reference':
        print(f'{app[1]}: {app[5]}')

Bible: 985920
Dictionary.com Dictionary & Thesaurus: 200047
Dictionary.com Dictionary & Thesaurus for iPad: 54175
Google Translate: 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran: 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition: 17588
Merriam-Webster Dictionary: 16849
Night Sky: 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE): 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools: 4693
GUNS MODS for Minecraft PC Edition - Mods Tools: 1497
Guides for Pokémon GO - Pokemon GO News and Cheats: 826
WWDC: 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free: 718
VPN Express: 14
Real Bike Traffic Rider Virtual Reality Glasses: 8
教えて!goo: 0
Jishokun-Japanese English Dictionary & Translator: 0


In [45]:
for app in ios_free_english:
    if app[11] == 'Social Networking':
        print(f'{app[1]}: {app[5]}')

Facebook: 2974676
Pinterest: 1061624
Skype for iPhone: 373519
Messenger: 351466
Tumblr: 334293
WhatsApp Messenger: 287589
Kik: 260965
ooVoo – Free Video Call, Text and Voice: 177501
TextNow - Unlimited Text + Calls: 164963
Viber Messenger – Text & Call: 164249
Followers - Social Analytics For Instagram: 112778
MeetMe - Chat and Meet New People: 97072
We Heart It - Fashion, wallpapers, quotes, tattoos: 90414
InsTrack for Instagram - Analytics Plus More: 85535
Tango - Free Video Call, Voice and Chat: 75412
LinkedIn: 71856
Match™ - #1 Dating App.: 60659
Skype for iPad: 60163
POF - Best Dating App for Conversations: 52642
Timehop: 49510
Find My Family, Friends & iPhone - Life360 Locator: 43877
Whisper - Share, Express, Meet: 39819
Hangouts: 36404
LINE PLAY - Your Avatar World: 34677
WeChat: 34584
Badoo - Meet New People, Chat, Socialize.: 34428
Followers + for Instagram - Follower Analytics: 28633
GroupMe: 28260
Marco Polo Video Walkie Talkie: 27662
Miitomo: 23965
SimSimi: 23530
Grindr - G

We can see that the `Navigation` app rating averages are heavily influenced by *Waze* and *Google Maps*, while the `Reference` apps are heavily influenced by *Bible* and *Dictionary.com*. `Social Networking` averages have a similar pattern due to *Facebook*, *Pinterest*, and other large popular apps. 

Music, Weather, Food & Drink, and Finance also are on the higher side in terms of average user ratings. However, these would not work for our purposes for the reasons listed below, so we will not be looking into them further.

- `Weather`: Generally, weather apps are looked at quickly so there would not be a lot of opportunity for in-app ads being profitable. Additionally, it would likely require access to connect to expensive APIs. 
- `Food & Drink`: Popular food-based apps revolve around a specific store that uses that app to sell their products. We are building an app, not a restaurant or physical service, so this wouldn't be helpful for us
- `Finance`: These apps require specific financial knowledge, which would require us to hire an entire team dedicated to the financial side of the app. This is not a route we are interested in taking.
- `Music`: The popular music apps are dedicated to downloading and listening to music, which would require some tier of payment for them to be profitable due to the cost of the music itself.

Taking a look at both the Android and iOS data, we can see an overlap in popularity of books and references genre. Though this app has a few big players, it would be entirely possible to create an app that does not compete with either of them. For example, a popular app may be focused on a certain book or series, including text, book-on-tape, quizzes, trivia, and a built-in dictionary.

## Conclusion

We have now completed our analyzation of the Android (Google Play Store) and iOS (Apple Store) apps. We cleaned the data so that it only includes complete rows that refer to free, English-based apps. We then analyzed the data, first taking a look at how many apps the stores hold by genre and then by popularity within each genre. 

This information led us to conclude that creating an app based in the genre **books and reference** has a high likelihood of succeeding in both the Google Play Store and the Apple Store. This app could be engaging enough to keep people on it long enough that in-app ads would be profitable. To make it engaging and different than just downloading a book from the libraries out there, it would be beneficial to pick a popular book and include special features, such as an in-app dictionary, trivia, quizzes, and an on-tape version.