# Profitable App Profiles for the App Store and Google Play Markets

As data analysts for a mobile app development company, we aim to identify profitable app profiles in the App Store and Google Play markets. We focus on *free* apps for *English-speaking* audience with in-app ads as our main revenue source, which means user numbers heavily impact our earnings. Through data analysis, we aim to guide our developers in building apps that attract more users.

## Data Preparation

We have two datasets that will help us to reach goal.

* [A dataset](https://www.kaggle.com/datasets/lava18/google-play-store-apps) containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018.
* [A dataset](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017.

First, we write helper function to transform csv file to list of lists

In [1]:
from csv import reader

def to_list(path):
    opened_file = open(path, encoding='utf8')
    read_file = reader(opened_file)
    apps_data = list(read_file)
    
    return apps_data[0], apps_data[1:]

In [2]:
ios_header, ios = to_list('AppleStore.csv')
android_header, android = to_list('googleplaystore.csv')

### Data Exploring

Let's explore these two data sets. We will use `explore_data()` function to do it

In [3]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [4]:
explore_data(ios, 0, 5, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 7197
Number of columns: 16


In [5]:
explore_data(android, 0, 5, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 10841
Number of columns: 13


We have 7,197 iOS apps and 10,841 Android apps.

Let's check what columns are in our datasets and try to understand which ones can help us with our analyses

In [6]:
print(ios_header)
print('\n')
print(android_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Let's use the documentation to understand what data are in the columns.

* [Documentation for iOS data set](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps)
* [Documentation for Android data set](https://www.kaggle.com/datasets/lava18/google-play-store-apps)

I can assume that most interesting columns for us is: `Category`, `Genres` for Android apps and `prime_genre` for iOS apps. But right now it's just an assumption.

### Data Cleaning

In this step we need to make sure the data we analyze is accurate, or the results of our analysis will be wrong.

We have to do next things for it:
* Detect inaccurate data, and correct or remove it.
* Detect duplicate data, and remove the duplicates.
* Detect and remove not relevant data


As we know our company builds *free* apps for *English-speaking* audience. This means that we have to remove all apps which do not fit these parameters. But first, we'll remove inaccurate data and duplicates.

#### Deleting Wrong Data

Based on the [discussion](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion/66015) we can see incorrect row. Due to a missing rating in a column, all columns were shifted.

Let's fix this mistake.

In [7]:
print(android[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [8]:
del android[10472]

In [9]:
print(android[10472])

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


#### Removing Duplicate Entries

Okay, we've removed incorrect row. Then we'll have to remove duplicates. First, we'll have to find them in our data set.

For instance, let's find Instagram duplicates in Android apps.

In [10]:
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


Instagram has four entries. Let's find all duplicates in Android apps.

In [11]:
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:15])

Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


We also have the task of choosing the best option among the duplicates that we will keep in our dataset. In this case, we will select the number of reviews as the determining factor. If an app has the highest number of reviews, it indicates that it is the latest version of the app.

To remove the duplicates, we will do the following:

* Create a dictionary, where each dictionary key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.
* Use the information stored in the dictionary and create a new dataset, which will have only one entry per app (and for each app, we'll only select the entry with the highest number of reviews).


In [12]:
reviews_max = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
print(reviews_max['Instagram'])
print(len(reviews_max))

66577446.0
9659


We have a dictionary that contains pairs of data in the format of `App name` : `Maximum number of reviews`. Let's use it to remove duplicates. Here's how we can do it:

1. Create two empty lists called `android_clean` and `already_added`
2. Loop through the android data set, and for every iteration:
    * Check if the number of reviews equals the maximum number of reviews from our `reviews_max` dictionary.
    * Check if the app name is not already in the `already_added` list. We need to do this because we have duplicates with the same number of reviews.
    * If both conditions are met, then add an app to the `android_clean` list and add app name to `already_added` list

In [13]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)
        
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


Actual number of rows equals expected.

#### Removing Non-English Apps

As mentioned above our company developes apps for *English-speaking* audience so we have to remove non-english apps from our datasets. In Python each character has a corresponding number associated with it. For instance, the corresponding number for character `'a'` is 97, character `'A'` is 65, and character `'爱'` is 29,233.

The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the [ASCII](https://en.wikipedia.org/wiki/ASCII) (American Standard Code for Information Interchange) system. Based on this number range, we can build a function that detects whether a character belongs to the set of common English characters or not. If the number is equal to or less than 127, then the character belongs to the set of common English characters.

Let's write function to remove Non-English Apps.

In [14]:
def is_english(str):
    for ch in str:
        if ord(ch) > 127:
            return False
    return True

print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
False
False


Our function works correctly only if app name doesn't include symbols like `™` or emojis. So if we use this function we can lose much useful data. Let's modify it to avoid this problem. If the input string has more than three characters that fall outside the ASCII range (0 - 127), then the function should return False (identify the string as non-English), otherwise it should return True.

In [15]:
def is_english(str):
    non_eng_chars = []
    
    for ch in str:
        if ord(ch) > 127:
            non_eng_chars.append(ch)
        if len(non_eng_chars) > 3:
            return False
    return True

print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
True
True


Now our function works correctly and we can use it to filter our datasets. Let's do it.

In [16]:
def remove_non_eng_apps(dataset, app_name_index):
    eng_apps = []
    
    for row in dataset:
        name = row[app_name_index]
        
        if is_english(name):
            eng_apps.append(row)
            
    return eng_apps

android_english = remove_non_eng_apps(android_clean, 0)
ios_english = remove_non_eng_apps(ios, 1)

explore_data(android_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

We can see that we're left with 9614 Android apps and 6183 iOS apps.

#### Isolating the Free Apps

As we mentioned in the introduction, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our datasets contain both free and non-free apps; we'll need to isolate only the free apps for our analysis.

In [17]:
def remove_non_free_apps(dataset, price_index):
    free_apps = []
    
    for row in dataset:
        price = row[price_index]
        
        if price == '0' or price == '0.0':
            free_apps.append(row)
    
    return free_apps

android_prepared = remove_non_free_apps(android_english, 7)
ios_prepared = remove_non_free_apps(ios_english, 4)

explore_data(android_prepared, 0, 3, True)
print('\n')
explore_data(ios_prepared, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 8864
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

We can see that we're left with 8864 Android apps and 3222 iOS apps. In the original datasets, we had 10841 Android apps and 7197 iOS apps. We can notice that we removed almost half of the dataset of iOS apps. Now our data is ready for analysis.

## Data Analysis

Our data went through several stages of cleaning:
* Removing inaccurate data
* Removing duplicate app entries
* Removing non-English apps
* Isolating the free apps

As we mentioned in the introduction, our goal is to determine the kinds of apps that are likely to attract more users because the number of people using our apps affect our revenue.

To minimize risks and overhead, our validation strategy for an app idea has three steps:
1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful in both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

Let's begin the analysis by determining the most common genres for each market. For this, we'll need to build frequency tables for a few columns in our datasets.

We'll create a frequency table for the `prime_genre` column in the App Store dataset and the `Genres` and `Category` columns in the Google Play dataset. The frequency table will show how often each category or genre appears in the respective datasets.

### Most Common Apps by Genre

We'll build two functions we can use to analyze the frequency tables:

1. One function to generate frequency tables that show percentages
    * This function takes two inputs: `dataset` and `index` of the column for which we want to build the frequence table
    * Build frequency table with numbers
    * Transform those numbers into percentages

In [35]:
def freq_table(dataset, index):
    result = {}
    total = len(dataset)
    
    for row in dataset:
        value = row[index]
        
        if value in result:
            result[value] += 1
        else:
            result[value] = 1
        
    for key in result:
        result[key] = (result[key] / total) * 100
    
    return result

Let's check how it works

In [36]:
freq_table(android_prepared, 9)

{'Art & Design': 0.5979241877256317,
 'Art & Design;Creativity': 0.06768953068592057,
 'Auto & Vehicles': 0.9250902527075812,
 'Beauty': 0.5979241877256317,
 'Books & Reference': 2.1435018050541514,
 'Business': 4.591606498194946,
 'Comics': 0.6092057761732852,
 'Comics;Creativity': 0.01128158844765343,
 'Communication': 3.2378158844765346,
 'Dating': 1.861462093862816,
 'Education': 5.347472924187725,
 'Education;Creativity': 0.04512635379061372,
 'Education;Education': 0.33844765342960287,
 'Education;Pretend Play': 0.056407942238267145,
 'Education;Brain Games': 0.033844765342960284,
 'Entertainment': 6.069494584837545,
 'Entertainment;Brain Games': 0.078971119133574,
 'Entertainment;Creativity': 0.033844765342960284,
 'Entertainment;Music & Video': 0.16922382671480143,
 'Events': 0.7107400722021661,
 'Finance': 3.7003610108303246,
 'Food & Drink': 1.2409747292418771,
 'Health & Fitness': 3.0798736462093865,
 'House & Home': 0.8235559566787004,
 'Libraries & Demo': 0.936371841155234

2. Another function we can use to display the percentages in a descending order
    * Takes in two parameters: dataset and index. dataset will be a list of lists, and index will be an integer
    * Generates a frequency table using the freq_table() function (which you're going to write as an exercise)
    * Transforms the frequency table into a list of tuples, then sorts the list in a descending order
    * Prints the entries of the frequency table in descending order

In [34]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Let's analyze frequency table for the `prime_genre` column of the App Store dataset.

In [45]:
display_table(ios_prepared, 11) # prime_genre

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


The two most common genres are `Games` and `Enterainment`. Also we can see big gap between `Games` and other genres. Most of the apps in dataset designed for entertainment (games, photo & video, sports, etc.).
We can't do global conclusions based only on genre frequency table but we can test the hypothesis that games are the most popular type of application that gains the most users. If our hypothesis is confirmed, the company may focus on mobile game development.

Let's analyze frequency tables for the `Category` column and `Genres` column of the Google Play dataset.

In [50]:
display_table(android_prepared, 1) # Category
print('\n')
print('\n')
display_table(android_prepared, 9) # Genres

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

The two most common categories are FAMILY and GAMES. Most common genre is Tools. Even though most of the apps designed for fun (family apps usually contain games for kids) we can see that genre tools takes **8%** and in top of genres there are tools, education, business, productivity. For Android we can test two hypothesis: try to develop games for family or try to develop productivity tools.

### Most Popular Apps by Genre on the App Store

The frequency tables we analyzed on the previous screen showed us that apps designed for fun dominate the App Store, while Google Play shows a more balanced landscape of both practical and fun apps. Now, we'd like to determine the kind of apps with the most users.

We can find this information in `Installs` column for Android apps, but we don't have this column in iOS. For iOS we'll use the total number of user ratings as a proxy, which we can find in the `rating_count_tot` column.

In [58]:
ios_genre_freq_table = freq_table(ios_prepared, 11)

for genre in ios_genre_freq_table:
    total = 0
    len_genre = 0
    
    for row in ios_prepared:
        genre_app = row[11]
        
        if genre == genre_app:
            n_rating = float(row[5])
            total += n_rating
            len_genre += 1
            
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)

Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


The top three genres by number of user reviews are `Navigation`, `Reference` and `Social Networking`. First, we'll check apps of `Reference` genre.

In [57]:
for app in ios_prepared:
    if app[11] == 'Reference':
        print(app[1], ':', app[5]) # print name and number of ratings

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


We can see that the most reviewed apps are from the categories of `Navigation` and `Social Networking`, which is why they have a high number of user reviews. While it may be difficult for our company to compete with the most popular apps in these categories, we could consider developing interactive books as a potential alternative.

### Most Popular Apps by Genre on Google Play

Let's perform the similar analysis, but for Android apps.

In [62]:
android_category_freq_table = freq_table(android_prepared, 1)

for category in android_category_freq_table:
    total = 0
    len_category = 0
    
    for row in android_prepared:
        category_app = row[1]
        
        if category_app == category:
            installs = row[5]
            installs = installs.replace('+', '')
            installs = installs.replace(',', '')
            installs = float(installs)
        
            total += installs
            len_category += 1
 
    avg_installs = total / len_category
    print(category, ':', avg_installs)

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

The top three categories regarding the number of installs on Android are `Communication`, `Video Players`, and `Social`, which differ from iOS where `Books and Reference` are popular. Given this, finding an audience for book-related applications on Android may be challenging. Instead, we could consider developing apps in the `Game` or `Tools` categories, which have many installs and are among the most popular categories on the platform.

## Conclusions

After analyzing the datasets, we have formulated several hypotheses:
* We can see that games are popular on both platforms, so we could focus on developing games.
* We observe that the Reference genre is popular on iOS, and we may be able to carve out a niche for ourselves in this genre by developing interactive books.
* Android users tend to prefer apps from the Tools category, so in addition to games, we could also explore developing applications for this category.