# Profitable App Profiles for the App Store and Google Play Markets

## Introduction

For this project, we'll pretend we're working as data analysts for a company that builds Android and iOS mobile apps. We make our apps available on Google Play and in the App Store.

We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that the number of users of our apps determines our revenue for any given app — the more users who see and engage with the ads, the better. Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.


## Opening the datasets

In [1]:
from csv import reader

### The Google Play data set ###
opened_file = open('googleplaystore.csv', encoding='utf8')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

### The App Store data set ###
opened_file = open('AppleStore.csv', encoding='utf8')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

## Data exploration


In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
explore_data(android, 0, 3, True)
print('\n')
explore_data(ios, 0, 3, True)
print('\n')

##Column names for android##
print(android_header)

print('\n')
##Column names for ios##
print(ios_header)


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+

Based on the headers of both datasets, it seems that in the [Android](https://www.kaggle.com/datasets/lava18/google-play-store-apps) dataset, the following 10 columns could be interesting:

| Column Name  | Description   |  Relevant? |
|:---|:---|---:|
| App    | Application name   | Yes |
| Category | Category the app belongs to | Yes  |
| Rating | Overall user rating of the app (as when scraped)  | Yes  |
| Reviews  | Number of user reviews for the app (as when scraped)  | Yes  |
| Size  | Size of the app (as when scraped)  | Yes  |
| Installs  | Number of user downloads/installs for the app (as when scraped)  | Yes |
| Type  | Paid or Free  | Yes  |
| Price  | Price of the app (as when scraped)  | Yes  |
| Content Rating  | Age group the app is targeted at - Children / Mature 21+ / Adult  | Yes |
| Genres | An app can belong to multiple genres (apart from its main category).  | Yes |
| Last Updated  |   | No  |
| Current Ver  |   | No |
| Android Ver  |   | No  |

For the [iOS](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps) data, these look promising (8 columns):


| Column Name  | Description   |  Relevant? |
|---|---|---|
| "id"     |  App ID   | Yes |
| "track_name" | App Name | Yes  |
| "size_bytes" | Size (in Bytes)  | Yes  |
| "currency" | Currency Type  | No (USD for all)  |
| "price"  | Price amount  | Yes  |
| "rating_count_tot"  | User Rating counts (for all version)  | Yes |
| "rating_count_ver"  | Average User Rating value (for current version)  | Yes  |
| "ver"  | Latest version code  | Yes  |
| "cont_rating"  | Content Rating (Age group) | Yes |
| "prime_genre" |  Primary Genre  | Yes |
| "sup_devices.num"  |  Number of supporting devices | No  |
| "ipadSc_urls.num"  | Number of screenshots showed for display  | No |
| "lang.num"  | Number of supported languages | No  |
| "vpp_lic" | Vpp Device Based Licensing Enabled | No

## Data cleaning
The discussion section of the Google Play dataset mentions several potential issues. E.g. one row has bee found to be erroneous and needs to be removed (row 10472, [reference](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion/66015)). 
Another issue regards potential duplicate entries, and lastly we will also only be interested in English language apps, which are free. We start by removing cases that, based on the [discussion section](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion?sort=undefined&page=3) of the dataset, have been found to contain errors.

In [3]:
print(android[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [4]:
del android[10472]

In [5]:
print(len(android))

10840


Checking the length of the dataset shows that one row was deleted.
There are no other clearly named erroneous cases, therefore we go to the next step and delete duplicate entries.

## Removing duplicate entries

This loop will add the name of an app to the list `unique_apps` if it encounters it the first time. But if it encounters it again at another time (i.e. it's already in `unique_apps`), it will add its name to the list `duplicate_apps`. We find 1181 duplicates.

In [6]:
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0] #This is the element (called app) with index 0 in the android list. I.e. the name.
    if name in unique_apps:
        duplicate_apps.append(name)
    else: 
        unique_apps.append(name)

print('Number of duplicate apps:', len(duplicate_apps)) #Length of duplicate_apps is the number.
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:15]) #Prints a tuple, first the string, then 15 instances of the object duplicate_apps.


Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


We will not remove duplicates randomly, but rather use the number of reviews `Reviews` (index number 3) as an indicator to retain the latest entry. The row of the duplicate apps with the highest number of reviews is assumed to be the latest entry.

In [7]:
reviews_max = {}
for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews: 
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews 
        
print('Expected length:', len(android) - 1181)
print('\n')
print('Length of dictionary:', len(reviews_max))

Expected length: 9659


Length of dictionary: 9659


Checking the length of the dictionary against the expected length (dataset - 1181 duplicates = 9659) shows that the procedure worked. The dictionary `reviews_max` now contains the entries with the highest number of reviews. We now will use this information to delete the duplicate entries.

*We will loop through the android dataset and compare the number of reviews in each entry with the list of highest entries previously created.

*We will add the entries (rows) with the highest number to a list called `android_clean`, and we will keep track of the apps that we successfully added by recording them in the list `already_added`.

In [8]:
android_clean = []
already_added = [] 

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)
        
print('Expected length:', len(android) - 1181)
print('\n')
print('Length of dictionary:', len(android_clean))

Expected length: 9659


Length of dictionary: 9659


Checking if this clean dataset looks ok:

In [9]:
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


As the iOS data does not contain duplicates (per dataset description), we will not undertake any measures to remove duplicates there.

## Removing non-English apps

Here will we use a strategy to identify letters that are not commonly used in the English language based on their ASCII values. Characters and letters commonly used in English will have an ASCII Code of equal or less 127. We will write a function that will return `False`if a strong contains charcters with value greater than 127, otherwise it returns `True`. The function is supposed to loop over app names and go through each character separately, We will, however, first test this approach for some sample strings.

In [10]:
test_string = 'abc'
def no_eng(a_string):
    for character in a_string:
        value = ord(character)
        if value > 127:
            return False
        else:
            return True
        
print(no_eng(test_string))
print(no_eng('Instagram'))
print(no_eng('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(no_eng('Docs To Go™ Free Office Suite'))
print(no_eng('Instachat 😜'))

True
True
False
True
True


The function seems to work as it should. (I am suspicious that it might not check all strings though, but only the first one.) But we should make sure that also special characters like the trademark symbol or emojis will not wrongly identify apps as using non-English languages. We therefore change the function so that it only identifies a string as non-English if more than three characters fall outdie of the ASCII range. I implement this by adding any non-English character to a count list and then checking the length of the list.

In [11]:
test_string = 'abc'
def no_eng(a_string):
    count = []
    for character in a_string:
        value = ord(character)
        if value > 127:
            count.append(character)
    if len(count) > 3:
        return False
    else:
        return True
        
print(no_eng(test_string))
print(no_eng('Instagram'))
print(no_eng('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(no_eng('Docs To Go™ Free Office Suite'))
print(no_eng('Instachat 😜'))

True
True
False
True
True


We now apply this function to both datasets.  If an app name is identified as English, we append the whole row to a separate list. 

In [12]:
android_eng_apps = []
android_no_eng_apps = []
for app in android_clean:
    name = app[0]
    check = no_eng(name)
    if check: 
        android_eng_apps.append(app)
    else:
        android_no_eng_apps.append(app)
        
print('Identified as English:', len(android_eng_apps))
print('\n')
print('Identified as non-English:', len(android_no_eng_apps))

Identified as English: 9614


Identified as non-English: 45


It seems we have 9614 cases in the cleaned android dataset to work with. Exploring the English and non-English datasets, just to make sure that this worked fine:

In [13]:
print(explore_data(android_eng_apps, 0, 3))
print('\n')
print(explore_data(android_no_eng_apps, 0, 3))

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


None


['Flame - درب عقلك يوميا', 'EDUCATION', '4.6', '56065', '37M', '1,000,000+', 'Free', '0', 'Everyone', 'Education', 'July 26, 2018', '3.3', '4.1 and up']


['သိင်္ Astrology - Min Thein Kha BayDin', 'LIFESTYLE', '4.7', '2225', '15M', '100,000+', 'Free', '0', 'Everyone', 'Lifestyle', 'July 26, 2018', '4.2.1', '4.0.3 and up']


['РИА Новости', 'NEWS_AND_MAGAZINES', '4.5', '44274', '8.0M', '1,000,000+', 'Free', '0', 'Everyone', 'News 

The same for the iOS dataset (don't forget to change the index variable):

In [14]:
ios_eng_apps = []
ios_no_eng_apps = []
for app in ios:
    name = app[1] #The name variable is in index 1!!
    check = no_eng(name)
    if check: 
        ios_eng_apps.append(app)
    else:
        ios_no_eng_apps.append(app)
        
print('Identified as English:', len(ios_eng_apps))
print('\n')
print('Identified as non-English:', len(ios_no_eng_apps))

Identified as English: 6183


Identified as non-English: 1014


The iOS data contains 1014 non-English language apps, leaving 6183 for us to work with. We can now work further with the datasets `android_eng_apps`and `ios_eng_apps`.

## Isolating the free apps

Since our revenues only depend on in-app ads, we are only interested in freely available apps. We are now checking whether apps are free and remove all cases, where a price has to be paid for the app. The prices have index number 7 in the android dataset and index number 4 in the iOS dataset. Also the android dataset contains an indicator that indicates whether an app is free at index number 6. 

In the android data, the price colum contains a dollar sign in front of the numerical price. We remove this with the line
`price = ''.join(c for c in price if c.isdigit() or c == '.')`.

Here’s what each part of this line is doing:

* `c for c in price`: This is iterating over each character `c` in the string price.
* `if c.isdigit() or c == '.'`: This is a condition that checks if the character `c` is a digit (0-9) or a decimal point (.).
* `''.join(...)`: This is joining all the characters that meet the condition into a new string without any spaces between them.

In [15]:
android_free_apps = []
android_no_free_apps = []
for app in android_eng_apps:
    price = app[7] 
    # Remove any non-numeric characters
    price = ''.join(c for c in price if c.isdigit() or c == '.')
    # Convert to float
    price = float(price) if price else 0
    app_type = app[6]
    if price == 0 and app_type == 'Free': 
        android_free_apps.append(app)
    else:
        android_no_free_apps.append(app)
        
print('Identified as free:', len(android_free_apps))
print('\n')
print('Identified as non-free:', len(android_no_free_apps))

Identified as free: 8863


Identified as non-free: 751


We have 8863 cases in `android_free_apps`to work with.
In the iOS datset, we need not remove any other strings. The price variable can readily be converted into a float. (Note that in the solution, there are 8864 free apps, but we think our result is accurate and carry on with this.)

In [16]:
ios_free_apps = []
ios_no_free_apps = []
for app in ios_eng_apps:
    price = float(app[4]) 
    if price > 0: 
        ios_no_free_apps.append(app)
    else:
        ios_free_apps.append(app)
        
print('Identified as free:', len(ios_free_apps))
print('\n')
print('Identified as non-free:', len(ios_no_free_apps))

Identified as free: 3222


Identified as non-free: 2961


We see that 3222 in `ios_free_apps` cases can be used for our analysis.

## Analysis
## Most common apps by genre

As we mentioned in the introduction, our goal is to determine the kinds of apps that are likely to attract more users because the number of people using our apps affect our revenue.

To minimize risks and overhead, our validation strategy for an app idea has three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful in both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

Let's begin the analysis by determining the most common genres for each market. For this, we'll need to build frequency tables for a few columns in our datasets.

The android dataset contains two colums with genre information. The first one is called 'Category' (index number 1), the second one 'Genres' (index number 9, identifying multiple genres).

In the iOS dataset we find the necessary information in the column called 'prime_genre' (index number 11).

**Building a frequency table function**

We first build a function to create a frequency table and then use another function to put it into descending order. The function `freq_table`takes in two inputs: `dataset`(list of lists) and `index`(integer). The index indicates the column.

In [17]:
def freq_table(dataset, index):
    frequency_table = {}
    total = 0
    
    for row in dataset:
        total += 1
        a_data_point = row[index]
        if a_data_point in frequency_table:
            frequency_table[a_data_point] += 1
        else:
            frequency_table[a_data_point] = 1
            
    for a_data_point in frequency_table:
        count = frequency_table[a_data_point]
        percentage = round((count / total) * 100, 2)
        frequency_table[a_data_point] = (count, percentage)
        
    return frequency_table

print(freq_table(android_free_apps, 1))

{'ART_AND_DESIGN': (57, 0.64), 'AUTO_AND_VEHICLES': (82, 0.93), 'BEAUTY': (53, 0.6), 'BOOKS_AND_REFERENCE': (190, 2.14), 'BUSINESS': (407, 4.59), 'COMICS': (55, 0.62), 'COMMUNICATION': (287, 3.24), 'DATING': (165, 1.86), 'EDUCATION': (103, 1.16), 'ENTERTAINMENT': (85, 0.96), 'EVENTS': (63, 0.71), 'FINANCE': (328, 3.7), 'FOOD_AND_DRINK': (110, 1.24), 'HEALTH_AND_FITNESS': (273, 3.08), 'HOUSE_AND_HOME': (73, 0.82), 'LIBRARIES_AND_DEMO': (83, 0.94), 'LIFESTYLE': (346, 3.9), 'GAME': (862, 9.73), 'FAMILY': (1675, 18.9), 'MEDICAL': (313, 3.53), 'SOCIAL': (236, 2.66), 'SHOPPING': (199, 2.25), 'PHOTOGRAPHY': (261, 2.94), 'SPORTS': (301, 3.4), 'TRAVEL_AND_LOCAL': (207, 2.34), 'TOOLS': (750, 8.46), 'PERSONALIZATION': (294, 3.32), 'PRODUCTIVITY': (345, 3.89), 'PARENTING': (58, 0.65), 'WEATHER': (71, 0.8), 'VIDEO_PLAYERS': (159, 1.79), 'NEWS_AND_MAGAZINES': (248, 2.8), 'MAPS_AND_NAVIGATION': (124, 1.4)}


As this generally seems to work, we combine this function with the `display_table()`function and display the frequency tables for 'Category', 'Genres' (android dataset), and 'prime_genre'(iOS dataset).

In [18]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        
#Category
print('Category, Android dataset')
display_table(android_free_apps, 1)
print('\n')

#Genres
print('Genres, Android dataset')
display_table(android_free_apps, 9)
print('\n')

#prime_genre
print('prime_genre, iOS dataset')
display_table(ios_free_apps, 11)

Category, Android dataset
FAMILY : (1675, 18.9)
GAME : (862, 9.73)
TOOLS : (750, 8.46)
BUSINESS : (407, 4.59)
LIFESTYLE : (346, 3.9)
PRODUCTIVITY : (345, 3.89)
FINANCE : (328, 3.7)
MEDICAL : (313, 3.53)
SPORTS : (301, 3.4)
PERSONALIZATION : (294, 3.32)
COMMUNICATION : (287, 3.24)
HEALTH_AND_FITNESS : (273, 3.08)
PHOTOGRAPHY : (261, 2.94)
NEWS_AND_MAGAZINES : (248, 2.8)
SOCIAL : (236, 2.66)
TRAVEL_AND_LOCAL : (207, 2.34)
SHOPPING : (199, 2.25)
BOOKS_AND_REFERENCE : (190, 2.14)
DATING : (165, 1.86)
VIDEO_PLAYERS : (159, 1.79)
MAPS_AND_NAVIGATION : (124, 1.4)
FOOD_AND_DRINK : (110, 1.24)
EDUCATION : (103, 1.16)
ENTERTAINMENT : (85, 0.96)
LIBRARIES_AND_DEMO : (83, 0.94)
AUTO_AND_VEHICLES : (82, 0.93)
HOUSE_AND_HOME : (73, 0.82)
WEATHER : (71, 0.8)
EVENTS : (63, 0.71)
PARENTING : (58, 0.65)
ART_AND_DESIGN : (57, 0.64)
COMICS : (55, 0.62)
BEAUTY : (53, 0.6)


Genres, Android dataset
Tools : (749, 8.45)
Entertainment : (538, 6.07)
Education : (474, 5.35)
Business : (407, 4.59)
Productivity : 

Let's start by taking a closer look at the `prime_genre`column in the iOS dataset:

In [19]:
#prime_genre
print('prime_genre, iOS dataset')
display_table(ios_free_apps, 11)

prime_genre, iOS dataset
Games : (1874, 58.16)
Entertainment : (254, 7.88)
Photo & Video : (160, 4.97)
Education : (118, 3.66)
Social Networking : (106, 3.29)
Shopping : (84, 2.61)
Utilities : (81, 2.51)
Sports : (69, 2.14)
Music : (66, 2.05)
Health & Fitness : (65, 2.02)
Productivity : (56, 1.74)
Lifestyle : (51, 1.58)
News : (43, 1.33)
Travel : (40, 1.24)
Finance : (36, 1.12)
Weather : (28, 0.87)
Food & Drink : (26, 0.81)
Reference : (18, 0.56)
Business : (17, 0.53)
Book : (14, 0.43)
Navigation : (6, 0.19)
Medical : (6, 0.19)
Catalogs : (4, 0.12)


We can see that the most common Genre is Games, with 58.16 %, followed by Entertainment (7.88 %). That means that approx. 2/3 of the apps in the iOS store are dedicated to Gaming and Entertainment. Photo and Video apps (4.97 %) and Education (3.66 %) are also somewhat popular, but these are basically only small categories.

We might recommend the development of a Gaming or Entertainment app based on this table. But some caution is needed: While the highest share is games, this alone does not tell us whether these apps also draw the highest share of users.

We now look at the data for the Android (Google Play Store) dataset.

In [20]:
#Category
print('Category, Android dataset')
display_table(android_free_apps, 1)
print('\n')

#Genres
print('Genres, Android dataset')
display_table(android_free_apps, 9)
print('\n')

Category, Android dataset
FAMILY : (1675, 18.9)
GAME : (862, 9.73)
TOOLS : (750, 8.46)
BUSINESS : (407, 4.59)
LIFESTYLE : (346, 3.9)
PRODUCTIVITY : (345, 3.89)
FINANCE : (328, 3.7)
MEDICAL : (313, 3.53)
SPORTS : (301, 3.4)
PERSONALIZATION : (294, 3.32)
COMMUNICATION : (287, 3.24)
HEALTH_AND_FITNESS : (273, 3.08)
PHOTOGRAPHY : (261, 2.94)
NEWS_AND_MAGAZINES : (248, 2.8)
SOCIAL : (236, 2.66)
TRAVEL_AND_LOCAL : (207, 2.34)
SHOPPING : (199, 2.25)
BOOKS_AND_REFERENCE : (190, 2.14)
DATING : (165, 1.86)
VIDEO_PLAYERS : (159, 1.79)
MAPS_AND_NAVIGATION : (124, 1.4)
FOOD_AND_DRINK : (110, 1.24)
EDUCATION : (103, 1.16)
ENTERTAINMENT : (85, 0.96)
LIBRARIES_AND_DEMO : (83, 0.94)
AUTO_AND_VEHICLES : (82, 0.93)
HOUSE_AND_HOME : (73, 0.82)
WEATHER : (71, 0.8)
EVENTS : (63, 0.71)
PARENTING : (58, 0.65)
ART_AND_DESIGN : (57, 0.64)
COMICS : (55, 0.62)
BEAUTY : (53, 0.6)


Genres, Android dataset
Tools : (749, 8.45)
Entertainment : (538, 6.07)
Education : (474, 5.35)
Business : (407, 4.59)
Productivity : 

Here, the pattern is not so clear. We find that the most frequent category is Family (18.9 %) followed by Game (9.73 %), Tools (8.46 %) and Business (4.59 %).
The Genre column is not particularly helpful, as we cannot readily identify the larger categories a subcategory or subcategories belong to. However, Tools (8.45 %), Entertainment (6.07 %), and Education (5.35 %) seem to be the most frequent Genres in this column.

Comparing the iOS and Android data, it seems that Games and Entertainment, and here predominantly family-friendly entertainment, seems to provide the largest overlaps between those platforms. But this is a shaky analysis, since the patterns of the distributions are so very different.
A general problem is that based on these tables we cannot yet identify the apps that draw the most users on both platforms.

## Most popular apps by genre (Apple store)

Since the user count is a better metric for us to make a decision, we will try to determine the kind of apps with the most users.

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the `Installs` column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the `rating_count_tot` app.

We start with calculating the average number of user ratings per app genre on the App Store. 

In [29]:
genres_ios = freq_table(ios_free_apps, 11)
for genre in genres_ios:
    total = 0 # Storing sum of ratings
    len_genre = 0 # storing number of apps per genre
    for app in ios_free_apps:
        genre_app = app[11]
        if genre_app == genre:
            n_ratings = float(app[5])
            total = total + n_ratings
            len_genre += 1
    avg_n_ratings = round(total / len_genre, 2)
    print(genre, avg_n_ratings)
     

Social Networking 71548.35
Photo & Video 28441.54
Games 22788.67
Music 57326.53
Reference 74942.11
Health & Fitness 23298.02
Weather 52279.89
Utilities 18684.46
Travel 28243.8
Shopping 26919.69
News 21248.02
Navigation 86090.33
Lifestyle 16485.76
Entertainment 14029.83
Food & Drink 33333.92
Sports 23008.9
Book 39758.5
Finance 31467.94
Education 7003.98
Productivity 21028.41
Business 7491.12
Catalogs 4004.0
Medical 612.0


Looking at the average number of ratings per app by genres, we see that the highest average number of rating in the genre "Navigation" with ca. 86090 user ratings per app. In second place we see the category "Reference", with on average 74942 user ratings per app in that genre. In this categories we find dictionaries, but also bible editions - anything that one might use to look up things for reference. And after that, with ca. 71548 user ratings per app we see Social Network apps, like Facebook or Instagram. We can print the most popular apps in these three genres, just to get an idea.

In [32]:
print('Navigation')
for app in ios_free_apps:
    if app[11] == 'Navigation':
        print(app[1], ':', app[5]) # print name and number of ratings
print('\n')

print('Reference')
for app in ios_free_apps:
    if app[11] == 'Reference':
        print(app[1], ':', app[5]) # print name and number of ratings
print('\n')

print('Social Networking')
for app in ios_free_apps:
    if app[11] == 'Social Networking':
        print(app[1], ':', app[5]) # print name and number of ratings       

Navigation
Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


Reference
Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition 

Yes, so the results show that in each of these categories there are a few apps at the top, drawing a huge number of users and then many apps with only few users. Waze and Google Maps dominate the Navigation Genre, the Bible and Dictionary.com dominate the Reference genre, and Facebook and Pinterest dominate the Social Networking genre.

Well, what evidence do we have then? We saw that by far the most number of apps are Games and Entertainment in the iOS app store (ca. 2/3 of all apps), but when it comes to the number of users, we see that apps in the genres Navigation, Reference and Social Networking draw the most users. But these are genres, which only represent a fraction of all apps, but dominating the user counts.

It probably would be difficult to break into the Navigation sector, since GPS data might cost too much. And building yet another social networking app will probably not work that well, given that that market might already be saturated. However, we might be able to enter the Reference genre, since there are plenty of free reference books out there with no licensing fees, and we can add features from this genre together (easy searching, look up words, good readability, audio feaures, etc.) to make for a compelling app that is nor a game.


## Most Popular Apps by Genre on Google Play

We now turn to the most popular apps on Google Play.
Here we have information about the number of installs for each app. It's stored in index number 5, and we can look at the distribution of this variable using the `display_table()` function.
The numbers are not very precise, as only broad categories are reported. But we might be able to work with that.

We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on. 

In [33]:
display_table(android_free_apps, 5)

1,000,000+ : (1394, 15.73)
100,000+ : (1024, 11.55)
10,000,000+ : (935, 10.55)
10,000+ : (904, 10.2)
1,000+ : (744, 8.39)
100+ : (613, 6.92)
5,000,000+ : (605, 6.83)
500,000+ : (493, 5.56)
50,000+ : (423, 4.77)
5,000+ : (400, 4.51)
10+ : (314, 3.54)
500+ : (288, 3.25)
50,000,000+ : (204, 2.3)
100,000,000+ : (189, 2.13)
50+ : (170, 1.92)
5+ : (70, 0.79)
1+ : (45, 0.51)
500,000,000+ : (24, 0.27)
1,000,000,000+ : (20, 0.23)
0+ : (4, 0.05)


In order to use these statistics for our calculations, we will remove the additional characters in the strings and then convert them into float. We will use the `str.replace(old, new)` method for this. We use the Category column (index number 1) in the android dataset to determine average installs per app in a genre.

In [34]:
genres_android = freq_table(android_free_apps, 1)
for category in genres_android:
    total = 0 # stores sum of installs in a genre
    len_category = 0 # stores number of apps in a genre
    for app in android_free_apps:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace('+', '') # replace the '+' characters with space 
            n_installs = n_installs.replace(',', '') # replace the ',' characters with space
            n_installs = float(n_installs) # convert to float
            total = total + n_installs
            len_category += 1
    avg_n_installs = round(total / len_category, 2)
    print(category, avg_n_installs)
          

ART_AND_DESIGN 1986335.09
AUTO_AND_VEHICLES 647317.82
BEAUTY 513151.89
BOOKS_AND_REFERENCE 8767811.89
BUSINESS 1712290.15
COMICS 817657.27
COMMUNICATION 38456119.17
DATING 854028.83
EDUCATION 1833495.15
ENTERTAINMENT 11640705.88
EVENTS 253542.22
FINANCE 1387692.48
FOOD_AND_DRINK 1924897.74
HEALTH_AND_FITNESS 4188821.99
HOUSE_AND_HOME 1331540.56
LIBRARIES_AND_DEMO 638503.73
LIFESTYLE 1437816.27
GAME 15588015.6
FAMILY 3697848.17
MEDICAL 120550.62
SOCIAL 23253652.13
SHOPPING 7036877.31
PHOTOGRAPHY 17840110.4
SPORTS 3638640.14
TRAVEL_AND_LOCAL 13984077.71
TOOLS 10801391.3
PERSONALIZATION 5201482.61
PRODUCTIVITY 16787331.34
PARENTING 542603.62
WEATHER 5074486.2
VIDEO_PLAYERS 24727872.45
NEWS_AND_MAGAZINES 9549178.47
MAPS_AND_NAVIGATION 4056941.77


Looking at the average number of installs per app by genres, we see that the highest average number of installs in the genre "COMMUNICATION" with ca. 38.5 million installs per app. In second place we see the category "VIDEO PLAYERS", with on average 24.7 million installs per app in that genre. And after that, with ca. 23.2 million installs per app we see Social Network apps, like Facebook or Instagram. (After that, with relatively similar average install statistics: Photography, Productivity, Games, Travel and Local, Entertainment.) We can print the most popular apps in these three genres, just to get an idea.

In [37]:
print('COMMUNICATION')
for app in android_free_apps:
    if app[1] == 'COMMUNICATION':
        print(app[0], ':', app[5]) # print name and number of installs
print('\n')

print('VIDEO PLAYERS')
for app in android_free_apps:
    if app[1] == 'VIDEO_PLAYERS':
        print(app[0], ':', app[5]) # print name and number of installs
print('\n')

print('SOCIAL')
for app in android_free_apps:
    if app[1] == 'SOCIAL':
        print(app[0], ':', app[5]) # print name and number of installs
print('\n')

print('BOOKS AND REFERENCE')
for app in android_free_apps:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5]) # print name and number of installs
print('\n')


COMMUNICATION
WhatsApp Messenger : 1,000,000,000+
Messenger for SMS : 10,000,000+
My Tele2 : 5,000,000+
imo beta free calls and text : 100,000,000+
Contacts : 50,000,000+
Call Free – Free Call : 5,000,000+
Web Browser & Explorer : 5,000,000+
Browser 4G : 10,000,000+
MegaFon Dashboard : 10,000,000+
ZenUI Dialer & Contacts : 10,000,000+
Cricket Visual Voicemail : 10,000,000+
TracFone My Account : 1,000,000+
Xperia Link™ : 10,000,000+
TouchPal Keyboard - Fun Emoji & Android Keyboard : 10,000,000+
Skype Lite - Free Video Call & Chat : 5,000,000+
My magenta : 1,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Seznam.cz : 1,000,000+
Antillean Gold Telegram (original version) : 100,000+
AT&T Visual Voicemail : 10,000,000+
GMX Mail : 10,000,000+
Omlet Chat : 10,000,000+
My Vodacom SA : 5,000,000+
Microsoft Edge : 5,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Calls & Text by Mo+ : 5,

We can see that here also in each category we have a number of outstanding apps having an outstanding influence on the average install statistics. In the category COMMUNICATION WhatsApp seems to have more than 1 billion (!) installs but also (Facebook?) Messenger, and browsers are also listed here. In the category video players we find the list dominated by the YouTube app (1,000,000,000+ installs). In the SOCIAL category, there are apps like Facebook and Facebook Lite, Tumblr, Google+ installed on many android phones. Because we found that a popular genre in the iOS store is the Reference genre, we also printed BOOKS AND REFERENCES here. This category is dominated by the Google Play Books app, but there are also many installed apps for free ebooks, dictionaries and quite a few Quran and Bible apps.

What to make from this? We wanted to develop an app profile for an app that could be popular in the Apple store and in the Google play store. We can't really enter the market for yet another messenger or social networking app, and video apps refer mostly to YouTube and similar applications. The REFERENCE AND BOOKS genre in the android store is not among the most popular apps according to average installs (only 8.7 million on average), but it's still quite substantial. If we consider overlaps between both stores, an app that enters the BOOK AND REFERENCES genre in both app stores might produce the strongest overlaps.

The book and reference genre includes a variety of apps: software for processing and reading ebooks, various collections of libraries, dictionaries, tutorials on programming or languages, etc. It seems there's still a small number of extremely popular apps that skew the average:

In [38]:
for app in android_free_apps:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                            or app[5] == '500,000,000+'
                                            or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


However, it looks like there are only a few very popular apps, so this market still shows potential. Let's try to get some app ideas based on the kind of apps that are somewhere in the middle in terms of popularity (between 1,000,000 and 100,000,000 downloads):

In [39]:
for app in android_free_apps:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
H

This niche seems to be dominated by software for processing and reading ebooks, as well as various collections of libraries and dictionaries, so it's probably not a good idea to build similar apps since there'll be some significant competition.

We also notice there are quite a few apps built around the book Quran, which suggests that building an app around a popular book can be profitable. It seems that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets.

However, it looks like the market is already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.

## Conclusions

In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.

We concluded that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets. The markets are already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.