# Profitable App Profiles for the App Store and Google Play Markets

**Our goal:** Identify the characteristics of mobile apps that are likely to be profitable in the App Store and Google Play.

**Our role:** We serve as data analysts for a company that develops mobile apps for Android and iOS. Our task is to assist the development team in making data-driven decisions about the types of apps they should build.

**Monetization model:** Our company's apps are free to download and install. The primary source of revenue is in-app advertising. Therefore, the revenue from a particular app is directly dependent on the number of users.

**Project goal:** Analyze data to help developers understand what types of apps are likely to attract more users.

# Opening and Exploring the Data

In September 2018, the App Store boasted around 2 million iOS applications, while Google Play offered 2.1 million Android apps.
![Image](https://s3.amazonaws.com/dq-content/350/py1m8_statista.png) Source: [Statista](https://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/)

Instead of gathering information for millions of apps, which would be expensive and time-consuming, let's look for existing data that might be helpful. Thankfully, we've found two relevant datasets:

* [A dataset](https://www.kaggle.com/lava18/google-play-store-apps) with information on roughly ten thousand Android apps from Google Play
* [A dataset](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) with data on about seven thousand iOS apps from the App Store.

Before diving into collecting our own data, let's explore these existing resources to see if they meet our needs.

In [7]:
from csv import reader

### The App Store dataset ###
opened_ios = open('AppleStore.csv', encoding='utf8')
read_ios = reader(opened_ios)
ios = list(read_ios)
ios_header = ios[0]
ios_data = ios[1:]

### The Google Play dataset ###
opened_android = open('googleplaystore.csv', encoding='utf8')
read_android = reader(opened_android)
android = list(read_android)
android_header = android[0]
android_data = android[1:]

To simplify our analysis of these datasets, we'll create a reusable function called `explore_data()`. This function will display data rows in a clear format, making it easier to understand. We can even design it to show the number of rows and columns within any dataset we pass through it.

In [9]:
def explore_data(dataset, start, end, header = False):
    data_slice = dataset[start:end]
    for row in data_slice:
        print(row)
        print('\n')
    
    if header:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

print(ios_header)
print('\n')
explore_data(ios_data, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


This dataset contains information on 7197 iOS apps. The following columns are of particular interest:
`track_name`, `currency price`,  `rating_count_tot`, `rating_count_ver` and `prime_genre`.
While not all column names are immediately clear, the dataset [documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) provides detailed information about each one.

Now let's take a look at the Google Play dataset.

In [11]:
print(android_header)
print('\n')
explore_data(android_data, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite ‚Äì FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


The Google Play dataset holds information for 10,841 apps across 13 columns. Upon initial inspection, some columns seem particularly relevant for our analysis, including `App`, `Category`, `Reviews`, `Installs`, `Type`, `Price`, and `Genres`.

# Deleting Wrong Data
The Google Play data set includes a [discussion section](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion/66015) where users [reported](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion/66015) an error in row 10472. To verify this, we can print the contents of row 10472 and compare it to the data set header and another valid row.

In [14]:
print(android_data[10472])  # incorrect row
print('\n')
print(android_header)  # header
print('\n')
print(android_data[3])      # correct row

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


We identified an anomaly in row 10472, representing the **Life Made WI-Fi Touchscreen Photo Frame** app. The rating listed as 19 is clearly invalid as Google Play ratings only go up to 5 stars. Discussions within the dataset indicate this might be caused by a missing value in the `Category` column. Consequently, we'll delete this row to ensure data integrity.

In [16]:
print(len(android_data))
del android_data[10472]  # don't run this more than once
print(len(android_data))

10841
10840


# Removing Duplicate Entries

## Part One

If we explore the Google Play data set long enough, we'll find that some apps have more than one entry. For instance, the application Instagram has four entries:

In [19]:
for app in android_data:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In total, there are 1,181 cases where an app occurs more than once:

In [21]:
unique_apps = []
duplicate_apps = []

for app in android_data:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:15])

Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


To ensure accurate analysis, we need to remove duplicate app entries and keep only one record per app. While randomly removing duplicates is an option, a more effective approach exists.

As observed in the previously printed rows for the Instagram app, the key difference lies in the fourth position (number of reviews). These variations indicate data collection at different times. We can leverage this insight to establish a retention criterion.

Instead of random removal, we'll prioritize rows with the highest number of reviews. This prioritizes data with potentially greater reliability due to the larger review pool.

Our strategy involves:

* Creating a dictionary: Each key represents a unique app name, and the value stores the highest number of reviews for that app.
* Building a new data set: Using the dictionary, we'll create a refined data set with only one entry per app, containing entries with the highest number of reviews.

## Part Two
Let's start by building the dictionary.

In [24]:
reviews_max = {}

for app in android_data:
    name = app[0]
    n_reviews = float(app[3])

    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        

Previously, we identified 1,181 instances where an app appeared multiple times in the data. Therefore, the length of our unique app dictionary should be the total length of the data set minus 1,181.

In [26]:
print('Expected length:', len(android_data) - 1181)
print('Actual length:', len(reviews_max))

Expected length: 9659
Actual length: 9659


Now, we'll utilize the `reviews_max` dictionary to filter out duplicate app entries. We'll retain only entries with the highest number of reviews for each duplicate app.

The code below outlines this process:

* We create two empty lists, `android_clean` and `already_added`.
* We loop through each entry in the `android` data set. For each iteration:
    + we extract the app name and its corresponding number of reviews.
    + we add the current row (`app`) to the `android_clean` list and its `name` to the `already_added` list under two conditions:
        - The current app's review count must match the highest review count for that app stored in the `reviews_max` dictionary; and
        - The app name must not already be present in the `already_added list`. This additional check handles cases where multiple entries for an app have the same highest review count. Relying solely on `reviews_max[name] == n_reviews` would introduce duplicate entries in such scenarios (e.g., the "Box" app with three entries with the same highest review count).

In [28]:
android_clean = []
already_added = []

for app in android_data:
    name = app[0]
    n_reviews = float(app[3])

    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

Let's take a moment to examine the newly created dataset. We can confirm that the number of rows is now 9,659.  This signifies a successful removal of duplicate entries.

In [30]:
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite ‚Äì FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


We have 9659 rows, just as expected.

# Removing Non-English Apps
## Part One
As you delve deeper into the datasets, you might encounter app names that suggest they are not targeted towards an English-speaking audience. Here are a few examples from both datasets for illustration:

In [33]:
print(ios_data[813][1])
print(ios_data[6731][1])

print(android_clean[4412][0])
print(android_clean[7940][0])

Áà±Â•áËâ∫PPS -„ÄäÊ¨¢‰πêÈ¢Ç2„ÄãÁîµËßÜÂâßÁÉ≠Êí≠
„ÄêËÑ±Âá∫„Ç≤„Éº„É†„ÄëÁµ∂ÂØæ„Å´ÊúÄÂæå„Åæ„Åß„Éó„É¨„Ç§„Åó„Å™„ÅÑ„Åß „ÄúË¨éËß£„ÅçÔºÜ„Éñ„É≠„ÉÉ„ÇØ„Éë„Ç∫„É´„Äú
‰∏≠ÂõΩË™û AQ„É™„Çπ„Éã„É≥„Ç∞
ŸÑÿπÿ®ÿ© ÿ™ŸÇÿØÿ± ÿ™ÿ±ÿ®ÿ≠ DZ


Our analysis focuses on apps relevant to an English-speaking audience. Therefore, we'll remove apps whose names contain characters uncommon in English text. Typically, English text includes:

* Letters from the English alphabet
* Numbers (0-9)
* Punctuation marks (., !, ?, ;, etc.)
* Common symbols (+, *, /, etc.)

These characters used in English text are encoded using the ASCII standard. Each ASCII character has a corresponding number between 0 and 127. We can leverage this to create a function that checks an app name for non-ASCII characters.

The code below defines such a function. It utilizes the built-in `ord()` function to determine the encoding number for each character in the app name.

In [35]:
def is_english(string):
    
    for character in string:
        if ord(character) > 127:
            return False
    return True

print(is_english('Instagram'))
print(is_english('Áà±Â•áËâ∫PPS -„ÄäÊ¨¢‰πêÈ¢Ç2„ÄãÁîµËßÜÂâßÁÉ≠Êí≠'))

True
False


While the current function appears functional, it has limitations. Certain English app names might contain emojis or symbols like (‚Ñ¢, ‚Äî (em dash), ‚Äì (en dash), etc.) that lie outside the ASCII character range. Consequently, using this function in its current state would lead to the unintended removal of relevant English apps.

In [37]:
print(is_english('Docs To Go‚Ñ¢ Free Office Suite'))
print(is_english('Instachat üòú'))

print(ord('‚Ñ¢'))
print(ord('üòú'))

False
False
8482
128540


## Part Two
To minimize the impact of data loss, we'll only remove an app if its name has more than three non-ASCII characters:

In [39]:
def is_english(string):
    no_ascii = 0

    for chacter in string:
        if ord(chacter) > 127:
            no_ascii += 1
    if no_ascii > 3:
        return False
    else:
        return True

print(is_english('Docs To Go‚Ñ¢ Free Office Suite'))
print(is_english('Instachat üòú'))

True
True


The `is_english()` function, though not flawless, should effectively filter out most non-English apps while potentially missing a small number. This level of accuracy suffices for our present analysis; further refinement can be addressed later.

We'll now utilize the `is_english()` function to remove non-English apps from both datasets.

In [41]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if is_english(name):
        android_english.append(app)

for app in ios_data:
    name = app[1]
    if is_english(name):
        ios_english.append(app)

explore_data(android_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite ‚Äì FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+'

We can see that we're left with 9614 Android apps and 6183 iOS apps.

# Isolating the Free Apps
As stated earlier, we exclusively develop free-to-download and install applications. Our primary revenue stream comes from in-app advertisements. Our datasets contain both free and paid apps, and for our analysis, we'll need to isolate only the free ones. Below, we've extracted the free apps from both of our datasets.

In [44]:
android_final = []
ios_final = []

for app in android_english:
    price = app[7]
    if price == '0':
        android_final.append(app)

for app in ios_english:
    price = app[4]
    if price == '0.0':
        ios_final.append(app)

print(len(android_final))
print(len(ios_final))

8864
3222


This process has resulted in a dataset of 8864 Android apps and 3222 iOS apps, which should be sufficient for our analysis.

# Most Common Apps by Genre
## Part One
As mentioned in the introduction, the objective of this study is to identify the types of mobile applications that are likely to attract the largest number of users, as their user base directly impacts our revenue.

To minimize development risks and expenses, we have developed a three-step app idea validation strategy:
* Create a Minimum Viable Product (MVP) Android version of the app and publish it on Google Play.
* Analyze user feedback: if the app receives positive feedback, it will be further developed and enhanced.
* If the app becomes profitable within six months, an iOS version of the app will be developed and published on the App Store.

Our goal is to develop applications that can achieve success on both the App Store and Google Play. Examples of such applications include productivity apps that incorporate gamification elements.

To begin our analysis, it is necessary to identify the most common app genres on each platform. For this purpose, frequency tables will be created for the following data set columns: `prime_genre` (App Store), `Genres` and `Category` (Google Play).

Analyzing app genres will provide us with insights into user preferences on different platforms, which will serve as the foundation for further app idea development and validation.

## Part Two
To facilitate the analysis of the frequency tables, two functions will be developed:
* Percentage Frequency Table Generation Function: This function will automate the generation of frequency tables that present data points as percentages.
* Descending Order Sorting Function: This function will sort the percentage values within the frequency tables in descending order, allowing for easier identification of the most prevalent genres.

In [48]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1

    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage

    return table_percentages

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

## Part Three
Our analysis commences by examining the frequency distribution of genres within the `prime_genre` column of the App Store data set.

In [50]:
display_table(ios_final, -5)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


An examination of the frequency distribution within the `prime_genre` column of the App Store data set reveals that games constitute the most prominent category, encompassing over half (58.16%) of the free English apps. Entertainment apps follow closely at nearly 8%, with photo and video apps trailing behind at approximately 5%. Educational and social networking apps hold a smaller share, accounting for 3.66% and 3.29% of the dataset, respectively.

This initial analysis suggests a potential bias towards entertainment-oriented applications within the App Store, particularly among free English apps. Games, entertainment, photo and video, social networking, sports, and music applications collectively represent a significant portion of the offerings. Conversely, applications designed for practical purposes, such as education, shopping, utilities, productivity, and lifestyle, appear less prevalent.

However, it is crucial to acknowledge that the sheer volume of a particular genre does not necessarily equate to a corresponding level of user engagement. Popularity, as measured by quantity, may not directly translate to user base size.

We shall now proceed by examining the interrelated `Genres` and `Category` columns within the Google Play data set for further insights.

In [52]:
display_table(android_final, 1) # Category

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

The landscape of app genres on Google Play appears demonstrably distinct from that of the App Store. Unlike the App Store's emphasis on entertainment, Google Play exhibits a seemingly lower concentration of applications designed primarily for recreational purposes. Conversely, a notable presence is observed for applications serving practical needs, including those categorized as family, tools, business, lifestyle, and productivity.

However, a closer examination of the `family` category, which constitutes nearly 19% of the offerings on Google Play, reveals a significant portion dedicated to games targeted towards children. This finding suggests that the seemingly practical focus of Google Play's genre distribution may be partially skewed by the inclusion of child-oriented entertainment within the family category.

![Image](https://camo.githubusercontent.com/820d6ec2d9a7187d65bd7ed393b39aaeba0dd3fa97862963e5a502bb8da4fc72/68747470733a2f2f73332e616d617a6f6e6177732e636f6d2f64712d636f6e74656e742f3335302f7079316d385f66616d696c792e706e67)

The observation of a stronger presence for practical applications on Google Play compared to the App Store is further corroborated by the frequency table generated for the `Genres` column. While the inclusion of the actual table would bolster this analysis visually, a key takeaway remains: Google Play's genre distribution leans more towards applications designed for practical use cases.

In [54]:
display_table(android_final, -4)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

While a subtle distinction exists between the `Genres` and `Category` columns within the Google Play data set, a key observation is the increased level of granularity offered by the `Genres` column due to its wider range of categories. For the purposes of this initial exploration aimed at identifying broader trends, we will focus our analysis on the `Category` column.

Our preliminary examination has revealed a potential bias towards entertainment-focused applications within the App Store, particularly among free English apps. Conversely, Google Play appears to exhibit a more balanced distribution, encompassing both applications designed for practical use cases and those intended for recreational purposes.

Moving forward, the focus of our investigation will shift towards gaining insights into the app genres that tend to attract the largest user bases on each platform.

# Most Popular Apps by Genre on the App Store
To determine the app genres with the highest user engagement, we will employ a metric indicative of user interest. Ideally, data on the average number of `Installs` per genre would be utilized for both the App Store and Google Play datasets. Unfortunately, the App Store data lacks this specific information.

As a viable alternative, we will leverage the total number of user ratings per genre within the App Store's `rating_count_tot` field as a proxy metric for user engagement. This approach assumes a correlation between the number of ratings and the overall user base of an app genre.

The following section will present the calculated average number of user ratings per app genre on the App Store:

In [57]:
genres_ios = freq_table(ios_final, -5)

for genre in genres_ios:
    total = 0
    len_genre = 0
    for app in ios_final:
        genre_app = app[-5]
        if genre_app == genre:
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ":", avg_n_ratings)

Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


While the initial analysis of the App Store data suggests that navigation applications possess the highest average number of user ratings, this finding warrants further investigation. Notably, a significant portion of this metric can be attributed to the presence of Waze and Google Maps, two prominent navigation apps boasting nearly half a million user reviews combined.

In [59]:
for app in ios_final:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5]) # print name and number of ratings

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching¬Æ : 12811
CoPilot GPS ‚Äì Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


The initial exploration of average user ratings per genre within the App Store data reveals potential outliers. Genres like navigation, social networking, and music exhibit inflated average ratings potentially influenced by a few dominant players. For instance, Facebook, Pinterest, Skype, Pandora, Spotify, and Shazam likely contribute significantly to the high averages observed in their respective genres.

This phenomenon skews the data, potentially misrepresenting the true popularity of these genres beyond the established giants. While a more granular analysis could involve removing these outliers and recalculating averages, such a deep dive is beyond the scope of this current exploration.

Furthermore, the "reference" genre, with an average of 74,942 user ratings, appears to be skewed by the presence of applications like the Bible and Dictionary.com, which garner exceptionally high user ratings.

In [61]:
for app in ios_final:
    if app[-5] == 'Reference':
        print(app[1], ':', app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ‚Ñ¢ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pok√©mon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
Êïô„Åà„Å¶!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


While the initial analysis of App Store user ratings identified potential outliers within certain genres, it also revealed a niche with intriguing possibilities. The "reference" genre, skewed by applications like the Bible and Dictionary.com, suggests a potential market for well-crafted educational or reference apps.

One possible approach could involve leveraging a popular book and transforming it into a comprehensive app. This app could extend beyond the raw text of the book by incorporating features such as daily quotes, an audiobook version, quizzes for knowledge assessment, and an integrated dictionary to eliminate the need for external app switching.

This concept aligns with the observed dominance of entertainment-focused apps on the App Store. The potential market saturation within this category suggests that a practical app might stand out more effectively amidst the vast number of offerings.

Further exploration of potentially popular genres on the App Store revealed weather, book, food and drink, and finance as potential contenders. However, a closer examination of these options yielded less promising results:

* Weather Apps: User engagement with weather apps is typically brief, limiting the potential for in-app advertising revenue. Additionally, acquiring reliable real-time weather data often necessitates integration with non-free APIs.
* Food and Drink Apps: Popular examples within this genre, such as Starbucks, Dunkin' Donuts, and McDonald's, often rely heavily on physical services like cooking and delivery infrastructure, which fall outside our company's core competencies.
* Finance Apps: Finance apps typically involve functions like banking, bill payments, and money transfers. Building such apps necessitates domain expertise, and recruiting a finance specialist solely for app development is not a strategic fit for our company.

Given these considerations, we shall now shift our focus towards analyzing the Google Play market to identify potential opportunities within that platform.

# Most Popular Apps by Genre on Google Play
In contrast to the App Store data, the Google Play data set offers the advantage of including install numbers for each app. This seemingly more precise metric should facilitate a clearer understanding of genre popularity on the Google Play platform.

However, a closer inspection reveals a potential limitation with the install data. A significant portion of the values appear to be presented in open-ended ranges, such as "100+," "1,000+," and "5,000+." While this approach offers a general sense of install volume, it introduces ambiguity when attempting to pinpoint exact install counts.

In [64]:
display_table(android_final, 5) # the Installs columns

1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


One problem with this data is that is not precise. For instance, we don't know whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000. However, we don't need very precise data for our purposes ‚Äî we only want to get an idea which app genres attract the most users, and we don't need perfect precision with respect to the number of users.

Therefore, we will retain the install data in its current form. This signifies that applications categorized as "100,000+ installs" will be considered to have 100,000 installs, while those categorized as "1,000,000+ installs" will be considered to have 1,000,000 installs, and so on.

It is important to note that calculations will necessitate transforming each install value into `float`. This conversion process requires the removal of commas and plus signs to avoid errors. We will accomplish this task directly within the loop below, where the average number of installs for each genre (category) will also be computed.

In [93]:
categories_android = freq_table(android_final, 1)

for category in categories_android:
    total = 0
    len_category = 0
    for app in android_final:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

The initial exploration of average install numbers per genre within the Google Play data reveals communication apps as the apparent leaders, boasting an average of 38,456,119 installs. However, it is crucial to acknowledge that this figure is likely influenced by a handful of outliers. Prominent applications within this genre, such as WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts, all surpass the one-billion-install mark. Additionally, several other communication apps fall within the 100-million and 500-million install range.

In [96]:
for app in android_final:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger ‚Äì Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Me

If we removed all the communication apps that have over 100 million installs, the average would be reduced roughly ten times:

In [99]:
under_100_m = []

for app in android_final:
    n_installs = app[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
        under_100_m.append(float(n_installs))
        
sum(under_100_m) / len(under_100_m)

3603485.3884615386

The pattern observed in communication apps is replicated across other genres. Video players, boasting an average of 24,727,872 installs, exemplify this phenomenon. The market is heavily influenced by established applications like Youtube, Google Play Movies & TV, and MX Player. This trend extends to social apps dominated by giants such as Facebook, Instagram, and Google+, photography apps featuring Google Photos and prominent photo editors, and productivity apps encompassing Microsoft Word, Dropbox, Google Calendar, and Evernote.

This recurring pattern raises a crucial concern. The high average install numbers within these genres might not accurately reflect their overall popularity. These niche markets appear to be disproportionately influenced by a select few dominant players, potentially presenting a significant competitive obstacle.

While the genre of the game exhibits signs of popularity, our earlier analysis revealed a potentially saturated market.  Therefore, we aim to explore alternative app genre recommendations.

The Books & Reference genre emerges as a potentially promising contender, with an average install count of 8,767,811. This genre warrants further investigation due to its potential success on both the App Store and Google Play, an aspect aligned with our goal of identifying a genre with profitability potential on both platforms.

To gain a deeper understanding of the Books & Reference genre, a closer look at individual applications within this category and their corresponding install numbers is necessary:


In [102]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra ‚Äì free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+

The Books & Reference genre encompasses a diverse range of applications, including those designed for ebook processing and reading, library collections, dictionaries, and programming or language tutorials. However, a similar trend to other genres is observed: a small number of highly popular applications appear to skew the average install count.

In [105]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                            or app[5] == '500,000,000+'
                                            or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad üìñ Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


While a small number of applications dominate the Books & Reference genre, the overall market exhibits potential. Let's focus on identifying app ideas inspired by applications with moderate install numbers, ranging from 1 million to 100 million downloads:

In [109]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra ‚Äì free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+

Within Books & Reference, ebook readers, library collections, and dictionaries appear dominant. Developing similar apps might face stiff competition. Interestingly, several Quran-based apps suggest profitability in building apps around popular books.

This concept aligns with our earlier App Store analysis ‚Äì transforming a well-known book (perhaps a recent one) into an app with engaging features could be profitable for both platforms. However, simply offering the raw text wouldn't suffice. To stand out, we should consider features like daily quotes, audiobooks, quizzes for knowledge assessment, and even a discussion forum to foster community engagement.

# Summary
Our analysis of both App Store and Google Play data suggests a promising app concept: an interactive book app. This app would leverage the popularity of a well-known book, potentially a recent release, and transform it into an engaging mobile experience.

Since the market is saturated with traditional library apps, our app would differentiate itself by offering features beyond the raw text.  This could include daily quotes, an audiobook version, quizzes to test user knowledge, and a dedicated forum for book discussions, fostering a sense of community among users.