# Identifying Opportunities in Android and IOS Application Markets

# Introduction

For this project, i am assuming my role as a Data Analyst for a company (ABC) that develops mobile applications. The goal of this project is to help stakeholders understand what type of apps are likely to attract more users on Google Play and the IOS App Store. The following are the data sets that will be used throughout the project:

* A [dataset](https://www.kaggle.com/lava18/google-play-store-apps) containing data about approximately 10,000 Android apps from      Google Play; the data was collected in August 2018. You can download the    data set directly from [this link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv)

* A [dataset](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately 7,000 IOS apps from the App Store; the data was collected in July 2017. You can download the data set directly from [this link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv)

In order to make our recommendations, we will try to examine:

* The most common apps by genre
* Apps with the highest number of users by genre
* The highest rated apps by genre

# Data Exploration

Let's start by opening the two datasets and continue with exploring the data

In [1]:
from csv import reader

# The App store dataset#
open_file = open('AppleStore.csv')
read_file = reader(open_file)
Ios = list(read_file)
Ios_header = Ios[0]
Ios = Ios[1:]


# The google play store dataset#
open_file = open('googleplaystore.csv')
read_file = reader(open_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

To make it easier to explore the two dataset, we created a function named explore_data() that we can use repeatedly to explore rows in a readable way.
We will also add an option for our function to show the number of rows and columns for any data set.

In [2]:
def explore_data(dataset, start, end, row_and_columns = False):
    dataset_slice = dataset[start:end]
    
    for row in dataset_slice:
        print(row)
        print('\n') #adds a new (empty) line after each row
        
    if row_and_columns:
        print('Number of rows:',len(dataset))
        print('Number of columns:',len(dataset[0]))

print(Ios_header)
print('\n')
explore_data(Ios , 0, 3, True)     

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


We have 7197 IOS apps and 16 columns in our dataset. Out of this some of the column names seems interesting like "price","currency", "rating_count_tot","prime_genre" etc. Not all the column names are self explanatory, in this case details about each column can be found in the  dataset [documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home)

Now let's take a look at Google play store dataset

In [3]:
print(android_header)
print('\n')
explore_data(android, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


We have 10841 Google apps and 13 columns in this dataset. Out of which some of the
columns like "app", "Price", "Category", "Genres' etc might be useful for our analysis.

# Data Cleaning

The Google Play data set has a dedicated [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion), and we can see that [one of the discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) describes an error for a row 10472. let's check it out by printing this row and compare against with header and another row that is correct.

In [4]:
print(android[10472]) #wrong row
print('\n')
print(android_header) #header row
print('\n')
print(android[0]) #correct row

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


The row 10472 corresponds to the app 'Life Made WI-Fi Touchscreen Photo Frame', and we can see that column shift happened for next columns so the rating is showing as 19. This is clearly off because the maximum rating for a Google Play app is 5. As a consequence, we'll delete this row

In [5]:
print(len(android))
del android[10472] #Don't run this statement more than once
print(len(android))

10841
10840


# Handling Duplicate Data 

Now let's confirm that each app has only one entry in our dataset
        

In [6]:
duplicate_apps = [] #This is for storing duplicate apps
unique_apps = []   # This is for storing unique apps

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:',duplicate_apps[:20])

Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software', 'MailChimp - Email, Marketing Automation', 'Crew - Free Messaging and Scheduling', 'Asana: organize team projects', 'Google Analytics', 'AdWords Express']


We do not want to include duplicate apps data in our analysis, therefore we will need to remove them and keep only one entry per app. First, let's confirm if there are any differences between each duplicate app entries.

In [7]:
#Print all data for one of the apps with duplicate entries
for app in android:
    name = app[0]
    if name == 'Slack':
        print(app)

['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51510', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']


It looks like the number of reviews is higher for the last entry for the Slack app. Therefore, we can assume that these rows were likely pulled at a different time. Instead of removing the rows randomly, we will want to keep rows with the highest number of reviews.

Let's proceed by doing the following steps:

* Create a dictionary where each key is a unique app name and the               corresponding dictionary value is the highest number of review for that app

* Use the information stored in the dictionary to create a new data set,       which will only have one entry per app

## Handling duplicate data: Step 1 - Creating a Dictionary

In [8]:
reviews_max = {}
for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

Since we know that the number of duplicate apps is 1,181, we should confirm that the number of de-duplicated data set under reviews_max dictionary is correct

In [9]:
print('Expected length:', len(android) - 1181)
print('Actual lenghth:', len(reviews_max))

Expected length: 9659
Actual lenghth: 9659


##  Handling duplicate data: Step 2 : Removing duplicate rows

From the above, it is evident that the reviews_max dictionary is pulling the proper data set, we want use that to remove duplicate rows on our Google Play data set.

In [10]:
android_clean= [] #stores our new cleaned dataset
already_added = [] # stores our app names

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    
# Add the row to android_clean list and the app name to the already_clean list if:
# 1. The app's number of reviews matches the app's number of reviews on reviews_max dictionary
# 2. The app name is not already in already_added list (in case the highest number of reviews of a duplicate app is the same for more than one entry)

    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

Let's explore the new dataset to confirm the number of rows turns out to be 9659

In [11]:
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


The clean dataset have 9659 rows, which is indeed as expected

# Handling Non- English Apps

Since we only have the resource to create English apps at our mobile development company, we should only include apps that are directed to English-speaking audience. According to the ASCII (American Standard Code for Information Interchange) system, the numbers corresponding to the characters we commonly use in an English text are all in the range of 0 to 127. Therefore, if the app name contains a character that is greater than 127, it means that the app likely has a non-English name.

In [12]:
# Create a function that will return 'False' if the character is greater than 127

def is_english(string):
    for character in string:
        if (ord(character)) > 127:
            return False
        
        return True
    
# Test our function on some English and Non English app names
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
True
True


If we observe clearly the last two apps are likely targeted to an English-speaking audience, our function still recognizes it as a non-English app. This is because emojis and ™ symbol falls outside the ASCII range and have corresponding numbers greater than 127.

In [13]:
print(ord('™'))
print(ord('😜'))

8482
128540


To minimize the impact of data loss, we'll only remove an app if its name has more than three characters with corresponding numbers falling outside the ASCII range. In this case our filter function is not perfect, but it should be fairly effective

In [14]:
def is_english(string):
    non_ascii = 0

# Create a loop that counts the number of characters that falls outside of ASCII range
    for character in string:
        if (ord(character)) > 127:
            non_ascii += 1
            
    if non_ascii > 3:
        return False
    else:
        return True
    
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))    

True
True
False


We have confirmed that this new version of the function will not exclude apps that may only contain emojis or symbols. Now, let's use this function to remove any non-English apps from our data set

In [15]:
android_english = []
Ios_english = []

for app in android_clean:
    name = app[0]
    if is_english(name):
        android_english.append(app)
        
for app in Ios:
    name = app[1]
    if is_english(name):
        Ios_english.append(app)
    

explore_data(android_english, 0, 3, True)
print('\n')
explore_data(Ios_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

After we remove any non-English apps, we are now left with 9,614 Android apps and 6,183 IOS apps.

# Handling paid apps data
As we mentiond above, we only build apps that are free to download and install, we need to make sure that we are only including free apps in our analysis.

In [17]:
#Creating a loop for Android that will append free apps to our final list

android_final = []
for app in android_english:
    price = app[7]
    if price == '0':
        android_final.append(app)
        
#Creating a loop for IOS that will append free apps to our final list
Ios_final = []
for app in Ios_english:
    price = app[4]
    if price == '0.0':
        Ios_final.append(app)

print(len(android_final))
print(len(Ios_final))

8864
3222


Finally We are left with 8864 Android apps and 3222 IOS apps, which should be enough to move forward with our analysis.

# Data Analysis

Our company's revenue is highly influenced by the number of people using our apps. Therefore, our goal for this project is to create an app that will likely attract more users.

To minimize risk and overhead, our validation strategy for an app idea is comprised of the following steps:

* Build a minimal Android version of the app, and add it to Google Play
* Observe users response towards the app, develop further as needed
* If the app is profitable within six months, we build an iOS version of the   app and add it to the App Store

In the end, we want to create an app on both Apple App Store and Google Play. Therefore, it is important for us to determine which app profile is most successful on both markets.

# Most Common Apps by Genre

Let's begin our analysis by getting a sense of the most common genres on each market.Let's build a frequency table for the prime_genre column of the Apple App Store data set, and the Genres and Category columns of the Google Play data set.

In order to proceed further we will create two seperate functions to understand our analysis better:

1. A function to generate frequency tables that will also show percentages
2. A function that can display the percentages in a descending order

In [18]:
#Function 1- Generate frequency table that shows percentage of total

def freq_table(dataset, index):
    table = {}
    total = 0
    
    # Count app frequency
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    # Calculate the app frequency as a percentage of total
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage
        
    return table_percentages

# Function 2: Display frequency table in a descending order

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    
    # Transform frequency table into a list of tuples 
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
    
    # Sort the list in a descending order
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

We can now use the display_table function to explore our data set. Let's start by exploring the prime_genre column of the Apple App Store data set.

In [19]:
display_table(Ios_final,-5)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


Based on this result, it can be inferred that the majority of the free English apps on the Apple App Store are Games (58.16%). The runner-up genre is Entertainment, which only has 7.88% of the market share, followed by Photo & Video apps at 4.96%. On the other hand, the least common apps are comprised of Navigation, Medical and Catalogs apps, each only takes up less than 0.2% of the apps in our data set.

The general impression is that Apple App Store is currently dominated by apps that are designed for fun, such as games, entertainment, photo and video, etc. Whereas apps with practical purposes, like productivity, finance, and navigation, are more rare. However, the number of fun apps does not necessarily imply that they have the highest number of users, as the demand might not be the same as the offer.

Next, we will explore the Genres and Category columns of the Google Play data set. We are going to look into both of them because we are currently unsure which column pertains to the actual genre.

In [20]:
display_table(android_final,1) #category column

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

Google Play market has more share for apps with "practical genres" comparatively to IOS App store data. If we observe good number of apps are designed for practical purposes(family, tools, business, finance, etc.) However, if we look into Family genre (accounts for 18.9% of the Google Play market), it appears to actually consist of games for kids. Therefore, we may also conclude that the top apps in Google Play market also consist of apps that are designed for fun.

In [21]:
display_table(android_final,-4)#genres column

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

It is unclear on what actually differentiates Genres column against Category column, other than Genres seems to have a more granular view. For the purpose of this analysis, we will only work with Category column because we want to focus on the bigger picture.

Based on our initial analysis, we found that App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and entertainment apps. We might also want to avoid developing another game app, as the market seems to be oversaturated by entertainment apps.

# Apps With Most Users by Genre

Next, we'll look into which genres have the most users. For the Google Play data set, we can get this information through the Installs column, whereas, for IOS App Store data set does not include this information. Therefore, we can use the total number of user ratings as proxyfound  and this can be found under rating_count_tot column.

We need to perform the below steps to conduct this analysis:

* Isolate the apps of each genre
* Sum up the user ratings for the apps of that genre
* Divide the sum by the number of apps belonging to that genre

Below, we calculate the average number of user ratings per app genre on the IOS App Store.

In [25]:
import operator

genres_ios = freq_table(Ios_final, -5)

unsorted_genres_ios = {}

for genre in genres_ios:
    total = 0       # Store sum of ratings for each genre
    len_genre = 0   # Store number of apps specific to the genre
    for app in Ios_final:
        genre_app = app[-5]
        if genre_app == genre:            
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre   # Compute the average number of user rating
    unsorted_genres_ios[genre] = avg_n_ratings # Store the value to an unsorted dictionary

# Sort the dictionary values into a list of tuples and assign it to a new variable
sorted_genres_ios = sorted(unsorted_genres_ios.items(), key=operator.itemgetter(1), reverse=True)

for item in sorted_genres_ios:
    print(item)

('Navigation', 86090.33333333333)
('Reference', 74942.11111111111)
('Social Networking', 71548.34905660378)
('Music', 57326.530303030304)
('Weather', 52279.892857142855)
('Book', 39758.5)
('Food & Drink', 33333.92307692308)
('Finance', 31467.944444444445)
('Photo & Video', 28441.54375)
('Travel', 28243.8)
('Shopping', 26919.690476190477)
('Health & Fitness', 23298.015384615384)
('Sports', 23008.898550724636)
('Games', 22788.6696905016)
('News', 21248.023255813954)
('Productivity', 21028.410714285714)
('Utilities', 18684.456790123455)
('Lifestyle', 16485.764705882353)
('Entertainment', 14029.830708661417)
('Business', 7491.117647058823)
('Education', 7003.983050847458)
('Catalogs', 4004.0)
('Medical', 612.0)


It seems that the Navigation and Reference are the most popular genre on IOS App Store. Let's explore each of this genre to make sure that their average number of user rating value is not skewed or inflated by one or two major apps.

In [30]:
# Print the total number of user ratings for each app under Navigation genre
for app in Ios_final:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5])

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


In [32]:
# Print the total number of user ratings for each app under Reference genre
for app in Ios_final:
    if app[-5] == 'Reference':
        print(app[1], ':', app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


It turns out that these genres are actually dominated by major apps like Waze, Bible, and Dictionary.com, which has more than 100,000 user ratings. To prevent our numbers from being skewed by major apps, let's adjust our average number calculation to only include apps with less than 100,000 number of ratings:

In [34]:
genres_ios = freq_table(Ios_final, -5)

unsorted_genres_ios = {}

for genre in genres_ios:
    total = 0          # Store sum of ratings for each genre
    len_genre = 0      # Store number of apps specific to the genre
    for app in Ios_final:
        genre_app = app[-5]
        ratings = float(app[5])
        if genre_app == genre and ratings < 100000:  # Excludes apps with more than 100,000 user ratings          
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre           # Compute the average number of user rating
    unsorted_genres_ios[genre] = avg_n_ratings  # Store the value to an unsorted dictionary

# Sort the dictionary values into a list of tuples and assign it to a new variable
sorted_genres_ios = sorted(unsorted_genres_ios.items(), key=operator.itemgetter(1), reverse=True)

for item in sorted_genres_ios:
    print(item)

('Book', 16605.75)
('Social Networking', 13900.021052631579)
('Shopping', 12601.987012987012)
('Productivity', 12377.692307692309)
('Finance', 10506.354838709678)
('Sports', 10314.09375)
('Reference', 10186.9375)
('Photo & Video', 9317.454545454546)
('Utilities', 8676.57142857143)
('Travel', 8490.081081081082)
('Music', 8395.543859649122)
('Entertainment', 8260.150406504064)
('News', 7850.45)
('Business', 7491.117647058823)
('Weather', 7454.772727272727)
('Games', 6854.495184135977)
('Health & Fitness', 6417.3442622950815)
('Lifestyle', 5297.666666666667)
('Education', 4660.163793103448)
('Navigation', 4146.25)
('Catalogs', 4004.0)
('Food & Drink', 3678.0454545454545)
('Medical', 612.0)


Book genre now appears to have the highest average number of user rating after we exclude any major apps, followed by Social Networking and Shopping apps. Let's look into what apps are actually available under that genre.

In [35]:
# Print the total number of user ratings for each app under Book genre
for app in Ios_final:
    if app[-5] == 'Book':
        print(app[1], ':', app[5])

Kindle – Read eBooks, Magazines & Textbooks : 252076
Audible – audio books, original series & podcasts : 105274
Color Therapy Adult Coloring Book for Adults : 84062
OverDrive – Library eBooks and Audiobooks : 65450
HOOKED - Chat Stories : 47829
BookShout: Read eBooks & Track Your Reading Goals : 879
Dr. Seuss Treasury — 50 best kids books : 451
Green Riding Hood : 392
Weirdwood Manor : 197
MangaZERO - comic reader : 9
ikouhoushi : 0
MangaTiara - love comic reader : 0
謎解き : 0
謎解き2016 : 0


The most popular Book apps tends to be library-type apps, where a person can access multiple books in a single app. Coloring book is the only other non-Library app that has a high number of user rating. However, we might want to avoid creating another entertainment app because based on our first analysis, it appears that the market is already oversaturated with fun apps.

Next, we'll analyze the Google Play market.

In [36]:
display_table(android_final, 5)

1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


Unlike the IOS App Store's rating_count_tot column, it seems like Google Play's Installs column is not as precise because the installs number are grouped under specific brackets. For example, we do not know if an app with 100,000+ installs has 100,500 installs or 350,000 installs. For the purposes of our current project we can leave the data as is, since we do not need that level of precision.

To perform computation on these installs number we will need to turn them into a float. Therefore, we will need to remove the commas and the plus signs, otherwise the calculation will fail and results in an error.

In [37]:
categories_android = freq_table(android_final, 1)

unsorted_categories_android = {}

for category in categories_android:
    total = 0                 # Store sum of installs for each genre
    len_category = 0          # Store number of apps specific to the genre
    for app in android_final:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')  # Remove commas
            n_installs = n_installs.replace('+','')   # Remove plus signs
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category                  # Compute the average number of user rating
    unsorted_categories_android[category] = avg_n_installs # Store the value to an unsorted dictionary

# Sort the dictionary values into a list of tuples and assign it to a new variable
sorted_categories_android = sorted(unsorted_categories_android.items(), key=operator.itemgetter(1), reverse=True)

for item in sorted_categories_android:
    print(item)

('COMMUNICATION', 38456119.167247385)
('VIDEO_PLAYERS', 24727872.452830188)
('SOCIAL', 23253652.127118643)
('PHOTOGRAPHY', 17840110.40229885)
('PRODUCTIVITY', 16787331.344927534)
('GAME', 15588015.603248259)
('TRAVEL_AND_LOCAL', 13984077.710144928)
('ENTERTAINMENT', 11640705.88235294)
('TOOLS', 10801391.298666667)
('NEWS_AND_MAGAZINES', 9549178.467741935)
('BOOKS_AND_REFERENCE', 8767811.894736841)
('SHOPPING', 7036877.311557789)
('PERSONALIZATION', 5201482.6122448975)
('WEATHER', 5074486.197183099)
('HEALTH_AND_FITNESS', 4188821.9853479853)
('MAPS_AND_NAVIGATION', 4056941.7741935486)
('FAMILY', 3695641.8198090694)
('SPORTS', 3638640.1428571427)
('ART_AND_DESIGN', 1986335.0877192982)
('FOOD_AND_DRINK', 1924897.7363636363)
('EDUCATION', 1833495.145631068)
('BUSINESS', 1712290.1474201474)
('LIFESTYLE', 1437816.2687861272)
('FINANCE', 1387692.475609756)
('HOUSE_AND_HOME', 1331540.5616438356)
('DATING', 854028.8303030303)
('COMICS', 817657.2727272727)
('AUTO_AND_VEHICLES', 647317.8170731707

Communication apps appears to be the most popular with 38,456,119 number of installs. However, this number might be heavily skewed by a few major apps with over one billion or over 500 million installs. We want to make sure we are not picking out a genre that might seem more popular than they really are. Furthermore, we also do not want to develop an app under a genre that is already dominated by a few major players because it will be difficult to compete against them.

Let's modify our analysis by excluding apps with over 100 million installs.

In [38]:
categories_android = freq_table(android_final, 1)

unsorted_categories_android = {}

for category in categories_android:
    total = 0                   # Store sum of installs for each genre
    len_category = 0            # Store number of apps specific to the genre
    for app in android_final:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')   # Remove commas
            n_installs = n_installs.replace('+','')    # Remove plus signs
            n_installs = float(n_installs)
            if n_installs < 100000000:                 # Exclude apps with over 100 million installs
                total += float(n_installs)
                len_category += 1
    avg_n_installs = total / len_category                   # Compute the average number of installs
    unsorted_categories_android[category] = avg_n_installs  # Store the value to an unsorted dictionary

# Sort the dictionary values into a list of tuples and assign it to a new variable
sorted_categories_android = sorted(unsorted_categories_android.items(), key=operator.itemgetter(1), reverse=True)

for item in sorted_categories_android:
    print(item)

('PHOTOGRAPHY', 7670532.29338843)
('GAME', 6272564.694894147)
('ENTERTAINMENT', 6118250.0)
('VIDEO_PLAYERS', 5544878.133333334)
('WEATHER', 5074486.197183099)
('SHOPPING', 4640920.541237113)
('COMMUNICATION', 3603485.3884615386)
('PRODUCTIVITY', 3379657.318885449)
('TOOLS', 3191461.128987517)
('SOCIAL', 3084582.5201793723)
('SPORTS', 2994082.551839465)
('TRAVEL_AND_LOCAL', 2944079.6336633665)
('PERSONALIZATION', 2549775.832167832)
('MAPS_AND_NAVIGATION', 2484104.7540983604)
('FAMILY', 2342897.527075812)
('HEALTH_AND_FITNESS', 2005713.6605166052)
('ART_AND_DESIGN', 1986335.0877192982)
('FOOD_AND_DRINK', 1924897.7363636363)
('EDUCATION', 1833495.145631068)
('NEWS_AND_MAGAZINES', 1502841.8775510204)
('BOOKS_AND_REFERENCE', 1437212.2162162163)
('HOUSE_AND_HOME', 1331540.5616438356)
('BUSINESS', 1226918.7407407407)
('LIFESTYLE', 1152128.779710145)
('FINANCE', 1086125.7859327218)
('DATING', 854028.8303030303)
('COMICS', 817657.2727272727)
('AUTO_AND_VEHICLES', 647317.8170731707)
('LIBRARIES_

After excluding apps with over 100 million installs, we see that Photography category now rises to be the most popular genere. Let's explore this category in more depth. On this analysis, we will als only include apps with less than 100 million installs.

In [39]:
for app in android_final:
    if app[1] == 'PHOTOGRAPHY' and (app[5] != '1,000,000,000+'
                                            or app[5] != '500,000,000+'
                                            or app[5] != '100,000,000+'):
        print(app[0], ':', app[5])

TouchNote: Cards & Gifts : 1,000,000+
FreePrints – Free Photos Delivered : 1,000,000+
Groovebook Photo Books & Gifts : 500,000+
Moony Lab - Print Photos, Books & Magnets ™ : 50,000+
LALALAB prints your photos, photobooks and magnets : 1,000,000+
Snapfish : 1,000,000+
Motorola Camera : 50,000,000+
HD Camera - Best Cam with filters & panorama : 5,000,000+
LightX Photo Editor & Photo Effects : 10,000,000+
Sweet Snap - live filter, Selfie photo edit : 10,000,000+
HD Camera - Quick Snap Photo & Video : 1,000,000+
B612 - Beauty & Filter Camera : 100,000,000+
Waterfall Photo Frames : 1,000,000+
Photo frame : 100,000+
Huji Cam : 5,000,000+
Unicorn Photo : 1,000,000+
HD Camera : 5,000,000+
Makeup Editor -Beauty Photo Editor & Selfie Camera : 1,000,000+
Makeup Photo Editor: Makeup Camera & Makeup Editor : 1,000,000+
Moto Photo Editor : 5,000,000+
InstaBeauty -Makeup Selfie Cam : 50,000,000+
Garden Photo Frames - Garden Photo Editor : 500,000+
Photo Frame : 10,000,000+
Selfie Camera - Photo Edito

It appears that the photography genre is oversaturated with apps that are related to video editing or selfies. So we may eliminate those app ideas. If we want to combine the most popular genres on both the IOS App Store and Google Play market, we may explore the idea of creating a Book app that relates to Photography. For example, we can create a Book app based on various photography guidebooks. Our app can contain photography tips, reference guide from experts, and even discussion boards on various photography topics.

# Popular Rated Apps by Genre

To further consolidate our analysis, we can also look into the popular rated apps by Genre. We can do this by looking into Rating column on Google Play dataset and user_rating column on the App Store dataset. By performing this analysis, we can assume that if people are willing to provide positive reviews for a certain genre, there is also a likelihood for them to promote these apps to their friends. This data will then signify the potential growth rate for our app, or even opportunity for monetization in the future.

First, we will look into the popular rated app for IOS App Store dataset. Let's make sure that the user_rating column has the appropriate data before we perform our analysis.

In [40]:
for app in Ios_final:
    if app[-5] == 'Book':
        print(app[1],':',app[7])

Kindle – Read eBooks, Magazines & Textbooks : 3.5
Audible – audio books, original series & podcasts : 4.5
Color Therapy Adult Coloring Book for Adults : 5.0
OverDrive – Library eBooks and Audiobooks : 4.0
HOOKED - Chat Stories : 4.5
BookShout: Read eBooks & Track Your Reading Goals : 4.0
Dr. Seuss Treasury — 50 best kids books : 4.5
Green Riding Hood : 4.0
Weirdwood Manor : 4.5
MangaZERO - comic reader : 4.5
ikouhoushi : 0.0
MangaTiara - love comic reader : 0.0
謎解き : 0.0
謎解き2016 : 0.0


It seems that there are some apps with 0 rating. This probably means that no one has given them a rating yet. Therefore, we will remove them from our analysis to prevent them from skewing our data.

In [41]:
rating_ios = freq_table(Ios_final, -5)

unsorted_ratings_ios = {}

for genre in rating_ios:
    total = 0             # Store sum of installs for each genre
    len_rating = 0        # Store number of apps specific to the genre
    for app in Ios_final:
        genre_app = app[-5]
        ratings = float(app[5])
        n_ratings = float(app[7])
        if genre == genre_app and ratings < 100000 and n_ratings != 0: # Exclude apps with over 100,000 number of ratings and has 0 rating
            total += n_ratings
            len_rating += 1
    avg_n_ratings = total / len_rating           # Compute the average rating value
    unsorted_ratings_ios[genre] = avg_n_ratings  # Store the value to an unsorted dictionary

# Sort the dictionary values into a list of tuples and assign it to a new variable
sorted_ratings_ios = sorted(unsorted_ratings_ios.items(), key=operator.itemgetter(1), reverse=True)

for item in sorted_ratings_ios:
    print(item)

('Medical', 4.5)
('Book', 4.375)
('Productivity', 4.255102040816326)
('Games', 4.193901716992303)
('Health & Fitness', 4.154545454545454)
('Photo & Video', 4.144827586206897)
('Catalogs', 4.125)
('Shopping', 4.114864864864865)
('Reference', 4.107142857142857)
('Business', 3.9705882352941178)
('Music', 3.9642857142857144)
('Weather', 3.8947368421052633)
('Education', 3.86697247706422)
('Utilities', 3.857142857142857)
('Travel', 3.75)
('Social Networking', 3.7444444444444445)
('Entertainment', 3.660337552742616)
('News', 3.5694444444444446)
('Lifestyle', 3.5217391304347827)
('Food & Drink', 3.5)
('Navigation', 3.5)
('Finance', 3.4655172413793105)
('Sports', 3.1475409836065573)


It appears that the genre with the highest average rating is Medical (4.5), followed by Books genre at 4.375. This further reinforce our idea to develop an app within the Book genre.

Next, we will look into the Google Play data to see if this sentiment applies on the Android market as well. Before we perform our analysis, let's make sure that the Rating column has the appropriate data.

In [42]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0],':',app[2])

E-Book Read - Read Book for free : 4.5
Download free book with green book : 4.6
Wikipedia : 4.4
Cool Reader : 4.5
Free Panda Radio Music : 4.5
Book store : 4.4
FBReader: Favorite Book Reader : 4.5
English Grammar Complete Handbook : 4.6
Free Books - Spirit Fanfiction and Stories : 4.8
Google Play Books : 3.9
AlReader -any text book reader : 4.6
Offline English Dictionary : 4.2
Offline: English to Tagalog Dictionary : 4.7
FamilySearch Tree : 4.3
Cloud of Books : 3.3
Recipes of Prophetic Medicine for free : 4.6
ReadEra – free ebook reader : 4.8
Anonymous caller detection : NaN
Ebook Reader : 4.1
Litnet - E-books : 4.6
Read books online : 4.1
English to Urdu Dictionary : 4.6
eBoox: book reader fb2 epub zip : 4.7
English Persian Dictionary : 4.5
Flybook : 3.9
All Maths Formulas : 4.4
Ancestry : 4.3
HTC Help : 4.2
English translation from Bengali : 4.5
Pdf Book Download - Read Pdf Book : 4.4
Free Book Reader : 3.4
eBoox new: Reader for fb2 epub zip books : 4.9
Only 30 days in English, the g

For Google Play data set, it seems like they signify apps without ratings with a NaN value. We want to make sure that these apps are excluded from our analysis to prevent any errors when performing the average rating calculation

In [43]:
ratings_android = freq_table(android_final, 1)

unsorted_ratings_android = {}

for category in ratings_android:
    total = 0                  # Store sum of installs for each genre
    len_ratings = 0            # Store number of apps specific to the genre
    for app in android_final:
        category_app = app[1]
        n_ratings = app[2]
        if category_app == category and n_ratings != 'NaN': # Exclude apps with NaN rating
            installs = app[5]
            installs = installs.replace('+','')   # Remove plus signs
            installs = installs.replace(',','')   # Remove commas
            installs = float(installs)
            n_ratings = float(app[2])
            if installs < 100000000:              # Exclude apps with over 100 million installs
                total += n_ratings
                len_ratings += 1
    avg_n_ratings = total / len_ratings           # Compute the average rating value
    unsorted_ratings_android[category] = avg_n_ratings # Store the value to an unsorted dictionary

# Sort the dictionary values into a list of tuples and assign it to a new variable
sorted_ratings_android = sorted(unsorted_ratings_android.items(), key=operator.itemgetter(1), reverse=True)

for item in sorted_ratings_android:
    print(item)

('EVENTS', 4.435555555555557)
('BOOKS_AND_REFERENCE', 4.346753246753246)
('EDUCATION', 4.3401960784313705)
('PARENTING', 4.3395833333333345)
('ART_AND_DESIGN', 4.338181818181818)
('PERSONALIZATION', 4.291555555555557)
('BEAUTY', 4.278571428571428)
('SOCIAL', 4.25159574468085)
('HEALTH_AND_FITNESS', 4.233333333333333)
('WEATHER', 4.229230769230768)
('SHOPPING', 4.221387283236996)
('GAME', 4.218372703412077)
('SPORTS', 4.213135593220339)
('AUTO_AND_VEHICLES', 4.184722222222223)
('LIBRARIES_AND_DEMO', 4.178125)
('COMICS', 4.177358490566039)
('FAMILY', 4.169251700680277)
('FOOD_AND_DRINK', 4.1673913043478255)
('PRODUCTIVITY', 4.16461538461539)
('MEDICAL', 4.147807017543858)
('PHOTOGRAPHY', 4.144104803493449)
('HOUSE_AND_HOME', 4.140983606557378)
('FINANCE', 4.128125000000001)
('ENTERTAINMENT', 4.115)
('NEWS_AND_MAGAZINES', 4.103076923076923)
('BUSINESS', 4.102390438247011)
('COMMUNICATION', 4.1019323671497565)
('LIFESTYLE', 4.082374100719423)
('TRAVEL_AND_LOCAL', 4.059195402298849)
('MAPS_

Similar to the IOS App Store market, Books and Reference also shows up as the second highest rated genre on the Google Play market. This means that the market are generally receptive and willing to advocate for apps within the Book genre. We should proceed with our idea to develop an app within the Book genre because there is good potential for growth and monetization.

# Conclusion
In this project, we analyzed IOS App Store and Google Play data to determine what type of app should our company build that can be profitable for both markets.

Our conclusion is to build a book app that functions as a reference or guide for photographers. We found that both markets are already oversaturated by entertainment apps. Therefore, we decide to go with a combination of the most popular genres from each market which are Books for App Store and Photography for Google Play. For example, we can try to develop a Book app based on various photography guidebooks. Our app can contain photography tips, reference guide from experts, and discussion boards on various photography topics. Based on the ratings data, we also affirmed that both markets are eager to support book apps, which can help us with our future growth and monetization plans.