# Analyzing Profitable Apps in the App Store and Google Play Store

# Introduction

The main objective for this project is to deduce which mobile apps from the two most popular app markets (App Store & Google Play Store) is profitable. For context, we are a data analyst for a company which specializes in developing IOS and Android apps and we are given the task of analysing the data to know which apps are worth making and which are not.

All of the apps developed are free to install and therefore the revenue model we opted for is in-app ads and in-app purchases. Thus, it can be said that the more customers we have, the higher the revenue. So, this means that we must analyze the data to understand what types of apps result in the most amount of downloads in order to maximize revenue.

## Opening and Exploring the Data

In [1]:
from csv import reader

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row
    
    # display the number of rows and columns in the dataset
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
# open the two datasets (AppleStore.csv & GooglePlayStore.csv)
ios_store = open("AppleStore.csv")
ios_store = reader(ios_store)
ios_content = list(ios_store)
ios_header = ios_content[0]
ios_data = ios_content[1:]

android_store = open("googleplaystore.csv")
android_store = reader(android_store)
android_content = list(android_store)
android_header = android_content[0]
android_data = android_content[1:]

# display the first five rows the ios dataset
print("App Store - First 3 Rows")
explore_data(ios_content, 0, 4, True)
print("")

App Store - First 3 Rows
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7198
Number of columns: 16



<b>Comment:</b> There are a few columns of interest in our App Store dataset that could help us with our objective and these are the ratings, genre and maybe even the currency.

In [2]:
# display the first five rows the android dataset
print("Google Play Store - First 3 Rows")
explore_data(android_content, 0, 4, True)

Google Play Store - First 3 Rows
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


<b>Comment:</b> Here, we can see that there is several columns of interest that we can use to solve this problem. So, what we can probably deduce is that the genre, rating, type, category and the number of installs will play a huge role in the success of an app.

## Removing Wrong Data

We have been notified that there has been an error in one of the rows in the Google Play Dataset (i.e. Wrong rating for entry 10472). So, this means that it is our task to find and remove this row as it may alter the results of our analysis if kept in.

In [3]:
print("Google Play Dataset Header")
print(android_header)
print("")

print("Entry 10472")
print(android_data[10472])

Google Play Dataset Header
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']

Entry 10472
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


<b>Comment:</b> Looking at the 2nd column, which is the categories column, we have the value of '1.9' in the 10472th entry. This means that this row has invalid data in one of the columns and we have concluded that it is better to remove it entirely.

In [4]:
# remove the row from the Google Play Dataset
del android_data[10472]

# recheck to see if the row of data is still there (should be a new row)
print(android_data[10472])

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


## Removing Duplicate Entries
Finding and removing duplicates is another data cleaning task we must do and according to some discussions, there have been several duplicates in the Google Play dataset such as the following.

In [5]:
# print the instagram duplicate entries
for app in android_data:
    app_name = app[0]
    if app_name == "Instagram":
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


<b>Comment:</b> It looks like there is 5 Instgram entries for the Google Play dataset and this will dilute the analysis, so it is best to check if there are any more duplicates present in the dataset. Our criterion for removing dduplicates will be based on the number of reviews (4th column), where the highest reviews row will be kept.

In [6]:
# create two list to store the unique apps and duplicate apps
unique_apps = []
duplicate_apps = []

for app in android_data:
    app_name = app[0]
    if app_name not in unique_apps:
        unique_apps.append(app_name)
    else:
        duplicate_apps.append(app_name)
        
# show number of unique and duplicate app entries
print("Number of unique android apps:", len(unique_apps))
print("Number of duplicate android apps:", len(duplicate_apps))

Number of unique android apps: 9659
Number of duplicate android apps: 1181


<b>Comment:</b> Here, we can see there is 1181 entries will duplicate app names and so we must remove them and keep the one with the highest number of reviews as that will most likely be the most up to date entry.

Now, we must keep track of each unique app entry and if they have duplicate entries, then we will save the one with the highest number of reviews and remove the rest.

In [7]:
# keep track of the app name and the highest number of reviews for this app
reviews_max = {}
for app in android_data:
    name = app[0]
    n_reviews = int(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

# check if the length of unique keys is equal to the number of unique apps
print("Unique Apps:", len(unique_apps))
print("Unique Dict:", len(reviews_max.keys()))

Unique Apps: 9659
Unique Dict: 9659


<b>Comment:</b> Now, we should make a clean android dataset with only the unique app entries with the highest number of reviews of each app.

In [8]:
# make a cleanse version of the android dataset
android_clean = [] # store cleaned dataset
already_added = [] # store app names

for app in android_data:
    name = app[0]
    n_reviews = int(app[3])
    
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)
        
# display the length of the new android dataset to confirm that we have removed the duplicates
print("Clean Android Dataset Apps:", len(android_clean))

Clean Android Dataset Apps: 9659


## Removing Non-English Apps 

So, we should now take out the rows of data which has names of apps that are not in English. This is because our target market is free Android and IOS apps for English users only.

In [9]:
# check if app name doesn't contain a set of common english characters
# returns false if the app name has at least 3 characters that are non english
def is_english(app_name):
    non_english_chars = 0
    for char in app_name:
        # if the number associated with the character is > 127 then it is
        # not a member of the set of common english characters
        if ord(char) > 127:
            non_english_chars += 1
    
    if non_english_chars >= 3:
        return False
    else:
        return True

Now, we have a function in place to remove app entries with a non english name, we can start to clean up the android dataset even further.

In [10]:
# go through the Google Play list and remove non english apps
android_data = []
for app in android_clean:
    app_name = app[0]
    
    if is_english(app_name):
        android_data.append(app)
    
# display the new number of rows for the cleaner android dataset
print("Unique English Android Apps:", len(android_data))

Unique English Android Apps: 9597


<b>Comment:</b> So, it is clear from above that we have remove 62 entries as they were non english apps and now we should focus on filter the list even further to find the apps that are free (our target market).

## Isolating the Free Apps

This will be a very similar process to above, but instead we focus on the app's price and get the android apps that cost nothing and put them into a new list. The 7th column of the android dataset should have whether the app is free or paid and we can use this as the filter.

In [11]:
# store free android apps in new list
free_android_apps = []

for app in android_data:
    app_type = app[6] # either Free or Paid
    
    if app_type == "Free":
        free_android_apps.append(app)
    
# display the number of free apps
print("Number of free android apps:", len(free_android_apps))

Number of free android apps: 8847


## Data Analysis 

Our goal was to determine the kind of free apps that will attract more users as our revenue model relies on in-app advertisements. So, in order to minimise future risks and overhead, we can create a validation strategy by doing the following:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Here, we want to find apps that are successful on both app markets. Thus, we must get an idea of the types of genres that attract the most users and this will mean we have to use a frequency table. A good candidate column for the Google Play dataset would be the Genres/Category columns of the Google Play dataset and the prime_genre column in the App Store dataset.

We'll build two functions we can use to analyze the frequency tables:

* One function to generate frequency tables that show percentages
* Another function that we can use to display the percentages in a descending order

In [14]:
# function to generate freq_table
def freq_table(dataset, index):
    ftable = {}
    
    for app in dataset:
        col = app[index]
        if col in ftable:
            ftable[col] += 1
        else:
            ftable[col] = 1
            
    for key in ftable:
        ftable[key] = ftable[key] / len(dataset)
        
    return ftable

# display frequency table
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Now, we should display the frequncy tables of the columns prime_genre (App Store dataset), Genres and Category (Google Play dataset) to get an idea of the most popular categories.

In [17]:
# prime_genre frequency table
display_table(ios_data, 11)

Games : 0.5366124774211477
Entertainment : 0.07433652910935112
Education : 0.06294289287203002
Photo & Video : 0.04849242740030568
Utilities : 0.034458802278727246
Health & Fitness : 0.02501042100875365
Productivity : 0.024732527441989716
Social Networking : 0.023204112824788105
Lifestyle : 0.020008336807002917
Music : 0.01917465610671113
Shopping : 0.016951507572599694
Sports : 0.015839933305543976
Book : 0.015562039738780047
Finance : 0.014450465471724329
Travel : 0.011254689453939141
News : 0.010421008753647354
Weather : 0.010004168403501459
Reference : 0.008892594136445742
Food & Drink : 0.008753647353063776
Business : 0.007919966652771988
Navigation : 0.0063915520355703765
Medical : 0.0031957760177851883
Catalogs : 0.0013894678338196471


<b>Comment:</b> From the App Store dataset, we can see the most common genre is Games then followed by the Entertainment genre of ios apps. It definitely seems that the most successful genres cater towards the younger generations because the latter genres such as Weather or Business are less popular and would be more inclined to be downloaded by older audiances.

Also, we can see that the most popular genres are more for entertainment (i.e games, entertainment and social networking) rather than pratical (food & drink, medical or finance).

Thus, the recommended app genre to focus on in the App Store would be to make a game or any of those that entertain the user rather than make them more productive. Despite games being the clear winner in terms of users, it also means there is an abundance of competitors, so it is better to focus on the next few genres.

In [20]:
# Genres frequency table
display_table(android_data, 9)

Tools : 0.08596436386370741
Entertainment : 0.058038970511618215
Education : 0.05241221214963009
Business : 0.04365947691987079
Medical : 0.04115869542565385
Personalization : 0.03907471084713973
Productivity : 0.03886631238928832
Lifestyle : 0.03761592164217985
Finance : 0.035948733979368555
Sports : 0.03438574554548297
Communication : 0.03261435865374596
Action : 0.031051370219860375
Health & Fitness : 0.030009377930603313
Photography : 0.029175784099197667
News & Magazines : 0.025945608002500783
Social : 0.02490361571324372
Travel & Local : 0.0227154319058039
Books & Reference : 0.02261123267687819
Shopping : 0.020944045014066895
Simulation : 0.01979785349588413
Arcade : 0.019068458893404187
Dating : 0.01771386891737001
Casual : 0.017192872772741483
Video Players & Editors : 0.016776075857038657
Maps & Navigation : 0.01333750130249036
Puzzle : 0.012399708242159009
Food & Drink : 0.011670313639679067
Role Playing : 0.010836719808273419
Strategy : 0.009794727519016359
Racing : 0.00948

<b>Comment:</b> The most popular Genres in the Google Play dataset seems to be the Entertainment apps. But, there also seems to be alot more genres than the App Store one, maybe because this one is more specific whereas the other was more broad for each app categorisation.

The most popular genres seem to be the opposite of the App Store one as it shows there is more demand for practical and productive apps rather than for entertainment apps.

In [21]:
# Category frequency table
display_table(android_data, 1)

FAMILY : 0.19360216734396166
GAME : 0.0979472751901636
TOOLS : 0.08606856309263311
BUSINESS : 0.04365947691987079
MEDICAL : 0.04115869542565385
PERSONALIZATION : 0.03907471084713973
PRODUCTIVITY : 0.03886631238928832
LIFESTYLE : 0.03772012087110555
FINANCE : 0.035948733979368555
SPORTS : 0.03376055017192873
COMMUNICATION : 0.03261435865374596
HEALTH_AND_FITNESS : 0.030009377930603313
PHOTOGRAPHY : 0.029175784099197667
NEWS_AND_MAGAZINES : 0.025945608002500783
SOCIAL : 0.02490361571324372
TRAVEL_AND_LOCAL : 0.022819631134729602
BOOKS_AND_REFERENCE : 0.02261123267687819
SHOPPING : 0.020944045014066895
DATING : 0.01771386891737001
VIDEO_PLAYERS : 0.01698447431489007
MAPS_AND_NAVIGATION : 0.01333750130249036
FOOD_AND_DRINK : 0.011670313639679067
EDUCATION : 0.01104511826612483
ENTERTAINMENT : 0.009065332916536417
LIBRARIES_AND_DEMO : 0.0087527352297593
AUTO_AND_VEHICLES : 0.0087527352297593
WEATHER : 0.008127539856205065
HOUSE_AND_HOME : 0.007398145253725123
EVENTS : 0.00666875065124518
PA

<b>Comment:</b> The most popular Category in the Google Play dataset is the family apps followed by Games. It seems that the Genres and Category column in the Google Play dataset show different popular app categorisations but that maybe that the genres are more specific whereas the categories are more broad.

<b>Conclusion:</b> Therefore, the most optimal apps to develop for the App Store would be any app that is considered fun such as games or social apps. On the otherhand, the Google Play stores most successful genres are more balanced between productivity and entertainment apps. This gives a rough idea of the types of apps that we should develop to maximise our profits and customer base.

## Most Popular Apps by Genre on the App Store

To quantitatively show which genre is the most successful (has the most users), we must calculate the average number of installs for each app genre. 

For the Google Play data set, we can find this information in the Installs column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the rating_count_tot app.

Let's start with calculating the average number of user ratings per app genre on the App Store. To do that, we'll need to:

- Isolate the apps of each genre.
- Sum up the user ratings for the apps of that genre.
- Divide the sum by the number of apps belonging to that genre (not by the total number of apps). 

In [32]:
# generate freq table for the prime_genre column to get the unique app genres
freq_genre = freq_table(ios_data, 11)

# display the average number of user ratings per app genre for the App Store
best_genre = ""
best_avg_number = 0
for genre in freq_genre.keys():
    total = 0 # store the sum of the user ratings
    len_genre = 0 # store the number of apps specific to that genre
    
    for app in ios_data:
        genre_app = app[11]
        
        if genre == genre_app:
            n_user_ratings = float(app[5])
            total += n_user_ratings
            len_genre += 1
    
    avg_user_ratings = total / len_genre
    
    # check if this genre has the best average number of ratings
    if avg_user_ratings > best_avg_number:
        best_genre = genre
        best_avg_number = avg_user_ratings

# display the app genre with the highest average number of user ratings
print("Genre:", best_genre, "| Average Number of User Ratings:", best_avg_number)

Genre: Social Networking | Average Number of User Ratings: 45498.89820359281


<b>Conclusion:</b> From above, we can see that the app genre with the most average number of users is the apps in the social networking genre. But, as we know, this sector is heavily skewered towards to social behemoths that is Youtube, Instagram, Facebook and Twitter. Thus, it will require us to target a smaller niche in the social network genre to get a slice of the marketshare but when we do, we should gain a decent following.

## Most Popular Apps by Genre on Google Play

Now that we have made a app genre recommendation We have data about the number of installs for the Google Play market, so we should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.)

For instance, we don't know whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000. However, we don't need very precise data for our purposes — we only want to find out which app genres attract the most users, and we don't need perfect precision with respect to the number of users.

We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on.

In [48]:
# generate a frequency table for the Category column in the Google Play dataset
freq_category = freq_table(android_data, 1)

# loop over the unique genres and get the average number of install per app genre
best_category = ""
best_avg = 0
for category in freq_category.keys():
    total = 0
    len_category = 0
    
    for app in android_data:
        category_app = app[1]
        
        if category == category_app:
            # remove non numeric chars from the installs value
            n_installs = app[5]
            n_installs = n_installs.replace('+', '')
            n_installs = n_installs.replace(',', '')
            n_installs = float(n_installs)
            
            total += n_installs
            len_category += 1

    avg_number_installs = total / len_category
         
    # set this category as the best if the average number of installs is the max
    if avg_number_installs > best_avg:
        best_category = category
        best_avg = avg_number_installs   
        
print("Category:", best_category, "||| Average Number of Installs:", int(best_avg))

Category: COMMUNICATION ||| Average Number of Installs: 35266026


<b>Conclusion:</b> So, we can see the the best category to develop an app in on the Google Play store is communication apps. This is because it has on average over 35 million installs per app which makes sense consider that we are using our phones more than ever to communicate. The possible reason for this maybe because most of the installs will be only for a few apps such as Facebook or I