# Analyzing Profitable Apps in the App Store and Google Play Store

# Introduction

The main objective for this project is to deduce which mobile apps from the two most popular app markets (App Store & Google Play Store) is profitable. For context, we are a data analyst for a company which specializes in developing IOS and Android apps and we are given the task of analysing the data to know which apps are worth making and which are not.

All of the apps developed are free to install and therefore the revenue model we opted for is in-app ads and in-app purchases. Thus, it can be said that the more customers we have, the higher the revenue. So, this means that we must analyze the data to understand what types of apps result in the most amount of downloads in order to maximize revenue.

## Opening and Exploring the Data

In [2]:
from csv import reader

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row
    
    # display the number of rows and columns in the dataset
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
# open the two datasets (AppleStore.csv & GooglePlayStore.csv)
ios_store = open("AppleStore.csv")
ios_store = reader(ios_store)
ios_content = list(ios_store)
ios_header = ios_content[0]
ios_data = ios_content[1:]

android_store = open("googleplaystore.csv")
android_store = reader(android_store)
android_content = list(android_store)
android_header = android_content[0]
android_data = android_content[1:]

# display the first five rows the ios dataset
print("App Store - First 3 Rows")
explore_data(ios_content, 0, 4, True)
print("")

App Store - First 3 Rows
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7198
Number of columns: 16



<b>Comment:</b> There are a few columns of interest in our App Store dataset that could help us with our objective and these are the ratings, genre and maybe even the currency.

In [3]:
# display the first five rows the android dataset
print("Google Play Store - First 3 Rows")
explore_data(android_content, 0, 4, True)

Google Play Store - First 3 Rows
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


<b>Comment:</b> Here, we can see that there is several columns of interest that we can use to solve this problem. So, what we can probably deduce is that the genre, rating, type, category and the number of installs will play a huge role in the success of an app.

## Removing Wrong Data

We have been notified that there has been an error in one of the rows in the Google Play Dataset (i.e. Wrong rating for entry 10472). So, this means that it is our task to find and remove this row as it may alter the results of our analysis if kept in.

In [4]:
print("Google Play Dataset Header")
print(android_header)
print("")

print("Entry 10472")
print(android_data[10472])

Google Play Dataset Header
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']

Entry 10472
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


<b>Comment:</b> Looking at the 2nd column, which is the categories column, we have the value of '1.9' in the 10472th entry. This means that this row has invalid data in one of the columns and we have concluded that it is better to remove it entirely.

In [5]:
# remove the row from the Google Play Dataset
del android_data[10472]

# recheck to see if the row of data is still there (should be a new row)
print(android_data[10472])

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


## Removing Duplicate Entries
Finding and removing duplicates is another data cleaning task we must do and according to some discussions, there have been several duplicates in the Google Play Dataset such as the following.

In [6]:
# print the instagram duplicate entries
for app in android_data:
    app_name = app[0]
    if app_name == "Instagram":
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


<b>Comment:</b> It looks like there is 5 Instgram entries for the Google Play Dataset and this will dilute the analysis, so it is best to check if there are any more duplicates present in the dataset. Our criterion for removing dduplicates will be based on the number of reviews (4th column), where the highest reviews row will be kept.

In [8]:
# create two list to store the unique apps and duplicate apps
unique_apps = []
duplicate_apps = []

for app in android_data:
    app_name = app[0]
    if app_name not in unique_apps:
        unique_apps.append(app_name)
    else:
        duplicate_apps.append(app_name)
        
# show number of unique and duplicate app entries
print("Number of unique android apps:", len(unique_apps))
print("Number of duplicate android apps:", len(duplicate_apps))

Number of unique android apps: 9659
Number of duplicate android apps: 1181


<b>Comment:</b> Here, we can see there is 1181 entries will duplicate app names and so we must remove them and keep the one with the highest number of reviews as that will most likely be the most up to date entry.

Now, we must keep track of each unique app entry and if they have duplicate entries, then we will save the one with the highest number of reviews and remove the rest.

In [13]:
# keep track of the app name and the highest number of reviews for this app
reviews_max = {}
for app in android_data:
    name = app[0]
    n_reviews = int(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

# check if the length of unique keys is equal to the number of unique apps
print("Unique Apps:", len(unique_apps))
print("Unique Dict:", len(reviews_max.keys()))

Unique Apps: 9659
Unique Dict: 9659


<b>Comment:</b> Now, we should make a clean android dataset with only the unique app entries with the highest number of reviews of each app.

In [17]:
# make a cleanse version of the android dataset
android_clean = [] # store cleaned dataset
already_added = [] # store app names

for app in android_data:
    name = app[0]
    n_reviews = int(app[3])
    
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)
        
# display the length of the new android dataset to confirm that we have removed the duplicates
print("Clean Android Dataset Apps:", len(android_clean))

Clean Android Dataset Apps: 9659


## Removing Non-English Apps 