# Profitable App Profiles for Apple Store and Google Play Markets

This case study relates to data on free Android and iOS mobile apps. Free apps generate revenue using in-app ads. Higher the number of users using the app, the more users that see and engage with ads, and higher the revenue generated.

The objective is to help developers understand what types of apps are likely to attract more users.

## Explore the datasets

We have 2 sources of data - one from Google Play store, and the other from Apple Store. We can open these up and store them as lists.

In [None]:
from csv import reader
apple_ds = list(reader(open('/Users/shubzroy/Documents/GitHub/Analytics/Datasets/AppleStore.csv')))
apple_ds_header = apple_ds[0]
apple_ds = apple_ds[1:]
google_ds = list(reader(open('/Users/shubzroy/Documents/GitHub/Analytics/Datasets/googleplaystore.csv')))
google_ds_header = google_ds[0]
google_ds = google_ds[1:]

The following function will allow us to slice the data and explore the contents of each list.

In [None]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [None]:
explore_data(google_ds, 0, 5, rows_and_columns = True)

In [None]:
explore_data(apple_ds, 0, 4, rows_and_columns = False)

In [None]:
apple_ds_header

In [None]:
google_ds_header

Based on the column headers, we can use the Category, Genres, Rating, Number of Reviews, and Installs to determine popularity based on App Type. Content Rating can give further insight on the population of users in case of targeted advertisiing.

# Data Cleansing


## Removing Duplicates
Before analyzing a dataset, we need to ensure the data is clean without any (or minimal) discrepancies, to ensure the analysis is accurate and to reduce the possibility of errors. This includes:
- removing duplicate records
- removing or correcting inaccurate data
- ensuring only relevant data is retained (eg. removing non-English apps)

Check the data to find rows that don't have same number of columns as described in header. Delete the row if there are any. In some cases, it might be useful to retain the datapoint if the missing values can be guessed based on sampling.

In [None]:
for row in google_ds:
    if len(row) != len(google_ds[0]):
        print(row)
        print(google_ds.index(row))
        
print(google_ds[10472])

In [None]:
del google_ds[10472]

Check the data for duplicate records. Here's an example of an app repeated over 4 rows. Using the loop below, we use "Instagram" as an example to highlight the presence of multiple rows for the same app.

In [None]:
for app in google_ds:
    name = app[0]
    if name == 'Box':
        print(app)

In [None]:
for app in apple_ds:
    name = app[1]
    if name == 'Instagram':
        print(app)

for app in apple_ds:
    name = app[1]
    if name == 'Mannequin Challenge':
        print(app)
        
for app in apple_ds:
    name = app[1]
    if name == 'VR Roller Coaster':
        print(app)

We can expand on the above function, loop through the data, and find all instances of duplicated data.

In [None]:
duplicate_apps = []
unique_apps = []

for app in google_ds:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:10])

In [None]:
# Check Apple data for duplicates

duplicate_apps_apple = []
unique_apps_apple = []

for app in apple_ds:
    id = app[0]
    if id in unique_apps_apple:
        duplicate_apps_apple.append(name)
    else:
        unique_apps_apple.append(name)
        
print('Number of duplicate apps:', len(duplicate_apps_apple))
print('\n')
print('Examples of duplicate apps:', duplicate_apps_apple[:10])

Given the ID field in the Apple Data is unique, we'll assume there are no duplicates in the data.
Note: further down the line, we may want to double check with the name field as well.

We can use several different criteria to remove duplicate data. We can use the following steps:
- For cases where there are multiple rows with no difference in values, we simply retain just one of the records
- We can use the updated date to obtain the latest values for an app
- We can use the number of reviews to retain the row with the highest number of refviews

In this example we shall only retain the entries that have the highest number of reviews. We first create a dictionary with the name of the app and the highest number of reviews of the corresponding app. Then we'll use this dictionary to create a new data set to retain just the entries in the source data with the highest number of reviews. 

In [None]:
reviews_max = {}
for app in google_ds:
    name = app[0]
    n_reviews = float(app[3])    
    if (name in reviews_max) and (reviews_max[name] < n_reviews):
        reviews_max[name] = n_reviews
    elif (name not in reviews_max):
        reviews_max[name] = n_reviews

The length of the dictionary can be used to check the number of unique apps versus the expected result.

In [None]:
print(len(google_ds)-1181)
print(len(reviews_max))

Now we loop through the Play Store data, and create a clean data set which has the apps only with the highest number of reviews, as per the dictionary. We also track the apps already added, such that if there are multiple entries with the same number of reviews, we only add it once.

In [None]:
already_added = []
google_clean = []

for app in google_ds:
    name = app[0]
    n_reviews = float(app[3])
    
    if (name not in already_added) and (n_reviews == reviews_max[name]):
        already_added.append(name)
        google_clean.append(app)

# explore the google_clean dataset to ensure 
print(len(google_clean))
explore_data(google_clean, 0, 4, True)

Note: The Apple Store data does not have duplicated data. This can be confirmed by following the above series of steps.

## Removing non-English characters

The analysis involves apps directed to English-speaking audiences only. The ord function checks a character and returns a number corresponding to the ASCII code. Numbers in the range 0 to 127 correspond to all the characters used in English text. 

In [None]:
def eng_check(word):
    check = 0
    for a in word:
        if ord(a) > 127:
            check+=1
            
    if check > 3:  # adding a threshold for emojis and other characters
        return False
    else:
        return True
    
# test the function
# eng_check('爱奇艺PPS -《欢乐颂2》电视剧热播')
# eng_check('Instagram')
# eng_check('Instachat 😜')
eng_check('Docs To Go™ Free Office Suite')

We can use the above function to loop through the dataset, append only the apps with English names to a new list.

In [None]:
google_english = []
google_nonenglish = []
apple_english = []
apple_nonenglish = []

for app in google_clean:
    name = app[0]
    if eng_check(name) == True:
        google_english.append(app)
    else:
        google_nonenglish.append(app)

for app in apple_ds:
    name = app[1]
    if eng_check(name) == True:
        apple_english.append(app)
    else:
        apple_nonenglish.append(app)

In [None]:
# Check the lengths of each dataset - should add up to original 9659 apps
print(len(google_english))
print(len(google_nonenglish))
print(len(apple_english))
print(len(apple_nonenglish))

In [None]:
explore_data(google_english, 0, 5, True)

In [None]:
explore_data(apple_english, 0, 5, True)

## Removing paid apps

Our analysis needs to include only free apps. We will have to loop through the data set once again and retain only free apps targeted to English speaking customers only.

In [None]:
google_eng_free = []
google_eng_paid = []
apple_eng_free = []
apple_eng_paid = []

for app in google_english:
    price = app[7]
    if price == '0':
        google_eng_free.append(app)
    else:
        google_eng_paid.append(app)

for app in apple_english:
    price = app[4]
    if price == '0.0':
        apple_eng_free.append(app)
    else:
        apple_eng_paid.append(app)

In [None]:
# Check the lengths of each dataset - should add up to original 9614 english apps
print(len(google_eng_free))
print(len(google_eng_paid))
print(len(apple_eng_free))
print(len(apple_eng_paid))

In [None]:
explore_data(apple_eng_free, 0, 5, True)

In [None]:
explore_data(google_eng_free, 0, 5, True)

# Data Analysis

## Problem Statement

The objective of the analysis is to determine what kinds of apps are likely to attract more customers since ad revenue from free apps is highly dependent on the number of users using the app. Additionally, the validation strategy of an app idea is 

1. Release an app involves releasing a minimal version on Google Play store
2. If the app has a good response from users, develop the app further
3. If the app is profitable after 6 months, build iOS version and release it on App Store

Given this strategy, it becomes important to understand what apps have been popular across both markets. Given this strategy, it becomes important to understand what apps have been popular across both markets.

## Most common apps by Genre

We're going to create a couple function to help us build frequency tables using the Google and Apple datasets.

In [None]:
# This function creates a frequency table using a given dataset, and a column of interest
def freq_table(dataset, index):
    freq = {}
    total = 0
    for app in dataset:
        total += 1
        value = app[index]
        if value in freq:
            freq[value] += 1
        else:
            freq[value] = 1
    
    freq_pct = {}
    for app in freq:
        freq_pct[app] = round((freq[app]/total)*100, 2)
    
    return freq_pct
    
# This function 1) takes in a dataset, and an index for a column as an input
# 2) builds a frequency table using the freq_table function
# 3) displays the table in a descending order of frequenct
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Let's run the above function on the Apple dataset for free English apps.

In [None]:
display_table(apple_eng_free, 11) # Prime Genre (Apple)

We can see that Games are the most common free English apps available from the Apple Store accounting for 58% of the population. The second-most common genre is Entertainment. The most common apps in this population are designed for entertainment purposes. Games, Entertainment, Photo & Video, Social Networking, Sports, and Music cover ~80% of the population. The apps designed for Practical purposes (for example, Lifestyle, Shopping, Weather) account for a much smaller portion of the available apps - often less 2-3%.

Note that these are just the apps that are available in the store, and does not account if these genres have a large number of users. This alone is not enough information to determine the app profile recommendation for the Ap Store market. We would have to combine this data with the number of users, and the average user rating as well for better results.

Let's look at the Category and Genre details for the Google dataset next.

In [None]:
display_table(google_eng_free, 1) # Category (Google)

We can see that the largest percentage of the apps in the Google play store is ~19% for Family. As per the Google Play store, the category of Family also includes Games. [Note: this is no longer the case. The category "Family" no longer exists in the Google Play Store.]

Even so, the remaining categories have a decent representation compared to the Apple Store data. Apps designed for fun account for ~30% of the population. The landscape is far more balanced. 

In [None]:
display_table(google_eng_free, 9) # Genre (Google)

This is further confirmed byt the Genre data. The Genre information appears to be a more granular look at the apps compared to the Category data, with a portion of the apps having multiple genres (for example, "Lifestyle; Education" or "Health & Fitness ; Action & Adventure".)

Again - this is not enough information to decide a recommendation. This merely tells us what are the apps currently availble in the store, and not how many users we have for each genre.

## Category of Apps with Most Users

To determine which category of apps are the most popular among users, we'd want to find the average number of installs per category. For this we have the 'Installs' column in the Google dataset. We do not have a similar datapoint in the Apple dataset. We could use the 'rating_count_tot' datapoint as a workaround. This is the total number of user ratings, and we'll be making the assumption that the users that installed the app rated the app as well.

Let's start with the Apple Store data, and generate a unique list of genres.

In [None]:
apple_freq_table = freq_table(apple_eng_free, 11)

for genre in apple_freq_table:
    total = 0
    len_genre = 0
    for app in apple_eng_free:
        genre_app = app[11]
        if genre_app == genre:
            ratings_ct = float(app[5])
            total += ratings_ct
            len_genre += 1
    avg_num_ratings = round(total / len_genre, 2)
    print(genre, ': ', avg_num_ratings)    

Navigation (86k) > Reference (75k) > Social Networking (71k) >  > Music (57k) > Weather (52k) > 
Book (39k) > Food & Drink (33k) > Finance (31k)

Free Navigation apps have the highest number of average users, followed by Reference and Social Networking apps. Navigation and Social Networking apps have key players already, and building a new app would need to have a major differentiator to compete. 

In [None]:
for app in apple_eng_free:
    genre = app[11]
    if genre == 'Navigation':
        print(app[1], ': ', app[5])

In [None]:
for app in apple_eng_free:
    genre = app[11]
    if genre == 'Social Networking':
        print(app[1], ': ', app[5])

As we can see from above, 2 Navigation apps (Waze and Google Maps) currently dominate the user base, and it would be a hard market to enter. There are several key players already in the Social Networking space (Facebook and Pinterest being the key leaders).

In [None]:
for app in apple_eng_free:
    genre = app[11]
    if genre == 'Reference':
        print(app[1], ': ', app[5])

In [None]:
for app in apple_eng_free:
    genre = app[11]
    if genre == 'Finance':
        print(app[1], ': ', app[5])

Looking into the Reference apps, we see that these are largely books or manuals converted to an app format. The Finance category is also one of the Top 10 categories. Specifically, a Reference app targeting Finance / Personal Budgeting could be an option to pursue.

We can next review the Google dataset. We see that we have the "installs" column in the dataset. However, this is a text field, and provides just ranges for the number of downloads. 

In [None]:
freq_table(google_eng_free, 5)

For example, an app could have "100,000+" downloads, with no specificity. This is fine since this still gives us an idea of the most popular categories based on the number. We first need to remove the "+" and "," to be able to estimate the average.

In [None]:
# Create a distinct list of categories in the Google dataset
google_freq_table = freq_table(google_eng_free, 1)

for category in google_freq_table:
    total = 0
    len_category = 0
    for app in google_eng_free:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace('+', '')
            n_installs = n_installs.replace(',', '')
            total += float(n_installs)
            len_category += 1
    avg_installs = round(total / len_category, 2)
    print(category, ': ', avg_installs)

The Communication and Video Player categories have the highest number of installs, but similar to the apple store, are likely to be dominated by a handful of players. 

In [None]:
for app in google_eng_free:
    category = app[1]
    n_installs = app[5]
    n_installs = n_installs.replace('+', '')
    n_installs = float(n_installs.replace(',', ''))
    if category == 'COMMUNICATION' and (n_installs >= 500000000):
        print(app[0], ': ', app[5])

In [None]:
for app in google_eng_free:
    category = app[1]
    n_installs = app[5]
    n_installs = n_installs.replace('+', '')
    n_installs = float(n_installs.replace(',', ''))
    if category == 'VIDEO_PLAYERS' and (n_installs >= 500000000):
        print(app[0], ': ', app[5])

We can see based on the above that 11, and 3 apps dominate the Communication, and Video Players categories, respectively, with each app accounting for greater than half a billion installs each. These would not be good recommendations, as it would be extremely difficult to enter these markets.

If we compare to the Apple Store data, Google Play appears to categorize the "Book" and "Reference" categories into one.

In [None]:
for app in google_eng_free:
    category = app[1]
    n_installs = app[5]
    n_installs = n_installs.replace('+', '')
    n_installs = float(n_installs.replace(',', ''))
    if category == 'BOOKS_AND_REFERENCE' and (n_installs >= 1000000):
        print(app[0], ': ', app[5])

Again, similar to the Apple Store data, we see that the books and reference apps are essentially large books that are in the format of an app. These include religious books, and dictionaries.

# Conclusion

Our objective was to analyze the app data from both the Google Play Store and the Apple App Store to provide a recommendation for a free app that could be a source of significant ad revenue. Through the process, we understood  which category of apps are the most downloaded on average. 

On further investigation, we noted that the Books and Rerefence category offers up the most amount of options and flexibility. Essentially, a large book typically referred to at a regular cadence (eg. tax or accounting information) could be built into an app format with search options. Additional features could include a discussion forum, highlighting/storing quotes or excerpts per user etc.