# Profitable App profiles for the Google Play and App Store markets

The purpose of this project is to analyze the profiles of successful apps in either of these markets, and provide recommendations to developers on what characteristics to build into their next app to ensure its success.

In [82]:
from csv import reader as readers
# appStorePath = 'Documents/Home/Python/Dataquest/Datasets/AppleStore.csv'
appStorePath = 'Datasets/AppleStore.csv'
googlePlayPath = 'Datasets/googleplaystore.csv'
appStoreData = open(appStorePath)
googlePlayData = open(googlePlayPath)
appStoreData = list(readers(appStoreData))
googlePlayData = list(readers(googlePlayData))

Add in predefined function to help us explore the dataset. For details on the Google Play Dataset, go [here](https://www.kaggle.com/lava18/google-play-store-apps). For details on the App Store dataset, go [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps).

In [83]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [84]:
explore_data(googlePlayData, 0,1,True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Number of rows: 10842
Number of columns: 13


# Data Cleaning

## Removing inaccurate data

First off, let's remove the header row in this dataset

In [85]:
del appStoreData[0]
del googlePlayData[0]

Based on this Kaggle [discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015), right off the bat we need to delete one entry in the Google Play Dataset.

In [86]:
del googlePlayData[10473]

## Removing duplicate entries

Now let's create a function that searches for duplicate app entries...

In [87]:
def find_duplicate_entries(dataset):
    duplicate_apps = []
    unique_apps = []
    for entry in dataset:
        name = entry[0]
        if name in unique_apps:
            duplicate_apps.append(name)
        else:
            unique_apps.append(name)
            
    return duplicate_apps

...and use it to find duplicate entries in both datasets

In [88]:
appStore_duplicate_apps = find_duplicate_entries(appStoreData)
googlePlay_duplicate_apps = find_duplicate_entries(googlePlayData)
print(appStore_duplicate_apps[:3])
print(googlePlay_duplicate_apps[:3])
print('Google Play dataset has {} repeated entries'.format(len(googlePlay_duplicate_apps)))

[]
['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business']
Google Play dataset has 1180 repeated entries


Looks like the Google Play dataset has quite a few repeated entries while the App store dataset has none. Let's explore one of these duplicates (Instagram)

In [89]:
for entry in googlePlayData:
    name = entry[0]
    if name == "Instagram":
        print(entry)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


Looks like one of the notable differences between these duplicate entries is the # of reviews. Logically, the more reviews an entry has, the more recent the entry is. So let's use this as a data cleanup rule: for all duplicate entries, keep only the entry with the highest number of reviews. To enact this rule, first we create a helper function that converts the number of ratings into a float

In [90]:
def convert_to_float(number_as_string):
    if 'M' in number_as_string:
        result = float(number_as_string.split('M')[0])*(10**6)
    else:
        result = float(number_as_string)
    
    return result

Then we create a dictionary `reviews_max` to store the name and corresponding maximum number of reviews for all apps. Using this dictionary, we iterate through the entire `googlePlayData` and store only the entries whose names and number of reviews correspond to those found in `reviews_max`

In [91]:
reviews_max = {}
for app in googlePlayData:
    name = app[0]
    if (name in reviews_max) and (reviews_max[name] < convert_to_float(app[3])):
        reviews_max[name] = convert_to_float(app[3])
    elif name not in reviews_max:
        reviews_max[name] = convert_to_float(app[3])

googlePlayData_clean = []
already_added = []

for app in googlePlayData:
    name = app[0]
    n_reviews = convert_to_float(app[3])
    if name not in already_added and n_reviews == reviews_max[name]:
        googlePlayData_clean.append(app)
        already_added.append(name)

In [92]:
googlePlayData = googlePlayData_clean #optional cell just to keep the naming/nomenclature clean and consistent

## Removing non-english apps

In [93]:
def check_if_english(name):
    flag_count = 0
    i = 0
    while (flag_count < 3) and (i <= len(name)-1):
        if ord(name[i]) > 127:
            flag_count += 1
            i += 1
        else:
            i += 1
    if flag_count < 3:
        result = True
    else:
        result = False
        
    return result

We can use the function above to screen-out non-english apps (it's not foolproof, but it should be fairly effective). Let's screen out non-english apps below.

In [94]:
googlePlayData_english = []
for entry in googlePlayData:
    if check_if_english(entry[0]):
        googlePlayData_english.append(entry)

In [95]:
appStoreData_english = []
for entry in appStoreData:
    if check_if_english(entry[1]):
        appStoreData_english.append(entry)

Check to make sure some apps were actually screened out (the length of googlePlayData_english should be less than or equal to that of googlePlayData, same for appStoreData)

In [96]:
print(len(googlePlayData_english))
print(len(googlePlayData))
print(len(appStoreData_english))
print(len(appStoreData))

9598
9660
6155
7197


In [97]:
appStoreData = appStoreData_english
googlePlayData = googlePlayData_english

## Removing non-free apps

We're only interested in free apps, so let's screen out the non-free apps.

In [98]:
print(googlePlayData[:3])

[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']]


In [99]:
appStoreData_free = []
for entry in appStoreData:
    if entry[4] == '0.0':
        appStoreData_free.append(entry)
        
googlePlayData_free = []
for entry in googlePlayData:
    if entry[6] == 'Free':
        googlePlayData_free.append(entry)

Check for expected behavior

In [100]:
print(len(googlePlayData_free))
print(len(googlePlayData))
print(len(appStoreData_free))
print(len(appStoreData))

8847
9598
3203
6155


In [101]:
appStoreData = appStoreData_free
googlePlayData = googlePlayData_free

# Data Analysis

Since our hope is to eventually publish an app on both the Google Play Store and the App Store, we need to learn what app profiles are profitable on both platforms. Roughly speaking the validation strategy (to gauge whether or not there is a market for this app) is as follows:
1. Publish a barebones version of the app on the Google Play store.
2. If we see traction, we develop the app further
3. If the app is profitable in six months, we create an iOS version of the app and publish it to the App store.

We need to find the most popular/common genres on each of the platforms. We'll do this by looking at something like the number of ratings or downloads for each type of app.

Let's start by defining some helper functions

In [102]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        
def freq_table(dataset, index):
    result = {}
    for entry in dataset:
        if entry[index] in result:
            result[entry[index]] += 1
        else:
            result[entry[index]] = 1
            
    for item in result:
        result[item] = round((result[item]/len(dataset))*100, 2)
    return result

## Popularity by Genre (Market Competition)

Let's take a look at App store data first.

In [103]:
display_table(appStoreData, 11)

Games : 58.26
Entertainment : 7.84
Photo & Video : 5.0
Education : 3.68
Social Networking : 3.31
Shopping : 2.59
Utilities : 2.47
Sports : 2.15
Music : 2.06
Health & Fitness : 2.03
Productivity : 1.75
Lifestyle : 1.56
News : 1.34
Travel : 1.25
Finance : 1.09
Weather : 0.87
Food & Drink : 0.81
Reference : 0.53
Business : 0.53
Book : 0.37
Navigation : 0.19
Medical : 0.19
Catalogs : 0.12


Right off the bat, it looks like, amongst free english apps on the app-store, most of them are games, and then entertainment apps, and then educational apps. This doesn't necessarily mean that these app genres are the most popular among users; it just means that these are the prevalent apps (by shear number) on the App Store, and if anything, indicates how much competition we would have in this apce were we to create an app in any of these categories. Additionally, it looks like very few apps fall into catalogs, medical, and navigation categories, perhaps because of how much data/regulation/infrastructure is necessary to make an App in this space.

Next let's take a look at Google Play data.

In [104]:
display_table(googlePlayData, 1) #category
display_table(googlePlayData, 9) #genres

FAMILY : 18.93
GAME : 9.7
TOOLS : 8.45
BUSINESS : 4.6
PRODUCTIVITY : 3.9
LIFESTYLE : 3.89
FINANCE : 3.71
MEDICAL : 3.54
SPORTS : 3.39
PERSONALIZATION : 3.32
COMMUNICATION : 3.23
HEALTH_AND_FITNESS : 3.09
PHOTOGRAPHY : 2.95
NEWS_AND_MAGAZINES : 2.8
SOCIAL : 2.67
TRAVEL_AND_LOCAL : 2.34
SHOPPING : 2.25
BOOKS_AND_REFERENCE : 2.14
DATING : 1.87
VIDEO_PLAYERS : 1.8
MAPS_AND_NAVIGATION : 1.39
FOOD_AND_DRINK : 1.24
EDUCATION : 1.16
ENTERTAINMENT : 0.96
LIBRARIES_AND_DEMO : 0.94
AUTO_AND_VEHICLES : 0.93
HOUSE_AND_HOME : 0.8
WEATHER : 0.79
EVENTS : 0.71
PARENTING : 0.66
ART_AND_DESIGN : 0.64
COMICS : 0.61
BEAUTY : 0.6
Tools : 8.44
Entertainment : 6.08
Education : 5.36
Business : 4.6
Productivity : 3.9
Lifestyle : 3.88
Finance : 3.71
Medical : 3.54
Sports : 3.46
Personalization : 3.32
Communication : 3.23
Action : 3.1
Health & Fitness : 3.09
Photography : 2.95
News & Magazines : 2.8
Social : 2.67
Travel & Local : 2.33
Shopping : 2.25
Books & Reference : 2.14
Simulation : 2.05
Dating : 1.87
Arcad

Seems like first of all, Google Play store has a higher number of effective "categories". This would likely prevent us from making strong 1:1 comparisons between App store and the Google Play store. A possible way to get a better 1:1 comparison is by grouping Google Play's categories into umbrella categories that are similar to the App store categories. Just an aside.

Overall it seems like Family apps, gaming apps, and tools apps comprise most of the free english apps on Google Play. The prevalent app genres are Tools, entertainment, and education. It seems like the app store and play store have some similarity in that the gaming apps, entertainment apps, and educational apps are among the most prevalent.

In order to recommend an app for the developers of this company though, we'd want to look at the size of the user base of all app genres.

## Popularity by size of user base

Let's first define a helper function

In [109]:
def display_table_descending(table):
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [105]:
appStoreGenreFreqTable = freq_table(appStoreData, 11)
appStoreAppsPerGenre = appStoreGenreFreqTable.copy()
appStoreRatingsPerGenre = appStoreGenreFreqTable.copy() # total ratings, not average
for genre in appStoreAppsPerGenre:
    appStoreAppsPerGenre[genre] = 0 # resetting the list
    appStoreRatingsPerGenre[genre] = 0 # resetting the list

for entry in appStoreData:
    entry_genre = entry[11]
    appStoreAppsPerGenre[entry_genre] += 1
    appStoreRatingsPerGenre[entry_genre] += int(entry[5])
    
avgappStoreRatingsPerGenre = {}
for genre in appStoreRatingsPerGenre: # calculating average ratings
    avgappStoreRatingsPerGenre[genre] = appStoreRatingsPerGenre[genre]/appStoreAppsPerGenre[genre]

In [110]:
print(appStoreAppsPerGenre)
print(appStoreRatingsPerGenre)
display_table_descending(avgappStoreRatingsPerGenre)

{'Social Networking': 106, 'Photo & Video': 160, 'Games': 1866, 'Music': 66, 'Reference': 17, 'Health & Fitness': 65, 'Weather': 28, 'Utilities': 79, 'Travel': 40, 'Shopping': 83, 'News': 43, 'Navigation': 6, 'Lifestyle': 50, 'Entertainment': 251, 'Food & Drink': 26, 'Sports': 69, 'Book': 12, 'Finance': 35, 'Education': 118, 'Productivity': 56, 'Business': 17, 'Catalogs': 4, 'Medical': 6}
{'Social Networking': 7584125, 'Photo & Video': 4550647, 'Games': 42705961, 'Music': 3783551, 'Reference': 1348958, 'Health & Fitness': 1514371, 'Weather': 1463837, 'Utilities': 1513363, 'Travel': 1129752, 'Shopping': 2260151, 'News': 913665, 'Navigation': 516542, 'Lifestyle': 840774, 'Entertainment': 3563035, 'Food & Drink': 866682, 'Sports': 1587614, 'Book': 556619, 'Finance': 1132846, 'Education': 826470, 'Productivity': 1177591, 'Business': 127349, 'Catalogs': 16016, 'Medical': 3672}
Navigation : 86090.33333333333
Reference : 79350.4705882353
Social Networking : 71548.34905660378
Music : 57326.530

Purely by the numbers, I'd recommend developing a navigation app, or reference app, or a social networking app, in that order (if developing for the app store). Social networking and navigation markets are already dominated by a few key players, so I'd recommend shooting for a reference app. Now let's repeat the same analysis for the Google Play dataset.

In [107]:
def number_parser(string_to_parse):
    result = string_to_parse.replace(",","")
    result = result.replace("+","")
    result = float(result)
    return result

googlePlayGenreFreqTable = freq_table(googlePlayData, 1)
googlePlayAppsPerGenre = googlePlayGenreFreqTable.copy()
googlePlayRatingsPerGenre = googlePlayGenreFreqTable.copy() #total ratings not average

for genre in googlePlayAppsPerGenre:
    googlePlayAppsPerGenre[genre] = 0 # resetting the list
    googlePlayRatingsPerGenre[genre] = 0 # resetting the list
    
i = 0
for entry in googlePlayData:
    entry_genre = entry[1]
    googlePlayAppsPerGenre[entry_genre] += 1
    googlePlayRatingsPerGenre[entry_genre] += number_parser(entry[5])
    i += 1
    
avgGooglePlayRatingsPerGenre = {}
for genre in googlePlayRatingsPerGenre: # calculating average ratings
    avgGooglePlayRatingsPerGenre[genre] = googlePlayRatingsPerGenre[genre]/googlePlayAppsPerGenre[genre]

In [111]:
print(googlePlayAppsPerGenre)
print(googlePlayRatingsPerGenre)
display_table_descending(avgGooglePlayRatingsPerGenre)

{'ART_AND_DESIGN': 57, 'AUTO_AND_VEHICLES': 82, 'BEAUTY': 53, 'BOOKS_AND_REFERENCE': 189, 'BUSINESS': 407, 'COMICS': 54, 'COMMUNICATION': 286, 'DATING': 165, 'EDUCATION': 103, 'ENTERTAINMENT': 85, 'EVENTS': 63, 'FINANCE': 328, 'FOOD_AND_DRINK': 110, 'HEALTH_AND_FITNESS': 273, 'HOUSE_AND_HOME': 71, 'LIBRARIES_AND_DEMO': 83, 'LIFESTYLE': 344, 'GAME': 858, 'FAMILY': 1675, 'MEDICAL': 313, 'SOCIAL': 236, 'SHOPPING': 199, 'PHOTOGRAPHY': 261, 'SPORTS': 300, 'TRAVEL_AND_LOCAL': 207, 'TOOLS': 748, 'PERSONALIZATION': 294, 'PRODUCTIVITY': 345, 'PARENTING': 58, 'WEATHER': 70, 'VIDEO_PLAYERS': 159, 'NEWS_AND_MAGAZINES': 248, 'MAPS_AND_NAVIGATION': 123}
{'ART_AND_DESIGN': 113221100.0, 'AUTO_AND_VEHICLES': 53080061.0, 'BEAUTY': 27197050.0, 'BOOKS_AND_REFERENCE': 1665883760.0, 'BUSINESS': 696902090.0, 'COMICS': 44961150.0, 'COMMUNICATION': 11036906191.0, 'DATING': 140914757.0, 'EDUCATION': 188850000.0, 'ENTERTAINMENT': 989460000.0, 'EVENTS': 15973160.0, 'FINANCE': 455163132.0, 'FOOD_AND_DRINK': 211738

Based purely on these numbers, I'd recommend developing a communication app, video players app, or social (media) app, if developing for the Google Play store. However, the market place for social media apps is already dominated by a few key players, so I'd recommend shooting for a communication app or video player app.