# Profitable App Profiles for the App Store and Google Play Markets

The goal of this projet is to find app profiles that are profitable for App Store and Google Play Markets.

In [2]:
from csv import reader

file = open('googleplaystore.csv')
read_file = reader(file)
android = list(read_file)
android_header = android[0]
#android[1:] refers to the dataset without the heading
android = android[1:]

file = open('AppleStore.csv')
read_file = reader(file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

In [3]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [4]:
print(android_header)
print('\n')
explore_data(android,0,4, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10841
Number of columns: 13


There are 10841 listed Android apps in this dataset. The columns we may want to focus on are Category, Rating, Reviews, Type, Price, Content Rating, and Genres. Here is a link for the documentation of each column: https://www.kaggle.com/lava18/google-play-store-apps/home

The columns were selected on the assumption that popular apps will tend towards either free or apps within a certain price range, genre, and high ratings.

In [5]:
print(ios_header)
print('\n')
explore_data(ios, 0, 4, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7197
Number of columns: 16


The Apple Store features 10840 apps. Columns worth observing to discern what makes a popular IOS app
might be price, user rating, primary genre, content rating, and language. This is on the assumption that the market largely caters to certain languages, low prices, and apps that have already earned high ratings.

Documentation for the IOS dataset: https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home

# Data cleaning portion of the project

In [6]:
print(android_header)
print('\n')
print(android[10472])


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


The rating on row 10472 exceeds the maximum value of 5 allowed by the app store. This row will be deleted to help avoid any skewing of app ratings.

In [7]:
del android[10472]
print(len(android))

10840


The Google Play data set has duplicate app entries as shown below. 

In [8]:
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)


print('Number of duplicate apps', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:15])

Number of duplicate apps 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


Before deleting any duplicate app entries, we must first take a look at why there are duplicates.

In [9]:
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


As shown with the Instagram app, the duplicates are due to different numbers of total reviews. The rows with the most reviews will be the most curent row for the app in this dataset. The row with the most reviews will be preserved while the other duplicate rows will be discarded.

In [10]:
reviews_max = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] <  n_reviews:
        reviews_max[name] = n_reviews
    else:
        reviews_max[name] = n_reviews

In [11]:
print('Original number of rows: ', len(android))
print('\n')
print('Expected number of rows. Original number of rows minus total duplicates: ',len(android)-1181)
print('\n')
print('Number of rows after we delete duplicates: ', len(reviews_max))

Original number of rows:  10840


Expected number of rows. Original number of rows minus total duplicates:  9659


Number of rows after we delete duplicates:  9659


In [12]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

What I've done in the previous cell was clean the original Google Play Store dataset by removing duplicate rows. We iterated through the original dataset and added unique instances of each app to a list called 'android_clean.'

The below cell shows shows the first three rows of the cleaned data set to show that we have preserved the original dataset format as well as the remaining rows of data.

In [13]:
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


The next step in the data cleaning process will involve removing non-English named apps. The following function will check for letters in a given word for non-English characters.

In [14]:
def EnglishCheck(word):
    nonEnglish = 0
    for letter in word:
        if ord(letter) > 127:
            nonEnglish += 1
    if nonEnglish > 3:
        return False
    else:
        return True

The following cell is a test for our EnglishCheck function which checks to see if an app name has more than three non-English characters.

In [15]:
EnglishCheck('Instachat 😜'), EnglishCheck('Docs To Go™ Free Office Suite'),
EnglishCheck('爱奇艺PPS -《欢乐颂2》电视剧热播')

False

In the next cell, we're going to figure out how many English apps we have for both the Google Play Store and iOS app markets

In [16]:
iOS_English = []

android_English = []


for app in android_clean:
    if EnglishCheck(app[0]):
        android_English.append(app)

for app in ios:
    if EnglishCheck(app[1]):
        iOS_English.append(app)


In [17]:
explore_data(iOS_English, 0, 3, True)
print('\n')
explore_data(android_English, 0, 3, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 6183
Number of columns: 16


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Vari

After ridding both datasets of non-English apps, we can see that we have 6183 iOS apps and 9614 Google Playstore apps. 

The firm we're pretending to be employ us only publishes free apps. So, we have to further clean the remaining dataset to include only free apps. The Google Play Store has a column that organizes apps by whether they are free or not. The Apple App Store dataset features a price column by which we can gather apps that have a price of zero.

In [18]:
android_Free = []
iOS_Free = []


for app in android_English:
    hasPrice = app[6]
    if hasPrice == "Free":
        android_Free.append(app)

for app in iOS_English:
    hasPrice = float(app[4])
    if hasPrice == 0:
        iOS_Free.append(app)

print("We have " + str(len(android_Free)) + " free Google Play Store apps.")
print("We have " + str(len(iOS_Free)) + " free Apple App Store apps.")

We have 8863 free Google Play Store apps.
We have 3222 free Apple App Store apps.


# Validation Strategy

- Build a minimal Android version of an app and then add it to Google Play.
- If the app receives good feedback then the app will be further developed.
- If the app is profitable after six months, an IOS version of the app will be made and added to the App Store.

Before we implement this validation strategy we'll first seek out the most common app genres for each market. In order to do this we'll create frequency tables for the genres for each market.

In [19]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages


### This function displays tables in descending order. ###

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])        

In [20]:
display_table(iOS_Free, 11)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


It looks like the most common apps are games, entertainment, media editing, and education. It looks like there are more entertainment apps available than utility apps. Looking at the Google Play store would give us greater insight on profitable app profiles. As a side note, this analysis would be greatly aided by looking at the most downloaded apps rather than only looking at what app profiles are published the most often.

Below are frequency tables for the Google Play store's Category and Genres columns. Category and Genres are both being analyzed as apps sometimes fit multiple genres 

In [21]:
#Category column
display_table(android_Free, 1)

FAMILY : 19.21471285117906
GAME : 9.511452104253639
TOOLS : 8.462146000225657
BUSINESS : 4.580841701455489
LIFESTYLE : 3.9038700214374367
PRODUCTIVITY : 3.8925871601038025
FINANCE : 3.7007785174320205
MEDICAL : 3.542818458761142
SPORTS : 3.4187069840911652
PERSONALIZATION : 3.317161232088458
COMMUNICATION : 3.2494640640866526
HEALTH_AND_FITNESS : 3.068938282748505
PHOTOGRAPHY : 2.944826808078529
NEWS_AND_MAGAZINES : 2.798149610741284
SOCIAL : 2.6627552747376737
TRAVEL_AND_LOCAL : 2.335552296062281
SHOPPING : 2.245289405393208
BOOKS_AND_REFERENCE : 2.1437436533904997
DATING : 1.8616721200496444
VIDEO_PLAYERS : 1.7826920907142052
MAPS_AND_NAVIGATION : 1.399074805370642
FOOD_AND_DRINK : 1.241114746699763
EDUCATION : 1.128286133363421
LIBRARIES_AND_DEMO : 0.9364774906916393
AUTO_AND_VEHICLES : 0.9251946293580051
ENTERTAINMENT : 0.8800631840234684
HOUSE_AND_HOME : 0.8236488773552973
WEATHER : 0.8010831546880289
EVENTS : 0.7108202640189552
PARENTING : 0.6544059573507841
ART_AND_DESIGN : 0.64

The frequency table for the category column above shows that family, game, and tools are the top three most popular categories of apps. Game apps are the second most popular with tools being a close third place runner-up. This suggests that a good portion of the app market isn't all about entertainment apps but apps that offer a practical use as well.

In [22]:
#Genre column

display_table(android_Free, 9)

Tools : 8.450863138892023
Entertainment : 6.070179397495204
Education : 5.348076272142616
Business : 4.580841701455489
Productivity : 3.8925871601038025
Lifestyle : 3.8925871601038025
Finance : 3.7007785174320205
Medical : 3.542818458761142
Sports : 3.463838429425702
Personalization : 3.317161232088458
Communication : 3.2494640640866526
Action : 3.102786866749408
Health & Fitness : 3.068938282748505
Photography : 2.944826808078529
News & Magazines : 2.798149610741284
Social : 2.6627552747376737
Travel & Local : 2.324269434728647
Shopping : 2.245289405393208
Books & Reference : 2.1437436533904997
Simulation : 2.042197901387792
Dating : 1.8616721200496444
Arcade : 1.8616721200496444
Video Players & Editors : 1.7826920907142052
Casual : 1.7488435067133026
Maps & Navigation : 1.399074805370642
Food & Drink : 1.241114746699763
Puzzle : 1.128286133363421
Racing : 0.9928917973598104
Role Playing : 0.9364774906916393
Libraries & Demo : 0.9364774906916393
Auto & Vehicles : 0.9251946293580051
St

The display table above showcases the most popular app genres. It falls in line with what we've seen in previous tables: tools and entertainment apps top the charts for most common kind of apps. 

A profitable app profile would be an entertainment application or one that serves a utilitarian purpose for smartphone owners. While the iOS App Store doesn't specifically list utility apps as a genre, it does showcase media editing applications as a popular kind of utility app. 

Further research on what kinds of games and what kinds of utility apps are preferred by users would be helpful.

## Most Popular Apps by Genre on the App Store

In [36]:
iOS_prime_genres = freq_table(iOS_Free, 11)

iOS_prime_genres

for genre in iOS_prime_genres:
    total = 0
    len_genre = 0
    for app in iOS_Free:
        genre_app = app[-5]
        if genre_app == genre:
            len_genre+=1
            total += float(app[5])
    
    print('\n')
    print("Total number of ratings: " + str(total))
    print("Number of apps specific to the " + genre + " genre: " + str(len_genre))
    print("Genre: " + genre + " | Average number of ratings: " + str(total/len_genre))
    print('\n')




Total number of ratings: 1177591.0
Number of apps specific to the Productivity genre: 56
Genre: Productivity | Average number of ratings: 21028.410714285714




Total number of ratings: 3783551.0
Number of apps specific to the Music genre: 66
Genre: Music | Average number of ratings: 57326.530303030304




Total number of ratings: 1348958.0
Number of apps specific to the Reference genre: 18
Genre: Reference | Average number of ratings: 74942.11111111111




Total number of ratings: 1463837.0
Number of apps specific to the Weather genre: 28
Genre: Weather | Average number of ratings: 52279.892857142855




Total number of ratings: 516542.0
Number of apps specific to the Navigation genre: 6
Genre: Navigation | Average number of ratings: 86090.33333333333




Total number of ratings: 3563577.0
Number of apps specific to the Entertainment genre: 254
Genre: Entertainment | Average number of ratings: 14029.830708661417




Total number of ratings: 866682.0
Number of apps specific to the Fo

Navigation apps seem to have the highest average number of ratings yet the fewest apps. This suggests that the navigation app market has few competitors and plenty of consumers. The most popularly published apps that also have plenty of reviews is the gaming genre of apps. Perhaps an intersection of navigation and gaming could be exploited as the navigation section has few competitors while the market shows a high preference for entertainment and games.



## Most Popular Apps by Genre on Google Play

In [52]:
android_categories = freq_table(android_Free, 1)

android_categories

for category in android_categories:
    total = 0
    len_category = 0
    for app in android_Free:
        category_app = app[1]
        if category_app == category:
            installs = app[5]
            installs = installs.replace(',', '')
            installs = installs.replace('+','')
            total += float(installs)
            len_category+=1

    
    print('\n')
    print("Total number of ratings: " + str(total))
    print("Number of apps specific to the " + category + " genre: " + str(len_category))
    print("Category: " + category + " | Average number of ratings: " + str(total/len_category))
    print('\n')




Total number of ratings: 691902090.0
Number of apps specific to the BUSINESS genre: 406
Category: BUSINESS | Average number of ratings: 1704192.3399014778




Total number of ratings: 53080061.0
Number of apps specific to the AUTO_AND_VEHICLES genre: 82
Category: AUTO_AND_VEHICLES | Average number of ratings: 647317.8170731707




Total number of ratings: 713460000.0
Number of apps specific to the ENTERTAINMENT genre: 78
Category: ENTERTAINMENT | Average number of ratings: 9146923.076923076




Total number of ratings: 8826995690.0
Number of apps specific to the FAMILY genre: 1703
Category: FAMILY | Average number of ratings: 5183203.576042279




Total number of ratings: 4656268815.0
Number of apps specific to the PHOTOGRAPHY genre: 261
Category: PHOTOGRAPHY | Average number of ratings: 17840110.40229885




Total number of ratings: 113221100.0
Number of apps specific to the ART_AND_DESIGN genre: 57
Category: ART_AND_DESIGN | Average number of ratings: 1986335.0877192982




Total n

A seemingly popular app category in the Google Play Store is communication. However, if we take a close look at the communication category we'd find that this is a skewed metric where very few apps dominate this category. Even so, the number of installs of these