# Profitable App Profiles for the App Store and Google Play Markets
This project uses [AppStore](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) and [GooglePlay](https://www.kaggle.com/lava18/google-play-store-apps) data from Kaggle. Interestingly, these contain 7,000 and 10,000 items respectively. Goal is to explore the raw data and to understand from the available features, apps that are profitable.

## Data Exploration

In [1]:
from csv import reader

# retrieve Apple Store data from CSV
file_opener = open('AppleStore.csv')
file_reader = reader(file_opener)
apple_store = list(file_reader)
ios_header, ios = apple_store[0], apple_store[1:]

# retrieve Google Play Store data from CSV
file_opener = open('googleplaystore.csv')
file_reader = reader(file_opener)
google_play = list(file_reader)
android_header, android = google_play[0], google_play[1:]

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

** iOS Head Data **

In [3]:
explore_data(ios, 0, 5, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 7197
Number of columns: 16


** Android Head Data **

In [4]:
explore_data(android, 0, 5, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 10841
Number of columns: 13


It can be observed that both datasets have different headers or column names:

In [5]:
print("iOS:")
print(ios_header)
print("\nAndroid:")
print(android_header)

iOS:
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']

Android:
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


## Data Cleaning

In [6]:
# Remove rows that are invalid based on Kaggle discussion
print(android[10472]) # category 1.9 is inaccurate
print(android_header)
del android[10472]

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In [7]:
# Check duplicate apps
def check_duplicates(dataset, name_col_index):
    duplicate_apps = []
    unique_apps = []

    for row in dataset:
        name = row[name_col_index]
        if name in unique_apps:
            duplicate_apps.append(name)
        else:
            unique_apps.append(name)

    return duplicate_apps

android_duplicates = check_duplicates(android, 0)
ios_duplicates = check_duplicates(ios, 1)

print("There are {} duplicate apps in Android".format(len(android_duplicates)))
print("There are {} duplicate apps in iOS".format(len(ios_duplicates)))

There are 1181 duplicate apps in Android
There are 2 duplicate apps in iOS


In [8]:
print(ios_duplicates)
print("\n")
print(android_duplicates[:2])

['Mannequin Challenge', 'VR Roller Coaster']


['Quick PDF Scanner + OCR FREE', 'Box']


In [9]:
def fix_duplicates(dataset, name_col_index, reviews_col_index):
    reviews_max = {}
    clean = []
    already_added = []
    
    # store the app with the highest review in a dictionary
    for row in dataset:
        name = row[name_col_index]
        n_reviews = float(row[reviews_col_index])
        
        if name in reviews_max and reviews_max[name] < n_reviews:
            reviews_max[name] = n_reviews
        elif name not in reviews_max:
            reviews_max[name] = n_reviews
    
    # get the app that matches the dictionary of apps with name & highest review
    for row in dataset:
        name = row[name_col_index]
        n_reviews = float(row[reviews_col_index])
        
        if n_reviews == reviews_max[name] and name not in already_added:
            clean.append(row)
            already_added.append(name)

    return reviews_max, clean

android_reviews_max, android_clean = fix_duplicates(android, 0, 3)
print("Android apps with no duplicates (highest reviews): {}".format(len(android_reviews_max)))
print("Length of cleaned Android data: {}".format(len(android_clean)))

Android apps with no duplicates (highest reviews): 9659
Length of cleaned Android data: 9659


In [10]:
# clean iOS apps
ios_reviews_max, ios_clean = fix_duplicates(ios, 1, 5)
print("iOS apps with no duplicates (highest reviews): {}".format(len(ios_reviews_max)))
print("Length of cleaned iOS data: {}".format(len(ios_clean)))

iOS apps with no duplicates (highest reviews): 7195
Length of cleaned iOS data: 7195


** Remove apps that are non-English **

In [11]:
# check for english vs non english if the name
# has more than 3 non-alphanumeric characters
def check_non_english(str_val):
    non_english_chars = []
    
    for char in str_val:
        if ord(char) > 127:
            non_english_chars.append(char)
    
    if len(non_english_chars) > 3:
        return True
    else:
        return False
    
# Sanity check
print(check_non_english('Docs To Go™ Free Office Suite'))
print(check_non_english('Instachat 😜'))
print(check_non_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))

False
False
True


In [12]:
# Remove non english apps 
def remove_non_english_apps(dataset, name_col_index):
    english_apps = []
    non_english_apps = []
    
    for app in dataset:
        name = app[name_col_index]
        
        if check_non_english(name):
            non_english_apps.append(app)
        else:
            english_apps.append(app)
            
    return english_apps, non_english_apps

android_clean_eng, android_clean_non_eng = remove_non_english_apps(android_clean, 0)
print("Length of Android English Apps: {}".format(len(android_clean_eng)))
print("Length of Android Non English Apps: {}".format(len(android_clean_non_eng)))

ios_clean_eng, ios_clean_non_eng = remove_non_english_apps(ios_clean, 1)
print("Length of iOS English Apps: {}".format(len(ios_clean_eng)))
print("Length of iOS Non English Apps: {}".format(len(ios_clean_non_eng)))

Length of Android English Apps: 9614
Length of Android Non English Apps: 45
Length of iOS English Apps: 6181
Length of iOS Non English Apps: 1014


** Isolate the free apps only **

In [13]:
# check the price col of the dataset to verify if free or not
def get_free_apps(dataset, price_col_index):
    free_apps = []
    for app in dataset:
        price = float(app[price_col_index].replace("$", ""))
        if price == 0:
            free_apps.append(app)
    return free_apps

In [14]:
# check headers
print("Android")
print(android_header)
print("\niOS:")
print(ios_header)

Android
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']

iOS:
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In [15]:
# Android
android_free_apps = get_free_apps(android_clean_eng, price_col_index=7)
print("No. of free Android apps: {}".format(len(android_free_apps)))

# iOS
ios_free_apps = get_free_apps(ios_clean_eng, price_col_index=4)
print("No. of free iOS apps: {}".format(len(ios_free_apps)))

No. of free Android apps: 8864
No. of free iOS apps: 3220


## Analysis

Since our goal is to find apps that are profitable in both markets: Google and Apple, we want to explore apps with the ff:
1. available in google and apple store?
2. highest rating and ratings/reviews count
3. category/genre it belongs to 

In [16]:
# check frequency of genre in both markets
def freq_table(dataset, index):
    table = {}
    for app in dataset:
        col_val = app[index]
        if col_val in table:
            table[col_val] += 1
        else:
            table[col_val] = 1
    
    total = len(dataset)
    perc_table = {}
    for key in table:
        perc_table[key] = round(table[key] / total * 100, 2)
    
    return perc_table

def display_freq_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [17]:
# iOS
display_freq_table(ios_free_apps, 11)

Games : 58.14
Entertainment : 7.89
Photo & Video : 4.97
Education : 3.66
Social Networking : 3.29
Shopping : 2.61
Utilities : 2.52
Sports : 2.14
Music : 2.05
Health & Fitness : 2.02
Productivity : 1.74
Lifestyle : 1.58
News : 1.34
Travel : 1.24
Finance : 1.12
Weather : 0.87
Food & Drink : 0.81
Reference : 0.56
Business : 0.53
Book : 0.43
Navigation : 0.19
Medical : 0.19
Catalogs : 0.12


We can see that iOS apps' top 3 genres are Games, Entertainment, and Photo & Video. We can deduce from the frequency table that people download free apps that are entertaining in general (Games and Entertainment) and these users probably enjoy the features of an iPhone camera in capturing fascinating photos (Photo & Video).

In summary, there are a lot of users that download games, entertainment, and photo & video apps; thus, it's better to focus on these genres when developing an app to generate profitable ones.

In [18]:
# Android Category
display_freq_table(android_free_apps, 1)

FAMILY : 18.91
GAME : 9.72
TOOLS : 8.46
BUSINESS : 4.59
LIFESTYLE : 3.9
PRODUCTIVITY : 3.89
FINANCE : 3.7
MEDICAL : 3.53
SPORTS : 3.4
PERSONALIZATION : 3.32
COMMUNICATION : 3.24
HEALTH_AND_FITNESS : 3.08
PHOTOGRAPHY : 2.94
NEWS_AND_MAGAZINES : 2.8
SOCIAL : 2.66
TRAVEL_AND_LOCAL : 2.34
SHOPPING : 2.25
BOOKS_AND_REFERENCE : 2.14
DATING : 1.86
VIDEO_PLAYERS : 1.79
MAPS_AND_NAVIGATION : 1.4
FOOD_AND_DRINK : 1.24
EDUCATION : 1.16
ENTERTAINMENT : 0.96
LIBRARIES_AND_DEMO : 0.94
AUTO_AND_VEHICLES : 0.93
HOUSE_AND_HOME : 0.82
WEATHER : 0.8
EVENTS : 0.71
PARENTING : 0.65
ART_AND_DESIGN : 0.64
COMICS : 0.62
BEAUTY : 0.6


In [19]:
# Android Genre
display_freq_table(android_free_apps, 9)

Tools : 8.45
Entertainment : 6.07
Education : 5.35
Business : 4.59
Productivity : 3.89
Lifestyle : 3.89
Finance : 3.7
Medical : 3.53
Sports : 3.46
Personalization : 3.32
Communication : 3.24
Action : 3.1
Health & Fitness : 3.08
Photography : 2.94
News & Magazines : 2.8
Social : 2.66
Travel & Local : 2.32
Shopping : 2.25
Books & Reference : 2.14
Simulation : 2.04
Dating : 1.86
Arcade : 1.85
Video Players & Editors : 1.77
Casual : 1.76
Maps & Navigation : 1.4
Food & Drink : 1.24
Puzzle : 1.13
Racing : 0.99
Role Playing : 0.94
Libraries & Demo : 0.94
Auto & Vehicles : 0.93
Strategy : 0.91
House & Home : 0.82
Weather : 0.8
Events : 0.71
Adventure : 0.68
Comics : 0.61
Beauty : 0.6
Art & Design : 0.6
Parenting : 0.5
Card : 0.45
Casino : 0.43
Trivia : 0.42
Educational;Education : 0.39
Board : 0.38
Educational : 0.37
Education;Education : 0.34
Word : 0.26
Casual;Pretend Play : 0.24
Music : 0.2
Racing;Action & Adventure : 0.17
Puzzle;Brain Games : 0.17
Entertainment;Music & Video : 0.17
Casual;

In contrast with iOS users, Android users typically download apps in the Family category with Games getting only the second spot. Looking at the Genres section, these users also find a lot of helpful Tool apps which are probably used everday. One pattern we can digest from the tables is that Android users look for apps that can be shared with the family and practical.  Surprisingly, it is possible that there are more free and high quality apps in the family/tools category than entertainment (games included). 

In [34]:
# Popular genres by number of installations
def genre_install_table(dataset, genres_table, genre_col_index, tot_ratings_col_index):
    for genre in genres_table:
        total_ratings_per_genre = 0
        genre_count = 0
        for row in dataset:
            genre_col = row[genre_col_index]
            if genre_col ==  genre:
                n_ratings = float(row[tot_ratings_col_index].replace("+","").replace(",",""))
                total_ratings_per_genre += n_ratings
                genre_count += 1
        avg_per_genre = total_ratings_per_genre / genre_count
        print("{}: {}".format(genre, avg_per_genre))

In [35]:
# iOS
ios_genres = freq_table(ios_free_apps, 11)
genre_install_table(ios_free_apps, ios_genres, 11, 5)

Reference: 74942.11111111111
Weather: 52279.892857142855
Entertainment: 14029.830708661417
Utilities: 18684.456790123455
News: 21248.023255813954
Games: 22812.92467948718
Lifestyle: 16485.764705882353
Music: 57326.530303030304
Business: 7491.117647058823
Health & Fitness: 23298.015384615384
Catalogs: 4004.0
Shopping: 26919.690476190477
Photo & Video: 28441.54375
Education: 7003.983050847458
Book: 39758.5
Productivity: 21028.410714285714
Navigation: 86090.33333333333
Sports: 23008.898550724636
Medical: 612.0
Social Networking: 71548.34905660378
Travel: 28243.8
Finance: 31467.944444444445
Food & Drink: 33333.92307692308


In [36]:
# Android
android_genres = freq_table(android_free_apps, 1)
genre_install_table(android_free_apps, android_genres, 1, 5)

SPORTS: 3638640.1428571427
AUTO_AND_VEHICLES: 647317.8170731707
FINANCE: 1387692.475609756
PARENTING: 542603.6206896552
MEDICAL: 120550.61980830671
LIBRARIES_AND_DEMO: 638503.734939759
COMICS: 817657.2727272727
BUSINESS: 1712290.1474201474
WEATHER: 5074486.197183099
TOOLS: 10801391.298666667
PRODUCTIVITY: 16787331.344927534
EVENTS: 253542.22222222222
NEWS_AND_MAGAZINES: 9549178.467741935
SHOPPING: 7036877.311557789
ART_AND_DESIGN: 1986335.0877192982
COMMUNICATION: 38456119.167247385
HOUSE_AND_HOME: 1331540.5616438356
LIFESTYLE: 1437816.2687861272
GAME: 15588015.603248259
BEAUTY: 513151.88679245283
FAMILY: 3695641.8198090694
ENTERTAINMENT: 11640705.88235294
EDUCATION: 1833495.145631068
MAPS_AND_NAVIGATION: 4056941.7741935486
PERSONALIZATION: 5201482.6122448975
FOOD_AND_DRINK: 1924897.7363636363
BOOKS_AND_REFERENCE: 8767811.894736841
SOCIAL: 23253652.127118643
DATING: 854028.8303030303
TRAVEL_AND_LOCAL: 13984077.710144928
VIDEO_PLAYERS: 24727872.452830188
PHOTOGRAPHY: 17840110.40229885
H