# Profitable App Profiles for the App Store and Google Play Markets

Download datasets here:
- [AppleStore.csv](https://dq-content.s3.amazonaws.com/350/AppleStore.csv)
- [googleplaystore.csv](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv)

Documentation on datasets:
- [Apple Store Data](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)
- [Google Play Store Data](https://www.kaggle.com/lava18/google-play-store-apps)



## Utility Functions

Function **explore_data** prints selected rows for a data set

In [1]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Function **get_csv_list** returns a list from a csv object loaded from a file

In [2]:
def get_csv_list(filename):
    opened_file = open(filename)
    from csv import reader
    read_file = reader(opened_file)
    return list(read_file)

## Dataset Descriptions

**AppleStore.csv**

| Column name | Description |
| ----------- | ----------- |
| '' | |
| 'id' | |
| 'track_name' | |
| 'size_bytes' | |
| 'currency' | |
| 'price' | |
| 'rating_count_tot' | |
| 'rating_count_ver' | |
| 'user_rating' | |
| 'user_rating_ver' | |
| 'ver' | |
| 'cont_rating' | |
| 'prime_genre' | |
| 'sup_devices.num' | |
| 'ipadSc_urls.num' | |
| 'lang.num' | |
| 'vpp_lic' | |

**googleplaystore.csv**

| Column name | Description |
| ----------- | ----------- |
| 'App' | |
| 'Category' | |
| 'Rating' | |
| 'Reviews' | |
| 'Size' | |
| 'Installs' | |
| 'Type' | |
| 'Price' | |
| 'Content Rating' | |
| 'Genres' | |
| 'Last Updated' | 
| 'Current Ver' | |
| 'Android Ver' | |

In [106]:
apple_app_data = get_csv_list('../datasets/AppleStore.csv')

In [160]:
explore_data(apple_app_data, 0, 10, True)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']


['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37'

In [180]:
play_app_data = get_csv_list('../datasets/googleplaystore.csv')

In [161]:
explore_data(play_app_data, 0, 10, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Eve

## Data Cleanup
Print and delete row that has error (missing category value)

In [13]:
print(play_app_data[10472])
print(play_app_data[10473])

['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [110]:
del play_app_data[10473]

---
Function **duplicate_apps** finds and returns a dict of duplicate apps in the Play dataset
- the dict is keyed by the index of the row in the original dataset
- the value of each entry in the dict is the duplicate record

In [104]:
def get_duplicate_apps(dataset):
    results = {}
    duplicates = []
    uniques = []
    
    index = 0
    for app in dataset:
        name = app[0]
        if name in uniques:
            duplicates.append(name)
            results[index] = app
        else:
            uniques.append(name)
        index += 1
        
    return results

In [105]:
duplicates = get_duplicate_apps(play_app_data)
print("Number of duplicate apps: ", len(duplicates))
#print("\n")
#print("Examples of duplicate apps: ")
#count = 0
#for index in duplicates.keys():
#    app = duplicates[index]
#    print(app)
#    count += 1
#    if count == 15:
#        break

Number of duplicate apps:  1181


---
Function **remove_duplicate_apps** removes duplicate apps.  
- The instance with the hightest number of reviews is assumed to be the most recent
- Keeps the instance of the app with the most reviews

In [122]:
def remove_duplicate_apps(dataset):
    
    most_recent = {}
    
    for app in dataset[1:]:
        name = app[0]
        num_reviews = float(app[3])
        if name in most_recent:
            num_reviews_mr = most_recent[name]
            if num_reviews > num_reviews_mr:
                most_recent[name] = num_reviews
        else:
            most_recent[name] = num_reviews
            
    new_dataset = []
    app_names = []
    new_dataset.append(dataset[0])
    for app in dataset[1:]:
        name = app[0]
        num_reviews = float(app[3])
        if num_reviews == most_recent[name] and name not in app_names:
            new_dataset.append(app)
            app_names.append(name)
        
    return new_dataset

In [154]:
play_app_data = get_csv_list('../datasets/googleplaystore.csv')
del play_app_data[10473]
print("Number in original dataset: ", len(play_app_data))

clean_play_app_data = remove_duplicate_apps(play_app_data)
print("Number in clean dataset: ", len(clean_play_app_data))

# check for duplicates on the clean dataset
duplicates = get_duplicate_apps(clean_play_app_data)
print("Number of duplicate apps: ", len(duplicates))

Number in original dataset:  10841
# most recent records = 9659
# recs=9660
Number in clean dataset:  9660
Number of duplicate apps:  0


---
Function **is_english_string** returns True if string has three or fewer non-ascii characters
- non-ascii characters are those with an ordinal value greater than 127

In [142]:
def is_english_string(src):
    cnt = 0
    for c in src:
        if ord(c) > 127:
            cnt += 1
            if cnt > 3:
                return False
    return True

In [143]:
print(is_english_string('Instagram!'))
print(is_english_string('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english_string('Docs To Go™ Free Office Suite'))
print(is_english_string('Instachat 😜'))

True
False
True
True


---
Filter non-english apps from dataset using function **is_english_string**

In [155]:
english_apple_app_data = []
english_play_app_data = []
for app in apple_app_data:
    name = app[2]
    if is_english_string(name):
        english_apple_app_data.append(app)
print("Number in english Apple dataset: ", len(english_apple_app_data))         

for app in clean_play_app_data:
    name = app[0]
    if is_english_string(name):
        english_play_app_data.append(app)
print("Number in English Play dataset: ", len(english_play_app_data))         

Number in english Apple dataset:  6184
Number in English Play dataset:  9615


---
Filter datasets with only free apps
- Free apps in the Apple dataset are those with a cost of '0.0' at index 4
- Free apps in the Play dataset are those with a cost of '0' at index 7

In [159]:
apple_app_data_final = []
play_app_data_final = []
for app in english_apple_app_data:
    cost = app[5]
    if cost == '0':
        apple_app_data_final.append(app)
print("Number in free Apple dataset: ", len(apple_app_data_final))         

for app in english_play_app_data:
    cost = app[7]
    if cost == '0':
        play_app_data_final.append(app)
print("Number in free Play dataset: ", len(play_app_data_final))         

Number in free Apple dataset:  3222
Number in free Play dataset:  8864


---
## Data Analysis
Build frequency tables to determine the most common genres in each market

In [None]:
#'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre'

#'Category', 'Rating', 'Reviews', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres'

---
Function **freq_table** generates frequency tables that show precentages for the given index

Function **display_table** displays the percentages in descending order

In [175]:
def freq_table(dataset, index):
    table = {}
    total = 0
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
            
    table_percentages = {}
    for value in table:
        table_percentages[value] = (table[value] / total) * 100
    
    return table_percentages

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for value in table:
        table_display.append((table[value], value))
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(str(entry[1]) + ':  ' + entry[0])
        

In [169]:
display_table(apple_app_data_final, 12)

Games:  58.16263190564867
Entertainment:  7.883302296710118
Photo & Video:  4.9658597144630665
Education:  3.662321539416512
Social Networking:  3.2898820608317814
Shopping:  2.60707635009311
Utilities:  2.5139664804469275
Sports:  2.1415270018621975
Music:  2.0484171322160147
Health & Fitness:  2.0173805090006205
Productivity:  1.7380509000620732
Lifestyle:  1.5828677839851024
News:  1.3345747982619491
Travel:  1.2414649286157666
Finance:  1.1173184357541899
Weather:  0.8690254500310366
Food & Drink:  0.8069522036002483
Reference:  0.5586592178770949
Business:  0.5276225946617008
Book:  0.4345127250155183
Navigation:  0.186219739292365
Medical:  0.186219739292365
Catalogs:  0.12414649286157665


In [171]:
display_table(play_app_data_final, 1)

FAMILY:  18.907942238267147
GAME:  9.724729241877256
TOOLS:  8.461191335740072
BUSINESS:  4.591606498194946
LIFESTYLE:  3.9034296028880866
PRODUCTIVITY:  3.892148014440433
FINANCE:  3.7003610108303246
MEDICAL:  3.531137184115524
SPORTS:  3.395758122743682
PERSONALIZATION:  3.3167870036101084
COMMUNICATION:  3.2378158844765346
HEALTH_AND_FITNESS:  3.0798736462093865
PHOTOGRAPHY:  2.944494584837545
NEWS_AND_MAGAZINES:  2.7978339350180503
SOCIAL:  2.6624548736462095
TRAVEL_AND_LOCAL:  2.33528880866426
SHOPPING:  2.2450361010830324
BOOKS_AND_REFERENCE:  2.1435018050541514
DATING:  1.861462093862816
VIDEO_PLAYERS:  1.7937725631768955
MAPS_AND_NAVIGATION:  1.3989169675090252
FOOD_AND_DRINK:  1.2409747292418771
EDUCATION:  1.1620036101083033
ENTERTAINMENT:  0.9589350180505415
LIBRARIES_AND_DEMO:  0.9363718411552346
AUTO_AND_VEHICLES:  0.9250902527075812
HOUSE_AND_HOME:  0.8235559566787004
WEATHER:  0.8009927797833934
EVENTS:  0.7107400722021661
PARENTING:  0.6543321299638989
ART_AND_DESIGN:  

In [172]:
display_table(play_app_data_final, 9)

Tools:  8.449909747292418
Entertainment:  6.069494584837545
Education:  5.347472924187725
Business:  4.591606498194946
Productivity:  3.892148014440433
Lifestyle:  3.892148014440433
Finance:  3.7003610108303246
Medical:  3.531137184115524
Sports:  3.463447653429603
Personalization:  3.3167870036101084
Communication:  3.2378158844765346
Action:  3.1024368231046933
Health & Fitness:  3.0798736462093865
Photography:  2.944494584837545
News & Magazines:  2.7978339350180503
Social:  2.6624548736462095
Travel & Local:  2.3240072202166067
Shopping:  2.2450361010830324
Books & Reference:  2.1435018050541514
Simulation:  2.0419675090252705
Dating:  1.861462093862816
Arcade:  1.8501805054151623
Video Players & Editors:  1.7712093862815883
Casual:  1.7599277978339352
Maps & Navigation:  1.3989169675090252
Food & Drink:  1.2409747292418771
Puzzle:  1.128158844765343
Racing:  0.9927797833935018
Role Playing:  0.9363718411552346
Libraries & Demo:  0.9363718411552346
Auto & Vehicles:  0.9250902527075

---
Calculate the average number of user ratings per app genre on the App Store data

In [177]:
# First approach
app_ratings_totals = {}
app_genre_totals = {}

for app in apple_app_data_final[1:]:
    genre = app[12]
    rating_count_tot = int(app[6])
    
    if genre in app_ratings_totals:
        app_ratings_totals[genre] += rating_count_tot
    else:
        app_ratings_totals[genre] = rating_count_tot
    
    if genre in app_genre_totals:
        app_genre_totals[genre] += 1
    else:
        app_genre_totals[genre] = 1
    
avg_app_ratings = {}
for genre in app_ratings_totals:
    avg_app_ratings[genre] = app_ratings_totals[genre] / app_genre_totals[genre]
    
for genre in avg_app_ratings:
    print(genre + ':  ' + str(avg_app_ratings[genre]))
    

Weather:  52279.892857142855
Shopping:  26919.690476190477
Reference:  74942.11111111111
Finance:  31467.944444444445
Music:  57326.530303030304
Utilities:  18684.456790123455
Travel:  28243.8
Social Networking:  71548.34905660378
Sports:  23008.898550724636
Health & Fitness:  23298.015384615384
Games:  22788.6696905016
Productivity:  18482.29090909091
Food & Drink:  33333.92307692308
News:  21248.023255813954
Book:  39758.5
Photo & Video:  28441.54375
Entertainment:  14029.830708661417
Business:  7491.117647058823
Lifestyle:  16485.764705882353
Education:  7003.983050847458
Navigation:  86090.33333333333
Medical:  612.0
Catalogs:  4004.0


In [179]:
# Second approach

genres_apple = freq_table(apple_app_data_final, 12)

for genre in genres_apple:
    total = 0    # sum of number of user ratings
    len_genre = 0    # number of apps specific to each genre
    
    for app2 in apple_app_data_final[1:]:
        genre_app = app2[12]
        if genre_app == genre:
            num_ratings = int(app2[6])
            total += num_ratings
            len_genre += 1
        
    avg_num_ratings = total / len_genre
    print(genre + ':  ' + str(avg_num_ratings))
    

Productivity:  18482.29090909091
Weather:  52279.892857142855
Shopping:  26919.690476190477
Reference:  74942.11111111111
Finance:  31467.944444444445
Music:  57326.530303030304
Utilities:  18684.456790123455
Travel:  28243.8
Social Networking:  71548.34905660378
Sports:  23008.898550724636
Health & Fitness:  23298.015384615384
Games:  22788.6696905016
Food & Drink:  33333.92307692308
News:  21248.023255813954
Book:  39758.5
Photo & Video:  28441.54375
Entertainment:  14029.830708661417
Business:  7491.117647058823
Lifestyle:  16485.764705882353
Education:  7003.983050847458
Navigation:  86090.33333333333
Medical:  612.0
Catalogs:  4004.0
