# Analyze profitable mobile apps

In this project, we're analizing the characteristics of profitable apps that are free to download and obtain the revenue with in-app ads.

Let's define a function to explore a dataset without a header row:

In [70]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [71]:
opened_file_ios = open('AppleStore.csv')
opened_file_google = open('googleplaystore.csv')
from csv import reader
read_file_ios = reader(opened_file_ios)
read_file_google = reader(opened_file_google)
ios = list(read_file_ios)
google = list(read_file_google)
columns_ios = ios[0] # save ios columns
columns_google = google[0] # save google columns
ios = ios[1:] # save dataset without columns
google = google[1:] # save dataset without columns

In [72]:
explore_data(ios,0,5) # first 5 rows without header

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']




In [73]:
print(columns_google) # column names of google dataset

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In [74]:
print(google[10472])
del google[10472] # remove erroneus row

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Define a function that searches for duplicated rows in google.

In [85]:
duplicate_apps = []
unique_apps = []

for app in google:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps', duplicate_apps[:5])

Number of duplicate apps: 1181


Examples of duplicate apps ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']


Next, we delete the duplicated rows, we're deleting the rows with the lowest number of review, just keeping the entry with the highest number of reviews.

First, create a dictionary where each key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.

In [80]:
reviews_max = {}
for app in google:
    name = app[0]
    n_reviews = float(app[3])
    if (name in reviews_max) and (reviews_max[name] < n_reviews):
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
print('Non-duplicated apps: ' + str(len(reviews_max))) # expected is 9659 entries

Non-duplicated apps: 9659


In [81]:
android_clean = []
already_added = []
for app in google:
    name = app[0]
    n_reviews = float(app[3])
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)
print('Non-duplicated apps: ' +str(len(android_clean))) # must be 9659 rows

Non-duplicated apps: 9659


We can get the corresponding number of each character using the `ord()` built-in function, the numbers corresponding to the characters we commonly use in an English text are all in the range **0** to **127**

In [None]:
def check_string(string):
    count = 0
    for character in string:
        if ord(character) > 127:
            count += 1
            if count > 3:
                return False
    return True

In [82]:
non_english_apps = []
english_apps = []
for app in android_clean:
    name = app[0]
    result = check_string(name)
    if result:
        english_apps.append(app)
    else: 
        non_english_apps.append(app)

print('English apps: ' + str(len(english_apps)))
print('Non-english apps: ' +str(len(non_english_apps)))

English apps: 9614
Non-english apps: 45


In [100]:
free_android_apps = []
non_free_android_apps = []
for app in english_apps:
    price = app[6]
    if price == 'Free':
        free_android_apps.append(app)
    else:
        non_free_android_apps.append(app)
print('Free english android apps: ' + str(len(free_android_apps)))
print('Non-free english android apps: ' +str(len(non_free_android_apps)))

Free english android apps: 8863
Non-free english android apps: 751


In [106]:
free_ios_apps = []
non_free_ios_apps = []
for app in ios:
    price = float(app[4])
    if price == 0.0:
        free_ios_apps.append(app)
    else:
        non_free_ios_apps.append(app)
print('Free english ios apps: ' + str(len(free_ios_apps)))
print('Non-free english ios apps: ' +str(len(non_free_ios_apps)))

Free english ios apps: 4056
Non-free english ios apps: 3141


To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:
1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

We want to find an app profile that fits both the App Store and Google Play, so we're gonna build a frequency table to see the most common genres.

In [90]:
print(columns_ios) 
print(columns_google)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


We see that the **genres** columns are called:
- 'prime_genre' in ios, index number = 11
- 'Category' and 'Genres' in android, index numbers = 1, 9 

In [92]:
def freq_table(dataset, index):
    freq_table = {}
    for app in dataset:
        genre = app[index]
        if genre in freq_table:
            freq_table[genre] += 1
        else:
            freq_table[genre] = 1
    return freq_table

In [93]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [107]:
print(display_table(free_ios_apps,11))

Games : 2257
Entertainment : 334
Photo & Video : 167
Social Networking : 143
Education : 132
Shopping : 121
Utilities : 109
Lifestyle : 94
Finance : 84
Sports : 79
Health & Fitness : 76
Music : 67
Book : 66
Productivity : 62
News : 58
Travel : 56
Food & Drink : 43
Weather : 31
Reference : 20
Navigation : 20
Business : 20
Catalogs : 9
Medical : 8
None


In [108]:
print(display_table(free_android_apps,1))

FAMILY : 1675
GAME : 862
TOOLS : 750
BUSINESS : 407
LIFESTYLE : 346
PRODUCTIVITY : 345
FINANCE : 328
MEDICAL : 313
SPORTS : 301
PERSONALIZATION : 294
COMMUNICATION : 287
HEALTH_AND_FITNESS : 273
PHOTOGRAPHY : 261
NEWS_AND_MAGAZINES : 248
SOCIAL : 236
TRAVEL_AND_LOCAL : 207
SHOPPING : 199
BOOKS_AND_REFERENCE : 190
DATING : 165
VIDEO_PLAYERS : 159
MAPS_AND_NAVIGATION : 124
FOOD_AND_DRINK : 110
EDUCATION : 103
ENTERTAINMENT : 85
LIBRARIES_AND_DEMO : 83
AUTO_AND_VEHICLES : 82
HOUSE_AND_HOME : 73
WEATHER : 71
EVENTS : 63
PARENTING : 58
ART_AND_DESIGN : 57
COMICS : 55
BEAUTY : 53
None


In [109]:
print(display_table(free_android_apps,9))

Tools : 749
Entertainment : 538
Education : 474
Business : 407
Productivity : 345
Lifestyle : 345
Finance : 328
Medical : 313
Sports : 307
Personalization : 294
Communication : 287
Action : 275
Health & Fitness : 273
Photography : 261
News & Magazines : 248
Social : 236
Travel & Local : 206
Shopping : 199
Books & Reference : 190
Simulation : 181
Dating : 165
Arcade : 164
Video Players & Editors : 157
Casual : 156
Maps & Navigation : 124
Food & Drink : 110
Puzzle : 100
Racing : 88
Role Playing : 83
Libraries & Demo : 83
Auto & Vehicles : 82
Strategy : 80
House & Home : 73
Weather : 71
Events : 63
Adventure : 60
Comics : 54
Beauty : 53
Art & Design : 53
Parenting : 44
Card : 40
Casino : 38
Trivia : 37
Educational;Education : 35
Board : 34
Educational : 33
Education;Education : 30
Word : 23
Casual;Pretend Play : 21
Music : 18
Racing;Action & Adventure : 15
Puzzle;Brain Games : 15
Entertainment;Music & Video : 15
Casual;Brain Games : 12
Casual;Action & Adventure : 12
Arcade;Action & Advent

Most common **genres** in ios and android (category, genres columns) free-apps:

|ios|android (1)|android (2)|
| --- | --- | --- |
|Games|Family|Tools|
|Entertainment|Games|Entertainment|
|Photo & Video|Tools|Education|

Now we want to get an idea about the kind of apps with the most users, so we can calculate the average number of installs for each app genre. In android there is a column called `Installs` but in ios there isn't so we're using the total number of user ratings as a proxy, which we can find in the `rating_count_tot` column.

In [114]:
freq_table_ios = freq_table(free_ios_apps,11)
for genre in freq_table_ios:
    total = 0
    len_genre = 0
    for app in free_ios_apps:
        genre_app = app[11]
        if genre_app == genre:
            user_ratings = float(app[5])
            total += user_ratings
            len_genre += 1
    print(genre + ': ' + str(round(total/len_genre)))

Entertainment: 10823
Travel: 20216
Health & Fitness: 19952
Productivity: 19054
Book: 8498
Utilities: 14010
Lifestyle: 8978
Navigation: 25972
News: 15893
Music: 56482
Reference: 67448
Finance: 13522
Weather: 47221
Social Networking: 53078
Education: 6266
Games: 18925
Photo & Video: 27250
Catalogs: 1780
Food & Drink: 20179
Sports: 20129
Business: 6368
Shopping: 18747
Medical: 460


The profile recommendation for the free App Store based on the number of installs is the **Reference** genre, apps like `Wikipedia` or `Google Translate` are inside this category.

Now let's examine the Google Play Store:

In [117]:
freq_table_android_1 = freq_table(free_android_apps,1)
for category in freq_table_android_1:
    total = 0
    len_category = 0
    for app in free_android_apps:
        category_app = app[1]
        if category_app == category:
            installs = app[5]
            installs = installs.replace('+','')
            installs = installs.replace(',','')
            installs = float(installs)
            total += installs
            len_category += 1
    print(category + ': ' + str(round(total/len_category)))

BOOKS_AND_REFERENCE: 8767812
TOOLS: 10801391
FOOD_AND_DRINK: 1924898
FINANCE: 1387692
HOUSE_AND_HOME: 1331541
ENTERTAINMENT: 11640706
GAME: 15588016
AUTO_AND_VEHICLES: 647318
BEAUTY: 513152
ART_AND_DESIGN: 1986335
PARENTING: 542604
SOCIAL: 23253652
EDUCATION: 1833495
PRODUCTIVITY: 16787331
SPORTS: 3638640
NEWS_AND_MAGAZINES: 9549178
FAMILY: 3697848
COMMUNICATION: 38456119
MEDICAL: 120551
COMICS: 817657
VIDEO_PLAYERS: 24727872
MAPS_AND_NAVIGATION: 4056942
PHOTOGRAPHY: 17840110
BUSINESS: 1712290
TRAVEL_AND_LOCAL: 13984078
SHOPPING: 7036877
WEATHER: 5074486
PERSONALIZATION: 5201483
LIBRARIES_AND_DEMO: 638504
DATING: 854029
EVENTS: 253542
HEALTH_AND_FITNESS: 4188822
LIFESTYLE: 1437816


The profile recommendation for the free Google Play Store based on the number of installs is the **Communication** genre, apps like `Whatsapp Messenger` or `Facebook Messenger` are inside this category.