## Insight of Revenue on Free Apps from App Store

As data analysts at our company, our goal is to discover insights of how our free apps are generating profitable revenue through ads. Our company produces only free apps, so we rely on ads for sustainability. 

The objective of this project is to be able to observe which free app on the market produces a profitable revenue for our company. Our team of data analyst will explore data to help our developers make data driven decisions and observe apps that will attract users.

In [1]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [2]:
# Apple iOS apps dataset #
opened_file = open('AppleStore.csv')
from csv import reader
read_file = reader(opened_file)
ios_apps = list(read_file)
ios_apps_header = ios_apps[0]
ios_apps = ios_apps[1:]

# Google Android apps dataset #
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android_apps = list(read_file)
android_apps_header = android_apps[0]
android_apps = android_apps[1:]


In [3]:
print(ios_apps_header)
print('\n')
explore_data(ios_apps, 1, 5, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 7197
Number of columns: 16


In [4]:
print(android_apps_header)
print('\n')
explore_data(android_apps, 1, 5, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 10841
Number of columns: 13


In the initial observation, column names that can help us from the iOs dataset are `track_name`, `currency`, `price`, `rating_count_tot`,
`rating_count_ver`, `prime_genre`. The column name descriptions can be found [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home)

For the android dataset, useful column names include: `App`, `Category`, `Rating`, `Reviews`, `Installs`, `Type`, `Price`,
`Genres`. Information regarding the columns can also be found [here](https://www.kaggle.com/lava18/google-play-store-apps/home)

In [5]:
android_apps[10472]

['Life Made WI-Fi Touchscreen Photo Frame',
 '1.9',
 '19',
 '3.0M',
 '1,000+',
 'Free',
 '0',
 'Everyone',
 '',
 'February 11, 2018',
 '1.0.19',
 '4.0 and up']

In [6]:
del android_apps[10472]

# Removing duplicate entries
## Part One
While observing the android apps dataset long enough, there exists multiple duplicate data for some apps.

In [7]:
android_duplicate_app = []
android_unique_app = []

for app in android_apps:
    name = app[0]
    if name in android_unique_app:
        android_duplicate_app.append(name)
    else:
        android_unique_app.append(name)
        
print('Number of duplicate android apps: ', len(android_duplicate_app))
print('\n')
print('Examples of duplicate android apps: ', android_duplicate_app[:10])




Number of duplicate android apps:  1181


Examples of duplicate android apps:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


## Part Two

Upon identifing the multiple app data entries, the next step would be to remove the ones that will not be used for analysis. For example, in this dataset, instagram has multiple entries because of different time of updates to the gathered data. We will not randomly remove any entry, instead, we will take the entry with the latest information update.

In [8]:
reviews_max = {}
for app in android_apps:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
print('Expected length: ', len(android_apps) -  len(android_duplicate_app))
print('Length of Updated apps: ', len(reviews_max))

android_clean = []
already_added = []

for app in android_apps:
    name = app[0]
    n_reviews = float(app[3])
    
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)
    


Expected length:  9659
Length of Updated apps:  9659


To apply actions of removal, I created a dictionary to store the highest value of an app to a unique key. A loop will compare the review values and take the highest value between similar named apps. If an app doesn't exist it will be added into the dictionary as a unique key.

In the second step of the removal process, I created two lists. One will be to store the new `android_clean` dataset, and the other will be `already_used` apps so doesn't have repeated app. In this loop, the reviews of the unclean dataset will be compared to the max review dictionary value and be stored in a clean dataset. If the app already exist it will be added to the list of `already_added`, so that no repeated app will exist in the `android_clean` dataset.

In [9]:
## Function to iterate through strings by character
## to check if there are non-english characters.
def english_check(a_string):
    not_english_char = 0
    for character in a_string:
        if ord(character) > 127 and not_english_char <= 3:
            not_english_char +=1
    if not_english_char > 3:
        return False
    else:
        return True

print(english_check('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english_check('Docs To Go™ Free Office Suite'))
print(english_check('Instachat 😜'))

False
True
True


In [10]:
## Removing non-english apps from android_clean ##
english_android_apps = []
for app in android_clean:
    name = app[0]
    if english_check(name):
        english_android_apps.append(app)
print(len(english_android_apps))
explore_data(english_android_apps, 0, 2, True)

9614
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 9614
Number of columns: 13


In [11]:
## Removing non-english apps from ios_apps ##
english_ios_apps = []
for app in ios_apps:
    name = app[1]
    if english_check(name):
        english_ios_apps.append(app)
print(len(english_ios_apps))
explore_data(english_ios_apps, 0, 2, True)

6183
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 6183
Number of columns: 16


In [12]:
## Isolating the free apps from the android dataset ##
free_eng_android = []
for app in english_android_apps:
    price = app[7]
    if price == '0':
        free_eng_android.append(app)
        
explore_data(free_eng_android, 0, 2, True)
        
    

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 8864
Number of columns: 13


In [13]:
## Isolating the fre apps from the ios dataset ##
free_eng_ios = []
for app in english_ios_apps:
    price = app[4]
    if price == '0.0':
        free_eng_ios.append(app)
        
explore_data(free_eng_ios, 0 , 2, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 3222
Number of columns: 16


To maximize our revenue off of free apps, we have to minimize the risk and cost of our our apps. In order to do so, the idea would be to create a minimal Android version of an app and put it on the Google Play. If the app recieves positive responses from users, the app will be developed/improved further. If the app continues to do well for six months or so, we can proceed to develope an iOS version of the app and add it to the App store. Our end goal is to have a profitable app on both the iOS and Android stores. Let's begin by observing the most popular genres from both the iOS App Store and Android Google Play Store.

In [14]:
## A function to create frequency table to observe ##
## most popular apps by genre ##
def freq_table(dataset, index):
    table = {}
    total = 0
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
            
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage
    
    return table_percentages

## A function to display a frequency table ##
def display_table (dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [15]:
## Run the iOS dataset through the functionns we just created ##
## iOS - Prime Genre Column ##
display_table(free_eng_ios, 11)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


For the free and english apps on iOS App store, the most common genre are `Games` and the second are `Entertainment` apps. A majority of the popular apps happen to be for entertainment purposes. Observing this data, we can see that apps designed for practical purposes do not have high usage.

Having a large number of users for a genre does not imply that an app has a large number of users. Depending on the market for a game app, the  number of users for a game app may fluctuate because of other competitors in the market as well.

In [16]:
## Run the Android dataset through the functions we just created ##
## Android - Category Column ##
display_table(free_eng_android, 1)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

For the free and english android apps, the most common apps are `Family`, `Games`, and `Tools`. The `Family` genre is a vague title and could mean games for kids. Compared to the iOS app, Google Play has more practical purpose apps that are popular than the App Store.

In [17]:
## Android - Genres Column ##
display_table(free_eng_android, -4)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

The difference between the `Category` and `Genre` column are not apparent. The biggest difference is that the `Genre` column seems to have a lot more labels. For the meantime, we can observe the `Category` column to get the big picture. 

For now, we have observed that the App store contains more apps that are designed for fun and entertainment, while the Google Play store has a mixture of both fun and practical apps.

# Most Popular Apps by Genre on the App Store

A way to find the most popular app by genre is to observe the total number of `installs` from both the dataset. The App Store does not have an installs category, but we can use `rating_count_tot` as a workaround.


In [18]:
genres_ios = freq_table(free_eng_ios, -5)

for genre in genres_ios:
    total = 0 # store sum of user ratings
    len_genre = 0 # store the number of apps specific to each genre
    for app in free_eng_ios:
        genre_app = app[-5]
        if genre_app == genre:
            number_rating = float(app[5])
            total += number_rating
            len_genre += 1
    avg_number_rating = total / len_genre
    print(genre, ':', avg_number_rating)

Social Networking : 71548.34905660378
Sports : 23008.898550724636
Medical : 612.0
Productivity : 21028.410714285714
Utilities : 18684.456790123455
Finance : 31467.944444444445
Book : 39758.5
News : 21248.023255813954
Lifestyle : 16485.764705882353
Music : 57326.530303030304
Navigation : 86090.33333333333
Travel : 28243.8
Entertainment : 14029.830708661417
Catalogs : 4004.0
Business : 7491.117647058823
Games : 22788.6696905016
Health & Fitness : 23298.015384615384
Photo & Video : 28441.54375
Shopping : 26919.690476190477
Food & Drink : 33333.92307692308
Reference : 74942.11111111111
Weather : 52279.892857142855
Education : 7003.983050847458


From the averages of total rating counts, it appears that `Navigation` apps have the most frequent amount of user reviews. This high review count could be influence by GPS type apps for navigation. The second highest genre with user reviews are `Reference` type apps. These apps include informative apps such as `Google Translate`.

# Most Popular Apps by Genre on the Google Play Store

In [28]:
categories_android = freq_table(free_eng_android, 1)

for category in categories_android:
    total = 0
    len_category = 0
    for app in free_eng_android:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace('+', '')
            n_installs = n_installs.replace(',', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)

MAPS_AND_NAVIGATION : 4056941.7741935486
SHOPPING : 7036877.311557789
HEALTH_AND_FITNESS : 4188821.9853479853
DATING : 854028.8303030303
BUSINESS : 1712290.1474201474
PARENTING : 542603.6206896552
PERSONALIZATION : 5201482.6122448975
BOOKS_AND_REFERENCE : 8767811.894736841
PHOTOGRAPHY : 17840110.40229885
COMICS : 817657.2727272727
AUTO_AND_VEHICLES : 647317.8170731707
SOCIAL : 23253652.127118643
EVENTS : 253542.22222222222
GAME : 15588015.603248259
LIBRARIES_AND_DEMO : 638503.734939759
TOOLS : 10801391.298666667
SPORTS : 3638640.1428571427
BEAUTY : 513151.88679245283
FAMILY : 3695641.8198090694
VIDEO_PLAYERS : 24727872.452830188
EDUCATION : 1833495.145631068
PRODUCTIVITY : 16787331.344927534
LIFESTYLE : 1437816.2687861272
FINANCE : 1387692.475609756
MEDICAL : 120550.61980830671
TRAVEL_AND_LOCAL : 13984077.710144928
FOOD_AND_DRINK : 1924897.7363636363
WEATHER : 5074486.197183099
HOUSE_AND_HOME : 1331540.5616438356
ART_AND_DESIGN : 1986335.0877192982
COMMUNICATION : 38456119.167247385
EN

On average, the highest installed apps by `genre` are the `Communication` apps with 38,456,119 installs.

In [34]:
for app in free_eng_android:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+' 
                                      or app[5] == '5,000,000+' 
                                      or app[5] == '1,000,000+'):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
My Tele2 : 5,000,000+
Call Free – Free Call : 5,000,000+
Web Browser & Explorer : 5,000,000+
TracFone My Account : 1,000,000+
Skype Lite - Free Video Call & Chat : 5,000,000+
My magenta : 1,000,000+
Seznam.cz : 1,000,000+
My Vodacom SA : 5,000,000+
Microsoft Edge : 5,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
Calls & Text by Mo+ : 5,000,000+
Skype - free IM & video calls : 1,000,000,000+
Messaging+ SMS, MMS Free : 1,000,000+
mysms SMS Text Messaging Sync : 1,000,000+
2ndLine - Second Phone Number : 1,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Ninesky Browser : 1,000,000+
Ghostery Privacy Browser : 1,000,000+
InBrowser - Incognito Browsing : 1,000,000+
PHONE for Google Voice & GTalk : 1,000,000+
Safest Call Blocker : 1,000,000+
Full Screen Caller ID : 5,000,000+
Should I Answer? : 1,000,000+
RocketDial Dialer & Contacts : 1,000,000+
CIA - Caller ID & Call Blocker : 5,000,000+
Call Control - Call Blocker : 5,000,000+