# Data analysis to determine free apps capable of generating profitable ad based revenue

## Description:
In this project a data set containing android and iOS applications will be analysed. We will specifically be concentrating on ad based free applications to determine what kind of apps will generate better profits.

## Goal:
The goal of this project is to determine free ad based apps that attract more users and generate maximum profits

In [1]:
opened_apple_store = open('AppleStore.csv')
from csv import reader
read_apple_store = reader(opened_apple_store)
app_store = list(read_apple_store)


opened_play_store = open('googleplaystore.csv')
read_play_store = reader(opened_play_store)
play_store = list(read_play_store)

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset[1:]))
        print('Number of columns:', len(dataset[0]))

In [3]:
explore_data(app_store, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7197
Number of columns: 16


In [4]:
explore_data(play_store, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


#### What we can deduce from the information above:

**A. APP STORE:**
The data set has 7197 entrees with 16 parameters. The parameters of most interest for our analysis would be `'id'`, `'track_name'`, `'price'`, `'rating_count_tot'`, `'prime_genre'` and `'rating_cont_ver'`.

**B. PLAY STORE:**
The data set has 10841 entrees with 13 parameters. The parameters of most interest for our analysis would be `'App'`, `'Category'`, `'Reviews'`, `'Installs'`, `'Price'` and `'Genres'`.

#### Below we examine the entry 10473 for error in data:

In [5]:
print(play_store[0])
print('\n')
print(play_store[10473])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [6]:
del play_store[10473]

#### We will now examine the google play store data set for repetative or duplicate values:

 If you see, Instagram app has multiple entrees in the data set:


In [7]:
for value in play_store[1:]:
    if value[0] == 'Instagram':
        print(value)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


**Here, we can see that there are quite a few duplicates. We will now try to count the total number of duplicates in google play store data set**

In [8]:
duplicate_values = []
unique_values = []

for value in play_store[1:]:
    if value[0] in unique_values:
        duplicate_values.append(value[0])
    else:
        unique_values.append(value[0])

print('Total number of duplicates =', len(duplicate_values))
print('\n')
print('Total number of unique values =', len(unique_values))

Total number of duplicates = 1181


Total number of unique values = 9659


**We can see that over 10% of the apps are duplicate apps in the data set. We will need to delete these values for better accuracy.**

**Consider the Instagram duplicates above:**
The one parameter that distinguishes each entree is the `'reviews'` column, as it shows a different value for each entry. Based on this, we can consider the entree with the highest number of reviews as the most recent one and discard the other duplicates. The code below aims at fulfilling this purpose:

Below we've created a `'reviews_max'` dictionary that has the app names as the keys and the maximum number of reviews as the corresponding value:

In [9]:
reviews_max = {}

for value in play_store[1:]:
    name = value[0]
    n_reviews = float(value[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

print('Expected dictionary length based on our analysis = 9659')
print('Actual dictionary length =', len(reviews_max))

Expected dictionary length based on our analysis = 9659
Actual dictionary length = 9659


We now want to delete these duplicate rows. To do that, we've created a separate list `'android_clean'` that stores only the entree with the highest number reviews and discards the other duplicates.

In [10]:
android_clean = []
already_added = []

for value in play_store[1:]:
    name = value[0]
    n_reviews = float(value[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(value)
        already_added.append(name)
        
print(len(android_clean))
print(android_clean[0:2])

9659
[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']]


**We also would want to eliminate non-English apps, as we will only be targeting our analysis on english speaking audience**

To do this, we've first created a function that takes in the app_name as input and returns whether it's name falls under English characters or not. As the character code for english characters fall between 0 and 127 as defined by ASCII system, we use that to accomplish our function.

In [11]:
def english_apps(app_name):
    for character in app_name:
        if ord(character) > 127:
            return False
        else:
            return True

We will now test our function

In [12]:
english_apps('Instagram')

True

In [13]:
english_apps('爱奇艺PPS -《欢乐颂2》电视剧热播')

False

The function we created above is able to eliminate most non-English apps. However, it will have a problem eliminating some special symbols such as `'😜'` or `'™'` that fall outside the ASCII range. We will try another way to eliminate, by saying that if a name has more than 3 characters falling outside ASCII range, classify it as non-English app. We'll modify the code above accordingly.

In [14]:
def english_apps(app_name):
    
    counter = 0
    
    for character in app_name:
        if ord(character) > 127:
            counter = counter + 1
            
    if counter > 3:
        return False
    else:
        return True

Let us now test our code for a few app names:

In [15]:
english_apps('Docs To Go™ Free Office Suite')

True

In [16]:
english_apps('Instachat 😜')

True

In [17]:
english_apps('爱奇艺PPS -《欢乐颂2》电视剧热播')

False

We will now use the function we created to eliminate non-english apps from both the data sets:

In [18]:
android_clean_1 = []

for row in android_clean:
    if english_apps(row[0]):
        android_clean_1.append(row)
        
print('Total English Apps (Android): ', len(android_clean_1))
print(android_clean_1[0:2])

Total English Apps (Android):  9614
[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']]


In [19]:
ios_clean_1 = []

for row in app_store[1:]:
    if english_apps(row[1]):
        ios_clean_1.append(row)
        
print('Total English Apps iOS: ', len(ios_clean_1))
print(ios_clean_1[0:2])

Total English Apps iOS:  6183
[['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'], ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']]


We will not filter the free apps, as we will only be analyzing free apps that have ad-based revenue:

In [20]:
android_clean_2 = []

for row in android_clean_1:
    if row[7] == '0':
        android_clean_2.append(row)

print('Total Englsih Free Android Apps: ', len(android_clean_2))
print(android_clean_2[0:2])

Total Englsih Free Android Apps:  8864
[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']]


In [21]:
ios_clean_2 = []

for row in ios_clean_1:
    if float(row[4]) == 0.0:
        ios_clean_2.append(row)

print('Total English Free iOS apps: ', len(ios_clean_2))
print(ios_clean_2[0:2])

Total English Free iOS apps:  3222
[['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'], ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']]


**Now that we've cleaned the data, we'll now perform some analysis on the data. Our primary goals are:**

1. To build a minimal Android version of the app and add it to Play Store
2. We develop the app further depending on the user response
3. If the app shows good profits after 6 months, we build an iOS version of the app.

As such, we're looking for apps that will be successful on both the platforms.

Let us start by analyzing what are the most common genres for each market by building frequency tables:

In [22]:
print(app_store[0])
print('\n')
print(play_store[0])

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


From the above, we can see that `'prime_genre'` and `'Genres'` are the colums of interest to us

In [23]:
def freq_table(dataset, index):
    
    frequency_table = {}
    frequency_table_percentage = {}
    
    for row in dataset:
        
        parameter = row[index]
        
        if parameter in frequency_table:
            frequency_table[parameter] += 1
        else:
            frequency_table[parameter] = 1
            
    for value in frequency_table:
        frequency_table_percentage[value] = round(((frequency_table[value] / len(dataset)) * 100), 2)
            
    return frequency_table_percentage
        

Now that we've generated a frequency table that shows the weightage of each Genre in iOS / Android. However, to better understand and interpret the data we need to sort the data in descending order. Below, we write a function to display our data as tuples in a list:

In [24]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

We will now try to display the sorted frequency tables for `'prime_genre'`, `'Genres'` and `'Category'` columns.

In [25]:
display_table(ios_clean_2, app_store[0].index('prime_genre'))

Games : 58.16
Entertainment : 7.88
Photo & Video : 4.97
Education : 3.66
Social Networking : 3.29
Shopping : 2.61
Utilities : 2.51
Sports : 2.14
Music : 2.05
Health & Fitness : 2.02
Productivity : 1.74
Lifestyle : 1.58
News : 1.33
Travel : 1.24
Finance : 1.12
Weather : 0.87
Food & Drink : 0.81
Reference : 0.56
Business : 0.53
Book : 0.43
Navigation : 0.19
Medical : 0.19
Catalogs : 0.12


In [26]:
display_table(android_clean_2, play_store[0].index('Genres'))

Tools : 8.45
Entertainment : 6.07
Education : 5.35
Business : 4.59
Productivity : 3.89
Lifestyle : 3.89
Finance : 3.7
Medical : 3.53
Sports : 3.46
Personalization : 3.32
Communication : 3.24
Action : 3.1
Health & Fitness : 3.08
Photography : 2.94
News & Magazines : 2.8
Social : 2.66
Travel & Local : 2.32
Shopping : 2.25
Books & Reference : 2.14
Simulation : 2.04
Dating : 1.86
Arcade : 1.85
Video Players & Editors : 1.77
Casual : 1.76
Maps & Navigation : 1.4
Food & Drink : 1.24
Puzzle : 1.13
Racing : 0.99
Role Playing : 0.94
Libraries & Demo : 0.94
Auto & Vehicles : 0.93
Strategy : 0.91
House & Home : 0.82
Weather : 0.8
Events : 0.71
Adventure : 0.68
Comics : 0.61
Beauty : 0.6
Art & Design : 0.6
Parenting : 0.5
Card : 0.45
Casino : 0.43
Trivia : 0.42
Educational;Education : 0.39
Board : 0.38
Educational : 0.37
Education;Education : 0.34
Word : 0.26
Casual;Pretend Play : 0.24
Music : 0.2
Racing;Action & Adventure : 0.17
Puzzle;Brain Games : 0.17
Entertainment;Music & Video : 0.17
Casual;

In [27]:
display_table(android_clean_2, play_store[0].index('Category'))

FAMILY : 18.91
GAME : 9.72
TOOLS : 8.46
BUSINESS : 4.59
LIFESTYLE : 3.9
PRODUCTIVITY : 3.89
FINANCE : 3.7
MEDICAL : 3.53
SPORTS : 3.4
PERSONALIZATION : 3.32
COMMUNICATION : 3.24
HEALTH_AND_FITNESS : 3.08
PHOTOGRAPHY : 2.94
NEWS_AND_MAGAZINES : 2.8
SOCIAL : 2.66
TRAVEL_AND_LOCAL : 2.34
SHOPPING : 2.25
BOOKS_AND_REFERENCE : 2.14
DATING : 1.86
VIDEO_PLAYERS : 1.79
MAPS_AND_NAVIGATION : 1.4
FOOD_AND_DRINK : 1.24
EDUCATION : 1.16
ENTERTAINMENT : 0.96
LIBRARIES_AND_DEMO : 0.94
AUTO_AND_VEHICLES : 0.93
HOUSE_AND_HOME : 0.82
WEATHER : 0.8
EVENTS : 0.71
PARENTING : 0.65
ART_AND_DESIGN : 0.64
COMICS : 0.62
BEAUTY : 0.6


A higher weight in the app store / play store doesn't necessarily mean that the particular genre will have a high number of users as well. We're interested in building an app that will attaract more users.

**Let us now try to analyze how many average number of users each genre has. We can do this by analyzing the number of `Installs` in Play Store and `rating_count_tot` in App Store:**

In [28]:
freq_table(ios_clean_2, app_store[0].index('prime_genre'))

{'Book': 0.43,
 'Business': 0.53,
 'Catalogs': 0.12,
 'Education': 3.66,
 'Entertainment': 7.88,
 'Finance': 1.12,
 'Food & Drink': 0.81,
 'Games': 58.16,
 'Health & Fitness': 2.02,
 'Lifestyle': 1.58,
 'Medical': 0.19,
 'Music': 2.05,
 'Navigation': 0.19,
 'News': 1.33,
 'Photo & Video': 4.97,
 'Productivity': 1.74,
 'Reference': 0.56,
 'Shopping': 2.61,
 'Social Networking': 3.29,
 'Sports': 2.14,
 'Travel': 1.24,
 'Utilities': 2.51,
 'Weather': 0.87}

In [29]:
print(app_store[0])

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In [30]:
genres_total_ratings = {}
genres_total_count = {}
average_genre_ratings = {}

for row in ios_clean_2:
    
    name = row[app_store[0].index('prime_genre')]
    ratings = float(row[app_store[0].index('rating_count_tot')])
    
    if name in genres_total_ratings:
        genres_total_ratings[name] = genres_total_ratings[name] + ratings
        genres_total_count[name] += 1
    
    else:
        genres_total_ratings[name] = ratings
        genres_total_count[name] = 1
        average_genre_ratings[name] = 0
        
for value in average_genre_ratings:
    
    average_genre_ratings[value] = genres_total_ratings[value] / genres_total_count[value]

    
display = []
    
for key in average_genre_ratings:
    
    key_val_tuple = (average_genre_ratings[key], key)
    
    display.append(key_val_tuple)
    
display = sorted(display, reverse = True)

for value in display:
    
    print(value[1], ':', value[0])
    

Navigation : 86090.33333333333
Reference : 74942.11111111111
Social Networking : 71548.34905660378
Music : 57326.530303030304
Weather : 52279.892857142855
Book : 39758.5
Food & Drink : 33333.92307692308
Finance : 31467.944444444445
Photo & Video : 28441.54375
Travel : 28243.8
Shopping : 26919.690476190477
Health & Fitness : 23298.015384615384
Sports : 23008.898550724636
Games : 22788.6696905016
News : 21248.023255813954
Productivity : 21028.410714285714
Utilities : 18684.456790123455
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Business : 7491.117647058823
Education : 7003.983050847458
Catalogs : 4004.0
Medical : 612.0


We can see that "Navigation" has the highest number of user ratings. Let us try to further analyze "Navigation" specifically.

In [31]:
for value in ios_clean_2:
    if value[app_store[0].index('prime_genre')] == 'Navigation':
        print(value[app_store[0].index('track_name')], ':', value[app_store[0].index('rating_count_tot')])

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


Above, we can see that over half a million ratings are concentrated for Waze and Google Maps for "Navigation". Whereas, the other apps have quite few ratings in compariosn to these two.

Similarly, let us now try analyzing the "Social Networking" category:

In [32]:
for value in ios_clean_2:
    if value[app_store[0].index('prime_genre')] == 'Social Networking':
        print(value[app_store[0].index('track_name')], ':', value[app_store[0].index('rating_count_tot')])

Facebook : 2974676
Pinterest : 1061624
Skype for iPhone : 373519
Messenger : 351466
Tumblr : 334293
WhatsApp Messenger : 287589
Kik : 260965
ooVoo – Free Video Call, Text and Voice : 177501
TextNow - Unlimited Text + Calls : 164963
Viber Messenger – Text & Call : 164249
Followers - Social Analytics For Instagram : 112778
MeetMe - Chat and Meet New People : 97072
We Heart It - Fashion, wallpapers, quotes, tattoos : 90414
InsTrack for Instagram - Analytics Plus More : 85535
Tango - Free Video Call, Voice and Chat : 75412
LinkedIn : 71856
Match™ - #1 Dating App. : 60659
Skype for iPad : 60163
POF - Best Dating App for Conversations : 52642
Timehop : 49510
Find My Family, Friends & iPhone - Life360 Locator : 43877
Whisper - Share, Express, Meet : 39819
Hangouts : 36404
LINE PLAY - Your Avatar World : 34677
WeChat : 34584
Badoo - Meet New People, Chat, Socialize. : 34428
Followers + for Instagram - Follower Analytics : 28633
GroupMe : 28260
Marco Polo Video Walkie Talkie : 27662
Miitomo : 2

Very similar to Navigation apps, we can see that Social Networking apps too seem to be heavily concentrated by popular Social apps such as Facebook, Pinterest Skype, etc.

Overall, we can observe that the average number of ratings seems to be skewed by a large number of users for just a handful of apps in Social Networking, Navigation and Music, whereas other apps struggle to even get past 10000 in these categories.

Let us take a look at reference apps:

In [33]:
for value in ios_clean_2:
    if value[app_store[0].index('prime_genre')] == 'Reference':
        print(value[app_store[0].index('track_name')], ':', value[app_store[0].index('rating_count_tot')])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


The 'Reference' category has approximately 75K average ratings. Although this category too seems to be dominated by a few apps such as Dictionary.com and Bible, it still has some scope. This is because the dominance is not an extremely high percentage as compared to the categories we mentioned above.

One good suggestion would be to take a book and convert it into an app. We can also add additional features to the app such as dictionary lookup within the app, audio and accent, pronounciation, quotes and extra information on some event in the book, pictures, quizzes, etc.

We also know that app store is dominated by apps made for fun, and our app serving a more practical purpose will stand a better chance, considering the saturation of fun apps.

Furthermore, apps in categories such as 'Food', 'Travel' or 'Finance' will require additional resources such as setting up delivery services, restaurant contacts, travel agents, banking and finance experts, etc. However, our aim was to explicitly design an app that will help us generate ad based revenue without much initial investment.

Apps such as 'Weather' won't be of much use as people useually don't spend much time on weather apps, which won't generate ad based revenue.

**Let us now analyze the Google Play Store dataset**

In [34]:
display_table(android_clean_2, play_store[0].index('Installs'))

1,000,000+ : 15.73
100,000+ : 11.55
10,000,000+ : 10.55
10,000+ : 10.2
1,000+ : 8.39
100+ : 6.92
5,000,000+ : 6.83
500,000+ : 5.56
50,000+ : 4.77
5,000+ : 4.51
10+ : 3.54
500+ : 3.25
50,000,000+ : 2.3
100,000,000+ : 2.13
50+ : 1.92
5+ : 0.79
1+ : 0.51
500,000,000+ : 0.27
1,000,000,000+ : 0.23
0+ : 0.05
0 : 0.01


Above, we can see that the number of installs are not as a definite number, instead they are displayed as a open ended value. However, this information should be sufficient enough for us to get an estimate that can fulfill our purpose.

For our analysis, we will consider 100+ as 100 installs, 10000+ as 10000 installs and so on.

In [35]:
freq_table(android_clean_2, play_store[0].index('Category'))

{'ART_AND_DESIGN': 0.64,
 'AUTO_AND_VEHICLES': 0.93,
 'BEAUTY': 0.6,
 'BOOKS_AND_REFERENCE': 2.14,
 'BUSINESS': 4.59,
 'COMICS': 0.62,
 'COMMUNICATION': 3.24,
 'DATING': 1.86,
 'EDUCATION': 1.16,
 'ENTERTAINMENT': 0.96,
 'EVENTS': 0.71,
 'FAMILY': 18.91,
 'FINANCE': 3.7,
 'FOOD_AND_DRINK': 1.24,
 'GAME': 9.72,
 'HEALTH_AND_FITNESS': 3.08,
 'HOUSE_AND_HOME': 0.82,
 'LIBRARIES_AND_DEMO': 0.94,
 'LIFESTYLE': 3.9,
 'MAPS_AND_NAVIGATION': 1.4,
 'MEDICAL': 3.53,
 'NEWS_AND_MAGAZINES': 2.8,
 'PARENTING': 0.65,
 'PERSONALIZATION': 3.32,
 'PHOTOGRAPHY': 2.94,
 'PRODUCTIVITY': 3.89,
 'SHOPPING': 2.25,
 'SOCIAL': 2.66,
 'SPORTS': 3.4,
 'TOOLS': 8.46,
 'TRAVEL_AND_LOCAL': 2.34,
 'VIDEO_PLAYERS': 1.79,
 'WEATHER': 0.8}

In [36]:
app_category = {}
len_apps = {}
average_installs = {}

for value in android_clean_2:
    installs_count = value[play_store[0].index('Installs')].replace('+', '')
    installs_count = float(installs_count.replace(',', ''))
    if value[play_store[0].index('Category')] in app_category:
        app_category[value[play_store[0].index('Category')]] += installs_count
        len_apps[value[play_store[0].index('Category')]] += 1
    else:
        app_category[value[play_store[0].index('Category')]] = installs_count
        len_apps[value[play_store[0].index('Category')]] = 1
    
for value in app_category:
    average_installs[value] = app_category[value] / len_apps[value]

average_installs
    

{'ART_AND_DESIGN': 1986335.0877192982,
 'AUTO_AND_VEHICLES': 647317.8170731707,
 'BEAUTY': 513151.88679245283,
 'BOOKS_AND_REFERENCE': 8767811.894736841,
 'BUSINESS': 1712290.1474201474,
 'COMICS': 817657.2727272727,
 'COMMUNICATION': 38456119.167247385,
 'DATING': 854028.8303030303,
 'EDUCATION': 1833495.145631068,
 'ENTERTAINMENT': 11640705.88235294,
 'EVENTS': 253542.22222222222,
 'FAMILY': 3695641.8198090694,
 'FINANCE': 1387692.475609756,
 'FOOD_AND_DRINK': 1924897.7363636363,
 'GAME': 15588015.603248259,
 'HEALTH_AND_FITNESS': 4188821.9853479853,
 'HOUSE_AND_HOME': 1331540.5616438356,
 'LIBRARIES_AND_DEMO': 638503.734939759,
 'LIFESTYLE': 1437816.2687861272,
 'MAPS_AND_NAVIGATION': 4056941.7741935486,
 'MEDICAL': 120550.61980830671,
 'NEWS_AND_MAGAZINES': 9549178.467741935,
 'PARENTING': 542603.6206896552,
 'PERSONALIZATION': 5201482.6122448975,
 'PHOTOGRAPHY': 17840110.40229885,
 'PRODUCTIVITY': 16787331.344927534,
 'SHOPPING': 7036877.311557789,
 'SOCIAL': 23253652.127118643,
 

In [37]:
app_categories = freq_table(android_clean_2, 1)

display = []

for value in app_categories:
    total = 0
    len_apps = 0
    for row in android_clean_2:
        app_cat = row[1]
        if app_cat == value:
            installs_count = row[5].replace('+', '')
            installs_count = float(installs_count.replace(',', ''))
            total = total + installs_count
            len_apps = len_apps + 1
            avg_rat = total / len_apps
    display.append((round(avg_rat, 2), value))
    
display = sorted(display, reverse = True)
    
for value in display:
    print(value[1], ':', value[0])

COMMUNICATION : 38456119.17
VIDEO_PLAYERS : 24727872.45
SOCIAL : 23253652.13
PHOTOGRAPHY : 17840110.4
PRODUCTIVITY : 16787331.34
GAME : 15588015.6
TRAVEL_AND_LOCAL : 13984077.71
ENTERTAINMENT : 11640705.88
TOOLS : 10801391.3
NEWS_AND_MAGAZINES : 9549178.47
BOOKS_AND_REFERENCE : 8767811.89
SHOPPING : 7036877.31
PERSONALIZATION : 5201482.61
WEATHER : 5074486.2
HEALTH_AND_FITNESS : 4188821.99
MAPS_AND_NAVIGATION : 4056941.77
FAMILY : 3695641.82
SPORTS : 3638640.14
ART_AND_DESIGN : 1986335.09
FOOD_AND_DRINK : 1924897.74
EDUCATION : 1833495.15
BUSINESS : 1712290.15
LIFESTYLE : 1437816.27
FINANCE : 1387692.48
HOUSE_AND_HOME : 1331540.56
DATING : 854028.83
COMICS : 817657.27
AUTO_AND_VEHICLES : 647317.82
LIBRARIES_AND_DEMO : 638503.73
PARENTING : 542603.62
BEAUTY : 513151.89
EVENTS : 253542.22
MEDICAL : 120550.62


Above, it's observed that communications has the highest number of installs. However, we need to check if that's skewed becasue of specific apps:

In [39]:
for app in android_clean_2:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

For getting a better clarity, we will try removing the above apps (Apps having greater than 100M installs) for our average rating calculation:

In [44]:
under_100m = []

for value in android_clean_2:
    n_installs = value[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    if (value[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
        under_100m.append(float(n_installs))
        
round(sum(under_100m) / len(under_100m), 2)

3603485.39

We can see that the average installs for 'Communications' dropped significantly after the highly dominant apps were removed from consideration.

A similar patter is observed for 'Video_Players', 'Social', 'Photographt', etc.

We are looking at an app that will succeed both on App Store as well as on Play Store.

'Games' as we observed earlier is again quite saturated category.

We can try exploring 'Books and Reference' category as it showed quite some bit of potential on App Store. Now let us see if it can show some potential for Android as well:

In [45]:
for value in android_clean_2:
    if value[1] == 'BOOKS_AND_REFERENCE':
        print(value[0], ':', value[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E

The book and reference category includes a variety of apps such as software for processing / reading ebooks, dictionaries, tutorials on languages, etc. Of course there are a few apps we can observe that are dominant and skew the average, however their proportion is comparatively quite less as compare to other categories.

In [47]:
for value in android_clean_2:
    if value[1] == 'BOOKS_AND_REFERENCE' and (value[5] == '1,000,000,000+'
                                            or value[5] == '500,000,000+'
                                            or value[5] == '100,000,000+'):
        print(value[0], ':', value[5])

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


It can be observed that only a small portion of apps in the 'Books and Reference' category skew the average.

Let us now try to get some insights by fetching apps that have installs somewhere in the middle range between 1,000,000 and 100,000,000 installs.

In [48]:
for value in android_clean_2:
    if value[1] == 'BOOKS_AND_REFERENCE' and (value[5] == '1,000,000+'
                                            or value[5] == '5,000,000+'
                                            or value[5] == '10,000,000+'
                                            or value[5] == '50,000,000+'):
        print(value[0], ':', value[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
H

This category seems to be dominated by ebooks, libraries and dictionaries, etc. Building similar apps won't be beneficial as there would be some already existing competition.

Notice that there are some apps built around the book Quran. It can be observed that building an app around a particular book and and adding extra and unique features to it can be profitable. For the app to succeed, there should be some special features besides the raw version of the book such as daily quotes from the book, quizzes on the book, a discussion forum, etc.

## Conclusions:

In this project, we analyzed data on App Store and Google Play store with the goal of recommending an app capable of generating profits on both the markets.

We concluded that building an app around a particular book and and adding extra and unique features to it can be profitable on both the markets. For the app to succeed, there should be some special features besides the raw version of the book such as daily quotes from the book, quizzes on the book, a discussion forum, etc.