### Analysing Profitable Apps in Google and Apple Store

Google and Apple app stores have millions of apps. Out of those millions of apps some are very famous and profitable while others are not. This project is about finding what makes an app profitable or which category of apps are more famous and profitable than others.

The goal of this project is to get an understanding of what key points make an app successful. The project should give us an insight on which category apps are more famous so that the developers can get an idea of how to attract the users for their next app.

In [81]:
from collections import defaultdict, Counter

In [1]:
def explore_data(dataset, start, end, rows_and_columns=False):
    """utility function to print dataset"""
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
    if rows_and_columns:
        print("Number of rows:    ", len(dataset))
        print("Number of columns: ", len(dataset[0]))

In [2]:
# create list of lists for both apple and google dataset
from csv import reader
with open("AppleStore.csv", "r") as fp:
    apple = reader(fp)
    apple_data = list(apple)
with open("googleplaystore.csv", "r") as fp:
    google = reader(fp)
    google_data = list(google)
    
android_header = google_data[0]
google_data = google_data[1:]
ios_header = apple_data[0]
apple_data = apple_data[1:]
print("Exploring Apple Dataset: \n")
explore_data(apple_data, 1, 3, True)

Exploring Apple Dataset: 

['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows:     7197
Number of columns:  16


In [3]:
print("Apple columns")
print(apple_data[0])
print("\n")
print("Google columns")
print(google_data[0])

Apple columns
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Google columns
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


#### More Info on Datasets

To get detailed information about the Google Play Store dataset, go to [Google dataset](https://www.kaggle.com/lava18/google-play-store-apps)

Similarly, more info about Apple App Store dataset can be found at [Apple dataset](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

### Data Cleaning

The process of preparing our data for analysis is calle __data cleaning__.
It includes -
* _deleting or correcting_ wrong data
* _deleting_ duplicate data
* _modifing_ data to fit our needs

`It is said that Data Scientist spend 80% of their time cleaning the data and 20% of their time in analysing it.`

### Deleting wrong data
The google dataset has a dedicated [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion) and we can check that [one of the discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) suggest us about the wrong data row. We'll try to find out if it is actually wrong.

In [4]:
print(google_data[10471])  # correct row
print('\n')
print(android_header)  # header row
print('\n')
print(google_data[10472])  # incorrect row

['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [5]:
# Looking at the kaggle discussion of the google dataset, we can see that the row at 10473 is missing some data points.
print(len(google_data))
del google_data[10472]
print(len(google_data))

10841
10840


### Removing duplicate entries
__PART 1__

By closing looking at the dataset, we can see that there are many duplicate entries. For instance, the app "Instagram" has four entries.

In [6]:
for app in google_data:
    name = app[0]
    if name == "Instagram":
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [7]:
duplicate_apps = []
unique_apps = []
for app in google_data:
    app_name = app[0]
    if app_name in unique_apps:
        duplicate_apps.append(app_name)
    else:
        unique_apps.append(app_name)
        
print(f"Length of duplicate apps: {len(duplicate_apps)}")
print(f"Length of unique apps: {len(unique_apps)}")  # -1 for header
print("\n")
print("Few examples of duplicate apps -")
print(duplicate_apps[:15])

Length of duplicate apps: 1181
Length of unique apps: 9659


Few examples of duplicate apps -
['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


We can see that there are total of duplicate 1181 apps.

If we look at the instagram duplicates we can find that the review column of it is not unique. The different values show that the data was collected at different times. We can use this criterion for keeping rows. We will not randomly delete the duplicate rows but keep only that row which has more reviews as they make more reliabile ratings.

__PART 2__

We will start by creating a dictionary that contains key as the app name and its highest review (among all its duplicate rows) as its value.

In [8]:
from collections import defaultdict
reviews_max = defaultdict(int)

for app in google_data:
    app_name = app[0]
    reviews = float(app[3])
    old_review = reviews_max[app_name]
    if old_review < reviews:
        reviews_max[app_name] = reviews

In [9]:
print(len(unique_apps) == len(reviews_max))  # confirming actual vs expected length

True


Using the dictionary we created, we will iterate over the entire dataset and select only those rows that match with dictionary (as the dictionary contains only those values with highest ratings). We will create a new dataset with the cleaned data.

We will also create an additional data structure for keeping just the app names that have been added to the cleaned dataset. The reason for this is to ignore those entries which are duplicate and have the same number of ratings. If we just use `reviews==max_review` then there will still be some cases where the duplicate entries have same number of ratings.

In [14]:
cleaned_android = []
already_added = []
for app in google_data:
    app_name = app[0]
    reviews = float(app[3])
    max_review = reviews_max[app_name]
    if max_review == reviews and app_name not in already_added:
        cleaned_android.append(app)
        already_added.append(app_name)

print(len(cleaned_android) == len(reviews_max) == len(unique_apps))

9659
True


In [16]:
explore_data(cleaned_android, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows:     9659
Number of columns:  13


### Removing non-english apps
#### Part 1
If we explore the dataset enough, we would find that are certain apps that are not directed towards English speaking audience. Below we see a couple of examples from both dataset -

In [44]:
print(apple_data[813][1])
print(apple_data[6731][1])

print(cleaned_android[4412][0])
print(cleaned_android[7940][0])

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜
中国語 AQリスニング
لعبة تقدر تربح DZ


We will write a function that takes a string and checks if the string is ASCII charater or not. ASCII chart suggests that the english characters fall under ASCII numbers between 0 and 127. To convert a character into ASCII number, we'll use in-built function `ord()`. Thus, our function would check if the string character falls under 127 or not and return boolean correspondingly.

In [19]:
def is_english(word):
    for c in word:
        if ord(c) > 127:
            return False
    return True

In [20]:
test_non_english = ["Instagram", "爱奇艺PPS -《欢乐颂2》电视剧热播", "Docs To Go™ Free Office Suite", "Instachat 😜"]
for sample in test_non_english:
    print(is_english(sample))

True
False
False
False


There can be some cases where the app is intended for english speaking audience but has emoji or characters like `™`.
We may loose many data if we filter out harshly. To counter that, we will accept at most 3 non-ascii characters in the app name and accept it as english app if it satisfied the boundary set. Although it is not a perfect solution but it should include those apps with 3 emojis or less and other use cases.

To achieve that we will use a flag to count the number of non-ascii characters in a word.

#### Part 2

To minimize the data loss, let's override the basic `is_english` function.

In [39]:
def is_english(word):
    non_ascii = 0
    for c in word:
        if ord(c) > 127:
            if non_ascii >= 3:
                return False
            else:
                non_ascii += 1
    return True

In [40]:
test_non_english = ["Instagram 😜😜😜", "Instagram 😜😜😜😜", "爱奇艺PPS -《欢乐颂2》电视剧热播", "Docs To Go™ Free Office Suite", "Instachat 😜"]
for sample in test_non_english:
    print(is_english(sample))

True
False
False
True
True


At this point, we save fairly good amount of english directed apps. We will stop the optimization at this point for english related apps and now use the function to clean our dataset.

In [46]:
android_english = []
ios_english = []
for app in cleaned_android:
    app_name = app[0]
    if is_english(app_name):
        android_english.append(app)

for app in apple_data:
    app_name = app[1]
    if is_english(app_name):
        ios_english.append(app)
        
print("Length of english android apps: ", len(android_english))
print("Length of english ios apps: ", len(ios_english))

Length of english android apps:  9614
Length of english ios apps:  6183


We see that we are left with __9614__ android apps and __6183__ ios apps.

In [58]:
non_english_android_perc = (1 - (len(android_english) / len(cleaned_android))) * 100
print(f"We reduced: {round(non_english_android_perc, 2)} percentage of non-english android apps")
non_english_ios_perc = (1 - (len(ios_english) / len(apple_data))) * 100
print(f"We reduced: {round(non_english_ios_perc, 2)} percentage of non-english ios apps")

We reduced: 0.47 percentage of non-english android apps
We reduced: 14.09 percentage of non-english ios apps


### Selecting only the free apps
We are only interested in building free apps, as our main source of revenue is in-app ads. We will separate or select only those apps that are free. This will be our last step in the process of _Data Cleaning_ in which we already covered - 
> * Removing inaccurate data
> * Removing duplicate entries
> * Removing non-english apps

In [60]:
print("Checking Android Header")
print(android_header)
print('\n')
print("Checking iOS Header")
print(ios_header)

Checking Android Header
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Checking iOS Header
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In [80]:
explore_data(android_english, 115, 119, False)

['Mirror Camera (Mirror + Selfie Camera)', 'BEAUTY', '4.1', '9315', '2.6M', '1,000,000+', 'Free', '0', 'Everyone', 'Beauty', 'November 21, 2017', '1.4.2', '4.0 and up']


['Beauty Tips - Beauty Tips in Sinhala', 'BEAUTY', '4.4', '75', '4.2M', '50,000+', 'Free', '0', 'Everyone', 'Beauty', 'October 18, 2017', '1.0.0', '4.0.3 and up']


['Haircut Tutorials/Haircut Videos', 'BEAUTY', '4.6', '38', '7.1M', '10,000+', 'Free', '0', 'Everyone', 'Beauty', 'May 29, 2018', '1.0', '4.0 and up']


['Sephora: Skin Care, Beauty Makeup & Fragrance Shop', 'BEAUTY', '4.5', '26834', '57M', '1,000,000+', 'Free', '0', 'Everyone', 'Beauty', 'July 24, 2018', '18.5', '5.0 and up']




In [75]:
android_final = []
ios_final = []
for app in android_english:
    price = app[7]
    if price == "0":
        android_final.append(app)
for app in ios_english:
    price = app[4]
    if price == "0.0":
        ios_final.append(app)
        
print(len(android_final))
print(len(ios_final))

8864
3222


### Most Common Apps By Genre
#### Part 1
Our __VALIDATION STRATEGY__ for building an app would consist of following steps -
    1. Find out the most popular app by genre common in both Google and Apple Store
    2. Create a minimal android version app of that genre and publish it
    3. If the app has a good response, we develop it further
    4. Release the full version of the app in both the app stores

To find out the popular apps by genre, we are going to create a frequency table that will contain info about the genre and the number of apps belonging to it.

#### Part 2
We will build two functions -
* One will create a frequency table of the most common genres of apps in the different stores.
* The second function will display the list of genres in percentage in descending order.

In [122]:
def freq_table(dataset, index):
    """Creates a dictionary with key as the different values of a column 'marked by index' 
    and value as the number of times it occured 
    dataset: list of list
    index: column number as an int
    returns dictionary"""
    freq_table = Counter()
    for row in dataset:
        column = row[index]
        freq_table[column] += 1
        
    for k, v in freq_table.items():
        freq_table[k] = round((v / len(dataset)) * 100, 2)
    return freq_table

android_genre_frequency = freq_table(android_final, 9)
android_category_frequency = freq_table(android_final, 1)
ios_genre_frequency = freq_table(ios_final, 11)
print(f"Total genres found in android: {len(android_genre_frequency)}")
print(f"Total categories found in android: {len(android_category_frequency)}")
print(f"Total genres found in ios: {len(ios_genre_frequency)}")

Total genres found in android: 114
Total categories found in android: 33
Total genres found in ios: 23


Lets sort the frequency tables so that the top genre is shown first.

In [123]:
def display_table(freq):
    sorted_list = sorted(freq.items(), key=lambda item: item[1], reverse=True)
    for t in sorted_list:
        print(t[0], t[1])

#### Part 3
We will start by analysing the most common genres in iOS app store.

In [124]:
display_table(ios_genre_frequency)

Games 58.16
Entertainment 7.88
Photo & Video 4.97
Education 3.66
Social Networking 3.29
Shopping 2.61
Utilities 2.51
Sports 2.14
Music 2.05
Health & Fitness 2.02
Productivity 1.74
Lifestyle 1.58
News 1.33
Travel 1.24
Finance 1.12
Weather 0.87
Food & Drink 0.81
Reference 0.56
Business 0.53
Book 0.43
Navigation 0.19
Medical 0.19
Catalogs 0.12


Looking at the data, we can conclude that more than half (58.16) of the english directed apps in iOS app store consists of games. Entertainment apps cover approximately 8% of the market followed by photo and video consisting of appx. 5% share. Education related apps comes in fourth position followed by social networking apps.

The general impression after looking at the list (of english directed apps) is that the market is dominated by apps that are designed for fun (games, entertainment, photo and video, social networking, sports etc.) while apps related to practical purpose (education, health, lifestyle etc.) are a bit rare. However, with this statistics it is difficult to say that the apps with most shares are equally liked by the users - __the supply can be more than the demand.__

Let's continue by looking at the category and genre section of the Google play store.

In [127]:
display_table(android_category_frequency)

FAMILY 18.91
GAME 9.72
TOOLS 8.46
BUSINESS 4.59
LIFESTYLE 3.9
PRODUCTIVITY 3.89
FINANCE 3.7
MEDICAL 3.53
SPORTS 3.4
PERSONALIZATION 3.32
COMMUNICATION 3.24
HEALTH_AND_FITNESS 3.08
PHOTOGRAPHY 2.94
NEWS_AND_MAGAZINES 2.8
SOCIAL 2.66
TRAVEL_AND_LOCAL 2.34
SHOPPING 2.25
BOOKS_AND_REFERENCE 2.14
DATING 1.86
VIDEO_PLAYERS 1.79
MAPS_AND_NAVIGATION 1.4
FOOD_AND_DRINK 1.24
EDUCATION 1.16
ENTERTAINMENT 0.96
LIBRARIES_AND_DEMO 0.94
AUTO_AND_VEHICLES 0.93
HOUSE_AND_HOME 0.82
WEATHER 0.8
EVENTS 0.71
PARENTING 0.65
ART_AND_DESIGN 0.64
COMICS 0.62
BEAUTY 0.6


The data looks different for the Google app store. Practical apps seem to be more compared to the fun apps. However, if we investigate more we can see that the apps under Family which cover almost 19% of the share consists of games for kids.

![](./google_play_store.png)

Even then, practical apps seem to be more common than the fun apps. This also gets confirmed by the gernre category of the apps below -

In [128]:
display_table(android_genre_frequency)

Tools 8.45
Entertainment 6.07
Education 5.35
Business 4.59
Lifestyle 3.89
Productivity 3.89
Finance 3.7
Medical 3.53
Sports 3.46
Personalization 3.32
Communication 3.24
Action 3.1
Health & Fitness 3.08
Photography 2.94
News & Magazines 2.8
Social 2.66
Travel & Local 2.32
Shopping 2.25
Books & Reference 2.14
Simulation 2.04
Dating 1.86
Arcade 1.85
Video Players & Editors 1.77
Casual 1.76
Maps & Navigation 1.4
Food & Drink 1.24
Puzzle 1.13
Racing 0.99
Libraries & Demo 0.94
Role Playing 0.94
Auto & Vehicles 0.93
Strategy 0.91
House & Home 0.82
Weather 0.8
Events 0.71
Adventure 0.68
Comics 0.61
Art & Design 0.6
Beauty 0.6
Parenting 0.5
Card 0.45
Casino 0.43
Trivia 0.42
Educational;Education 0.39
Board 0.38
Educational 0.37
Education;Education 0.34
Word 0.26
Casual;Pretend Play 0.24
Music 0.2
Entertainment;Music & Video 0.17
Puzzle;Brain Games 0.17
Racing;Action & Adventure 0.17
Casual;Brain Games 0.14
Casual;Action & Adventure 0.14
Arcade;Action & Adventure 0.12
Action;Action & Adventure 0

Looking at all the three dataset, we can say that the Apple store has more fun related apps while Google app store is kind of balanced towards practical purpose as well as for-fun apps. We will now move forward and look into the kind of apps that have most number of users.

In [132]:
print(ios_header)
print('\n')
print(ios_final[0])
print('\n')
print(android_header)
print('\n')
print(android_final[0])

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


### Most Popular Apps By Genre On The App Store (Apple)
To find out the most number of users for an app, we can look into the number of installs. For Google play store, we get this information by looking into the 'Installs' column. But for Apple app store, the 'installs' column is missing. We will use a workaround, where we'll take the 'rating_count_tot' as the column that tells the approx number of ratings.

Below, we find out the average number of ratings for app store.

In [140]:
ios_ratings_per_genre = defaultdict(int)  # genre: total_ratings ...1
apps_per_genre = defaultdict(int)  # genre: total_apps ...2
ios_avg_ratings_per_genre = {}  # genre: avg_rating ...(1 divide by 2)
for app in ios_final:
    genre = app[-5]
    rating = float(app[5])
    ios_ratings_per_genre[genre] += rating
    apps_per_genre[genre] += 1
    
for genre, rating in sorted_ios_ratings_per_genre:
    avg_ratings_per_genre = rating / apps_per_genre[genre]
    ios_avg_ratings_per_genre[genre] = avg_ratings_per_genre
    
sorted_ios_ratings_per_genre = sorted(ios_avg_ratings_per_genre.items(), key=lambda t: t[1], reverse=True)
for k, v in sorted_ios_ratings_per_genre:
    print(k, " : ", v)

Navigation  :  86090.33333333333
Reference  :  74942.11111111111
Social Networking  :  71548.34905660378
Music  :  57326.530303030304
Weather  :  52279.892857142855
Book  :  39758.5
Food & Drink  :  33333.92307692308
Finance  :  31467.944444444445
Photo & Video  :  28441.54375
Travel  :  28243.8
Shopping  :  26919.690476190477
Health & Fitness  :  23298.015384615384
Sports  :  23008.898550724636
Games  :  22788.6696905016
News  :  21248.023255813954
Productivity  :  21028.410714285714
Utilities  :  18684.456790123455
Lifestyle  :  16485.764705882353
Entertainment  :  14029.830708661417
Business  :  7491.117647058823
Education  :  7003.983050847458
Catalogs  :  4004.0
Medical  :  612.0


In general, the 'Navigation' apps have more average ratings than any other genre. But let's check if it is influenced by only few apps within that genre (__imbalanced distribution__) or is it a __normal distribution__.

In [141]:
for app in ios_final:
    if app[-5] == "Navigation":
        print(app[1], " : ", app[5])

Waze - GPS Navigation, Maps & Real-time Traffic  :  345046
Google Maps - Navigation & Transit  :  154911
Geocaching®  :  12811
CoPilot GPS – Car Navigation & Offline Maps  :  3582
ImmobilienScout24: Real Estate Search in Germany  :  187
Railway Route Search  :  5


We see that it is a case of imbalanced distribution of ratings wherein most rating count is coming from '_Waze - GPS Navigation, Maps & Real-time Traffic_' and '_Google Maps - Navigation & Transit_' apps.

The same pattern applies to 'Social Networking' genre where the ratings are high for giants like Facebook, Pinterest, Skype etc. Similarly for 'Music', the main players are giants like Spotify, Pandora and Shazam. 

The most popular genres bases on average ratings seems to be navigation, social networking and music. But the average raring is heavily influenced by just few of the apps within their respective category. One way of moving ahead could be to remove the first few giants from those genres and then try to find out the average rating per genre but that could be skipped for later work.

In [142]:
for app in ios_final:
    if app[-5] == "Reference":
        print(app[1], " : ", app[5])

Bible  :  985920
Dictionary.com Dictionary & Thesaurus  :  200047
Dictionary.com Dictionary & Thesaurus for iPad  :  54175
Google Translate  :  26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran  :  18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition  :  17588
Merriam-Webster Dictionary  :  16849
Night Sky  :  12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE)  :  8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools  :  4693
GUNS MODS for Minecraft PC Edition - Mods Tools  :  1497
Guides for Pokémon GO - Pokemon GO News and Cheats  :  826
WWDC  :  762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free  :  718
VPN Express  :  14
Real Bike Traffic Rider Virtual Reality Glasses  :  8
教えて!goo  :  0
Jishokun-Japanese English Dictionary & Translator  :  0


Looking into the second most populare category by rating, we see that 'Bible' and 'Dictionary.com' has the highest influence on the average rating. But it seems like this niche has some potentials. We could take another popular book and convert it into an app and also adding special features to it. The features might include daily quotes from the book, audio version of the book, quizzes around the book. We could also embed a dictionary on top of the app so that the user doesn't have to go out of the book to find a meaning for any word.

This idea could also be backed by the fact that most apps are dominated by fun-apps. It could also be that the market is saturated by fun apps and rather focusing towards making a productivity or practical purpose app into a fun one could be a good strategy to make it popular on the app store.

Other popular genre apps does not seem to be interesting for the reasons below -
* Weather apps - They are not engaging enough as the user might just look into the weather and close the app. The chances of making profit from this app using in-app ads seem to become low if it is not engaging.
* Food & Drink - Mostly influenced by Starbucks, Mc Donalds etc. So, making an app would mean actual cooking and delivery service, and we don't want to open a kitchen to develop an app.
* Financial apps - they involve banking, paying bills, money transfer etc, and we don't have expertise in the domain. It would mean for us to consult or hire a finance person and we don't want that.

Now, let's analyze the Google Play Store.

### Most Popular Apps By Genre On The Play Store (Google)
The google play store has an obvious column for the number of install per app. It should be fairly clear to get an idea about the popularity of the app with that number. So, let's begin.

In [144]:
display_table(freq_table(android_final, 5))  # Installs column

1,000,000+ 15.73
100,000+ 11.55
10,000,000+ 10.55
10,000+ 10.2
1,000+ 8.39
100+ 6.92
5,000,000+ 6.83
500,000+ 5.56
50,000+ 4.77
5,000+ 4.51
10+ 3.54
500+ 3.25
50,000,000+ 2.3
100,000,000+ 2.13
50+ 1.92
5+ 0.79
1+ 0.51
500,000,000+ 0.27
1,000,000,000+ 0.23
0+ 0.05
0 0.01


We see in the above output that the installs column is not precise. It is __open ended__.
It is not clear from (for example) 100,000+ if it is 100,500 or 190,000 or in between. 

Here we will take an assumption and keep the numbers as it is. However, to perform the computation we will remove the '+' sign from the data points so that the string can be converted to float.

In [145]:
android_installs_per_category = defaultdict(int)
android_apps_per_category = defaultdict(int)
for app in android_final:
    installs = app[5]
    installs = installs.replace(',', '')
    installs = installs.replace('+', '')
    installs = float(installs)
    category = app[1]
    android_installs_per_category[category] += installs
    android_apps_per_category[category] += 1
    
avg_android_installs_per_category = {}
for category, installs in android_installs_per_category.items():
    avg_android_installs_per_category[category] = installs / android_apps_per_category[category]
    
sorted_avg_android_installs_per_category = sorted(avg_android_installs_per_category.items(), key=lambda t: t[1], reverse=True)
for k, v in sorted_avg_android_installs_per_category:
    print(k, " : ", v)
    

COMMUNICATION  :  38456119.167247385
VIDEO_PLAYERS  :  24727872.452830188
SOCIAL  :  23253652.127118643
PHOTOGRAPHY  :  17840110.40229885
PRODUCTIVITY  :  16787331.344927534
GAME  :  15588015.603248259
TRAVEL_AND_LOCAL  :  13984077.710144928
ENTERTAINMENT  :  11640705.88235294
TOOLS  :  10801391.298666667
NEWS_AND_MAGAZINES  :  9549178.467741935
BOOKS_AND_REFERENCE  :  8767811.894736841
SHOPPING  :  7036877.311557789
PERSONALIZATION  :  5201482.6122448975
WEATHER  :  5074486.197183099
HEALTH_AND_FITNESS  :  4188821.9853479853
MAPS_AND_NAVIGATION  :  4056941.7741935486
FAMILY  :  3695641.8198090694
SPORTS  :  3638640.1428571427
ART_AND_DESIGN  :  1986335.0877192982
FOOD_AND_DRINK  :  1924897.7363636363
EDUCATION  :  1833495.145631068
BUSINESS  :  1712290.1474201474
LIFESTYLE  :  1437816.2687861272
FINANCE  :  1387692.475609756
HOUSE_AND_HOME  :  1331540.5616438356
DATING  :  854028.8303030303
COMICS  :  817657.2727272727
AUTO_AND_VEHICLES  :  647317.8170731707
LIBRARIES_AND_DEMO  :  638

We see that most number of installs are for 'Communication' category. Looking closely inside the category indicates us that the number is heavily influened by few of the apps.

In [147]:
for app in android_final:
    if app[1] == "COMMUNICATION" and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], " : ", app[5])

WhatsApp Messenger  :  1,000,000,000+
imo beta free calls and text  :  100,000,000+
Android Messages  :  100,000,000+
Google Duo - High Quality Video Calls  :  500,000,000+
Messenger – Text and Video Chat for Free  :  1,000,000,000+
imo free video calls and chat  :  500,000,000+
Skype - free IM & video calls  :  1,000,000,000+
Who  :  100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji  :  100,000,000+
LINE: Free Calls & Messages  :  500,000,000+
Google Chrome: Fast & Secure  :  1,000,000,000+
Firefox Browser fast & private  :  100,000,000+
UC Browser - Fast Download Private & Secure  :  500,000,000+
Gmail  :  1,000,000,000+
Hangouts  :  1,000,000,000+
Messenger Lite: Free Calls & Messages  :  100,000,000+
Kik  :  100,000,000+
KakaoTalk: Free Calls & Text  :  100,000,000+
Opera Mini - fast web browser  :  100,000,000+
Opera Browser: Fast and Secure  :  100,000,000+
Telegram  :  100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer  :  100,000,000+
UC Browser Mini -Tiny Fas

Let's try and remove those apps that highly influence the average installs.

In [148]:
under_100_m = []
for app in android_final:
    installs = app[5]
    installs = installs.replace(',', '')
    installs = installs.replace('+', '')
    installs = float(installs)
    if app[1] == "COMMUNICATION" and installs < 100000000:
        under_100_m.append(installs)
print(sum(under_100_m) / len(under_100_m))

3603485.3884615386


It can be seen here that the average rating for communication has decreased almost 10 times. The same pattern can be seen in the second positioned genre in terms of number of installs. The market is dominated by some of the giant companies.

It wouldn't be a great idea to compete against big giants like Google, Facebook and Microsoft in genres such as Communication, Social network, Music etc. 

The games genre looks pretty popular but as seen in app store as well, it seems to be saturated in play store also.

The books and reference sections seems pretty fair to investigate having an average installs of 8767811. It would be a good idea to explore more of this category as it also showed some potential of success in App Store.

Let's look at some of the apps in this genre and their number of installs.

In [150]:
for app in android_final:
    if app[1] == "BOOKS_AND_REFERENCE":
        print(app[0], " : ", app[5])

E-Book Read - Read Book for free  :  50,000+
Download free book with green book  :  100,000+
Wikipedia  :  10,000,000+
Cool Reader  :  10,000,000+
Free Panda Radio Music  :  100,000+
Book store  :  1,000,000+
FBReader: Favorite Book Reader  :  10,000,000+
English Grammar Complete Handbook  :  500,000+
Free Books - Spirit Fanfiction and Stories  :  1,000,000+
Google Play Books  :  1,000,000,000+
AlReader -any text book reader  :  5,000,000+
Offline English Dictionary  :  100,000+
Offline: English to Tagalog Dictionary  :  500,000+
FamilySearch Tree  :  1,000,000+
Cloud of Books  :  1,000,000+
Recipes of Prophetic Medicine for free  :  500,000+
ReadEra – free ebook reader  :  1,000,000+
Anonymous caller detection  :  10,000+
Ebook Reader  :  5,000,000+
Litnet - E-books  :  100,000+
Read books online  :  5,000,000+
English to Urdu Dictionary  :  500,000+
eBoox: book reader fb2 epub zip  :  1,000,000+
English Persian Dictionary  :  500,000+
Flybook  :  500,000+
All Maths Formulas  :  1,000

We follow the same process in this category as well and try to find out few apps which heavily skew the average.

In [151]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                            or app[5] == '500,000,000+'
                                            or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


It looks like the number is quite less compared to other categories meaning it still has potential to be looked into. Let's try to get an idea about the apps that have mid-level of installs (between 1 million to 50 million installs).

In [153]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
H

This category surely seems promising as it can be seen that some popular books like Quran surely has good number of installs. Taking into account the most popular books maybe also in recent year and converting them into ebooks could be a great idea.

However, this market is dominated by just ebooks so putting some extra efforts in creativity and making it more engaging may seem to work well.

### Conclusion
In this project we analyzed App Store and Google Play Store apps with the intention of recommending an app for both the stores that can be a profitable business.

We found that taking a book or reference and converting them into an ebook with some additional creative features have the potential to win the users. The special features could mean daily quotes from the book, audio version, in build dictionary, quizzes etc.

### Technical Summary:
We analyzed the data thoroughly and found that both the dataset need some cleaning.
Our data cleaning steps included -
* Removing corrupt data row
* Removing duplicate rows logically
* Removing non english oriented apps

After that we stepped into each dataset and listed the top genres based on highest number of apps.
We also listed out the average installs/ratings per each genre.
After getting the idea about the genres/categories with most ratings/installs, we tried _ignoring those categories which have imbalanced dataset_ (in a way that only few apps had biggest role in increaing the average).
We analyzed those categories carefully which fall in between considering the top genre apps as well as number of installs per genre. (intersection between top genres and average installs per genre)
