# Profitable App Profiles for the App Store and Google Play Markets

This project is about to figure out what type of applications can attract most users. The goal for this project is to provide insights to find the application type that most users uses to the application developers so that they can develop applications that makes profits via ads. 

Notes: 
- The application developers only build apps that are free to download and install, and that are directed toward an _English-speaking_ audience.

## Define functions

`explore_data()` function returns a dataset sliced based on the `start` and `end` parameters.
`delete_data()` function returns a dataset result excluding any indices passed from the `wrong_data_dict` parameter.


In [1]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
#     for row in dataset_slice:
#         print(row)
#         print('\n') # adds a new (empty) line after each row
    
    if rows_and_columns:
        print('Number of rows: ', len(dataset))
        print('Number of columns: ', len(dataset[0]))
        
    return dataset_slice

def delete_data(dataset, wrong_data_dict, rows_and_columns=False):
    result = [v for i, v in enumerate(dataset) if i not in wrong_data_dict]
    
    if rows_and_columns:
        print('Number of rows: ', len(result))
        print('Number of columns: ', len(result[0]))
    
    return result
        
        

## Retrieve datasets for both Apple Store and Google Play Store

By using the defined `explore_data()` function, retrieve the datasets for both Apple Store and Google Play Store.

In [2]:
from csv import reader

apple_store_opened_file = open('datasets/AppleStore.csv', encoding="utf8")
apple_store_read_file = reader(apple_store_opened_file)
apple_store_data_list = list(apple_store_read_file) 

ios_header = explore_data(apple_store_data_list, 0, 1)
ios_dataset = explore_data(apple_store_data_list[1:], 0, len(apple_store_data_list[1:]), True)
print(ios_header)

google_play_store_opened_file = open('datasets/googleplaystore.csv', encoding="utf8")
google_play_store_read_file = reader(google_play_store_opened_file)
google_play_store_data_list = list(google_play_store_read_file)

android_header = explore_data(google_play_store_data_list, 0, 1)
android_dataset = explore_data(google_play_store_data_list[1:], 0, len(google_play_store_data_list[1:]), True)
print(android_header)





Number of rows:  7197
Number of columns:  16
[['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']]
Number of rows:  10841
Number of columns:  13
[['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']]


## Remove wrong data from the datasets

Some wrong data found from the [discussion forum](https://www.kaggle.com/lava18/google-play-store-apps/discussion?search=wrong) should be removed from the dataset in order to provide accurate datasets.

There was no identified wrong data from the Apple Store dataset.

In [3]:
android_wrong_data = {9148, 10472} # wrong data indices
android_dataset = delete_data(android_dataset, android_wrong_data, True)

Number of rows:  10839
Number of columns:  13


## Removing duplicate entries

If we see the [discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion) section, we will notice some apps have duplicate entries. For instance, Instagram has four entries:

In [4]:
for app in android_dataset:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In total, there are 1,181 cases where an app occurs more than onces:

In [5]:
duplicate_apps = []
unique_apps = []

for app in android_dataset:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print("Total duplicate entries from Android: " + str(len(duplicate_apps)))

Total duplicate entries from Android: 1181


Rather than removing duplicate entries randomly, it's better to set some criterion for removing. If we see the Instagram example, we can notice the review value is different among them. That means, each of them created at different date.

In [6]:
for app in android_dataset:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


Therefore, I'm going to leave the one has highest number of reviews as we can assume that is the latest entry amongh them.

In [7]:
print("Expected number of entries after removing duplicates: ", len(android_dataset) - 1181)

reviews_max = {}

for app in android_dataset:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
print("The number of entries after deleting duplicates: ", len(reviews_max))

Expected number of entries after removing duplicates:  9658
The number of entries after deleting duplicates:  9658


Use the directionary created above to remove the duplicate rows:
- Start by creating two empty lists: `android_clean`(which will store our new cleaned data set) and `already_added`(which will just store app names).
- Loop through the Google Play data set (without header row), and for each iteration:
 - Append the entire row to the `android_clean` list (which will eventually be a list of list and store our cleaned data set).
 - Append the name of the app `name` to the `already_added` list - this helps us to keep track of apps that we already added.

In [8]:
android_clean = []
already_added = [] # will store app names.

for app in android_dataset:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)
        
print("The number of clean data entities: ", len(android_clean))



The number of clean data entities:  9658


## Removing non-english apps

Remember we use English for the apps we develop, and we'd like to analyze only the apps that are directed toward an English-speaking audience. However, if we explore the data long enough, we'll find that both data sets have apps with names that suggest they are not directed toward an English-speaking audience.

In [9]:
print(ios_dataset[813][1])
print(ios_dataset[6731][1])
print('\n')
print(android_clean[4412][0])
print(android_clean[7940][0])

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜


中国語 AQリスニング
لعبة تقدر تربح DZ


The numbers corresponding to the characters we commonly use in English text are all in the range 0 to 127, according to [ASCII](https://en.wikipedia.org/wiki/ASCII)(American Standard Code for Information Interchange) system. Based on this number range, we can build a function that detects whether a character belongs to the set of common English characters or not. If the number is equal to or less than 127, then the character belongs to the set of common English characters.


In [10]:
def is_app_name_english(app_name):
    for char in app_name:
        ascii_num = ord(char)
        if ascii_num > 127:
            return False
    
    return True

Use the function to check whether these app names are detected as English or non-English:
- `'Instagram'`
- `'爱奇艺PPS -《欢乐颂2》电视剧热播'`
- `'Docs To Go™ Free Office Suite'`
- `'Instachat 😜'`

After runing those app names for testing, we can notice that the english named apps containing special characters are distinguished as a non-English app. This will lose useful data since many English apps will be incorrectly labeled as non-English. To minimize the impact of data loss, we'll only remove an app if its name contains more than three characters with corresponding numbers falling outside the ASCII range. The function is still not perfect, but it should be fairly effective.

In [11]:
def is_app_name_english(app_name):
    num_of_outside_range = 0
    for char in app_name:
        ascii_num = ord(char)
        if ascii_num > 127 and num_of_outside_range < 3:
            num_of_outside_range += 1
        elif ascii_num > 127 and num_of_outside_range >= 2:
            return False
    
    return True

Use the new function to check whether these app names are detected as English or non-English:

- `'Docs To Go™ Free Office Suite'`
- `'Instachat 😜'`
- `'爱奇艺PPS -《欢乐颂2》电视剧热播'`

Use the function to filter out non-English apps from the both datasets. 

In [12]:
ios_eng_apps = []
android_eng_apps = []

for app in ios_dataset:
    name = app[1]
    if is_app_name_english(name):
        ios_eng_apps.append(app)

for app in android_clean:
    name = app[0]
    if is_app_name_english(name):
        android_eng_apps.append(app)
    
print("Remaining ios data entities: ", len(ios_eng_apps))
print("Remaining android data entities: ", len(android_eng_apps))

Remaining ios data entities:  6183
Remaining android data entities:  9613


As we mentioned in the introduction, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our datasets contain both free and non-free apps; we'll need to isolate only the free apps for our analysis.

In [13]:
ios_free_apps = []
android_free_apps = []

for app in ios_eng_apps:
    price = float(app[4])
    if price == 0.0:
        ios_free_apps.append(app)

for app in android_eng_apps:
    price = float(app[7].replace('$', ''))
    if price == 0.0:
        android_free_apps.append(app)
        
print("Remaining ios data entities: ", len(ios_free_apps))
print("Remaining android data entities: ", len(android_free_apps))
        

Remaining ios data entities:  3222
Remaining android data entities:  8863


## Find most common apps by genre

Our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps. 

Because our end goal is to add the app on both Google Play and App Store, we need to find app profiles that are successful on both markets. For instance, a profile that works well both markets might be a productivity app that makes use of gamification.

For this, we'll need to build frequency tables that displayed in ascending order for a few columns in our datasets.


In [22]:
def freq_table(dataset, index):
    freq_table_dict = {}
    total_apps = len(dataset)
    for row in dataset:
        val = row[index]
        if val in freq_table_dict:
            freq_table_dict[val] += 1
        else:
            freq_table_dict[val] = 1
            
    for key in freq_table_dict:
        freq_table_dict[key] = (freq_table_dict[key]/total_apps) * 100
    return freq_table_dict

def display_table(dataset, index, reverse=False):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
    
    table_sorted = sorted(table_display, reverse=reverse)
    
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

display_table(ios_free_apps, 11)
print("\n")
display_table(android_free_apps, 9)
print("\n")
display_table(android_free_apps, 1)

Catalogs : 0.12414649286157665
Medical : 0.186219739292365
Navigation : 0.186219739292365
Book : 0.4345127250155183
Business : 0.5276225946617008
Reference : 0.5586592178770949
Food & Drink : 0.8069522036002483
Weather : 0.8690254500310366
Finance : 1.1173184357541899
Travel : 1.2414649286157666
News : 1.3345747982619491
Lifestyle : 1.5828677839851024
Productivity : 1.7380509000620732
Health & Fitness : 2.0173805090006205
Music : 2.0484171322160147
Sports : 2.1415270018621975
Utilities : 2.5139664804469275
Shopping : 2.60707635009311
Social Networking : 3.2898820608317814
Education : 3.662321539416512
Photo & Video : 4.9658597144630665
Entertainment : 7.883302296710118
Games : 58.16263190564867


Adventure;Education : 0.011282861333634209
Arcade;Pretend Play : 0.011282861333634209
Art & Design;Action & Adventure : 0.011282861333634209
Art & Design;Pretend Play : 0.011282861333634209
Books & Reference;Education : 0.011282861333634209
Card;Action & Adventure : 0.011282861333634209
Casual

The frequency tables we analyzed above showed us that the App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and fun apps. Now, we'd like to get an idea about the kind of apps with the most users.

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play dataset, we can find this information in the `Installs` column, but this information is missing for the App Store dataset. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the `rating_cout_tot` app.


In [15]:
ios_genre = freq_table(ios_free_apps, 11)

for genre in ios_genre:
    total = 0
    len_genre = 0
    for app in ios_free_apps:
        genre_app = app[11]
        if genre_app == genre:
            num_of_ratings = float(app[5])
            total += num_of_ratings
            len_genre += 1
    avg_app_genre = total/len_genre
    print(genre + ": " + str(avg_app_genre))

Social Networking: 71548.34905660378
Photo & Video: 28441.54375
Games: 22788.6696905016
Music: 57326.530303030304
Reference: 74942.11111111111
Health & Fitness: 23298.015384615384
Weather: 52279.892857142855
Utilities: 18684.456790123455
Travel: 28243.8
Shopping: 26919.690476190477
News: 21248.023255813954
Navigation: 86090.33333333333
Lifestyle: 16485.764705882353
Entertainment: 14029.830708661417
Food & Drink: 33333.92307692308
Sports: 23008.898550724636
Book: 39758.5
Finance: 31467.944444444445
Education: 7003.983050847458
Productivity: 21028.410714285714
Business: 7491.117647058823
Catalogs: 4004.0
Medical: 612.0


On average, Navigation apps have the highest number of user reviews, but this figure is heavily influenced by Waze and Google Maps, which have close to half a million user reviews together:

In [18]:
for app in ios_free_apps:
    if app[11] == 'Navigation':
        print(app[1], ':', app[5])

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


The same pattern applies to social networking apps, where the average number is heavily influenced by a few giants like Facebook, Pinterest, Skype, etc. Same applies to music apps, where a few big players like Pandora, Spotify, and Shazam heavily influence the average number.

Our aim is to find popular genres, but navigation, social networking or music apps might seem more popular than they really are. The average number of ratings seem to be skewed by very few apps which have hundreds of thousands of user ratings, while the other apps may struggle to get past the 10,000 threshold. We could get a better picture by removing these extremely popular apps for each genre and then rework the averages, but we'll leave this level of detail for later.

Reference apps have 74,942 user ratings on average, but it's actually the Bible and Dictionary.com which skew up the average rating:


In [20]:
for app in ios_free_apps:
    if app[-5] == 'Reference':
        print(app[1], ':', app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


However, this niche seems to show some potential. One thing we could do is take another popular book and turn it into an app where we could add different features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes about the book, etc. On top of that, we could also embed a dictionary within the app, so users don't need to exit our app to look up words in an external app.

This idea seems to fit well with the fact that the App Store is dominated by for-fun apps. This suggests the market might be a bit saturated with for-fun apps, which means a practical app might have more of a chance to stand out among the huge number of apps on the App Store.

Other genres that seem popular include weather, book, food and drink, or finance. The book genre seem to overlap a bit with the app idea we described above, but the other genres don't seem too interesting to us:
- Weather apps — people generally don't spend too much time in-app, and the chances of making profit from in-app adds are low. Also, getting reliable live weather data may require us to connect our apps to non-free APIs.
- Food and drink — examples here include Starbucks, Dunkin' Donuts, McDonald's, etc. So making a popular food and drink app requires actual cooking and a delivery service, which is outside the scope.
- Finance apps — these apps involve banking, paying bills, money transfer, etc. Building a finance app requires domain knowledge, and we don't want to hire a finance expert just to build an app.

Now let's analyze the Google Play market a bit.

## Most popular apps by genre on Google Play

In [23]:
display_table(android_free_apps, 5, True)


1,000,000+ : 15.728308699086089
100,000+ : 11.55365000564143
10,000,000+ : 10.549475346947986
10,000+ : 10.199706645605326
1,000+ : 8.394448832223853
100+ : 6.916393997517771
5,000,000+ : 6.826131106848697
500,000+ : 5.562450637481666
50,000+ : 4.772650344127271
5,000+ : 4.513144533453684
10+ : 3.542818458761142
500+ : 3.2494640640866526
50,000,000+ : 2.3017037120613786
100,000,000+ : 2.1324607920568655
50+ : 1.9180864267178157
5+ : 0.7898002933543946
1+ : 0.5077287600135394
500,000,000+ : 0.270788672007221
1,000,000,000+ : 0.2256572266726842
0+ : 0.045131445334536835
