# Profitable App Profiles

## Intro
The aim of this project is to analyze data from the Apple App store and Google Play store to find profitable app profiles. The goal is to identify which apps could be more likely to attract more users. We are interested specifically in free apps. The idea is that by doing this analysis we can determine what kind of app to develop that could have a chance of doing well on both the Apple and Android app stores.

## Opening the Files

As of September 2018 there were approximately 2 million iOS apps and 2.1 million Android apps on their respective stores. We will be looking at a sample of this as analyzing over 2 million data points is a bit overkill.

The datasets we will be using are:
* [A data set](https://www.kaggle.com/lava18/google-play-store-apps) with data on ~10,000 Android apps. The data was scraped in August 2018. The dataset can be downloaded directly [here](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).
* [A data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) with data on ~7000 iOS apps, the data was scraped in July 2017. It can be downloaded [here](https://dq-content.s3.amazonaws.com/350/AppleStore.csv).

We will start by opening the datasets for exploration.

In [1]:
from csv import reader

# Google play data
opened_file = open('googleplaystore.csv', encoding="utf8")
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

# Apple app store data
opened_file = open('AppleStore.csv', encoding="utf8")
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

Because we'll want to explore the datasets several times we can write a function to make this easier for us.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

The above function takes in four parameters.
* dataset which is expected to be a 2D list
* start and end which are expected to be integers and represent the starting and the ending indices of a slice from the dataset
* rows_and_columns which is expected to be a boolean and has false as a default argument.

It slices the dataset according to start and end
Loops through the slice, and for each iteration, prints a row and adds a new line after that row.
If rows_and_columns is True then it will print the number of rows and columns.
* dataset shouldn't have a header row, otherwise this function will print the wrong number of rows

Lets take a look at the Google Play Store data.

In [3]:
print(android_header)
print('\n')
explore_data(android, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


It appears that the columns that could help with our objective are
* App
* Category
* Rating
* Installs
* Type
* Price
* Genres

Now lets look at the Apple Store data.

In [4]:
print(ios_header)
print('\n')
explore_data(ios, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


The columns of interest appear to be
* track_name
* currency
* price
* rating_count_tot
* rating_count_ver
* prime_genre

## Data Cleaning
We've previously established we are only interested in the apps which are free so we will need to clean our data to get only the appropriate apps. We also want to make sure our dataset consists of English apps only to make our analysis simpler for us. Naturally we also want to ensure that the dataset is free from any errors.

The Google Play dataset has a [dedicated discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion), in that discussion section we can see that [one of the discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) describes an error for a certain row.

At row 10472 the category appears to be missing, let's verify that.

In [5]:
print(android_header)
print('\n')
print(android[10472])
print('\n')
print(android[0]) # Print another row to compare with

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


That row does appear to be missing the category measure so we will simply remove it.

In [6]:
print(len(android))
del android[10472]  # Make sure to only run this once
print(len(android))

10841
10840


Again looking at the discussions section on Kaggle we can also see that there are reports of apps haivng duplicate entries. For example Instagram was reported as having duplicate entries.

In [7]:
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In order to find all instances of apps with duplicate entries we can use the following code.

In [8]:
duplicate_apps = []
unique_apps  = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of duplicate apps:', len(duplicate_apps))

Number of duplicate apps: 1181


Above, we:
* Created two lists, one for storing the name of duplicate apps and the other for storing the names of unique apps.
* Looped through the android dataset, and for each iteration:
    * Saved the app name to a variable named 'name'
    
    * If name was already in the unique_apps list then we append name to the duplicate_apps list.
    * Else we appended it to the unique_apps list.
    
Now comes the question of how we will be removing these duplicate entries from the dataset. We could just remove these indiscriminately but we could probably find a better way. If we examine the rows we printed using Instagram as an example we can see that the number of reviews differs for each entry. This means we can keep the row with the highest number of reviews since that would be the most recent data for that entry.

To remove the duplicates, we will:
* Create a dictionary, where each key is a unique app name the corresponding dictionary value is the highest number of reviews of that app.
* Use the information stored in the dictionary to create a new dataset, this dataset will only have one entry per app

In [9]:
reviews_max = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

We know from earlier that we have 1181 duplicate apps so our dictionary should be the length of the original dataset - 1181.

In [10]:
print('Expected Length: ', len(android) - 1181)
print('Actual Length: ', len(reviews_max))

Expected Length:  9659
Actual Length:  9659


Now we can use this dictionary to actually remove the duplicate rows.

In [11]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

Here we create two lists, one to store the cleaned dataset and the other is to store the names of the apps that we have already included in the dataset.

We loop through the dataset and if n_reviews is the same as the number of maximum reviews of the app name (found in the reviews_max dictionary) and name is not already in the list already_added then we append the entire row to the android_clean list, and then we also append the name of the app to the already_added list to keep track of which apps we have already included.

In [12]:
# Verifying our above work is correct, our data should have 9659 rows

explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


Now we check to see if there are any duplicates in the iOS dataset.

In [13]:
duplicate_ios = []
unique_ios  = []

for app in ios:
    name = app[1]
    if name in unique_ios:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of duplicate apps:', len(duplicate_ios))

Number of duplicate apps: 0


There appear to be no duplicates so we can move on to the next step.

## Removal of Non-English Apps

We are not interested in keeping any apps with Non-English names since we as an English speaker will not really be able to analyze them. To do this we will make use of ASCII, if a character falls out of the range of 0-127 (the numbers that correspond to the characters we use in the English language) then we will know that the app is not an English app.

To do so we will first write a function that checks whether an app has an English title or not.

In [14]:
def is_eng(a_string):
    for character in a_string:
        if ord(character) > 127:
            return False
    return True

# Testing
print(is_eng('Instagram'))
print(is_eng('爱奇艺PPS -《欢乐颂2》电视剧热播'))

True
False


We know this function works but what about apps with English titles and special characters?

In [15]:
print(is_eng('Docs To Go™ Free Office Suite'))
print(is_eng('Instachat 😜'))

False
False


Because these emoji's have ASCII values greater than 127 our function treats them as non-English. This means if we were to use this function we would actually be ignoring important data points. To prevent this from happening we can modify our original function.

In [16]:
def is_eng(a_string):
    counter = 0
    for character in a_string:
        if ord(character) > 127:
            counter += 1
    if counter > 3:
        return False
    else:
        return True

The above function will deem an app as non-English if there are greater than 3 characters that aren't English characters. While not a perfect solution it will work for us.

In [17]:
android_eng = []
ios_eng = []

for app in android_clean:
    name = app[0]
    if is_eng(name):
        android_eng.append(app)
        
for app in ios:
    name = app[1]
    if is_eng(name):
        ios_eng.append(app)
        
explore_data(android_eng, 0, 3, True)
print('\n')
explore_data(ios_eng, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

## Isolating the Free Apps

Now that we have the two datasets of English apps we can isolate just the free apps, we can do this by following a similar procedure to what we did just earlier.

In [18]:
android_free = []
ios_free = []

for app in android_eng:
    price = app[6]
    if price == 'Free':
        android_free.append(app)
        
for app in ios_eng:
    price = float(app[4])
    if price == 0.0:
        ios_free.append(app)
        
explore_data(android_free, 0, 3, True)
print()
explore_data(ios_free, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 8863
Number of columns: 13

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', '

## Analysis

Now that we have finished cleaning our data we can move on to the analysis portion. First we need to determine what kinds of apps are successful for both the Android and iPhone markets. We can begin by getting a sense of what the most common app genres are for each market.

For this we will build a frequency table for the genres for each dataset.

### Building the Frequency Table

In [19]:
def freq_table(dataset, index):
    table = {}
    total = 0
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage
    
    return table_percentages

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

### Genre Sizes for the iOS and Google Play Store

Let us first examine the ios dataset.

In [20]:
print(ios_header)
print()

display_table(ios_free, -5)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


These are the percentages of apps that make up each of the categories. Among free apps we can see that an overwhelming majority of them (58%) are games.

The general impression we can take away from this are that most of the free apps are for games and fun. To me this suggests that this category is oversaturated and if we were to develop an app; it would just get lost in the sea of free gaming apps. It is also important to note that this table tells us which kind of app is most frequent, not which app sees the most users.

Now lets take a look at what we can determine from the Android dataset.

In [21]:
print(android_header)
print()

# Categories
display_table(android_free, 1)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']

FAMILY : 18.898792733837304
GAME : 9.725826469592688
TOOLS : 8.462146000225657
BUSINESS : 4.592124562789123
LIFESTYLE : 3.9038700214374367
PRODUCTIVITY : 3.8925871601038025
FINANCE : 3.7007785174320205
MEDICAL : 3.5315355974275078
SPORTS : 3.396141261423897
PERSONALIZATION : 3.317161232088458
COMMUNICATION : 3.2381812027530184
HEALTH_AND_FITNESS : 3.0802211440821394
PHOTOGRAPHY : 2.944826808078529
NEWS_AND_MAGAZINES : 2.798149610741284
SOCIAL : 2.6627552747376737
TRAVEL_AND_LOCAL : 2.335552296062281
SHOPPING : 2.245289405393208
BOOKS_AND_REFERENCE : 2.1437436533904997
DATING : 1.8616721200496444
VIDEO_PLAYERS : 1.7939749520478394
MAPS_AND_NAVIGATION : 1.399074805370642
FOOD_AND_DRINK : 1.241114746699763
EDUCATION : 1.1621347173643235
ENTERTAINMENT : 0.9590432133589079
LIBRARIES_AND_DEMO : 0.9364774906916393
AUTO_AND_VEHICLES : 0.925194

In [22]:
# Genres
display_table(android_free, -4)

Tools : 8.450863138892023
Entertainment : 6.070179397495204
Education : 5.348076272142616
Business : 4.592124562789123
Productivity : 3.8925871601038025
Lifestyle : 3.8925871601038025
Finance : 3.7007785174320205
Medical : 3.5315355974275078
Sports : 3.463838429425702
Personalization : 3.317161232088458
Communication : 3.2381812027530184
Action : 3.102786866749408
Health & Fitness : 3.0802211440821394
Photography : 2.944826808078529
News & Magazines : 2.798149610741284
Social : 2.6627552747376737
Travel & Local : 2.324269434728647
Shopping : 2.245289405393208
Books & Reference : 2.1437436533904997
Simulation : 2.042197901387792
Dating : 1.8616721200496444
Arcade : 1.8503892587160102
Video Players & Editors : 1.771409229380571
Casual : 1.7601263680469368
Maps & Navigation : 1.399074805370642
Food & Drink : 1.241114746699763
Puzzle : 1.128286133363421
Racing : 0.9928917973598104
Role Playing : 0.9364774906916393
Libraries & Demo : 0.9364774906916393
Auto & Vehicles : 0.9251946293580051
S

The top category for the play store is "Family" followed by "Game", the genres are quite diverse for the apps.

Looking at this data we again can't make a solid suggestion for an app profile, although we can recommend that we try to avoid the most saturated categories. Again just to bring up the point that these frequencies are for the number of apps, not the number of users. If we want to make profit from a free app then we are dependant on running ads which means we want to reach the greatest number of users.

### Most Popular Apps by Genre on iOS

To get an idea of which apps have the most users we can calculate the average number of installs for each app genre, both datasets have a column containing the number of installs for an app.

To do this we will need to:
* Isolate the apps of each genre
* Sum the user ratings for the apps of the genre
* Divide the sum by the number of apps belonging to that genre

In the case of the iOS dataset there is no column for installs so we will use the user ratings instead.

In [23]:
genres_ios = freq_table(ios_free, -5)

for genre in genres_ios:
    total = 0
    len_genre = 0
    for app in ios_free:
        genre_app = app[-5]
        if genre_app == genre:
            num_ratings = float(app[5])
            total += num_ratings
            len_genre += 1
    avg_ratings = total / len_genre
    print(genre, ":", avg_ratings)

Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


It appears navigation apps and social networking apps see the greatest number of reviews on the Apple app store. But it's possible that the results are being skewed by massively popular apps in the app store. For instance Google Maps, Waze, Facebook, Instagram, and Twitter.

In [24]:
for app in ios_free:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5])

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


In [25]:
for app in ios_free:
    if app[-5] == "Social Networking":
        print(app[1], ':', app[5])

Facebook : 2974676
Pinterest : 1061624
Skype for iPhone : 373519
Messenger : 351466
Tumblr : 334293
WhatsApp Messenger : 287589
Kik : 260965
ooVoo – Free Video Call, Text and Voice : 177501
TextNow - Unlimited Text + Calls : 164963
Viber Messenger – Text & Call : 164249
Followers - Social Analytics For Instagram : 112778
MeetMe - Chat and Meet New People : 97072
We Heart It - Fashion, wallpapers, quotes, tattoos : 90414
InsTrack for Instagram - Analytics Plus More : 85535
Tango - Free Video Call, Voice and Chat : 75412
LinkedIn : 71856
Match™ - #1 Dating App. : 60659
Skype for iPad : 60163
POF - Best Dating App for Conversations : 52642
Timehop : 49510
Find My Family, Friends & iPhone - Life360 Locator : 43877
Whisper - Share, Express, Meet : 39819
Hangouts : 36404
LINE PLAY - Your Avatar World : 34677
WeChat : 34584
Badoo - Meet New People, Chat, Socialize. : 34428
Followers + for Instagram - Follower Analytics : 28633
GroupMe : 28260
Marco Polo Video Walkie Talkie : 27662
Miitomo : 2

As we can see, the massively popular apps in these categories are developed by the tech giants. It's probably safe to assume that this could be the case for most of the categories. A few massively popular apps skewing our overall results.

Let's take a look at one of the categories unlikely to be dominated by one of these tech giants.

In [26]:
for app in ios_free:
    if app[-5] == "Health & Fitness":
        print(app[1], ':', app[5])

Calorie Counter & Diet Tracker by MyFitnessPal : 507706
Lose It! – Weight Loss Program and Calorie Counter : 373835
Weight Watchers : 136833
Sleep Cycle alarm clock : 104539
Fitbit : 90496
Period Tracker Lite : 53620
Nike+ Training Club - Workouts & Fitness Plans : 33969
Plant Nanny - Water Reminder with Cute Plants : 27421
Sworkit - Custom Workouts for Exercise & Fitness : 16819
Clue Period Tracker: Period & Ovulation Tracker : 13436
Headspace : 12819
Fooducate - Lose Weight, Eat Healthy,Get Motivated : 11875
Runtastic Running, Jogging and Walking Tracker : 10298
WebMD for iPad : 9142
8fit - Workouts, meal plans and personal trainer : 8730
Garmin Connect™ Mobile : 8341
Record by Under Armour, connects with UA HealthBox : 7754
Fitstar Personal Trainer : 7496
My Cycles Period and Ovulation Tracker : 7469
Seven - 7 Minute Workout Training Challenge : 6808
RUNNING for weight loss: workout & meal plans : 6407
Lifesum – Inspiring healthy lifestyle app : 5795
Waterlogged - Daily Hydration Tr

This niche shows some promise. While there are some hugely popular apps (MyFitnessPal, Nike Run Club) there are some other apps that are still largely popular and aren't developed by a massive company.

Plant Nanny has 27,000+ reviews and it appears to be a water intake tracking app with some kind of gamification aspect. Other popular apps appear to be workout trackers. Perhaps an app that could do well in this niche would be a workout logger/tracker with some gamification aspect implemented.

Now let's take a look at the Google Play Store.

### Most Popular Apps by Genre on Google Play

In [27]:
display_table(android_free, 5)

1,000,000+ : 15.728308699086089
100,000+ : 11.55365000564143
10,000,000+ : 10.549475346947986
10,000+ : 10.199706645605326
1,000+ : 8.394448832223853
100+ : 6.916393997517771
5,000,000+ : 6.826131106848697
500,000+ : 5.562450637481666
50,000+ : 4.772650344127271
5,000+ : 4.513144533453684
10+ : 3.542818458761142
500+ : 3.2494640640866526
50,000,000+ : 2.3017037120613786
100,000,000+ : 2.1324607920568655
50+ : 1.9180864267178157
5+ : 0.7898002933543946
1+ : 0.5077287600135394
500,000,000+ : 0.270788672007221
1,000,000,000+ : 0.2256572266726842
0+ : 0.045131445334536835


Above we see the column with the install information for Google Play, you'll notice we don't actually get very granular information regarding the exact number of installations. An app with 100,000+ installs could have 100,000 installs or it could have 200,000. There's no way of knowing, but for our purpose this is totally fine. We don't actually need super precise information. We only want to find out which app genre attracts the highest number of users and we don't need perfect precision to get a picture of that.

We can leave these numbers as they are but we will need to convert this column to a numerical column. To do that we have to first replace the punctuation and symbols.

In [28]:
categories_android = freq_table(android_free, 1)

for cat in categories_android:
    total = 0
    len_cat = 0
    for app in android_free:
        category_app = app[1]
        if category_app == cat:
            num_installs = app[5]
            num_installs = num_installs.replace(',', '')
            num_installs = num_installs.replace('+', '')
            total += float(num_installs)
            len_cat += 1
    avg_num_installs = total / len_cat
    print(cat, ':', avg_num_installs)

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3697848.1731343283
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

Again it is probably wise to avoid categories that are likely to be saturated with apps developed by giant companies. For instance communication, some apps in this category have over a billion installations.

In [29]:
for app in android_free:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

Not only do these categories seem more popular than they really are, they are dominated by a small number of giants, making them categories we should be trying to avoid when developing a new application.

Let's explore the BOOKS_AND_REFERENCE category. As it appears to be a quite sizable one.

In [30]:
for app in android_free:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E

Again we see some applications that see an overwhelmingly large amount of installations that are skweing the average.

In [31]:
for app in android_free:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                            or app[5] == '500,000,000+'
                                            or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


Let's ignore these and see what apps are performing well in the range of 1M+ installs to 100M.

In [32]:
for app in android_free:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
H

We see that a majority of the apps in this segment are e-readers e-book apps and services. So we probably don't want to be competing by creating a similar app.

Other applications that see some popularity within this category are Dictionaries, Game Guides, and Religious books turned to an app.

Perhaps a good niche we could fall in is creating and digitizing some kind of reference material for something popular.

## Conclusion

In this project we analyzed different app profiles in the Apple App Store and the Google Play Store to determine what kinds of applications are most frequent and most popular.

We also came to the conclusion that some profitable app ideas could be creating a gamified workout tracker or a digitized reference book/guide book for something popular.