# Profitable App Analysis
by Serge Kamilchu

---

A brief dive into the profiles of apps offered at **Apple App Store** and **Google Play Market** to ascertain profitablilty potential of apps based on chosen descriptor columns(examples: `Genre`, `Category`, `user_ratings`, `Installs`, etc.,).
**Primary Focus** will be on free/ad-revenue apps. Python3+ with csv module will be used in this analysis. A step by step notation will precede code used to manipulate these data sets.

As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play. This is a vast amount of data that would take a considerable amount of time and money to collect. For this project we'll have to make due with smaller sample sizes to represent the whole:
* [Android Kaggle data-set](https://www.kaggle.com/lava18/google-play-store-apps/home) of approximately ten thousand Android Google Play apps - August 2018.
* [iOS Kaggle data-set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) of approximately seven thousand Apple Store apps - July 2017.

---

In [2]:
import csv
import os

# below: storing both datasets as list of lists into below variables
ios_file = list(csv.reader(open('../data_sets/AppleStore.csv')))
android_file = list(csv.reader(open('../data_sets/googleplaystore.csv')))

# below: multiple assignment to spit header from body of both datasets
ios_header, ios_body = ios_file[0], ios_file[1:]
android_header, android_body = android_file[0], android_file[1:]


Below `explore_data()` function takes a few arguments:
>- `dataset`: dataset in the form of list of lists.
>- `start`,`end`: indices indicating rows of interest to show.
>- `rows_and_columns = True` Shows the number of rows and columns. `False` Does not show(default).

In [3]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n') # print empty row for readability
        
    if rows_and_columns:
        print(f'Total number of rows in : ', len(dataset))
        print(f'Total number of columns: ', len(dataset[0]))
        

---
## Apps at a brief glance:

In the next few cells I will try to identify column representation using the headers stored in `android_header` and `ios_header`. The Kaggle sources stated above can be referenced for further identification of each column in each platforms' headers if needed.

Below is a print out of `android_header`  and `ios_header` and the first couple of rows contained in their respective `_body` datasets.
Lets observe the headers to see if there are any columns of interest as pertaining to our goal.

In [4]:
print(F"<<ANDROID HEADER ROW>>: ", android_header, '\n')
explore_data(android_body, 0, 2, True)

<<ANDROID HEADER ROW>>:  ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Total number of rows in :  10841
Total number of columns:  13


`abdroid_body` contains a total of 10841 app rows and a few columns of interest in the header row are:
- `'App'`, `'Category'`, `'Reviews'`, `'Installs'`, `'Type'`, `'Price'`, and `'Genres'`.

In [5]:
print("<<IOS HEADER ROW>>: ", ios_header, '\n')
explore_data(ios_body, 0, 2, True)

<<IOS HEADER ROW>>:  ['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


Total number of rows in :  7197
Total number of columns:  17


`ios_body` contains a total of 7197 apps with a few columns of interest in its header row:
- `'track_name'`, `'currency'`, `'price'`, `'rating_count_tot'`, `'rating_count_ver'`, and `'prime_genre'`.



---
## Cleaning incompatible and wrong data

Before beginning our analysis, we need to make sure the data we analyze is accurate, otherwise the results of our analysis will be wrong. This means that we'll need to:

- Detect apps containing 'NaN' and removes them
- Detect inaccurate data and correct (or remove) it
- Detect duplicate data and remove the duplicates
- Remove non-English apps like 爱奇艺PPS -《欢乐颂2》电视剧热播
- Remove non-free apps

In [5]:
print(len(android_body))

# this loop deletes any app row that contains missing items or 'NaN'
for row in android_body:
    if ('' in row) or ('NaN' in row):
        android_body.remove(row)
            
print(len(android_body))

10841
9753


---
## Removing duplicate entries
##### _Side note: iOS apps do not have empty inputs or duplicates for apps. Work consisting of removing duplicates will be done to the Android dataset only._

If we explore the data within `android_body` long enough, we'll find that some apps have more than one entry. For instance, the application Instagram has four entries:


In [6]:
print(android_header, '\n')

for app in android_body:
    name = app[0]
    if name == 'Instagram':
        print(app)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


---
Duplicates duplicates duplicates. But how do we choose which ones to keep and which ones to get rid of? Upon observation of Instagrams duplicates, we can see that all variables of the row are identical except for the `'Reviews'`. Our best educated guess would tell us that the highest `'Reviews'` count would mean the latest data on the app.

In [7]:
duplicate_apps = []
unique_apps = []

for app in android_body:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print('Number of duplicate apps: ', len(duplicate_apps),
      '\n')
print('Examples of duplicate apps: ', duplicate_apps[:10])

Number of duplicate apps:  1170 

Examples of duplicate apps:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


---
In the below cell we create a dictionary using a loop to check through the `duplicate_apps`, and:
>- add to `reviews_max` only one of the duplicate names as a `key`.
- also check through duplicates' reviews to assign the highest review as its `value`.

In [8]:
reviews_max =  {}

for row in android_body:
    name = row[0]
    n_reviews = float(row[3])
    
    if (name in reviews_max) and (reviews_max[name] < n_reviews):
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
print(len(reviews_max))

8583


---
Now, for the cell below let's use the `reviews_max` dictionary we made earlier to filter duplicates from `android_body` and build a new duplicate-less list of lists called `android_clean`. We do this by adding each full row to `android_clean` if:
>- The number of reviews of current app matches number of reviews in teh app stored in `reviews_max` dictionary; and
- The name of the app is not already in `apps_added`. We need to add this supplementary condition to account for those cases where the highest number of reviews of a duplicate app is the same for more than one entry (for example, the Box app has three entries, and the number of reviews is the same). If we just check with `reviews_max[name] == n_reviews`, we'll still end up with duplicate entries for some apps.

In [9]:
android_clean = []
android_added = []

# loop to clean out Android duplicates 
for row in android_body:
    name = row[0]
    n_reviews = float(row[3])
    
    if (reviews_max[name] == n_reviews) and (name not in android_added):
        android_clean.append(row)
        android_added.append(name)

In [10]:
print(android_header, '\n')
explore_data(android_clean, 0, 3, True)


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Total number of rows in :  8583
Total number of columns:  13


---
## Removing Non-English Apps

If you explore the data sets enough, you'll notice the names of some of the apps suggest they are not directed toward an English-speaking audience. Below, we see a couple of app row examples from both data sets:

In [11]:
print(ios_body[814][2])
print(ios_body[6734][2])

print(android_clean[4412][0])
print(android_clean[7940][0])

搜狐新闻—新闻热点资讯掌上阅读软件
エレメンタル ファンタジー - 高精細３ＤアクションＲＰＧ
Adventure Xpress
Advanced EX for KIA


---
We're not interested in keeping these kind of apps, so we'll remove them. One way to go about this is to remove each app whose name contains a symbol that is not commonly used in English text — English text usually includes letters from the English alphabet, numbers composed of digits from 0 to 9, punctuation marks (., !, ?, ;, etc.), and other symbols (+, *, /, etc.).

All these characters that are specific to English texts are encoded using the [ASCII standard](https://www.ibm.com/support/knowledgecenter/SSQ2R2_9.1.1/com.ibm.ent.cbl.zos.doc/PGandLR/ref/rlebcasc.html). Each ASCII character has a corresponding number between 0 and 127 associated with it, and we can take advantage of that to build a function that checks an app name and tells us whether it contains non-ASCII characters.

In [12]:
# built in `ord()` function returns the ASCII number associated with the character passed to it 
print(ord('h'))
print(ord('😜'))
print(ord('ت'))

104
128540
1578


Function below checks for ASCII consistency in the string we pass to it.
Since some apps with English ASCII characters also contain emojis and symbols(™, — (em dash), – (en dash) that fall outside of the ASCII range for English, we'll create a function to only remove an app if its name has more than three non-ASCII characters. Although obviously imperfect, it should suffice for this particular demonstration.

---

In [13]:
def is_english(string):
    '''
    Takes string as argument. -> Returns bool.
    
    Returns True if string contains 2 or less ACSII characters with corresponding numbers greater than 127
    Returns False if string has 3 or more ASCII characters with corresponding numbers greater than 127
    
    Example:
        >>> is_english('Docs To Go™ Free Office Suite')
        [1] True 

        is_english('لعبة تقدر تربح DZ')
        [2] False
    '''
    ascii_count = 0
    for letter in string:
        if ascii_count >= 3:
            return False
        elif ord(letter) > 127:
            ascii_count += 1

    return True

print(is_english('hello world!'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('لعبة تقدر تربح DZ'))
print(is_english('Instachat 😜'))

True
True
False
True


---
## Isolating Free and English Apps

We'll use the newly created `is_english()` function to filter out non-English apps from both data sets. But also, as we mentioned in the introduction, we are interested in apps that are free to download and install. Our datasets contain both free and non-free apps, and we'll need to isolate only the free apps for our analysis as well. 

 - We'll set the code up below to loop through each row or `android_clean` and `ios_body`. Any row that passes both tests of the `is_english()` function and price comparison(`'Price' == '0'`) below, get added to new "final" datasets.

In [14]:
android_final = []
ios_final = []

# loop to filter Android
for row in android_clean:
    name = row[0]
    price = row[7]
    if is_english(name) and (price == '0'):
        android_final.append(row)

# loop to filter iOS
for row in ios_body:
    name = row[2]
    price = row[5]
    if is_english(name) and price == '0':
        ios_final.append(row)
        
print(android_header, '\n')
explore_data(android_final, 0, 2, True)
print('\n')
print(ios_header, '\n')
explore_data(ios_final, 0, 2, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Total number of rows in :  7909
Total number of columns:  13


['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 

['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '2

---
## Most Common Apps by Genre

As we mentioned in the introduction, our aim is to determine the kinds of free apps(on both platforms) that are likely to attract more users because revenue of free apps are highly influenced by the number of people using these apps.
We'll begin the analysis by getting a sense of the most common genres for each market. For this, we'll code a function that:

- Compiles a frequency table(`dict` type) for the `'prime_genre'` column of the `ios_final` dataset, or the `'Genres'` and `'Category'` columns of the `android_final` dataset.
- Calculates the percentage of apps per genre, sorts in descending order based on percentage, then displays it.

In [15]:

def app_genre_percentage_display(dataset, index):
    '''freq_table(iterable, int) -> print(dataset[index] = percentage occurence in relation to dataset)
    Example:
    >>> freq_table(['x','y'], 0)
    x : 50.0
    y : 50.0
    '''
    apps_freq = {}
    for row in dataset:
        value = row[index]
        if value in apps_freq:
            apps_freq[value] += 1
        else:
            apps_freq[value] = 1

    apps_percentages = {}
    for key in apps_freq:
        percentage = apps_freq[key] / len(dataset) * 100
        apps_percentages[key] = round(percentage, 3)
    
    apps_percentages = sorted(apps_percentages.items(), key=lambda x: x[1], reverse=True)
    
    for genre, percentage in apps_percentages:
#         print(genre, ':' , percentage)
        print(f"{genre} : {percentage}%")


app_genre_percentage_display(ios_final, -5) # 'Prime Genre' column

Games : 58.24%
Entertainment : 7.867%
Photo & Video : 4.975%
Education : 3.669%
Social Networking : 3.296%
Shopping : 2.581%
Utilities : 2.488%
Sports : 2.146%
Music : 2.052%
Health & Fitness : 2.021%
Productivity : 1.741%
Lifestyle : 1.586%
News : 1.337%
Travel : 1.244%
Finance : 1.119%
Weather : 0.871%
Food & Drink : 0.808%
Reference : 0.529%
Business : 0.529%
Book : 0.404%
Navigation : 0.187%
Medical : 0.187%
Catalogs : 0.124%


We can see that within the `ios_final` dataset, more than a half (57.85%) are games. `'Entertainment'` apps are close to 8%, followed by photo and video apps, which are close to 5%. Only 3.63% of the apps are designed for education, followed by social networking apps which amount for 3.26% of the apps in our data set.

The general impression is that `ios_final` dataset is dominated by apps that are designed for fun (`'Games'`, `'Entertainment'`, `'Photo & Video'`, `'Social Networking'`, `'Sports'`, `'Music'`, etc.), while apps with practical purposes (`'Education'`, `'Shopping'`, `'Utilities'`, `'Productivity'`, `'Lifestyle'`, etc.) are more rare. However, the fact that fun apps are the most numerous doesn't also imply that they also have the greatest number of users — the demand might not be the same as the offer.

---
Let's continue by examining the `'Category'` column of the `android_final` data set.
- ( Note: the `'Genre'` column could also be used; however, it is far too nuanced and granular for this analysis. )

In [16]:
app_genre_percentage_display(android_final, 1) # 'Category' column

FAMILY : 19.383%
GAME : 10.558%
TOOLS : 8.573%
PRODUCTIVITY : 3.818%
LIFESTYLE : 3.806%
FINANCE : 3.793%
BUSINESS : 3.679%
SPORTS : 3.262%
MEDICAL : 3.249%
PHOTOGRAPHY : 3.174%
PERSONALIZATION : 3.148%
COMMUNICATION : 3.072%
HEALTH_AND_FITNESS : 3.072%
NEWS_AND_MAGAZINES : 2.693%
SOCIAL : 2.655%
SHOPPING : 2.314%
TRAVEL_AND_LOCAL : 2.314%
BOOKS_AND_REFERENCE : 2.074%
VIDEO_PLAYERS : 1.871%
DATING : 1.808%
MAPS_AND_NAVIGATION : 1.467%
EDUCATION : 1.29%
FOOD_AND_DRINK : 1.252%
ENTERTAINMENT : 1.075%
AUTO_AND_VEHICLES : 0.923%
LIBRARIES_AND_DEMO : 0.885%
WEATHER : 0.834%
HOUSE_AND_HOME : 0.771%
ART_AND_DESIGN : 0.708%
COMICS : 0.657%
EVENTS : 0.645%
PARENTING : 0.62%
BEAUTY : 0.556%


Up to this point, we found that the `ios_final` dataset is dominated by apps designed for fun, while the `android_final` shows a more balanced landscape of both practical and for-fun apps. Now we'd like to get an idea about the kind of apps that have most users.

---

### Most Popular Apps by Genre on the App Store

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the `android_final` dataset, we can find this information in the `'Installs'` column, but for the `ios_final` dataset this information is missing. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the `rating_count_tot`. Below, we calculate the average number of user ratings per app genre on the App Store:

In [17]:

def freq_table(dataset, index):
    '''freq_table(iterable, int) -> dictionary. dataset[index] = number_of_occurences
    >>> freq_table(['x','y'],['y','y'], 1)
    {'y': 2}
    '''
    table = {}
    for row in dataset:
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
            
    return table

ios_genre_freq = freq_table(ios_final, -5)

print('AVERAGE RATING COUNT PER APPLE APP GENRE:', '\n')
for genre in ios_genre_freq:
    total = 0
    len_genre = 0
    for row in ios_final:
        app_genre = row[-5]
        if app_genre == genre:
            n_ratings = float(row[6])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', round(avg_n_ratings, 2))

AVERAGE RATING COUNT PER APPLE APP GENRE: 

Productivity : 21028.41
Weather : 52279.89
Shopping : 27230.73
Reference : 79350.47
Finance : 31467.94
Music : 57326.53
Utilities : 18917.04
Travel : 28243.8
Social Networking : 71548.35
Sports : 23008.9
Health & Fitness : 23298.02
Games : 22800.84
Food & Drink : 33333.92
News : 21248.02
Book : 42816.85
Photo & Video : 28441.54
Entertainment : 14084.89
Business : 7491.12
Lifestyle : 16485.76
Education : 7003.98
Navigation : 86090.33
Medical : 612.0
Catalogs : 4004.0


---
On average, navigation apps have the highest number of user reviews, but this figure is heavily influenced by `Waze` and `Google Maps`, which have close to half a million `'user_ratings'` together:

In [18]:
for app in ios_final:
    if app[-5] == 'Navigation':
        print(app[2], ':', app[6])

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Geocaching® : 12811
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5
CoPilot GPS – Car Navigation & Offline Maps : 3582
Google Maps - Navigation & Transit : 154911


The same pattern applies to social networking apps, where the average number is heavily influenced by a few giants like `Facebook`, `Pinterest`, `Skype`, etc. Same applies to music apps, where a few big players like `Pandora`, `Spotify`, and `Shazam` heavily influence the average number.

Our aim is to find popular genres, but navigation, social networking or music apps might seem more popular than they really are. The average number of ratings seem to be skewed by very few apps which have hundreds of thousands of user ratings, while the other apps may struggle to get past the 10,000 threshold. We could get a better picture by removing these extremely popular apps for each genre and then rework the averages, but we'll leave this level of detail for later.

---
Reference apps have 79,350 user ratings on average, but it's actually the `Bible` and `Dictionary.com` which skew up the average rating:

In [19]:
for app in ios_final:
    if app[-5] == 'Reference':
        print(app[2], ':', app[6])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
Merriam-Webster Dictionary : 16849
Google Translate : 26786
Night Sky : 12122
WWDC : 762
Jishokun-Japanese English Dictionary & Translator : 0
VPN Express : 14
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Real Bike Traffic Rider Virtual Reality Glasses : 8


In [20]:
app_genre_percentage_display(android_final, 5)

1,000,000+ : 17.6%
100,000+ : 12.72%
10,000,000+ : 11.822%
10,000+ : 10.962%
1,000+ : 7.65%
5,000,000+ : 7.637%
500,000+ : 6.195%
50,000+ : 5.184%
5,000+ : 4.64%
100+ : 4.362%
50,000,000+ : 2.579%
500+ : 2.529%
100,000,000+ : 2.39%
10+ : 1.669%
50+ : 1.037%
500,000,000+ : 0.303%
5+ : 0.278%
1,000,000,000+ : 0.253%
1+ : 0.164%
0+ : 0.025%


---
One problem with this data is that is not precise. For instance, we don't know whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000. However, we don't need very precise data for our purposes — we only want to get an idea which app genres attract the most users, and we don't need perfect precision with respect to the number of users.

We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on.

To perform computations, however, we'll need to convert each install number to float — this means that we need to remove the commas and the plus characters, otherwise the conversion will fail and raise an error. We'll do this directly in the loop below, where we also compute the average number of installs for each genre (category).

In [21]:
android_category_freq = freq_table(android_final, 1)
print('AVERAGE INSTALLS PER ANDROID APP CATEGORY:', '\n')
for category in android_category_freq:
    total = 0
    len_category = 0
    
    for row in android_final:
        app_category = row[1]
        
        if app_category == category:
            installs = row[5]
            installs = installs.replace('+','').replace(',', '')
            total += float(installs)
            len_category += 1
    category_avg = total / len_category
    print(category, ':', round(category_avg, 2))

AVERAGE INSTALLS PER ANDROID APP CATEGORY: 

ART_AND_DESIGN : 2021716.07
AUTO_AND_VEHICLES : 727121.92
BEAUTY : 611961.36
BOOKS_AND_REFERENCE : 10156777.56
BUSINESS : 2394383.08
COMICS : 863675.0
COMMUNICATION : 45419276.46
DATING : 985340.18
EDUCATION : 1850490.2
ENTERTAINMENT : 11640705.88
EVENTS : 312761.18
FINANCE : 1517130.87
FOOD_AND_DRINK : 2137555.82
HEALTH_AND_FITNESS : 4705903.95
HOUSE_AND_HOME : 1581509.85
LIBRARIES_AND_DEMO : 746988.57
LIFESTYLE : 1652537.0
GAME : 16091751.5
FAMILY : 4040034.88
MEDICAL : 146741.84
SOCIAL : 26132614.55
SHOPPING : 7652095.9
PHOTOGRAPHY : 18550857.81
SPORTS : 4244782.33
TRAVEL_AND_LOCAL : 15817915.36
TOOLS : 11944673.57
PERSONALIZATION : 6140956.19
PRODUCTIVITY : 19177522.8
PARENTING : 634204.29
WEATHER : 5457378.79
VIDEO_PLAYERS : 26565731.08
NEWS_AND_MAGAZINES : 11117853.0
MAPS_AND_NAVIGATION : 4293557.5


On average, `'COMMUNICATION'` apps have the most installs: 45,419,276. This number is heavily skewed up by a few apps that have over one billion installs (`WhatsApp`, `Facebook Messenger`, `Skype`, `Google Chrome`, `Gmail`, and `Hangouts`), and a few others with over 100 and 500 million `'Installs'`:

In [22]:
for app in android_final:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                     or app[5] == '500,000,000+'
                                     or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

---
If we removed all the communication apps that have over 100 million installs, the average would be reduced roughly ten times:

In [23]:
under_100_mil = []

for row in android_final:
    n_installs = row[5]
    n_installs = n_installs.replace(',', '').replace('+','')
    if (row[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
        under_100_mil.append(float(n_installs))
        
sum(under_100_mil) / len(under_100_mil)

4337426.759259259

We see the same pattern for the `'VIDEO_PLAYERS'` category, which is the runner-up with 26,565,731 installs. 

---

## Conclusion/Observation:

The market is dominated by apps like `Youtube`, `Google Play Movies & TV`, or `MX Player`. The pattern is repeated for social apps (where we have giants like `Facebook`, `Instagram`, `Google+`, etc.), photography apps (`Google Photos` and other popular photo editors), or productivity apps (`Microsoft Word`, `Dropbox`, `Google Calendar`, `Evernote`, etc.).




Although it seems clear that the type of apps aforementioned are great in the number of users they attract. The main concern(and my personal assessment for now) is that these app __genres__ might seem more popular than they really are because of near total monopolization. Now if I were inclined to develop an application for either platform, the approach would be more nuanced to find a genre that may have large competetors but not to the extent that we see in previous examples.

---

### Going Forward:

This was a brief practice analysis.
I will be adding more cells below in the future as I tinker again with the `Genre` column or others such as `User Ratings` and others.

In [30]:
for app in ios_final:
    if app[-5] == 'Reference' and float(app[5]) > 0 :
        print(app)
#         print(app[2], ':', app[6])