# Profitable app profiles (AppStore and GooglePlay)


## About

In this project, we are going to do the analysis of the AppStore and GooglePlay datasets to understand which app profile could be profitable in the respective marketplace.

## Goal:

Through this project, we aim to understand the *free apps*\** that are listed in the Apple AppStore and Google Play markets and based on the actual usage statistics which app type profile might be best suited for us to develop as a free app so that it maximises the in-app ad revenue.

\** *We are only considering apps which are of English names*

In [1]:
def explore_data(dataset, start, end, rows_and_columns = False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
    if rows_and_columns:
        print('Number of rows: ', len(dataset))
        print('Number of columns: ', len(dataset[0]))

from csv import reader
open_appstore = open('Datasets\AppleStore.csv', encoding="utf8")
read_appstore = reader(open_appstore)
iosapps = list(read_appstore)
iosapps_header = iosapps[0]
iosapps = iosapps[1:]
open_playstore = open('Datasets\googleplaystore.csv', encoding="utf8")
read_playstore = reader(open_playstore)
androidapps = list(read_playstore)
androidapps_header = androidapps[0]
androidapps = androidapps[1:]

print(iosapps_header)
print('\n')
explore_data(iosapps, 0, 3, rows_and_columns = True)
print('\n')
print(androidapps_header)
print('\n')
explore_data(androidapps, 0, 3, rows_and_columns = True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows:  7197
Number of columns:  16


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design'

## About the dataset

1. Looking at the Apple AppStore dataset: 
   * There are **7197** apps and information about app spread across **16** columns
   * Following columns might be of interest in our analysis:
      * `prime_genre` - Genre of the app
      * `rating_count_tot` - Total users who have rated the apps across all versions
      * `price` - To determine if its a free or a paid app
   * For full documentation about the dataset, please follow this [link](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)


2. Looking at the Google PlayStore dataset:
   * There are **10841** apps and information about app spread across **13** columns
   * Following columns might be of interest in our analysis:
      * `Category` - Higher level category of the app
      * `Genres` - Genre of the app 
      * `Installs` - Number of users who have installed the app
      * `Type` - To determine if its a free or a paid app
   * For full documentation about the dataset, pleae follow this [link](https://www.kaggle.com/lava18/google-play-store-apps/notebooks)

## Data cleaning

### Part 1: Missing data

In [2]:
for app in androidapps:
    if app[0] == "Life Made WI-Fi Touchscreen Photo Frame":
        print(androidapps_header)
        print('\n')
        print(app)        

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


If we look at this app data above in Playstore dataset, columns have shifted position to the left. Now observing from the 2nd column `Category` with the value of **1.9** it seems like this value is more relevant on the `Reviews` column.

Checking the Google Playstore in the web (screenshot below), we see that the category for this app is **Lifestyle**. 

![PlayStore](./Assets/PlayStore_App_Missing_Category.png)

So now lets fix this in our dataset by using the `insert()` method on the list at the index position 1.

In [3]:
for app in androidapps:
    if app[0] == "Life Made WI-Fi Touchscreen Photo Frame":
        app.insert(1, "Lifestyle")                
        print(app)

['Life Made WI-Fi Touchscreen Photo Frame', 'Lifestyle', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


### Part 2: Duplicate data

Before performing detailed data analysis, it is always important to check if there are any duplicates in our data.

**PlayStore Dataset**:

In [4]:
duplicate_apps = []
unique_apps = []
for app in androidapps:
    app_name = app[0]
    if app_name in unique_apps:
        duplicate_apps.append(app_name)
    else:
        unique_apps.append(app_name)

print('Number of duplicate apps: ',len(duplicate_apps))
print('Number of unique apps: ',len(unique_apps))
print('\n')
print(androidapps_header)
print('\n')
for app in androidapps:
    app_name = app[0]
    if app_name in duplicate_apps and app_name == "Instagram":
        print(app)

Number of duplicate apps:  1181
Number of unique apps:  9660


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


We see **1181** duplicate rows in the dataset. And as a sample, we show all the rows that contain duplicate data for one such app **"Instagram"**

Now to remove the duplicates, we have to keep only one row and delete the remaining for each of the duplicate app. 

For this, looking at the column `Reviews` for the **Instagram** duplicate rows, there are variations between the rows and we are making an assumption that higher the number of reviews more relevant the data is for our analysis.

So we will keep the row which has maximum number of reviews and remove the remaining rows for each of the app.

In [5]:
## Build a dictionary: key = unique app name, value = max reviews count ##
reviews_max = {}
for app in androidapps:
    app_name = app[0]
    n_reviews = int(app[3])
    if app_name in reviews_max and reviews_max[app_name] < n_reviews:
            reviews_max[app_name] = n_reviews
    elif app_name not in reviews_max:
        reviews_max[app_name] = n_reviews

print('Unique apps in max reviews dictionary : ',len(reviews_max))
for key in reviews_max:
    if key == "Instagram":
        print('----> e.g. App:',key, ', Max reviews:', reviews_max[key])

## Use the dictionary above to build a clean list of lists of android app dataset ##
android_clean  = []
already_added = []
for app in androidapps:
    app_name = app[0]
    n_reviews = int(app[3])    
    if app_name not in already_added and n_reviews == reviews_max[app_name]:
        android_clean.append(app)
        already_added.append(app_name)
print('\n')
print('Unique apps:',len(android_clean))
for app in android_clean:
    if app[0] == "Instagram":
        print('----> e.g. App:',app)

Unique apps in max reviews dictionary :  9660
----> e.g. App: Instagram , Max reviews: 66577446


Unique apps: 9660
----> e.g. App: ['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


To remove the duplicates, we took the following steps with the above code block:
   1. Loop through the `androidapps` dataset and build a dictionary (`reviews_max`) with app name as the key and maximum total review count for that app as the value
   2. Made sure our `reviews_max` dictionary does not contain duplicates by printing the number of apps and as an example printed the max reviews for the app **Instagram** which had duplicates from previous analysis
   3. Using the above dictionary, build the final list of clean android app dataset `android_clean` that we can take it forward
      * To build the clean list, we check each app row and its reviews if it matches the max reviews in our `reviews_max` dictionary
      * Also to make sure we add the app only once (if we had duplicates with each contains the reviews = max reviews), we use a supplementary list called `already_added`

**AppStore Dataset**:

In [6]:
duplicate_apps = []
unique_apps = []
for app in iosapps:
    app_name = app[1]
    if app_name in unique_apps:
        duplicate_apps.append(app_name)
    else:
        unique_apps.append(app_name)

print('Number of duplicate apps: ',len(duplicate_apps))
print('Number of unique apps: ',len(unique_apps))
print('\n')
print(iosapps_header)
print('\n')
for app in iosapps:
    app_name = app[1]
    if app_name in duplicate_apps:
        print(app)

Number of duplicate apps:  2
Number of unique apps:  7195


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['1173990889', 'Mannequin Challenge', '109705216', 'USD', '0.0', '668', '87', '3.0', '3.0', '1.4', '9+', 'Games', '37', '4', '1', '1']
['952877179', 'VR Roller Coaster', '169523200', 'USD', '0.0', '107', '102', '3.5', '3.5', '2.0.0', '4+', 'Games', '37', '5', '1', '1']
['1178454060', 'Mannequin Challenge', '59572224', 'USD', '0.0', '105', '58', '4.0', '4.5', '1.0.1', '4+', 'Games', '38', '5', '1', '1']
['1089824278', 'VR Roller Coaster', '240964608', 'USD', '0.0', '67', '44', '3.5', '4.0', '0.81', '4+', 'Games', '38', '0', '1', '1']


If we explore the duplicate rows in the AppStore dataset, even though we see that there are **2** duplicate apps. By looking at the rows we see that the column `id` is unique and also the columns `sup_devices.num`, `ipadSc_urls.num` which points that both of these rows are displayed as individual apps in the AppStore perhaps a dedicated iPad and iPhone apps.

So we are going to keep these rows as is for our analysis further.

### Part 3: English only apps

If we remember our project goal, we are looking to develop an app targetted towards English speaking audience.

In [7]:
print('iOS Apps')
print(iosapps[813])
print(iosapps[6731])
print('\n')
print('Android Apps')
print(android_clean[4412])
print(android_clean[7940])

iOS Apps
['445375097', '爱奇艺PPS -《欢乐颂2》电视剧热播', '224617472', 'USD', '0.0', '14844', '0', '4.0', '0.0', '6.3.3', '17+', 'Entertainment', '38', '5', '3', '1']
['1120021683', '【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜', '77551616', 'USD', '0.0', '0', '0', '0.0', '0.0', '1.3', '12+', 'Games', '38', '0', '1', '1']


Android Apps
['中国語 AQリスニング', 'FAMILY', 'NaN', '21', '17M', '5,000+', 'Free', '0', 'Everyone', 'Education', 'June 22, 2016', '2.4.0', '4.0 and up']
['لعبة تقدر تربح DZ', 'FAMILY', '4.2', '238', '6.8M', '10,000+', 'Free', '0', 'Everyone', 'Education', 'November 18, 2016', '6.0.0.0', '4.1 and up']


But brief exploration of both AppStore and PlayStore datasets reveal that we have some apps which are targetted at non-English speaking audience.

Hence for relevancy of our project and subsequent analysis, we are going to remove these apps.

In [8]:
def is_english_app(input_string):
    non_eng_el_cnt = 0
    for el in input_string:
        if ord(el) > 127:
            non_eng_el_cnt += 1
    if non_eng_el_cnt > 3:
        return False
    else:
        return True

print(is_english_app('Instagram'))
print(is_english_app('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english_app('Docs To Go™ Free Office Suite'))
print(is_english_app('Instachat 😜'))

True
False
True
True


To determine if a character is a standard character we are going to use its ascii value and if its upto **127** then includes letters from the English alphabet, numbers composed of digits from 0 to 9, punctuation marks (., !, ?, ;), and other symbols (+, *, /).

To determine the ascii value of a character we can use the standard method `ords()`

However, to ensure that we do not ignore apps that contain special characters such as ™ or 😜 in their app names (As these have ascii value > 127), we are going to establish a rule that upto 3 characters with ascii value greater than 127 is allowed.

For this purpose, we have defined a function `is_english_app()` in the above code block which returns `True` if a string contains 3 or less non-standard characters and `False` otherwise.

To ensure that this function works as intended, we have used certain app names to determine the result.

Let's now use this new function on our AppStore (`iosapps`) and PlayStore (`android_clean`) datasets.

In [9]:
ios_english = []
ios_non_english = []
android_english = []
android_non_english = []
for app in iosapps:
    app_name = app[1]
    if is_english_app(app_name):
        ios_english.append(app)
    else:
        ios_non_english.append(app)

for app in android_clean:
    app_name = app[0]
    if is_english_app(app_name):
        android_english.append(app)
    else:
        android_non_english.append(app)        

print('iOS English Apps:',len(ios_english))
print('iOS Non-English Apps:',len(ios_non_english))
print('\n')
print('Android English Apps:',len(android_english))
print('Android Non-English Apps:',len(android_non_english))

iOS English Apps: 6183
iOS Non-English Apps: 1014


Android English Apps: 9615
Android Non-English Apps: 45


Running through our rule for determining English lauguage app on both AppStore and PlayStore datasets, we now have a seperate list for English and Non-English apps data.

   * For iOS, we seperate the dataset between `ios_english` and `ios_non_english`.

   * For Android, we seperate the dataset between `android_english` and `android_non_english`

From the above statistics, it seems that iOS store seem to have a lot more apps that's targetted toward Non-English audience when compared to Android.

### Part 4: Isolating free apps

If we remember our project goal, we are looking to develop a free app and maximise our ad-revenue and hence for our analysis we only need the free apps to be considered for our profiling.

In this final part of the data cleansing excercise, we are going to seperate the free and paid apps in both AppStore and PlayStore datasets.

In [10]:
ios_final = []
ios_paid = []
for app in ios_english:
    app_price = float(app[4]) ## column "price" on the AppleStore.csv - Convert string to float
    if app_price == 0.0:
        ios_final.append(app)
    else:
        ios_paid.append(app)

print('Final number of iOS apps (Free):',len(ios_final))
print('Number of paid iOS apps excluded:',len(ios_paid))

android_final = []
android_paid = []
for app in android_english:
    app_price = app[6] ## column "price" on the AppleStore.csv - Convert string to float
    if app_price == "Free":
        android_final.append(app)
    else:
        android_paid.append(app)

print('\n')        
print('Final number of Android apps (Free):',len(android_final))
print('Number of paid Android apps excluded:',len(android_paid))     

Final number of iOS apps (Free): 3222
Number of paid iOS apps excluded: 2961


Final number of Android apps (Free): 8864
Number of paid Android apps excluded: 751


## Data Analysis

As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets.

Looking at the columns on both AppStore and PlayStore datasets:
   1. In AppStore, column `prime_genre` (**Index: -5**) shows the App genre
   2. In PlayStore, however we have two columns which can be used for categorising an app:
      * `Category` (**Index: 1**)
      * `Genres` (**Index: -4**)

Let's use these columns on the datasets to see what patterns we can uncover.

Before we see the patterns based on genre across AppStore and PlayStore, we have to generate frequency tables to aid our analysis. We are doing that in the below code block.

For this, we have a built a generic function `freq_table` which takes the dataset (List of lists) and the index of the column based on which we are going to build the frequency table. The output of this function is going to be a dictionary which shows the percentage of apps in a particular categorisation which is defined by the column index we pass.

Also to be able to sort and display this frequency table based on the highest percentage of apps by categorisation we are using a function called `display_table` which in turn first calls the above `freq_table` function and then organises the data through tupple to sort it in the reverse order using the standard method `sorted()`.

In [11]:
## To generate the frequency table to understand proportions
def freq_table(dataset, index):
    freq_tbl = {}
    for app in dataset:
        app_freq_key = app[index]
        if app_freq_key in freq_tbl:
            freq_tbl[app_freq_key] += 1
        else:
            freq_tbl[app_freq_key] = 1

    total_apps = len(dataset)
    for key in freq_tbl:
        tot_apps_per_key = freq_tbl[key]
        freq_tbl[key] =  round((tot_apps_per_key / total_apps) * 100,2)
    return freq_tbl

## To display the results based on highest proportion
### Note: We cannot use sorted function effectively on our frequency table directly as key is string and 
### we want to sort based on value
def display_table(dataset, index):
    freq_tbl = freq_table(dataset, index)    
    display_tbl = []    
    for key in freq_tbl:
        value_key_tpl = (freq_tbl[key], key)
        display_tbl.append(value_key_tpl)     
    for entry in sorted(display_tbl, reverse = True):
        print(entry[1],':',entry[0])
print('iOS Apps by Genre')
display_table(ios_final, -5)
print('\n')
print('Android App by Category')
display_table(android_final, 1)
print('\n')
print('Android App by Genre')
display_table(android_final, -4)

iOS Apps by Genre
Games : 58.16
Entertainment : 7.88
Photo & Video : 4.97
Education : 3.66
Social Networking : 3.29
Shopping : 2.61
Utilities : 2.51
Sports : 2.14
Music : 2.05
Health & Fitness : 2.02
Productivity : 1.74
Lifestyle : 1.58
News : 1.33
Travel : 1.24
Finance : 1.12
Weather : 0.87
Food & Drink : 0.81
Reference : 0.56
Business : 0.53
Book : 0.43
Navigation : 0.19
Medical : 0.19
Catalogs : 0.12


Android App by Category
FAMILY : 18.9
GAME : 9.72
TOOLS : 8.46
BUSINESS : 4.59
LIFESTYLE : 3.9
PRODUCTIVITY : 3.89
FINANCE : 3.7
MEDICAL : 3.53
SPORTS : 3.4
PERSONALIZATION : 3.32
COMMUNICATION : 3.24
HEALTH_AND_FITNESS : 3.08
PHOTOGRAPHY : 2.94
NEWS_AND_MAGAZINES : 2.8
SOCIAL : 2.66
TRAVEL_AND_LOCAL : 2.34
SHOPPING : 2.25
BOOKS_AND_REFERENCE : 2.14
DATING : 1.86
VIDEO_PLAYERS : 1.79
MAPS_AND_NAVIGATION : 1.4
FOOD_AND_DRINK : 1.24
EDUCATION : 1.16
ENTERTAINMENT : 0.96
LIBRARIES_AND_DEMO : 0.94
AUTO_AND_VEHICLES : 0.93
HOUSE_AND_HOME : 0.82
WEATHER : 0.8
EVENTS : 0.71
PARENTING : 0.65


### Apps categorisation

**AppStore**: 
Looking at the data displayed above, we can see the following patterns:
   1. The AppStore is dominated by the *Games* genre which occupies over half of the AppStore content at **58.16%**
   2. Now if we categorise the genres as Fun and Practical - We see that AppStore is dominated by apps for fun rather than practical purposes:
      * Fun (Games, Entertainment, Photo & Video, Social Networking, Sports, Music) - occupies ~**78%**
      * Practical - occupies ~**22%**
   3. So may be a practical purpose might be beneficial as fun category might have reached the saturation point.
   
**PlayStore**:
Here it is a little complicated that we have two columns (*Genres* and *Category*) for the purpose of categorisation. Now if we look at the *Genres* column we see that data is too granular to be able to stay categorically at a high-level similar to the AppStore.

So for our sub-sequent analysis to stay at the same level as the AppStore, we are going to consider the *Category* column and ignore the *Genres* column in the PlayStore.

Looking at the data displayed above for PlayStore based on *Category*:

   1. At the first glance, seems like PlayStore has lot more proportion of apps for practical purposes than fun
   2. Apart from the "FAMILY" category, rest of the categories we can seperate as either fun or practical purpose
   
Let's now look at the "FAMILY" category a bit closely as it constitues ~**19%** of our dataset.   

In [12]:
family_cat_genres = {}
tot_family_apps = 0
for app in android_final:
    app_category = app[1]
    app_genre = app[-4]
    if app_category == "FAMILY":
        tot_family_apps += 1
        if app_genre in family_cat_genres:
            family_cat_genres[app_genre] += 1
        else:
            family_cat_genres[app_genre] = 1
for key in family_cat_genres:
    family_cat_genres[key] = round((family_cat_genres[key] / tot_family_apps) * 100,2)
display_tbl = []
for key in family_cat_genres:
    value_key_tpl = (family_cat_genres[key], key)
    display_tbl.append(value_key_tpl)     
for entry in sorted(display_tbl, reverse = True):
    print(entry[1],':',entry[0])
print('\n')
print('Checking the fun vs practical within FAMILY category')
practical_sum = 0
fun_sum = 0
for entry in display_tbl:
    genre = entry[1]
    value = entry[0]
    if 'EDUCATION' in genre.upper():
        practical_sum += value
    else:
        fun_sum += value
print('Fun :',round(fun_sum,2))        
print('Practical :',round(practical_sum,2))

Entertainment : 27.34
Education : 22.81
Simulation : 10.39
Casual : 8.0
Puzzle : 4.66
Role Playing : 4.3
Strategy : 3.88
Educational;Education : 2.09
Educational : 1.97
Education;Education : 1.43
Casual;Pretend Play : 1.25
Racing;Action & Adventure : 0.9
Puzzle;Brain Games : 0.9
Entertainment;Music & Video : 0.72
Casual;Action & Adventure : 0.72
Casual;Brain Games : 0.66
Arcade;Action & Adventure : 0.66
Educational;Pretend Play : 0.48
Action;Action & Adventure : 0.48
Simulation;Action & Adventure : 0.42
Board;Brain Games : 0.42
Entertainment;Brain Games : 0.36
Educational;Brain Games : 0.36
Casual;Creativity : 0.36
Role Playing;Pretend Play : 0.24
Education;Pretend Play : 0.24
Role Playing;Action & Adventure : 0.18
Puzzle;Action & Adventure : 0.18
Entertainment;Action & Adventure : 0.18
Educational;Creativity : 0.18
Educational;Action & Adventure : 0.18
Education;Music & Video : 0.18
Education;Action & Adventure : 0.18
Adventure;Action & Adventure : 0.18
Sports;Action & Adventure : 0.1

Continuing our analysis on the PlayStore dataset's **FAMILY** category (~19% of total apps):
   1. We see that ~69% out of the FAMILY category is attributed to Fun
   2. And the remaining ~31% in the FAMILY category is attributed for practical purposes

So putting this all together across categories, still we see that in PlayStore:
   1. Only ~32% of the apps are related to fun
   2. Large part (~68%) of the apps are related to practical purposes
   
This is different to the AppStore analysis we did in the previous section.

Before we recommend an app profile that we should build, we will also analyse the user base of apps.
   * For PlayStore, we clearly have column called `Installs` which shows the number of users who have installed the apps
   * However, for AppStore, we do not have such column. Even though not ideal, we will use `rating_count_tot` column as comparable as this shows the number of user's who have reviewed the app across versions.
   
Let's use these columns for our analysis now.   

### App users by genre

**AppStore**:

We will see the app genre and its average number of users. We will do in the following order:

   1. To aid this purpose, we will reuse our `freq_table` to build our genre dictionary
   2. Use the above genre dictionary in combination with the iOS dataset (as a nested loop) to determine
   > average users per genre = total of all app users per genre / number of apps per genre
   3. Finally, we sort this in reverse order so that we can genre with higher average number of users at the top

In [13]:
ios_genre_dict = freq_table(ios_final, -5)
display_tbl = []
for genre in ios_genre_dict:
    total = 0
    len_genre = 0
    for app in ios_final:
        genre_app = app[-5]
        if genre == genre_app:
            n_user_ratings = float(app[5])
            total += n_user_ratings
            len_genre += 1
    avg_ratings = round(total / len_genre,2)
    value_key_tpl = (avg_ratings, genre)
    display_tbl.append(value_key_tpl)
for entry in sorted(display_tbl, reverse = True):
    print(entry[1],':',entry[0])

Navigation : 86090.33
Reference : 74942.11
Social Networking : 71548.35
Music : 57326.53
Weather : 52279.89
Book : 39758.5
Food & Drink : 33333.92
Finance : 31467.94
Photo & Video : 28441.54
Travel : 28243.8
Shopping : 26919.69
Health & Fitness : 23298.02
Sports : 23008.9
Games : 22788.67
News : 21248.02
Productivity : 21028.41
Utilities : 18684.46
Lifestyle : 16485.76
Entertainment : 14029.83
Business : 7491.12
Education : 7003.98
Catalogs : 4004.0
Medical : 612.0


Even though games genre had larger proportion of apps, from the above data on app users we see that Navigation and Reference genres have more average users.

We will explore further and see the patterns.

To aid this, we have written a function `top_apps_by_genre` below that take quite a few parameters:
   * `dataset` - Dataset passed as list of lists as this can be used for both AppStore and PlayStore
   * `genre` - Genre or category value that we pass to get the top apps for that genre
   * `genre_index` - Index of the genre or category column in the dataset
   * `appname_index` - Index of app name column in the dataset
   * `users_index` - Index of the column holding number of users in the dataset
   * `top_n` - How many apps we want to display - Default is **5** for top 5 apps
   * `pct` - Whether we want to display number of users of the app as a percentage of total apps users in the genre - Default is False

In [14]:
def top_apps_by_genre(dataset, genre, genre_index, appname_index, users_index, top_n = 5, pct = False):
    genre_apps = []
    total_genre_users = 0
    for app in dataset:        
        app_genre = app[genre_index]
        app_name = app[appname_index]
        app_users = int((app[users_index].replace(',','')).replace('+',''))        
        if app_genre == genre:
            total_genre_users += app_users
            app_tupple = (app_users, app_name)
            genre_apps.append(app_tupple)
    top = 0
    print('*'*5,'Top',top_n,'apps for',genre,'*'*5)    
    for app in sorted(genre_apps, reverse = True):
        top += 1
        if top > top_n:
            print('\n')
            break
        app_name = app[1]
        app_users = app[0]
        if pct == True:
            if total_genre_users != 0:
                app_user_pct = round((app_users / total_genre_users) * 100,2)
            else:
                app_user_pct = 0
            print(app_name,':',app_users, '(' + str(app_user_pct)+'%)')
        else:
            print(app_name,':',app_users)
        
top_apps_by_genre(dataset = ios_final, genre="Navigation", genre_index=-5, appname_index=1, users_index=5, pct=True)
top_apps_by_genre(dataset = ios_final, genre="Reference", genre_index=-5, appname_index=1, users_index=5, pct=True)
top_apps_by_genre(dataset = ios_final, genre="Social Networking", genre_index=-5, appname_index=1, users_index=5, pct=True)
top_apps_by_genre(dataset = ios_final, genre="Music", genre_index=-5, appname_index=1, users_index=5, pct=True)
top_apps_by_genre(dataset = ios_final, genre="Weather", genre_index=-5, appname_index=1, users_index=5, pct=True)
top_apps_by_genre(dataset = ios_final, genre="Book", genre_index=-5, appname_index=1, users_index=5, pct=True)
top_apps_by_genre(dataset = ios_final, genre="Finance", genre_index=-5, appname_index=1, users_index=5, pct=True)

***** Top 5 apps for Navigation *****
Waze - GPS Navigation, Maps & Real-time Traffic : 345046 (66.8%)
Google Maps - Navigation & Transit : 154911 (29.99%)
Geocaching® : 12811 (2.48%)
CoPilot GPS – Car Navigation & Offline Maps : 3582 (0.69%)
ImmobilienScout24: Real Estate Search in Germany : 187 (0.04%)


***** Top 5 apps for Reference *****
Bible : 985920 (73.09%)
Dictionary.com Dictionary & Thesaurus : 200047 (14.83%)
Dictionary.com Dictionary & Thesaurus for iPad : 54175 (4.02%)
Google Translate : 26786 (1.99%)
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418 (1.37%)


***** Top 5 apps for Social Networking *****
Facebook : 2974676 (39.22%)
Pinterest : 1061624 (14.0%)
Skype for iPhone : 373519 (4.93%)
Messenger : 351466 (4.63%)
Tumblr : 334293 (4.41%)


***** Top 5 apps for Music *****
Pandora - Music & Radio : 1126879 (29.78%)
Spotify Music : 878563 (23.22%)
Shazam - Discover music, artists, videos & lyrics : 402925 (10.65%)
iHeartRadio – Free Music & Radio Stations : 29

Looking at the Top 5 apps for few genres which command highest users, we see the following patterns:

   1. Consistently in Navigation, Social Networking, Music generes 2-3 apps dominate the user base and hence they skew the results overall.
   2. Reference genre is interesting in that Bible and Dictionary govern near monopoly same as Navigation with Google (Waze and Google Maps both are owned by Google).
   3. Weather genre shows promise in that it ticks the rule that its practical use app rather than fun app which is near saturation point in AppStore - However, considering our primary goal of free app and maximising ad-revenue, weather might not be suitable genre were users would stay long enough within our app.
   4. Other genres that provide practical purpose such as "Food & Drink" could be considered but these require deeper partnerships at the supply chain level but may be perhaps marketplace is preferrable option here.

Even though "Book" genre is predominantly covered by Amazon, we do see some promise that other non-big name company such as "*Color Therapy Adult Coloring Book for Adults*", "*HOOKED - Chat Stories*" and "*OverDrive – Library eBooks and Audiobooks*" have good percentage of user base too (Combined ~35%)

This definetly shows promise that if we do bring a standalone app in the "Book" genre perhaps of some famous best seller book with reference such as dictionary, quotes embedded then there is potential for maximising ad-revenue by keeping the user within our app for longer time. 

**Note of caution**: We do need to check about the rights and partnership for the book but more importantly it should not already been an eBook/Audio book through Amazon's platform.

"Finance" genre is interesting here:
   1. Proportion of apps in this space is only **1.12%** of the AppStore 
   2. But the average number of users on the apps in this space is quite high
   3. Also looking at the Top 5 apps in this space, we see that it is not monopolised by any big players and spread is quite even and good
   4. There are higher proportion of apps in this category which provide sub-category of banking/payment services but also there is space for sub-category of "Personal Finance"

Even though this genre requires deeper domain expertise, but if we could partner with some wealth management company or Financial adviser then we have a potential to add significant value to the user at the same time maximising the ad-revenue related to the financial advise in our app.

> From all of the genres in the AppStore, two genres definetly emerge as potentials in the AppStore:
>
>> 1. "Book" genre perhaps with a popular/best selling published book not yet in Amazon platform
>> 2. "Finance" genre in the Personal Finance section and maximise ad-revenue
>>       
>> Note: Other areas in Finance need more domain infrastructure/license (such as Payments) and expertise 

Both of these definetly fits our theme of apps for practical use.

Let's now explore the Google PlayStore.

**Note**: In PlayStore dataset, the `Installs` column has a range base such as 1,000+, 10,000+ but for our analysis we will consider these are hard numbers (By replacing "," and "+") as we are going to employ the same technique across the dataset it should not cause any error in judgment.

In [15]:
android_genre_dict = freq_table(android_final, 1)
display_tbl = []
for category in android_genre_dict:
    total = 0
    len_category = 0
    for app in android_final:
        app_category = app[1]
        user_ratings = app[5]
        n_user_ratings = float((user_ratings.replace(',','')).replace('+',''))
        if category == app_category:            
            total += n_user_ratings
            len_category += 1
    avg_ratings = round(total / len_category,2)
    value_key_tpl = (avg_ratings, category)
    display_tbl.append(value_key_tpl)
for entry in sorted(display_tbl, reverse = True):
    print(entry[1],':',entry[0])

COMMUNICATION : 38456119.17
VIDEO_PLAYERS : 24727872.45
SOCIAL : 23253652.13
PHOTOGRAPHY : 17840110.4
PRODUCTIVITY : 16787331.34
GAME : 15588015.6
TRAVEL_AND_LOCAL : 13984077.71
ENTERTAINMENT : 11640705.88
TOOLS : 10801391.3
NEWS_AND_MAGAZINES : 9549178.47
BOOKS_AND_REFERENCE : 8767811.89
SHOPPING : 7036877.31
PERSONALIZATION : 5201482.61
WEATHER : 5074486.2
HEALTH_AND_FITNESS : 4188821.99
MAPS_AND_NAVIGATION : 4056941.77
FAMILY : 3697848.17
SPORTS : 3638640.14
ART_AND_DESIGN : 1986335.09
FOOD_AND_DRINK : 1924897.74
EDUCATION : 1833495.15
BUSINESS : 1712290.15
LIFESTYLE : 1437816.27
FINANCE : 1387692.48
HOUSE_AND_HOME : 1331540.56
DATING : 854028.83
COMICS : 817657.27
AUTO_AND_VEHICLES : 647317.82
LIBRARIES_AND_DEMO : 638503.73
PARENTING : 542603.62
BEAUTY : 513151.89
EVENTS : 253542.22
MEDICAL : 120550.62
Lifestyle : 1000.0


Even though the categorisation on the PlayStore is slightly different to the AppStore, there are some common themes and categories. But PlayStore user base shows good promise for both fun and practical purpose apps.

In [16]:
top_apps_by_genre(dataset = android_final, genre="COMMUNICATION", genre_index=1, appname_index=0, users_index=5, pct=True)
top_apps_by_genre(dataset = android_final, genre="PRODUCTIVITY", genre_index=1, appname_index=0, users_index=5, pct=True)
top_apps_by_genre(dataset = android_final, genre="FINANCE", genre_index=1, appname_index=0, users_index=5, pct=True)
top_apps_by_genre(dataset = android_final, genre="BOOKS_AND_REFERENCE", genre_index=1, appname_index=0, users_index=5, pct=True)

***** Top 5 apps for COMMUNICATION *****
WhatsApp Messenger : 1000000000 (9.06%)
Skype - free IM & video calls : 1000000000 (9.06%)
Messenger – Text and Video Chat for Free : 1000000000 (9.06%)
Hangouts : 1000000000 (9.06%)
Google Chrome: Fast & Secure : 1000000000 (9.06%)


***** Top 5 apps for PRODUCTIVITY *****
Google Drive : 1000000000 (17.27%)
Microsoft Word : 500000000 (8.63%)
Google Calendar : 500000000 (8.63%)
Dropbox : 500000000 (8.63%)
Cloud Print : 500000000 (8.63%)


***** Top 5 apps for FINANCE *****
Google Pay : 100000000 (21.97%)
PayPal : 50000000 (10.99%)
İşCep : 10000000 (2.2%)
Wells Fargo Mobile : 10000000 (2.2%)
Mobile Bancomer : 10000000 (2.2%)


***** Top 5 apps for BOOKS_AND_REFERENCE *****
Google Play Books : 1000000000 (60.03%)
Wattpad 📖 Free Books : 100000000 (6.0%)
Bible : 100000000 (6.0%)
Audiobooks from Audible : 100000000 (6.0%)
Amazon Kindle : 100000000 (6.0%)




## Conclusion:

Consider our primary goal that we are going to launch the app in both AppStore and PlayStore albeit different timeline. 

So from that perspective, we are looking at the Top 5 apps on user base by the same genre/category in the AppStore we listed as potentials (Books and Finance).

Here again the "Books" category shows dominance by big player with their marketplace apps such as Google or Amazon.

This makes it especially hard to launch a famous/best selling book as app without having license/rights issue with these big players is going to be hard.

Our recommendation at this stage is:

> **Develop a "Personal Finance" app in both AppStore and PlayStore**


Note: 
* We should not consider any sub-categories which require banking license, deep domain expertise and infrastructure (such as Payments) at this stage
* To maximise the value we create for our user base and in turn maximise our ad-revenue, we should look at a partnership with a Financial advisor.