# Profitable apps within the Google Play store and the Apple iOS app store

- I aim to identify which types of apps attract the most users
    - Mobile app analytics allows for development of apps which will attract high user engagement, increasing app revenue

- There are approximately 2 million iOS apps available on the App Store and 2.1 million Android apps on the Google Play store (Sept. 2018)
    - Android holds about 53.2% of the smartphone market, while iOS is 43%
    
- To understand which apps attract users in both markets I will analyze two freely available data sets:
    - Google Play Store Apps
        - This data set contains more than 10,000 Google Play mobile applications details
        - Web scraping tools were used to extract data from the Google Play store
        - You can download this data set [here](https://www.kaggle.com/lava18/google-play-store-apps)
        
    - Mobile App Statistics (Apple iOS app store)
        - This data set contains more than 7,000 Apple iOS mobile applications details
        - R and linux web scraping tools were used to extract data from the iTunes Search API at the Apple Inc website
        - You can download this data set [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

-----

I will begin with opening each data set, followed by data exploration, cleaning, and analysis.

In [1]:
from csv import reader

## Apple iOS app store dataset
open_apple = open('AppleStore.csv')
apple_list = list(reader(open_apple))
apple_list_header = apple_list[0] 
apple_list = apple_list[1:] # store the data set without the header

##Google Play store dataset
open_google = open('googleplaystore.csv')
from csv import reader
google_list = list(reader(open_google))
google_list_header = google_list[0]
google_list = google_list[1:] # store the data set without the header

The `explore_data()` function breaks up the rows of each data set to print them in a more readable way. This function can also be used to print the number of rows and columns of each data set.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
print('Apple iOS app store:')
print('\n')
print(apple_list_header)
print('\n')
explore_data(apple_list,0,3, True)

Apple iOS app store:


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


We see that the Apple iOS app store contains 16 columns and 7197 apps. 

I expect to predominately use the track_name, price, rating_count_tot, user_rating, cont_rating, and prime_genre columns in this analysis. Details about each column can be found in the data set [documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps).

In [4]:
print('Google Play Store:')
print('\n')
print(google_list_header)
print('\n')
explore_data(google_list,0,3, True)

Google Play Store:


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


We see that the Google Play store contains 13 columns and 10841 apps. 

I expect to predominately use the App, Category, Rating, Reviews, Type, Price, Content Rating, and Genres columns in this analysis. Details about each column can be found in the data set [documentation](https://www.kaggle.com/lava18/google-play-store-apps).

## Deleting Wrong Data

- Both data sets have dedicated discussion sections
- The Google Play store data set discussions have identified an error for row 10472 

Below I will print this row and compare it to the header and another row that is correct.

In [5]:
print(google_list[10472]) # incorrect row
print('\n')
print(google_list_header)  # header
print('\n')
print(google_list[0]) # correct row as example

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


The row 10472 contains the app "Life Made WI-Fi Touchscreen Photo Frame" and the rating is listed as 19. The maximum rating for a Google Play app is 5, so this must be incorrect, and I will remove this row. I will print the number of rows `len()` of the Google Play store data set prior to and after deleting this row to allow for comparison, and ensure that only one row was removed.

In [6]:
print(len(google_list))
del(google_list[10472]) # only run this code once 
print(len(google_list))
print('\n')
print(google_list[10472])

10841
10840


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


# Removing Duplicate Entries

The Google Play store data set discussions have also indicated that some apps have been given more than one entry. I will check this hypothesis using Twitter and Instagram, apps that regularly push out updates, and were first added to the store approx. 10 years ago.

In [7]:
for app in google_list:
    name = app[0]
    if name == 'Twitter':
        print(app)
print('\n')
for app in google_list:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Twitter', 'NEWS_AND_MAGAZINES', '4.3', '11667403', 'Varies with device', '500,000,000+', 'Free', '0', 'Mature 17+', 'News & Magazines', 'August 6, 2018', 'Varies with device', 'Varies with device']
['Twitter', 'NEWS_AND_MAGAZINES', '4.3', '11667403', 'Varies with device', '500,000,000+', 'Free', '0', 'Mature 17+', 'News & Magazines', 'August 6, 2018', 'Varies with device', 'Varies with device']
['Twitter', 'NEWS_AND_MAGAZINES', '4.3', '11657972', 'Varies with device', '500,000,000+', 'Free', '0', 'Mature 17+', 'News & Magazines', 'July 30, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varie

As hypothesized, some apps have more than one entry within the Google Play sotre data set. To begin removal of these apps I count the number of duplicates and list some examples below. Although the Apple iOS app store data set disscusions did not mention any duplicates in the data, I will include it in the analysis below just in case.

In [8]:
def duplicate_search(app_list):
    duplicate_apps = []
    unique_apps = []

    for each_line in app_list:
        name = each_line[0]
        if name in unique_apps:
            duplicate_apps.append(name)
        else:
            unique_apps.append(name)
    return duplicate_apps
    return unique_apps

In [9]:
google_duplicates = duplicate_search(google_list)

print('Number of duplicate apps in Google Play store:', len(google_duplicates))
print('\n')
print('Examples of duplicate apps in Google Play store:', google_duplicates[:15])

Number of duplicate apps in Google Play store: 1181


Examples of duplicate apps in Google Play store: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


In total, there are 1,181 cases where an app occurs more than once within the Google Play store.

In [10]:
apple_duplicates = duplicate_search(apple_list)
print('Number of duplicate apps in Apple iOS store:', len(apple_duplicates))
print('\n')
print('Examples of duplicate apps in Apple iOS store:', apple_duplicates[:15])

Number of duplicate apps in Apple iOS store: 0


Examples of duplicate apps in Apple iOS store: []


No cases of duplicate apps were found in the Apple iOS app store.

To avoid counting some apps more than once during analysis, I will remove the duplicate entries found in the Google Play store data set, keeping only one entry per app. Looking at the Twitter app entries printed above, each line differs in either column 4 and/or column 11 which correspond to the number of reviews and the date of the last update respectively. As column 4 differs more regularly I will be using it to determine which duplicate to keep, and which to remove. The entry with the highest number of reviews will be kept as it is likely the most recent entry, and as the number of reviews increases, so too does the reliability of the associated rating.


To facilitate this I will create a dictionary `reviews_max` where each key is a unique app name, and the value is the highest number of reviews of that app. I will then use this dictionary to create a new data set, containing only one entry per app.

In [11]:
reviews_max = {}
for each_line in google_list:
    name = each_line[0]
    n_reviews = float(each_line[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

I found previously that there are 1,181 app duplications, so the length of `reviews_max` should equal the difference between the length of the Google Play app store data set, and 1,181.

In [12]:
print('Expected length of Google Play store list:', len(google_list) - 1181)
print('Actual length of Google Play store list:', len(reviews_max))

Expected length of Google Play store list: 9659
Actual length of Google Play store list: 9659


I will use `reviews_max` to remove the Google Play store data set duplicates, keeping only the entries with the highest number of reviews.

To do so I will initialize two empty lists, `google_clean` and `already_added`, loop through the Google Play data set, and for every entry isolate the name of the app and the number of reviews.

Each entry will be added to the google_clean list, and the corresponding app name to the already_added list if:
- The number of reviews of the current app matches the number of reviews of that app within the reviews_max dictionary and
- The name of the app is not already in the already_added list
    - This supplementary condition accounts for cases where the highest number of reviews of a duplicate app is the same for more than one entry

In [13]:
google_clean = []
already_added = []
for each_entry in google_list:
    name = each_entry[0]
    n_reviews = float(each_entry[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        google_clean.append(each_entry)
        already_added.append(name)

The google_clean list is expected to now have 9659 entries. 

In [14]:
explore_data(google_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


# Removing Non-English Apps

My analysis will only include apps directed toward an English-speaking audience. I will narrow my scope of analysis by removing each app whose name contains symbols not commonly used in English text.
- English text usually includes letters from the English alphabet, numbers composed of digits from 0 to 9, punctuation marks (., !, ?, ;, etc.), and other symbols (+, *, /, etc.)
- These characters are encoded using the ASCII standard
- Each ASCII character has an associated corresponding number between 0 and 127
- I will use these characters to check if an app name contains non-ASCII characters
- To minimize the impact of data loss, I will only remove an app if its name has more than three non-ASCII characters
    - This avoids removal of an app name that may contain a specialty symbol like an emoji or a copyright mark

In [15]:
def english_app_check(string):
    count = 0
    for each_character in string:
        if ord(each_character) > 127: # 0-127 ascii
            count += 1
    if count > 3:
        return False
    else:
        return True

print(english_app_check('Instagram'))
print(english_app_check('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english_app_check('Docs To Go™ Free Office Suite'))
print(english_app_check('Instachat 😜'))

True
False
True
True


In [16]:
google_clean_english = []
for each_line in google_clean:
    name = each_line[0] # name column
    if english_app_check(name):
        google_clean_english.append(each_line)

apple_list_english = []
for each_line in apple_list:
    name = each_line[1] # name column
    if english_app_check(name):
        apple_list_english.append(each_line)

In [17]:
print(explore_data(google_clean_english, 0, 3, True))

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13
None


In [18]:
print(explore_data(apple_list_english, 0, 3, True))

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 6183
Number of columns: 16
None


# Isolating Free Apps

My analysis will only include apps that are free to download and install. In this case the main revenue source of a future app would be in-app ads. Both data sets contain both free and non-free apps, and so I will isolate only the free apps for analysis.

In [19]:
google_clean_english_free = []
for each_line in google_clean_english:
    price = each_line[7] # price column
    if price == '0':
        google_clean_english_free.append(each_line)
        
apple_list_english_free = []
for each_line in apple_list_english:
    price = each_line[4] # price column
    if price == '0.0':
        apple_list_english_free.append(each_line)

In [20]:
print(explore_data(google_clean_english_free, 0, 3, True))

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 8864
Number of columns: 13
None


In [21]:
print(explore_data(apple_list_english_free, 0, 3, True))

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 3222
Number of columns: 16
None


In [22]:
google_final = google_clean_english_free
apple_final = apple_list_english_free
print(len(google_final))
print(len(apple_final))

8864
3222


I'm left with 8864 Android apps and 3222 iOS apps for analysis.

# Most Common Apps by Genre

I aim to identify apps genres that are likely to attract more users to allow for the main revenue source to be in-app ads. I am looking particularly for apps that would be successful within both the Google Play and App Store.

To begin I will identify some of the most common app genres for each market by building a frequency table for the prime_genre column of the App Store data set, and the Genres and Category columns of the Google Play data set.

First I will review each dataset:

In [23]:
print('Final Google Data Set:')
print(google_list_header)
print('\n')
print(explore_data(google_final,0,3,True))
print('\n')
print('Final Apple Data Set:')
print(apple_list_header)
print('\n')
print(explore_data(apple_final,0,3,True))

Final Google Data Set:
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 8864
Number of columns: 13
None


Final Apple Data Set:
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadS

To analyze the frequency tables I will build a function to generate frequency tables that show percentages, and another function that displays the percentages in a descending order.

In [24]:
def freq_table(dataset, index):
    table = {}
    total = 0
    for each_line in dataset:
        total += 1
        each_entry = each_line[index]
        if each_entry in table:
            table[each_entry] += 1
        else:
            table[each_entry] = 1
    table_percent = {}
    for key in table:
        percent = ((table[key] / total) * 100)
        table_percent[key] = percent
    return table_percent

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

I will input the following columns into my functions:
Google: Category = column 1 & Genres = column 9
Apple: prime_genre = column 11

In [25]:
google_category = display_table(google_final, 1)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

In [26]:
google_genres = display_table(google_final, 9)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

In [27]:
apple_prime_genre = display_table(apple_final, 11)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


Apple - prime_genre:
- Top 5 most common applications:
    - Games @ 58.16%
    - Entertainment @ 7.88%
    - Photo & Video @ 4.97%
    - Education @ 3.66%
    - Social Networking @ 3.29%
- Media consumption through games, entertainment, photos, videos, and social networking appears to be a common theme within the Apple iOS app store

Google - category:
- Top 5 most common applications:
    - Family @ 18.91%
    - Game @ 9.72%
    - Tools @ 8.46%
    - Business @ 4.59%
    - Lifestyle @ 3.90%
Google - genres: 
- Top 5 most common applications:
    - Tools @ 8.45%
    - Entertainment @ 6.07%
    - Education @ 5.35%
    - Business @ 4.59%
    - Productivity @ 3.89%
- Practicality through tools, buisiness, education, lifestyle and productivity appears to be a common theme within the Google Play app store

- The two platforms share commonalities in the gaming, entertainment, and education categories

The difference between the Google App Store Genres and Category columns is not clear, but the Genres column appears to have more categories. I am more interested in general trends so I will focus on the Category coloumn. 

# Most Popular Apps by Genre

I want to know what kind of apps attract the most users. I will do this by calculating the average number of installs (Google Play data set) or the total number of user ratings (Apple App store data) for each app genre. 

Apple App store - rating_count_tot column
Google Play data set - Installs column

In [28]:
# Apple - total rating count column 5
apple_genres = freq_table(apple_final, 11)
for genre in apple_genres:
    total = 0
    len_genre = 0
    for each_line in apple_final:
        genre_app = each_line[11]
        if genre_app == genre:
            users = float(each_line[5])
            total += users
            len_genre += 1
    avg = total / len_genre
    print(genre, ':', avg)
#print(apple_genres)

Education : 7003.983050847458
News : 21248.023255813954
Reference : 74942.11111111111
Catalogs : 4004.0
Games : 22788.6696905016
Utilities : 18684.456790123455
Lifestyle : 16485.764705882353
Social Networking : 71548.34905660378
Navigation : 86090.33333333333
Productivity : 21028.410714285714
Business : 7491.117647058823
Medical : 612.0
Finance : 31467.944444444445
Health & Fitness : 23298.015384615384
Photo & Video : 28441.54375
Food & Drink : 33333.92307692308
Book : 39758.5
Music : 57326.530303030304
Shopping : 26919.690476190477
Entertainment : 14029.830708661417
Travel : 28243.8
Sports : 23008.898550724636
Weather : 52279.892857142855


Apple - total ratings count:
- Top 5 high use app genres:
    - Navigation @ 86090.33
    - Reference @ 74942.11
    - Social Networking @ 71548.35
    - Music @ 57326.53
    - Weather @ 52279.89
- Navigation, social networking, and music may be skewed by highly popular apps like Google Maps, Waze, Facebook, Twitter, Google Play, Spotify, etc. 
- This leaves Reference, and Weather as potential app profiles of interest

In [29]:
for app in apple_final:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5]) # print name and number of ratings

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


Reference apps have 74,942 user ratings on average, but the ratings are skewed by Bible apps and Dictionary.com:

In [30]:
for app in apple_final:
    if app[-5] == 'Reference':
        print(app[1], ':', app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


Regardless, this genre could be a possible niche for developing a new app outside of a saturated market. For example, an app could focus on a particularly popular topic or book of interest and include associated readings, reflections, etc. as well as an integrated dictionary.

This is prefereable to weather apps as users may not spend enough time in-app to generate ad revenue. Also, getting reliable live weather data may require connection to non-free APIs.

In [31]:
# Google - installs column 5
google_genres = freq_table(google_final, 1)
for genre in google_genres:
    total = 0
    len_genre = 0
    for each_line in google_final:
        genre_app = each_line[1]
        if genre_app == genre:
            users = each_line[5]
            users = users.replace(',','')
            users = users.replace('+','')
            total += float(users)
            len_genre += 1
    avg = total / len_genre
    print(genre, ':', avg)

SPORTS : 3638640.1428571427
MEDICAL : 120550.61980830671
HOUSE_AND_HOME : 1331540.5616438356
FAMILY : 3695641.8198090694
TOOLS : 10801391.298666667
ART_AND_DESIGN : 1986335.0877192982
EDUCATION : 1833495.145631068
FINANCE : 1387692.475609756
BEAUTY : 513151.88679245283
MAPS_AND_NAVIGATION : 4056941.7741935486
EVENTS : 253542.22222222222
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
ENTERTAINMENT : 11640705.88235294
AUTO_AND_VEHICLES : 647317.8170731707
HEALTH_AND_FITNESS : 4188821.9853479853
VIDEO_PLAYERS : 24727872.452830188
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
PRODUCTIVITY : 16787331.344927534
SOCIAL : 23253652.127118643
PERSONALIZATION : 5201482.6122448975
TRAVEL_AND_LOCAL : 13984077.710144928
SHOPPING : 7036877.311557789
BUSINESS : 1712290.1474201474
GAME : 15588015.603248259
PHOTOGRAPHY : 17840110.40229885
FOOD_AND_DRINK : 1924897.7363636363
PARENTING : 542603.6206896552
BOOKS_AND_REFERENCE : 8767811.894736841
WEATHER : 5074486.19718

On average, communication apps have the most installs: 38,456,119. This number is heavily skewed up by a few apps that have over one billion installs (WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts), and a few others with over 100 and 500 million installs:

Google - total ratings count:
- Top 5 high use app genres:
    - Communication @ 38456119.17
    - Video players @ 24727872.45
    - Social @ 23253652.13
    - Photography @ 17840110.40
    - Game @ 15588015.60
    
- Communication, Video players, and Social may be skewed by highly popular apps like WhatsApp, Facebook, Netflix, Gmail, etc.
- This leaves Photography, and Game as potential app profiles of interest, however, Game is a highly saturated market.

In [32]:
for app in google_final:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

If we removed all the communication apps that have over 100 million installs, the average would be reduced roughly ten times:

In [34]:
under_100_m = []

for app in google_final:
    n_installs = app[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
        under_100_m.append(float(n_installs))
        
sum(under_100_m) / len(under_100_m)

3603485.3884615386

As genres like Communication or Social are dominated by a few large, highly popular applications, they are difficult spaces to find success in.

Like in the Apple app store, the Books and reference genre is fairly popular with an average number of installs of 8,767,811.

In [35]:
for app in google_final:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E

This genre includes software for processing and reading ebooks, libraries, dictionaries, tutorials, etc.

In [36]:
for app in google_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                            or app[5] == '500,000,000+'
                                            or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


Unlike the game, social, communication, etc. generes there are only a few very popular book/reference apps. To get some app ideas I will explore the apps that are somewhere in the middle in terms of popularity (between 1,000,000 and 100,000,000 downloads):

In [37]:
for app in google_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
H

The apps in this group are mostly focused on ebook reading, library collections, and dictionaries, as well as Quaran readers and study guides.

Once again an app focued on a particularly popular topic or book of interest, including associated readings, reflections, etc. as well as an integrated dictionary could be successful within this category and within both app stores.