# Project Overview
This is a guided project at the end of module 1. The purpose of this project is to apply the skills learnt in this module, including:
- The basics of programming in Python (arithmetical operations, variables, common data types, etc.)
- List and for loops
- Conditional statements
- Dictionaries and frequency tables
- Functions
- Jupyter Notebook

### Project scenario
For this project, we'll pretend we're working as data analysts for a company that builds Android and iOS mobile apps. We make our apps available on Google Play and the App Store.

We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means our revenue for any given app is mostly influenced by the number of users who use our app — the more users that see and engage with the ads, the better. Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.

In [1]:
opened_file = open('Apple iOS Store/AppleStore.csv')
from csv import reader
read_file = reader(opened_file)
ios_apps_data = list(read_file)

opened_file = open('Google Play Store/googleplaystore.csv')
from csv import reader
read_file = reader(opened_file)
gp_apps_data = list(read_file)

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

## iOS Data Summary

The iOS dataset has 7197 rows of data (excluding the header). There are 17 columns of data.
\
Here is a [link](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) to the source data.
\
The columns are:

| 0   | 1      | 2          | 3               | 4             | 5            | 6                                    | 7                                        | 8                                           | 9                                               | 10                  | 11             | 12             | 13                            | 14                                       | 15                            | 16                                 |
|:----|:-------|:-----------|:----------------|:--------------|:-------------|:-------------------------------------|:-----------------------------------------|:--------------------------------------------|:------------------------------------------------|:--------------------|:---------------|:---------------|:------------------------------|:-----------------------------------------|:------------------------------|:-----------------------------------|
|     | id     | track_name | size_bytes      | currency      | price        | rating_count_tot                     | rating_count_ver                         | user_rating                                 | user_rating_ver                                 | ver                 | cont_rating    | prime_genre    | sup_devices.num               | ipadSc_urls.num                          | lang.num                      | vpp_lic                            |
| Row | App ID | App Name   | Size (in Bytes) | Currency Type | Price amount | User Rating counts (for all version) | User Rating counts (for current version) | Average User Rating value (for all version) | Average User Rating value (for current version) | Latest version code | Content Rating | Primary GenreÊ | Number of supporting devicesÊ | Number of screenshots showed for display | Number of supported languages | Vpp Device Based Licensing Enabled |



In [2]:
print('iOS Data Summary')
print('Number of rows excl header: ', len(ios_apps_data[1:]))
print('Number of columns: ', len(ios_apps_data[0]))
print('\n')
explore_data(ios_apps_data, 0, 6)

iOS Data Summary
Number of rows excl header:  7197
Number of columns:  17


['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']


['5', '282935706', 'Bible', '92774400'

## Google App Store Data Summary
The Google App Store data has 10841 rows (exluding the header) and has 13 columns.
\
Here is a [link](https://www.kaggle.com/lava18/google-play-store-apps) to the source data
\
The columns are:


| 0        | 1                           | 2                              | 3                                  | 4               | 5                                             | 6            | 7                | 8                                                                | 9                                   | 10                                               | 11                                                 | 12                           |
|:---------|:----------------------------|:-------------------------------|:-----------------------------------|:----------------|:----------------------------------------------|:-------------|:-----------------|:-----------------------------------------------------------------|:-------------------------------------|:-------------------------------------------------|:---------------------------------------------------|:-----------------------------|
| App      | Category                    | Rating                         | Reviews                            | Size            | Installs                                      | Type         | Price            | Content Rating                                                   | Genres                               | Last Updated                                     | Current Ver                                        | Android Ver                  |
| App name | Category the App belongs to | Overall user rating of the app | Number of user reviews for the app | Size of the app | Number of user downloads/installs for the app | Paid or Free | Price of the app | Age group the app is targeted at - Children / Mature 21+ / Adult | An app can belong to multiple genres | Date when the app was last updated on Play Store | Current version of the app available on Play Store | Min required Android version |


In [3]:
print('Google Plan Data Summary')
print('Number of rows excl header: ', len(gp_apps_data[1:]))
print('Number of columns: ', len(gp_apps_data[0]))
print('\n')
explore_data(gp_apps_data, 0, 6)

Google Plan Data Summary
Number of rows excl header:  10841
Number of columns:  13


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art 

## Data Cleaning
---

### Inaccudate data in GP dataset
Per the notes with the data, there is an inaccurate row. We will drop this row.

In [4]:
print('problem row reported as row 10472. Missing "Category"')
print('\n')
print(gp_apps_data[10473])

print('\n')

print('5 rows around 10472')
print('\n')
explore_data(gp_apps_data, 10470, 10476)

problem row reported as row 10472. Missing "Category"


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


5 rows around 10472


['TownWiFi | Wi-Fi Everywhere', 'COMMUNICATION', '3.9', '2372', '58M', '500,000+', 'Free', '0', 'Everyone', 'Communication', 'August 2, 2018', '4.2.1', '4.2 and up']


['Jazz Wi-Fi', 'COMMUNICATION', '3.4', '49', '4.0M', '10,000+', 'Free', '0', 'Everyone', 'Communication', 'February 10, 2017', '0.1', '2.3 and up']


['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14'

In [5]:
# Delete row 10472 which is index 10473
del gp_apps_data[10473]

In [6]:
print('5 rows around 10472')
print('\n')
explore_data(gp_apps_data, 10470, 10476)

5 rows around 10472


['TownWiFi | Wi-Fi Everywhere', 'COMMUNICATION', '3.9', '2372', '58M', '500,000+', 'Free', '0', 'Everyone', 'Communication', 'August 2, 2018', '4.2.1', '4.2 and up']


['Jazz Wi-Fi', 'COMMUNICATION', '3.4', '49', '4.0M', '10,000+', 'Free', '0', 'Everyone', 'Communication', 'February 10, 2017', '0.1', '2.3 and up']


['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


['Sat-Fi Voice', 'COMMUNICATION', '3.4', '37', '14M', '1,000+', 'Free', '0', 'Everyone', 'Communication', 'November 21, 2014', '2.2.1.5', '2.2 and up']


['Wi-Fi Visualizer', 'TOOLS', '3.9', '132', '2.6M', '50,000+', 'Free', '0', 'Everyone', 'Tools', 'May 17, 2017', '0.0.9', '2.3 and up']




### Duplicate apps in gp data

In [7]:
#Check for duplicate apps in the gp data. Based on comments on data there appears to be a number of duplicates

for row in gp_apps_data:
    name = row[0]
    if name == 'Instagram':
        print(row)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [8]:
#Identify all duplicates in gp data

gp_dup_apps = []
gp_unique_apps = []

for row in gp_apps_data:
    name = row[0]
    
    if name in gp_unique_apps:
        gp_dup_apps.append(name)
    else:
        gp_unique_apps.append(name)
        
print('Number of duplicate apps: ', len(gp_dup_apps))
print('Number of unique apps: ', len(gp_unique_apps))
print('\n')
print('Examples of duplicates:', gp_dup_apps[:15])

Number of duplicate apps:  1181
Number of unique apps:  9660


Examples of duplicates: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


### Removing duplicate apps in gp data
There are 1181 duplicate apps in the gp data.
Using the Instagram example above, we can determine that the apps data must have been taken at different points in time based on the change in the number of reviews between rows (position 3). We will use this data point to remove the duplicate rows, leaving the one with the highest review count.

In [9]:
# Target length for gp data after dups removed
len(gp_apps_data[1:]) - 1181

9659

In [10]:
#created a dictionary of each app in the gp dataset with its highest rating
gp_reviews_max = {}

for row in gp_apps_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    
    if name in gp_reviews_max and gp_reviews_max[name] < n_reviews:
        gp_reviews_max[name] = n_reviews
    
    if name not in gp_reviews_max:
        gp_reviews_max[name] = n_reviews
        
print(gp_reviews_max['Instagram'])
print(gp_reviews_max['Jazz Wi-Fi'])

66577446.0
49.0


In [11]:
#use the dictionary to identify the rows with the highest rating and add those to a new list of lists called "gp_clean"
#we use two lists (clean and already added) because some apps may have the same number of reviews across dup rows.
#this second list protects us against this and ensures no duplicates in the final 'clean' list

gp_clean = []
gp_already_added = []

for row in gp_apps_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    
    if n_reviews == gp_reviews_max[name] and name not in gp_already_added:
        gp_clean.append(row)
        gp_already_added.append(name)
        
print('gp_clean number of rows: ', len(gp_clean), '\n')

explore_data(gp_clean, 0, 6)

gp_clean number of rows:  9659 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIG

expected length of 9659 matches actual lenght of 9659

### Duplicate apps in iOS data

In [12]:
# identify duplicates in ios data

ios_dup_apps = []
ios_unique_apps = []

for row in ios_apps_data:
    iid = row[1]
    
    if iid in ios_unique_apps:
        ios_dup_apps.append(name)
    else:
        ios_unique_apps.append(name)
        
print('Number of duplicate apps: ', len(ios_dup_apps))
print('Number of unique apps: ', len(ios_unique_apps))


Number of duplicate apps:  0
Number of unique apps:  7198


No duplicate apps in the iOS data

In [13]:
row2 = ios_apps_data[3]
name = row2[2]
char = name[0]
ord(char)


87

### Create a list of iOS apps with english-only names
Search through app names and where an app has 3 or more characters with ASCII value greater than 127, add this app to the list `ios_non_eng_apps`
Use the `ios_non_eng_apps` list to create a list of iOS apps with english-only names

In [14]:
ios_non_eng_apps = []
ios_apps_eng = []

for row in ios_apps_data[1:]:
    name = row[2]
    char_count = 0
    for character in name:
        if ord(character) > 127 and char_count < 3:
            char_count += 1
        elif ord(character) > 127 and char_count == 3:
            char_count += 1
            ios_non_eng_apps.append(name)
    if name not in ios_non_eng_apps:
        ios_apps_eng.append(row)
            
print('number of non-english apps: ',len(ios_non_eng_apps))
print('target number for english apps only list: ', len(ios_apps_data[1:]) - len(ios_non_eng_apps))
print(ios_non_eng_apps[:6])
print('\n')
print('number of english-only apps :', len(ios_apps_eng))
print(ios_apps_eng[:6])

number of non-english apps:  1014
target number for english apps only list:  6183
['新浪新闻-阅读最新时事热门头条资讯视频', '同花顺-炒股、股票', '央视影音-海量央视内容高清直播', '优酷视频', 'クックパッド - No.1料理レシピ検索アプリ', '大众点评-发现品质生活']


number of english-only apps : 6183
[['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1'], ['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1'], ['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1'], ['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1'], ['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37', '5', '45',

expected lenght of list of english only apps matches final number

### Create a list of Google Play apps with english-only names
Search through app names and where an app has 3 or more characters with ASCII value greater than 127, add this app to the list `gp_non_eng_apps`
Use the `gp_non_eng_apps` list to create a list of GP apps with english-only names

In [15]:
gp_non_eng_apps = []
gp_clean_eng = []

for row in gp_clean:
    name = row[0]
    char_count = 0
    for character in name:
        if ord(character) > 127 and char_count < 3:
            char_count += 1
        elif ord(character) > 127 and char_count == 3:
            char_count += 1
            gp_non_eng_apps.append(name)
    if name not in gp_non_eng_apps:
        gp_clean_eng.append(row)
            
print('number of non-english apps: ',len(gp_non_eng_apps))
print('target number for english apps only list: ', len(gp_clean) - len(gp_non_eng_apps))
print(gp_non_eng_apps[:6])
print('\n')
print('number of english-only apps :', len(gp_clean_eng))
print(gp_clean_eng[:6])

number of non-english apps:  45
target number for english apps only list:  9614
['Flame - درب عقلك يوميا', 'သိင်္ Astrology - Min Thein Kha BayDin', 'РИА Новости', 'صور حرف H', 'L.POINT - 엘포인트 [ 포인트, 멤버십, 적립, 사용, 모바일 카드, 쿠폰, 롯데]', 'RMEduS - 음성인식을 활용한 R 프로그래밍 실습 시스템']


number of english-only apps : 9614
[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creat

expected lenght of list of english only apps matches final number

### Create list of iOS apps that are free
Use the price field (which is a string) and find those apps with a price string == `'0'`or `'0.0'` or `'0.00'`

In [16]:
ios_apps_eng_free = []

for row in ios_apps_eng:
    price = row[5]
    if price == '0' or price == '0.0' or price == '0.00':
        ios_apps_eng_free.append(row)
        
print('number of free apps: ', len(ios_apps_eng_free))
print('number of non-free apps: ', len(ios_apps_eng) - len(ios_apps_eng_free))
print(ios_apps_eng_free[:6])

number of free apps:  3222
number of non-free apps:  2961
[['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1'], ['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1'], ['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1'], ['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37', '5', '45', '1'], ['7', '283646709', 'PayPal - Send and request money safely', '227795968', 'USD', '0', '119487', '879', '4', '4.5', '6.12.0', '4+', 'Finance', '37', '0', '19', '1'], ['8', '284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0', '1126879', '3594', '4', '4.5', '8.4.1', '12+', 'Music', '37', '4

### Create a list of Google Play apps that are free
The GP app dataset has a field that indicates where an app is free or paid.
We will use this field to identiy those apps that are free

In [17]:
gp_clean_eng_free = []

for row in gp_clean_eng:
    type = row[6]
    if type.lower() == 'free':
        gp_clean_eng_free.append(row)
        
print('number of free apps: ', len(gp_clean_eng_free))
print('number of non-free apps: ', len(gp_clean_eng) - len(gp_clean_eng_free))
print(gp_clean_eng_free[:6])

number of free apps:  8863
number of non-free apps:  751
[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'], ['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up'], ['Smoke Effect Photo Maker - Smoke Edi

### Final cleaned lists to use for analysis
iOS data `ios_apps_eng_free`

GP data `gp_clean_eng_free`

Cleaning performed:
- Inaccurate data removed (GP only)
- Duplicates removed (GP only - none in iOS)
- Removed apps with non-english names
- Removed apps that are not free

## Data analysis
---

### Analysis objective
Our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

- Build a minimal Android version of the app, and add it to Google Play.
- If the app has a good response from users, we develop it further.
- If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification

#### Step 1: Create functions 
Create two functions for:
- a frequency table; and 
- a tuple to display the frequency table in highest to lowest order

In [26]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        key = row[index]
        if key in table:
            table[key] += 1
        else: 
            table[key] = 1
            
    table_percentages = {}
    
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage
        
    return table_percentages
    
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

#### Step 2: Generate sorted frequency tables by Genre
Generate sorted frequency tables for the following:
- iOS `prime_genre` `[12]`
- GP `Genre` `[9]`
- GP `Category` `[1]`

In [29]:
display_table(ios_apps_eng_free, 12)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


In [30]:
display_table(gp_clean_eng_free, 9)

Tools : 8.450863138892023
Entertainment : 6.070179397495204
Education : 5.348076272142616
Business : 4.592124562789123
Productivity : 3.8925871601038025
Lifestyle : 3.8925871601038025
Finance : 3.7007785174320205
Medical : 3.5315355974275078
Sports : 3.463838429425702
Personalization : 3.317161232088458
Communication : 3.2381812027530184
Action : 3.102786866749408
Health & Fitness : 3.0802211440821394
Photography : 2.944826808078529
News & Magazines : 2.798149610741284
Social : 2.6627552747376737
Travel & Local : 2.324269434728647
Shopping : 2.245289405393208
Books & Reference : 2.1437436533904997
Simulation : 2.042197901387792
Dating : 1.8616721200496444
Arcade : 1.8503892587160102
Video Players & Editors : 1.771409229380571
Casual : 1.7601263680469368
Maps & Navigation : 1.399074805370642
Food & Drink : 1.241114746699763
Puzzle : 1.128286133363421
Racing : 0.9928917973598104
Role Playing : 0.9364774906916393
Libraries & Demo : 0.9364774906916393
Auto & Vehicles : 0.9251946293580051
S

In [31]:
display_table(gp_clean_eng_free, 1)

FAMILY : 18.898792733837304
GAME : 9.725826469592688
TOOLS : 8.462146000225657
BUSINESS : 4.592124562789123
LIFESTYLE : 3.9038700214374367
PRODUCTIVITY : 3.8925871601038025
FINANCE : 3.7007785174320205
MEDICAL : 3.5315355974275078
SPORTS : 3.396141261423897
PERSONALIZATION : 3.317161232088458
COMMUNICATION : 3.2381812027530184
HEALTH_AND_FITNESS : 3.0802211440821394
PHOTOGRAPHY : 2.944826808078529
NEWS_AND_MAGAZINES : 2.798149610741284
SOCIAL : 2.6627552747376737
TRAVEL_AND_LOCAL : 2.335552296062281
SHOPPING : 2.245289405393208
BOOKS_AND_REFERENCE : 2.1437436533904997
DATING : 1.8616721200496444
VIDEO_PLAYERS : 1.7939749520478394
MAPS_AND_NAVIGATION : 1.399074805370642
FOOD_AND_DRINK : 1.241114746699763
EDUCATION : 1.1621347173643235
ENTERTAINMENT : 0.9590432133589079
LIBRARIES_AND_DEMO : 0.9364774906916393
AUTO_AND_VEHICLES : 0.9251946293580051
HOUSE_AND_HOME : 0.8236488773552973
WEATHER : 0.8010831546880289
EVENTS : 0.7108202640189552
PARENTING : 0.6544059573507841
ART_AND_DESIGN : 0

In the iOS data it is clear that Game are the most common app type. This also appears to be true for the GP data, as the Family category includes games for kids.

Whilst the above frequency tables show the number of apps in each cateogry, they do not tell us how popular these apps are (i.e. how many installs they have). We can determine this easily for the GP data as there is a datapoint for Installs. For the iOS data we will derive it by using the number of user ratings (`rating_count_tot`) as a proxy

#### Step 3: Generate popularity frequency table by genre
For the GP data, we will use the `Installs` field

For the iOS data, we will us the `rating_count_tot` as a proxy for popularity

##### iOS Data

In [33]:
genres_ios = freq_table(ios_apps_eng_free, 12)

for genre in genres_ios:
    total = 0
    genre_len = 0
    
    for app in ios_apps_eng_free:
        app_genre = app[12]
        if app_genre == genre:
            total += float(app[6])
            genre_len += 1
            
    avg_n_ratings = total / genre_len
    
    print(genre, ': ', avg_n_ratings)

Productivity :  21028.410714285714
Weather :  52279.892857142855
Shopping :  26919.690476190477
Reference :  74942.11111111111
Finance :  31467.944444444445
Music :  57326.530303030304
Utilities :  18684.456790123455
Travel :  28243.8
Social Networking :  71548.34905660378
Sports :  23008.898550724636
Health & Fitness :  23298.015384615384
Games :  22788.6696905016
Food & Drink :  33333.92307692308
News :  21248.023255813954
Book :  39758.5
Photo & Video :  28441.54375
Entertainment :  14029.830708661417
Business :  7491.117647058823
Lifestyle :  16485.764705882353
Education :  7003.983050847458
Navigation :  86090.33333333333
Medical :  612.0
Catalogs :  4004.0


Below are the average number of ios ratings per app by genre sorted

![Ratings](ios_average_n_ratings.jpg)

As we can see, Navigation apps are at the top of the list, followed by Reference, Social Networking , Music and Weather

In [36]:
for app in ios_apps_eng_free:
    if app[12] == 'Navigation':
        print(app[2], ' : ',app[6])

Waze - GPS Navigation, Maps & Real-time Traffic  :  345046
Geocaching®  :  12811
ImmobilienScout24: Real Estate Search in Germany  :  187
Railway Route Search  :  5
CoPilot GPS – Car Navigation & Offline Maps  :  3582
Google Maps - Navigation & Transit  :  154911


In [38]:
for app in ios_apps_eng_free:
    if app[12] == 'Reference':
        print(app[2], ' : ',app[6])

Bible  :  985920
Dictionary.com Dictionary & Thesaurus  :  200047
Dictionary.com Dictionary & Thesaurus for iPad  :  54175
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran  :  18418
Merriam-Webster Dictionary  :  16849
Google Translate  :  26786
Night Sky  :  12122
WWDC  :  762
Jishokun-Japanese English Dictionary & Translator  :  0
教えて!goo  :  0
VPN Express  :  14
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition  :  17588
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools  :  4693
Guides for Pokémon GO - Pokemon GO News and Cheats  :  826
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free  :  718
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE)  :  8535
GUNS MODS for Minecraft PC Edition - Mods Tools  :  1497
Real Bike Traffic Rider Virtual Reality Glasses  :  8


In [37]:
for app in ios_apps_eng_free:
    if app[12] == 'Social Networking':
        print(app[2], ' : ',app[6])

Facebook  :  2974676
LinkedIn  :  71856
Skype for iPhone  :  373519
Tumblr  :  334293
Match™ - #1 Dating App.  :  60659
WhatsApp Messenger  :  287589
TextNow - Unlimited Text + Calls  :  164963
Grindr - Gay and same sex guys chat, meet and date  :  23201
imo video calls and chat  :  18841
Ameba  :  269
Weibo  :  7265
Badoo - Meet New People, Chat, Socialize.  :  34428
Kik  :  260965
Qzone  :  1649
Fake-A-Location Free ™  :  354
Tango - Free Video Call, Voice and Chat  :  75412
MeetMe - Chat and Meet New People  :  97072
SimSimi  :  23530
Viber Messenger – Text & Call  :  164249
Find My Family, Friends & iPhone - Life360 Locator  :  43877
Weibo HD  :  16772
POF - Best Dating App for Conversations  :  52642
GroupMe  :  28260
Lobi  :  36
WeChat  :  34584
ooVoo – Free Video Call, Text and Voice  :  177501
Pinterest  :  1061624
知乎  :  397
Qzone HD  :  458
Skype for iPad  :  60163
LINE  :  11437
QQ  :  9109
LOVOO - Dating Chat  :  1985
QQ HD  :  5058
Messenger  :  351466
eHarmony™ Dating App

In [39]:
for app in ios_apps_eng_free:
    if app[12] == 'Weather':
        print(app[2], ' : ',app[6])

WeatherBug - Local Weather, Radar, Maps, Alerts  :  188583
The Weather Channel: Forecast, Radar & Alerts  :  495626
AccuWeather - Weather for Life  :  144214
MyRadar NOAA Weather Radar Forecast  :  150158
The Weather Channel App for iPad – best local forecast, radar map, and storm tracking  :  208648
Météo-France  :  24
Yurekuru Call  :  53
QuakeFeed Earthquake Map, Alerts, and News  :  6081
Moji Weather - Free Weather Forecast  :  2333
FEMA  :  128
Weather Underground: Custom Forecast & Local Radar  :  49192
JaxReady  :  22
Hurricane Tracker WESH 2 Orlando, Central Florida  :  203
Hurricane by American Red Cross  :  1158
Weather & Radar  :  37
WRAL Weather Alert  :  25
Yahoo Weather  :  112603
Weather Live Free - Weather Forecast & Alerts  :  35702
NOAA Weather Radar - Weather Forecast & HD Radar  :  45696
iWeather - World weather forecast  :  80
Almanac Long-Range Weather Forecast  :  12
TodayAir  :  0
Weather - Radar - Storm with Morecast App  :  78
Storm Radar  :  22792
WarnWetter 

##### Google Play data

In [41]:
display_table(gp_clean_eng_free, 5)

1,000,000+ : 15.728308699086089
100,000+ : 11.55365000564143
10,000,000+ : 10.549475346947986
10,000+ : 10.199706645605326
1,000+ : 8.394448832223853
100+ : 6.916393997517771
5,000,000+ : 6.826131106848697
500,000+ : 5.562450637481666
50,000+ : 4.772650344127271
5,000+ : 4.513144533453684
10+ : 3.542818458761142
500+ : 3.2494640640866526
50,000,000+ : 2.3017037120613786
100,000,000+ : 2.1324607920568655
50+ : 1.9180864267178157
5+ : 0.7898002933543946
1+ : 0.5077287600135394
500,000,000+ : 0.270788672007221
1,000,000,000+ : 0.2256572266726842
0+ : 0.045131445334536835


The GP installs value is actually series of groups and not an precice number (not an intiger). Therefore, we will use the same appoach as taken for the iOS data and use count of ratings as a proxy

In [43]:
cats_gp = freq_table(gp_clean_eng_free, 1)

for cat in cats_gp:
    total = 0
    cat_len = 0
    
    for app in gp_clean_eng_free:
        app_cat = app[1]
        if app_cat == cat:
            total += float(app[3])
            cat_len += 1
            
    avg_n_ratings = total / cat_len
    
    print(cat, ': ', avg_n_ratings)

ART_AND_DESIGN :  24699.42105263158
AUTO_AND_VEHICLES :  14140.280487804877
BEAUTY :  7476.226415094339
BOOKS_AND_REFERENCE :  87995.06842105264
BUSINESS :  24239.727272727272
COMICS :  42585.61818181818
COMMUNICATION :  995608.4634146341
DATING :  21953.272727272728
EDUCATION :  56293.09708737864
ENTERTAINMENT :  301752.24705882353
EVENTS :  2555.84126984127
FINANCE :  38535.8993902439
FOOD_AND_DRINK :  57478.79090909091
HEALTH_AND_FITNESS :  78094.9706959707
HOUSE_AND_HOME :  26435.465753424658
LIBRARIES_AND_DEMO :  10925.807228915663
LIFESTYLE :  33921.82369942196
GAME :  683523.8445475638
FAMILY :  113210.54626865672
MEDICAL :  3730.1533546325877
SOCIAL :  965830.9872881356
SHOPPING :  223887.34673366835
PHOTOGRAPHY :  404081.3754789272
SPORTS :  116938.6146179402
TRAVEL_AND_LOCAL :  129484.42512077295
TOOLS :  305732.8973333333
PERSONALIZATION :  181122.31632653062
PRODUCTIVITY :  160634.5420289855
PARENTING :  16378.706896551725
WEATHER :  171250.77464788733
VIDEO_PLAYERS :  4253

Below are the aveage number of GP app ratings by category

![gp average n ratings by cat](gp_avg_n_ratings_by_cat.jpg)

In [46]:
for app in gp_clean_eng_free:
    if app[1] == 'COMMUNICATION':
        print(app[0], ' : ',app[3])

WhatsApp Messenger  :  69119316
Messenger for SMS  :  125257
My Tele2  :  158679
imo beta free calls and text  :  659395
Contacts  :  66602
Call Free – Free Call  :  30209
Web Browser & Explorer  :  36901
Browser 4G  :  192948
MegaFon Dashboard  :  99559
ZenUI Dialer & Contacts  :  437674
Cricket Visual Voicemail  :  13698
TracFone My Account  :  20769
Xperia Link™  :  45487
TouchPal Keyboard - Fun Emoji & Android Keyboard  :  615381
Skype Lite - Free Video Call & Chat  :  33053
My magenta  :  42370
Android Messages  :  781810
Google Duo - High Quality Video Calls  :  2083237
Seznam.cz  :  46702
Antillean Gold Telegram (original version)  :  2939
AT&T Visual Voicemail  :  13761
GMX Mail  :  258556
Omlet Chat  :  40751
My Vodacom SA  :  25021
Microsoft Edge  :  27187
Messenger – Text and Video Chat for Free  :  56646578
imo free video calls and chat  :  4785988
Calls & Text by Mo+  :  83239
free video calls and chat  :  594728
Skype - free IM & video calls  :  10484169
Who  :  2451093
G

In [47]:
for app in gp_clean_eng_free:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ' : ',app[3])

E-Book Read - Read Book for free  :  1857
Download free book with green book  :  4478
Wikipedia  :  577550
Cool Reader  :  246315
Free Panda Radio Music  :  418
Book store  :  22486
FBReader: Favorite Book Reader  :  203130
English Grammar Complete Handbook  :  1435
Free Books - Spirit Fanfiction and Stories  :  116507
Google Play Books  :  1433233
AlReader -any text book reader  :  90468
Offline English Dictionary  :  860
Offline: English to Tagalog Dictionary  :  967
FamilySearch Tree  :  17506
Cloud of Books  :  1862
Recipes of Prophetic Medicine for free  :  2084
ReadEra – free ebook reader  :  47303
Anonymous caller detection  :  161
Ebook Reader  :  85842
Litnet - E-books  :  7831
Read books online  :  91615
English to Urdu Dictionary  :  4620
eBoox: book reader fb2 epub zip  :  21336
English Persian Dictionary  :  26875
Flybook  :  1778
All Maths Formulas  :  2709
Ancestry  :  64513
HTC Help  :  8342
English translation from Bengali  :  527
Pdf Book Download - Read Pdf Book  :  

We can also try considering the `Installs` field as actual numbers. For instance, if an app is in the `100,000+` group we could assume it has 100,000 installs.

In [50]:
cats_gp = freq_table(gp_clean_eng_free, 1)

for category in cats_gp:
    total = 0
    cat_len = 0
    
    for app in gp_clean_eng_free:
        app_cat = app[1]
        
        if app_cat == category:
            installs = app[5]
            installs = installs.replace('+', '')
            installs = installs.replace(',', '')
            installs = int(installs)
            total += installs
            cat_len += 1
    
    avg_installs = total / cat_len
    
    print(category, ' : ', avg_installs)

ART_AND_DESIGN  :  1986335.0877192982
AUTO_AND_VEHICLES  :  647317.8170731707
BEAUTY  :  513151.88679245283
BOOKS_AND_REFERENCE  :  8767811.894736841
BUSINESS  :  1712290.1474201474
COMICS  :  817657.2727272727
COMMUNICATION  :  38456119.167247385
DATING  :  854028.8303030303
EDUCATION  :  1833495.145631068
ENTERTAINMENT  :  11640705.88235294
EVENTS  :  253542.22222222222
FINANCE  :  1387692.475609756
FOOD_AND_DRINK  :  1924897.7363636363
HEALTH_AND_FITNESS  :  4188821.9853479853
HOUSE_AND_HOME  :  1331540.5616438356
LIBRARIES_AND_DEMO  :  638503.734939759
LIFESTYLE  :  1437816.2687861272
GAME  :  15588015.603248259
FAMILY  :  3697848.1731343283
MEDICAL  :  120550.61980830671
SOCIAL  :  23253652.127118643
SHOPPING  :  7036877.311557789
PHOTOGRAPHY  :  17840110.40229885
SPORTS  :  3638640.1428571427
TRAVEL_AND_LOCAL  :  13984077.710144928
TOOLS  :  10801391.298666667
PERSONALIZATION  :  5201482.6122448975
PRODUCTIVITY  :  16787331.344927534
PARENTING  :  542603.6206896552
WEATHER  :  50

Below are the aveage number of installs by category. Note that these results are not significanty different for the user rating count proxy considered above

![gp average n ratings by cat](gp_avg_installs_by_cat.jpg)

## Conclusions
---

The two app stores appear to be dominated by different app types. On face value there do appear to be some popular app categories, however when we look into these it is clear they are often dominated by several very popular apps, thus skewing the apparently popularity of the overall category.

Apps categories that do appear to present more genuine and interesting options are:
- Productivity apps
- Tools apps (similar to Productivity)
- Books and Reference type apps