# Project Overview
This is a guided project at the end of module 1. The purpose of this project is to apply the skills learnt in this module, including:
- The basics of programming in Python (arithmetical operations, variables, common data types, etc.)
- List and for loops
- Conditional statements
- Dictionaries and frequency tables
- Functions
- Jupyter Notebook

### Project scenario
For this project, we'll pretend we're working as data analysts for a company that builds Android and iOS mobile apps. We make our apps available on Google Play and the App Store.

We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means our revenue for any given app is mostly influenced by the number of users who use our app — the more users that see and engage with the ads, the better. Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.

In [1]:
opened_file = open('Apple iOS Store/AppleStore.csv')
from csv import reader
read_file = reader(opened_file)
ios_apps_data = list(read_file)

opened_file = open('Google Play Store/googleplaystore.csv')
from csv import reader
read_file = reader(opened_file)
gp_apps_data = list(read_file)

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

## iOS Data Summary

The iOS dataset has 7197 rows of data (excluding the header). There are 17 columns of data.
\
Here is a [link](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) to the source data.
\
The columns are:

| 0   | 1      | 2          | 3               | 4             | 5            | 6                                    | 7                                        | 8                                           | 9                                               | 10                  | 11             | 12             | 13                            | 14                                       | 15                            | 16                                 |
|:----|:-------|:-----------|:----------------|:--------------|:-------------|:-------------------------------------|:-----------------------------------------|:--------------------------------------------|:------------------------------------------------|:--------------------|:---------------|:---------------|:------------------------------|:-----------------------------------------|:------------------------------|:-----------------------------------|
|     | id     | track_name | size_bytes      | currency      | price        | rating_count_tot                     | rating_count_ver                         | user_rating                                 | user_rating_ver                                 | ver                 | cont_rating    | prime_genre    | sup_devices.num               | ipadSc_urls.num                          | lang.num                      | vpp_lic                            |
| Row | App ID | App Name   | Size (in Bytes) | Currency Type | Price amount | User Rating counts (for all version) | User Rating counts (for current version) | Average User Rating value (for all version) | Average User Rating value (for current version) | Latest version code | Content Rating | Primary GenreÊ | Number of supporting devicesÊ | Number of screenshots showed for display | Number of supported languages | Vpp Device Based Licensing Enabled |



In [2]:
print('iOS Data Summary')
print('Number of rows excl header: ', len(ios_apps_data[1:]))
print('Number of columns: ', len(ios_apps_data[0]))
print('\n')
explore_data(ios_apps_data, 0, 6)

iOS Data Summary
Number of rows excl header:  7197
Number of columns:  17


['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']


['5', '282935706', 'Bible', '92774400'

## Google App Store Data Summary
The Google App Store data has 10841 rows (exluding the header) and has 13 columns.
\
Here is a [link](https://www.kaggle.com/lava18/google-play-store-apps) to the source data
\
The columns are:


| 0        | 1                           | 2                              | 3                                  | 4               | 5                                             | 6            | 7                | 8                                                                | 9                                   | 10                                               | 11                                                 | 12                           |
|:---------|:----------------------------|:-------------------------------|:-----------------------------------|:----------------|:----------------------------------------------|:-------------|:-----------------|:-----------------------------------------------------------------|:-------------------------------------|:-------------------------------------------------|:---------------------------------------------------|:-----------------------------|
| App      | Category                    | Rating                         | Reviews                            | Size            | Installs                                      | Type         | Price            | Content Rating                                                   | Genres                               | Last Updated                                     | Current Ver                                        | Android Ver                  |
| App name | Category the App belongs to | Overall user rating of the app | Number of user reviews for the app | Size of the app | Number of user downloads/installs for the app | Paid or Free | Price of the app | Age group the app is targeted at - Children / Mature 21+ / Adult | An app can belong to multiple genres | Date when the app was last updated on Play Store | Current version of the app available on Play Store | Min required Android version |


In [3]:
print('Google Plan Data Summary')
print('Number of rows excl header: ', len(gp_apps_data[1:]))
print('Number of columns: ', len(gp_apps_data[0]))
print('\n')
explore_data(gp_apps_data, 0, 6)

Google Plan Data Summary
Number of rows excl header:  10841
Number of columns:  13


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art 

In [10]:
print('problem row reported as row 10472. Missing "Category"')
print('\n')
print(gp_apps_data[10473])

print('\n')

print('5 rows around 10472')
print('\n')
explore_data(gp_apps_data, 10470, 10476)

problem row reported as row 10472. Missing "Category"


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


5 rows around 10472


['TownWiFi | Wi-Fi Everywhere', 'COMMUNICATION', '3.9', '2372', '58M', '500,000+', 'Free', '0', 'Everyone', 'Communication', 'August 2, 2018', '4.2.1', '4.2 and up']


['Jazz Wi-Fi', 'COMMUNICATION', '3.4', '49', '4.0M', '10,000+', 'Free', '0', 'Everyone', 'Communication', 'February 10, 2017', '0.1', '2.3 and up']


['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14'

In [11]:
# Delete row 10472 which is index 10473
del gp_apps_data[10473]

In [12]:
print('5 rows around 10472')
print('\n')
explore_data(gp_apps_data, 10470, 10476)

5 rows around 10472


['TownWiFi | Wi-Fi Everywhere', 'COMMUNICATION', '3.9', '2372', '58M', '500,000+', 'Free', '0', 'Everyone', 'Communication', 'August 2, 2018', '4.2.1', '4.2 and up']


['Jazz Wi-Fi', 'COMMUNICATION', '3.4', '49', '4.0M', '10,000+', 'Free', '0', 'Everyone', 'Communication', 'February 10, 2017', '0.1', '2.3 and up']


['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


['Sat-Fi Voice', 'COMMUNICATION', '3.4', '37', '14M', '1,000+', 'Free', '0', 'Everyone', 'Communication', 'November 21, 2014', '2.2.1.5', '2.2 and up']


['Wi-Fi Visualizer', 'TOOLS', '3.9', '132', '2.6M', '50,000+', 'Free', '0', 'Everyone', 'Tools', 'May 17, 2017', '0.0.9', '2.3 and up']




## Duplicate apps in gp data

In [15]:
#Check for duplicate apps in the gp data. Based on comments on data there appears to be a number of duplicates

for row in gp_apps_data:
    name = row[0]
    if name == 'Instagram':
        print(row)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [14]:
#Identify all duplicates in gp data

gp_dup_apps = []
gp_unique_apps = []

for row in gp_apps_data:
    name = row[0]
    
    if name in gp_unique_apps:
        gp_dup_apps.append(name)
    else:
        gp_unique_apps.append(name)
        
print('Number of duplicate apps: ', len(gp_dup_apps))
print('Number of unique apps: ', len(gp_unique_apps))
print('\n')
print('Examples of duplicates:', gp_dup_apps[:15])

Number of duplicate apps:  1181
Number of unique apps:  9660


Examples of duplicates: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


### Removing duplicate apps in gp data
There are 1181 duplicate apps in the gp data.
Using the Instagram example above, we can determine that the apps data must have been taken at different points in time based on the change in the number of reviews between rows (position 3). We will use this data point to remove the duplicate rows, leaving the one with the highest review count.

In [22]:
# Target length for gp data after dups removed
len(gp_apps_data[1:]) - 1181

9659

In [21]:
#created a dictionary of each app in the gp dataset with its highest rating
gp_reviews_max = {}

for row in gp_apps_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    
    if name in gp_reviews_max and gp_reviews_max[name] < n_reviews:
        gp_reviews_max[name] = n_reviews
    
    if name not in gp_reviews_max:
        gp_reviews_max[name] = n_reviews
        
print(gp_reviews_max['Instagram'])
print(gp_reviews_max['Jazz Wi-Fi'])

66577446.0
49.0


In [25]:
#use the dictionary to identify the rows with the highest rating and add those to a new list of lists called "gp_clean"
#we use two lists (clean and already added) because some apps may have the same number of reviews across dup rows.
#this second list protects us against this and ensures no duplicates in the final 'clean' list

gp_clean = []
gp_already_added = []

for row in gp_apps_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    
    if n_reviews == gp_reviews_max[name] and name not in gp_already_added:
        gp_clean.append(row)
        gp_already_added.append(name)
        
print('gp_clean number of rows: ', len(gp_clean), '\n')

explore_data(gp_clean, 0, 6)

gp_clean number of rows:  9659 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIG

expected length of 9659 matches actual lenght of 9659

## Duplicate apps in iOS data

In [26]:
# identify duplicates in ios data

ios_dup_apps = []
ios_unique_apps = []

for row in ios_apps_data:
    iid = row[1]
    
    if iid in ios_unique_apps:
        ios_dup_apps.append(name)
    else:
        ios_unique_apps.append(name)
        
print('Number of duplicate apps: ', len(ios_dup_apps))
print('Number of unique apps: ', len(ios_unique_apps))


Number of duplicate apps:  0
Number of unique apps:  7198


No duplicate apps in the iOS data