# Analysis of Android and iOS Apps Installations

##### Overview
This project will explore data on user engagement with mobile applications in the App Store and Google Play that are available free of cost.

##### Goal
The goal of this data research is to understand better which types of mobile applications attract more users.

Let's start by importing data on the App Store and Google Play:

In [5]:
from csv import reader

### Apple Store Dataset ###
as_opened_file = open('AppleStore.csv')
as_read_file = reader(as_opened_file)
as_apps_data = list(as_read_file)
as_header = as_apps_data[0]
as_content = as_apps_data[1:]

### Google Play Dataset ###
gp_opened_file = open('googleplaystore.csv')
gp_read_file = reader(gp_opened_file)
gp_apps_data = list(gp_read_file)
gp_header = gp_apps_data[0]
gp_content = gp_apps_data[1:]

We are going to explore these datasets using the following `explore_data()` function. We are able to see what sort of information is available for each mobile app in both of these datasets. This preliminary analysis shows that the App Store dataset has 7,197 apps and the Google Play dataset has 10,841 apps. We further see that this data also has 16 and 13 columns of information for the App Store and Google Play, respectively.

In [6]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
print('Apple Store Dataset:')
print(as_header)
print('\n')
explore_data(as_content, 10, 12, True)

print('\n')

print('Google Play Dataset:')
print(gp_header)
print('\n')
explore_data(gp_content, 10, 12, True)

Apple Store Dataset:
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['512939461', 'Subway Surfers', '156038144', 'USD', '0.0', '706110', '97', '4.5', '4.0', '1.72.1', '9+', 'Games', '38', '5', '1', '1']


['362949845', 'Fruit Ninja Classic', '104590336', 'USD', '1.99', '698516', '132', '4.5', '4.0', '2.3.9', '4+', 'Games', '38', '5', '13', '1']


Number of rows: 7197
Number of columns: 16


Google Play Dataset:
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Text on Photo - Fonteee', 'ART_AND_DESIGN', '4.4', '13880', '28M', '1,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'October 27, 2017', '1.0.4', '4.1 and up']


['Name Art Photo Editor - Focus n Filters', 'ART_AND_DESIGN', '4.4', '8788', '1

#### Data Cleanup
##### Part 1 - Identifying Missing and Duplicate Data
Let's start our analysis first with the Google Play dataset. We see that there are a few discrepancies in the data, some entries have missing data, or there are multiples of the same entries. In this step we start the clean the data for more robust and reliable analysis.

First let's sort out and remove datapoints that have incomplete information:

In [7]:
print(gp_header)
print('\n')
explore_data(gp_content, 10472, 10473, True)
print('\n')
gp_content[10472]

del gp_content[10472]
gp_content[10472]

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Number of rows: 10841
Number of columns: 13




['osmino Wi-Fi: free WiFi',
 'TOOLS',
 '4.2',
 '134203',
 '4.1M',
 '10,000,000+',
 'Free',
 '0',
 'Everyone',
 'Tools',
 'August 7, 2018',
 '6.06.14',
 '4.4 and up']

Next, let check for duplicate entries. We initialize two lists, `gp_duplicate_apps` and `gp_unique_apps`. Next, we loop through the list `gp_content` and examine the name of each app. If the name is found in the list `gp_unique_apps`, the name is then additionally added to the `gp_duplicate_apps` list. Otherwise, the name of the app is appended to the list,`gp_unique_apps`.

In [8]:
gp_duplicate_apps = []
gp_unique_apps = []

for app in gp_content:
    name = app[0]
    if name in gp_unique_apps:
        gp_duplicate_apps.append(name)
    else:
        gp_unique_apps.append(name)
        
print('Number of duplicate apps:', len(gp_duplicate_apps))
print('\n')
print('Examples of duplicate apps:', gp_duplicate_apps[:20])

Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software', 'MailChimp - Email, Marketing Automation', 'Crew - Free Messaging and Scheduling', 'Asana: organize team projects', 'Google Analytics', 'AdWords Express']


Here we can see there are quite a few duplicate entries in the dataset. There are many ways to distill this list to retain the correct versions. Two possible ways are by examining the reviews column or installs column. These columns show the total number of reviews and the total number of installs, respectively. By choosing to use the reviews column to analyze, we can ensure that the entry with the highest number of reviews is likely the most recent entry in the data.

In [9]:
print('Expected length:', len(gp_content)-1181)

Expected length: 9659


In [10]:
gp_header

['App',
 'Category',
 'Rating',
 'Reviews',
 'Size',
 'Installs',
 'Type',
 'Price',
 'Content Rating',
 'Genres',
 'Last Updated',
 'Current Ver',
 'Android Ver']

##### Part 2 - Consolidation of Multiple Data

Here we loop through the Google Play content list and identify the maximum reviews for each application. We start with initializing a dictionary, `reviews_max`. Next, we loop through `gp_content`. If the name of the app is present in the `reviews_max` dictionary as a key **and** the number of reviews present in the dictionary value is less than the number if reviews for this app entry, we update the value. Otherwise if the name is not present in the dictionary, we add the name as a new key and review value.

In [19]:
import random
reviews_max = {}
for app in gp_content:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews


print(random.sample( reviews_max.items(), 10 ))

[('Wave Z Live Wallpaper', 63699.0), ('Soldiers of Glory: Modern War', 64815.0), ('Stash: Invest. Learn. Save.', 11919.0), ('Kids A-Z', 26426.0), ('Medical terms (OFFLINE)', 104.0), ('My Recipes Cookbook : RecetteTek', 11707.0), ('U Pull It Auto Dismantler', 71.0), ('Yandex.Shell (Launcher+Dialer)', 87300.0), ('Antivirus & Mobile Security', 351267.0), ('CX-42', 0.0)]


In [20]:
print(len(reviews_max))

9659


This dictionary provides a guide for cherry-picking the correct data out of the main dataset. The next step taken will generate a new list of data that only includes the most recent entries based on number of reviews.
Here we initialize two empty lists, `android_clean` and `already_added`. Looping through the applications in `gp_content` we examine the number of reviews for each entry and set it as a float to the variable `n_reviews`. We then refer back to the dictionary, `reviews_max` that we created in the previous step to compare the value of each key to the value of `n_reviews`. If the values match and if the name of the app is not in the `already_added` list, we append the entry to the `android_clean` list and add in the entry to `aldready_added` list to prevent a repeat entry.

In [21]:
android_clean = []
already_added = []

for app in gp_content:
    name = app[0]
    n_reviews = float(app[3])
    
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)

print(android_clean[10])

['Name Art Photo Editor - Focus n Filters', 'ART_AND_DESIGN', '4.4', '8788', '12M', '1,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'July 31, 2018', '1.0.15', '4.0 and up']


In [22]:
print(len(android_clean))

9659


Next we will remove any non-english apps from the dataset by employing the ASCII system. Here we create a function that takes in an input of strings and examines if each character falls into the desired ASCII range of less than 127. This is the range of the English language in ASCII.

In [29]:
def is_english(input_str):
    for character in input_str:
        if ord(character) > 127:
            return False      
    return True

Next, let's test out the function on various types of strings:

In [30]:
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
False
False


There is a minor issue with the previously defined `is_english` function. It does not accept any special characters and emojis as they fall out of the ASCII range we have set. 

In [32]:
def is_english(input_str):
    non_eng = 0
    for character in input_str:
        if ord(character) > 127:
            non_eng += 1
            if non_eng > 3:
                return False      
    return True

In [33]:
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
True
True


In [42]:
android_english = []
for app in android_clean:
    name = app[0]
    if is_english(name):
        android_english.append(app)
        
print('The number of english apps in Google Play is:', len(android_english))

os_english = []
for app in as_content:
    name = app[0]
    if is_english(name):
        os_english.append(app)

print('The number of english apps in App Store is:', len(os_english))



The number of english apps in Google Play is: 9614
The number of english apps in App Store is: 7197
