# Analysis of Android and iOS Apps Installations

## Overview
This project will explore data on user engagement with mobile applications in the App Store and Google Play that are available free of cost.

## Goal
The goal of this data research is to understand better which types of mobile applications attract more users.

### Part 1 - Exploring Data
Let's start by importing data on the App Store and Google Play. We will source the data from Kaggle. The [Google Play Data](https://www.kaggle.com/lava18/google-play-store-apps) has 9660 unique entries and the [Apps Store Data](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) has 7195 unique entries.

Next we source the reader from the csv module and open the files.

In [1]:
from csv import reader

### Apple Store Dataset ###
os_opened_file = open('AppleStore.csv')
os_read_file = reader(os_opened_file)
os_apps_data = list(os_read_file)
os_header = os_apps_data[0]
os_content = os_apps_data[1:]

### Google Play Dataset ###
gp_opened_file = open('googleplaystore.csv')
gp_read_file = reader(gp_opened_file)
gp_apps_data = list(gp_read_file)
gp_header = gp_apps_data[0]
gp_content = gp_apps_data[1:]

We are going to explore these datasets using the following `explore_data()` function. We are able to see what sort of information is available for each mobile app in both of these datasets. This preliminary analysis shows that the App Store dataset has 7,197 apps and the Google Play dataset has 10,841 apps. We further see that this data also has 16 and 13 columns of information for the App Store and Google Play, respectively.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
print('Apple Store Dataset:')
print(os_header)
print('\n')
explore_data(os_content, 10, 12, True)

print('\n')

print('Google Play Dataset:')
print(gp_header)
print('\n')
explore_data(gp_content, 10, 12, True)

Apple Store Dataset:
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['512939461', 'Subway Surfers', '156038144', 'USD', '0.0', '706110', '97', '4.5', '4.0', '1.72.1', '9+', 'Games', '38', '5', '1', '1']


['362949845', 'Fruit Ninja Classic', '104590336', 'USD', '1.99', '698516', '132', '4.5', '4.0', '2.3.9', '4+', 'Games', '38', '5', '13', '1']


Number of rows: 7197
Number of columns: 16


Google Play Dataset:
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Text on Photo - Fonteee', 'ART_AND_DESIGN', '4.4', '13880', '28M', '1,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'October 27, 2017', '1.0.4', '4.1 and up']


['Name Art Photo Editor - Focus n Filters', 'ART_AND_DESIGN', '4.4', '8788', '1

---

## Data Cleanup
### Part 1 - Identifying Missing and Duplicate Data
Let's start our analysis first with the Google Play dataset. We see that there are a few discrepancies in the data, some entries have missing data, or there are multiples of the same entries. In this step we start the clean the data for more robust and reliable analysis.

First let's sort out and remove datapoints that have incomplete information:

In [3]:
print(gp_header)
print('\n')
explore_data(gp_content, 10472, 10473, True)
print('\n')
gp_content[10472]

del gp_content[10472]
gp_content[10472]

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Number of rows: 10841
Number of columns: 13




['osmino Wi-Fi: free WiFi',
 'TOOLS',
 '4.2',
 '134203',
 '4.1M',
 '10,000,000+',
 'Free',
 '0',
 'Everyone',
 'Tools',
 'August 7, 2018',
 '6.06.14',
 '4.4 and up']

Next, let check for duplicate entries. We initialize two lists, `gp_duplicate_apps` and `gp_unique_apps`. Next, we loop through the list `gp_content` and examine the name of each app. If the name is found in the list `gp_unique_apps`, the name is then additionally added to the `gp_duplicate_apps` list. Otherwise, the name of the app is appended to the list,`gp_unique_apps`.

In [4]:
gp_duplicate_apps = []
gp_unique_apps = []

for app in gp_content:
    name = app[0]
    if name in gp_unique_apps:
        gp_duplicate_apps.append(name)
    else:
        gp_unique_apps.append(name)
        
print('Number of duplicate apps:', len(gp_duplicate_apps))
print('\n')
print('Examples of duplicate apps:', gp_duplicate_apps[:20])

Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software', 'MailChimp - Email, Marketing Automation', 'Crew - Free Messaging and Scheduling', 'Asana: organize team projects', 'Google Analytics', 'AdWords Express']


Here we can see there are quite a few duplicate entries in the dataset. There are many ways to distill this list to retain the correct versions. Two possible ways are by examining the reviews column or installs column. These columns show the total number of reviews and the total number of installs, respectively. By choosing to use the reviews column to analyze, we can ensure that the entry with the highest number of reviews is likely the most recent entry in the data.

In [5]:
print('Expected length:', len(gp_content)-1181)

Expected length: 9659


Quickly using the same method as above, let's check for any duplicate entries in the App Store data set:

In [6]:
os_duplicate_apps = []
os_unique_apps = []

for app in os_content:
    name = app[1]
    if name in os_unique_apps:
        os_duplicate_apps.append(name)
    else:
        os_unique_apps.append(name)
        
print('Number of duplicate apps:', len(os_duplicate_apps))
print('\n')
print('Duplicate apps:', os_duplicate_apps)
print('Expected length:', len(os_content)-2)

Number of duplicate apps: 2


Duplicate apps: ['Mannequin Challenge', 'VR Roller Coaster']
Expected length: 7195


### Part 2 - Consolidation of Multiple Data

Here we loop through the Google Play content list and identify the maximum reviews for each application. We start with initializing a dictionary, `reviews_max`. Next, we loop through `gp_content`. If the name of the app is present in the `reviews_max` dictionary as a key **and** the number of reviews present in the dictionary value is less than the number if reviews for this app entry, we update the value. Otherwise if the name is not present in the dictionary, we add the name as a new key and review value.

In [7]:
import random
reviews_max = {}
for app in gp_content:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews


print(random.sample( reviews_max.items(), 10 ))

[('Free Live Talk-Video Call', 158.0), ('Ascape VR: 360° Virtual Travel', 890.0), ('Phoenix - Facebook & Messenger', 9606.0), ('R Bank', 90.0), ('Disc Label Print', 262.0), ('AJ Jump: Animal Jam Kangaroos!', 2975.0), ('EZ-SEE', 71.0), ('BK Formula Calculator', 6.0), ('Organizer', 936.0), ('PitchBlack S - Samsung Substratum Theme “For Oreo”', 90.0)]


In [8]:
print(len(reviews_max))

9659


This dictionary provides a guide for cherry-picking the correct data out of the main dataset. The next step taken will generate a new list of data that only includes the most recent entries based on number of reviews.
Here we initialize two empty lists, `android_clean` and `already_added`. Looping through the applications in `gp_content` we examine the number of reviews for each entry and set it as a float to the variable `n_reviews`. We then refer back to the dictionary, `reviews_max` that we created in the previous step to compare the value of each key to the value of `n_reviews`. If the values match and if the name of the app is not in the `already_added` list, we append the entry to the `android_clean` list and add in the entry to `aldready_added` list to prevent a repeat entry.

In [9]:
android_clean = []
already_added = []

for app in gp_content:
    name = app[0]
    n_reviews = float(app[3])
    
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)

print(android_clean[10])

['Name Art Photo Editor - Focus n Filters', 'ART_AND_DESIGN', '4.4', '8788', '12M', '1,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'July 31, 2018', '1.0.15', '4.0 and up']


In [10]:
print(len(android_clean))

9659


Next we repeat the process to ensure that the data for the App Store is clean and has the most current information on each app.

In [11]:
print(len(os_unique_apps))
os_reviews_max = {}

for app in os_content:
    name = app[1]
    os_reviews = float(app[5])
    
    if name in os_reviews_max and os_reviews_max[name] < os_reviews:
        os_reviews_max[name] = os_reviews
    
    elif name not in os_reviews_max:
        os_reviews_max[name] = os_reviews
print(random.sample( os_reviews_max.items(), 10 ))
print(len(os_reviews_max))


7195
[('Troll Face Quest Video Games', 2118.0), ('Flowstate', 70.0), ('Water Bottle Flip Challenge', 868.0), ('【殺人現場へようこそ】推理サスペンス劇場/謎解き大人の脳トレゲーム', 0.0), ('A.BIG.T -- A Smart VPN', 17.0), ('Dude Perfect', 9763.0), ('Beautiful Japanese Handwriting', 0.0), ('Tayasui Sketches Pro', 1761.0), ('CLUE Bingo', 12123.0), ('Pirate Power', 2555.0)]
7195


In [12]:
os_clean = []
os_already_added = []
for app in os_content:
    name = app[1]
    os_reviews = float(app[5])
    if os_reviews == os_reviews_max[name] and name not in os_already_added:
        os_clean.append(app)
        os_already_added.append(name)

print(len(os_already_added))

7195


### Part 3 - Extraction of English-only Apps

Next we will remove any non-english apps from the dataset by employing the ASCII system. Here we create a function that takes in an input of strings and examines if each character falls into the desired ASCII range. English characters are identifiable if they have an ASCII value between zero to 127, inclusive. 

In [13]:
def is_english(input_str):
    for character in input_str:
        if ord(character) > 127:
            return False      
    return True

Next, let's test out the function on various types of strings:

In [14]:
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
False
False


There is a minor issue with the previously defined `is_english` function. It does not accept any special characters and emojis as they fall out of the ASCII range we have set. We will create another conditional where if there are more than three non-English characters, the function returns false.

In [15]:
def is_english(input_str):
    non_eng = 0
    for character in input_str:
        if ord(character) > 127:
            non_eng += 1
            if non_eng > 3:
                return False      
    return True

In [16]:
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
True
True


In [17]:
android_english = []
for app in android_clean:
    name = app[0]
    if is_english(name):
        android_english.append(app)
        
print('The number of english apps in Google Play is:', len(android_english))

os_english = []
for app in os_clean:
    name = app[1]
    if is_english(name):
        os_english.append(app)

print('The number of english apps in App Store is:', len(os_english))

The number of english apps in Google Play is: 9614
The number of english apps in App Store is: 6181


In [18]:
from pprint import pprint
android = []
for app in android_english:
    price = float(app[7].strip('$'))
    if price == 0.0:
        android.append(app)
        
# pprint --> prints lists in more organized manner        
print(android[21:25])
print('\n')
print('The number of free apps in Google Play is:', len(android))

os = []
for app in os_english:
    price = app[4]
    if price == '0.0':
        os.append(app)
        
print('The number of free apps in App Store is:', len(os))

[['Superheroes Wallpapers | 4K Backgrounds', 'ART_AND_DESIGN', '4.7', '7699', '4.2M', '500,000+', 'Free', '0', 'Everyone 10+', 'Art & Design', 'July 12, 2018', '2.2.6.2', '4.0.3 and up'], ['HD Mickey Minnie Wallpapers', 'ART_AND_DESIGN', '4.7', '118', '23M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'July 7, 2018', '1.1.3', '4.1 and up'], ['Harley Quinn wallpapers HD', 'ART_AND_DESIGN', '4.8', '192', '6.0M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'April 25, 2018', '1.5', '3.0 and up'], ['Colorfit - Drawing & Coloring', 'ART_AND_DESIGN', '4.7', '20260', '25M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'October 11, 2017', '1.0.8', '4.0.3 and up']]


The number of free apps in Google Play is: 8864
The number of free apps in App Store is: 3220


---

## Data Analysis

The cleaned datasets are now ready for analysis. We will now explore the data using a previously defined function, `explore_data`, to determine which types of apps are most likely to be installed and used. Let's look to see which columns could prove to be most useful for analysis.

In [19]:
print('Google Play Dataset:')
print(gp_header)
print('\n')
explore_data(android, 0, 2, True)
print('\n')
print('Apple Store Dataset:')
print(os_header)
print('\n')
explore_data(os, 0, 2, True)

Google Play Dataset:
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 8864
Number of columns: 13


Apple Store Dataset:
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']




For the Google Play dataset, we can use the `Category` and `Genre` columns and for the Apple Store dataset, the `prime_genre` columns would prove most useful.

Next we will create a new function, `freq_table` that will return a frequency table of a column.

In [20]:
def freq_table(dataset, index):
    freq = {}
    for app in dataset:
        genre = app[index]
        if genre in freq:
            freq[genre] += 1
        else:
            freq[genre] = 1
    
    freq_percent = {}
    for genre in freq:
        freq_percent[genre] = ((freq[genre]/len(dataset))*100)
        
    return freq_percent

In [21]:
### provided by coursework
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Let us examine the frequency tables of the Google Play and the App Store data:

App Store:

Here we can observe that there is a significant majority present in the data. 58% of the free apps fall under the Games genre. The next common is Entertainment, at around 7.9%, followed by Photo & Video at almost 5%. There is a clear picture here that the most apps that make it to the App Store are games. Most other apps fall into single digit percentages, with only a couple percentage points separating them.

Although most apps are categorized as games, we will have to investigate further to see if the most successful apps still fall under this genre. Some future analysis would include examining the number of ratings and average user ratings for each app.

In [22]:
display_table(os, -5)

Games : 58.13664596273293
Entertainment : 7.888198757763975
Photo & Video : 4.968944099378882
Education : 3.6645962732919255
Social Networking : 3.291925465838509
Shopping : 2.608695652173913
Utilities : 2.515527950310559
Sports : 2.142857142857143
Music : 2.049689440993789
Health & Fitness : 2.018633540372671
Productivity : 1.7391304347826086
Lifestyle : 1.5838509316770186
News : 1.3354037267080745
Travel : 1.2422360248447204
Finance : 1.1180124223602486
Weather : 0.8695652173913043
Food & Drink : 0.8074534161490683
Reference : 0.5590062111801243
Business : 0.5279503105590062
Book : 0.43478260869565216
Navigation : 0.18633540372670807
Medical : 0.18633540372670807
Catalogs : 0.12422360248447205


Google Play (Genres):

Immediately we can observe two main things about the Genres data table. First, the data is much more evenly spread out, and second there are many more genres. Here we see that the most common genre is Tools, containing almost 8.5% of all apps in the dataset. Similar to the App Store data, the second most common genre is Entertainment, at around 6.1%; followed by Education at around 5.3%.

One interesting thing to note with this data set is that some genres seem to have sub-genres (ex: Education, Entertainment, etc...). This requires some extra clean up so that the data presented is more robust.

In [23]:
display_table(android, -4)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

Google Play (Category):

While the Genres column in the previous analysis is more granular, this dataset looking at categories is cleaner and more concise. It can be more successfully compared to the App Store data set.

There are some hard contrasts that can be seen here versus the App Store data. The top category here is Family, at almost 19% of all apps. Game(s) is still a top category, but the margins are very low. Tools is the third highest category.

The spread of the frequency of the types of apps in the Google Play store is still more evenly distributed than in the App Store.

In [24]:
display_table(android, 1)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

Overall, games is a dominant app genre present in both marketplaces. However, this analysis does not prove that games apps are the most succesful of all the different genres. For that it is also important to take into account user feedback, in the form of ratings, total number of ratings, and total number of users.