## Profitable App Profiles for the App Store and Google Play

In this project we are looking at what type of app are likely to attract more users.  The apps are free to download and install and the main source of revenue consists of in-app ads.  For any given app the revenue is mostly indfluenced by the number of users of that app.<br>We will collect and analyze data about mobile apps available on Google Play and the App Store.<br>
* The Google Play data set containing approximately 10,000 Android apps can be found [here](https://www.kaggle.com/lava18/google-play-store-apps)
* The App Store data set containing approximately 7,000 iOS apps can be found [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

These datasets are from 2018 and 2017, respectively.

In [110]:
import pandas as pd
import csv
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 

In [111]:
def create_file_list(file):
    file_open=open(file)
    file_read=csv.reader(file_open)
    return list(file_read)
ios_all=create_file_list('AppleStore.csv')
ios_header=ios_all[0]
ios=ios_all[1:]
android_all=create_file_list('googleplaystore.csv')
android_header=android_all[0]
android=android_all[1:]

In [112]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice=dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:',len(dataset[0]))

In [113]:
#iOS
print('Headers-iOS','\n','\n',ios_header,'\n')
print('First few rows of data-iOS','\n')
print(explore_data(ios, 0,2,rows_and_columns=True))

Headers-iOS 
 
 ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 

First few rows of data-iOS 

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7197
Number of columns: 16
None


Not all of the column headers are self-explanatory, see data documentation for description [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps).First glance some of the columns that may be useful: `id`,`track_name`,`price`,`rating_count`,`user_rating`,`prime_genre`

In [114]:
#Android
print('Headers-Android','\n','\n',android_header,'\n')
print('First few rows of data-Android','\n')
print(explore_data(android, 0,2,rows_and_columns=True))

Headers-Android 
 
 ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

First few rows of data-Android 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13
None


Not all of the column headers are self-explanatory, see data documentation for description [here](https://www.kaggle.com/lava18/google-play-store-apps).  First glance some of the columns that may be useful: 
`App`,`Category`,`Rating`,`Reviews`,`Installs`,`Price`,`Genres`

### Data Cleaning
* Detect inaccurate data, correct or remove
* Detect duplicate data, remove it
* Remove non-english apps (we are only interested in English speaking audiance for this project)
* Remove apps that aren't free (we are only concerned with apps free to download and install for this project)

The Google play [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) describes an error for a certain row, 10472.  Let us print this row and check it out.  It appears as if it is missing the category and the data has shifted.  We could research and try to find the category or delete it.  It is just one row so we will delete it.

In [115]:
print(android_header)
print(android[10472])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [116]:
del android[10472]

### Removing Duplicates
The discussion section also indicates multiple duplicate entries.  We will define a function to create a list of the names of the duplicate apps and a list of the names of the unique apps.  We don't want to delete the duplicates as random, first we must explore the duplicates to see the differences to best determine which one to keep.

In [117]:
#Function to create list of duplicate and unique apps
def duplicate_apps(app_list):
    duplicate_apps=[]
    unique_apps=[]
    for app in app_list:
        name=app[0]
        if name in unique_apps:
            duplicate_apps.append(name)
        else:
            unique_apps.append(name)
    print('Number of duplicate apps:', len(duplicate_apps),'\n')
    print('Examples of duplicate apps:', duplicate_apps[:20])
    return duplicate_apps, unique_apps

In [118]:
#Android
android_duplicate_apps, android_unique_apps = duplicate_apps(android)

Number of duplicate apps: 1181 

Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software', 'MailChimp - Email, Marketing Automation', 'Crew - Free Messaging and Scheduling', 'Asana: organize team projects', 'Google Analytics', 'AdWords Express']


Let us look at Slack apps in the Android apps list and determine a criteria for deleting duplicates.

In [119]:
for app in android:
    name=app[0]
    if name == 'Slack':
        print(app)

['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51510', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']


Column 4, `Reviews`, is the difference in these duplicates.  We will keep the row with the highest number of reviews.

In [121]:
print('Expected length for Android apps after removing duplicates:', len(android)-len(android_duplicate_apps))

Expected length for Android apps after removing duplicates: 9659


Now we will remove the duplicates for the iOS apps

In [122]:
#iOS
ios_duplicate_apps, ios_unique_apps = duplicate_apps(ios)

Number of duplicate apps: 0 

Examples of duplicate apps: []


There are no duplicates in the ios app list.

In [123]:
def max_reviews(app_list):
    reviews_max={}
    for app in app_list:
        name=app[0]
        n_reviews=float(app[3])
        if name in reviews_max and reviews_max[name]<n_reviews:
            reviews_max[name]=n_reviews
        if name not in reviews_max:
            reviews_max[name]=n_reviews
    return reviews_max
def clean_app_list(app_list):
    reviews_max=max_reviews(app_list)
    clean=[]
    already_added=[]
    for app in app_list:
        name=app[0]
        n_reviews=float(app[3])
        if n_reviews == reviews_max[name] and name not in already_added:
            clean.append(app)
            already_added.append(name)
    return clean, already_added
        

In [124]:
android_clean, android_already_added=clean_app_list(android)

In [125]:
len(android_clean)

9659

### Removing non English apps
Sccording to the [ASCII](https://en.wikipedia.org/wiki/ASCII) (American Standard Code for Information Interchange) system the numbers corresponding to english characters range from 0 to 127.  We will create a function to remove the apps with non english characters.<br>To account for english apps with emojis and characters like `™` we will remove apps with more than three characters that fall out of the ASCII range.  This allows us to keep english apps with up to three special characters.

In [131]:
def english(string):
    count=0
    for char in string:
        if ord(char) > 127:
            count+=1
    if count > 3:
        return False
    else:
        return True
    

In [132]:
print(english('Instagram'))
print(english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english('Docs To Go™ Free Office Suite'))
print(english('Instachat 😜'))

True
False
True
True


In [135]:
android_english=[]
ios_english=[]
for app in android_clean:
    name=app[0]
    if english(name):
        android_english.append(app)
        
for app in ios:
    name=app[1]
    if english(name):
        ios_english.append(app)
        
print('Number of android english apps:',len(android_english))
print('Number of non english android apps deleted:',len(android_clean)-len(android_english))

print('Number of ios english apps:',len(ios_english))
print('Number of non english ios apps deleted:',len(ios)-len(ios_english))
        
        

Number of android english apps: 9614
Number of non english android apps deleted: 45
Number of ios english apps: 6183
Number of non english ios apps deleted: 1014


### Isolate the apps that are free to download