## Project: Profitable App Profiles for the App Store and Google Play Markets

In this project, our goal is to analyze data to assist developers understand what type of apps are likley to attact more users on Google Play and the App Store. 

As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.

Collecting data for over 4 million apps requires a significant amount of time and money, so we'll try to analyze a sample of the data instead. To avoid spending resources on collecting new data ourselves, we should first try to see if we can find any relevant existing data at no cost. Luckily, these are two data sets that seem suitable for our goals:

- [A data set](https://www.kaggle.com/lava18/google-play-store-apps) containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018. You can download the data set directly from [this link.](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv)

- [A data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017. You can download the data set directly from [this link.](https://dq-content.s3.amazonaws.com/350/AppleStore.csv)


In [8]:
from csv import reader

#Google#
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]
len_android_columns = 'len_android_columns = '+str(len(android_header))
len_android_rows = 'len_android_rows = '+str(len(android))

#Apple#
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]
len_ios_columns = 'len_ios_columns = '+str(len(ios_header))
len_ios_rows = 'len_ios_rows = ' +str(len(ios))

To make it easier to explore the two data sets, we'll first write a function named explore_data() that we can use repeatedly to explore rows in a more readable way. We'll also add an option for our function to show the number of rows and columns for any data set.

In [9]:
#explore data function - provided#

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row,'\n')
        #adds a new empty line after each row
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns', len(dataset[0]))

In [13]:
#google exploring#
print('~ GOOGLE DATA REVIEW ~')
print('\n')
print(android_header)
print('\n')
explore_data(android, 0, 3, True)
print('\n')
print('\n')

#apple exploring#
print('~ IOS DATA REVIEW ~')
print('\n')
print(ios_header)
print('\n')
explore_data(ios, 0, 3, True)

~ GOOGLE DATA REVIEW ~


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] 

['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up'] 

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'] 

Number of rows: 10841
Number of columns 13




~ IOS DATA REVIEW ~


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipad

In [20]:
print(android[10472])
print('\n')
print(android_header)
print('\n')
print(android[0])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


## Deleting Wrong Data

The Google Play data set has a [dedicated discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion) section, and we can see that [one of the discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion/164101) outlines an error for row 10472. Let's print this row and compare it against the header and another row that is correct.

In [23]:
print(len(android))
del android[10472]
print(len(android))

10841
10840


## Removing Duplicate Entries

**Part One**

Upon further review of the Google Play data and [discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion) section, we found that some apps have more than one entry. For instance, the application Instagram has several duplicate entries:

In [27]:
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


## Identifiying Number of Duplicate Apps in GooglePlay Dataset

We will create two lists, one for duplicate apps and one for unique apps contained in the GooglePlay dataset. This is completed by for loop through the android dataset and for each iteration we will do the following:

- Save the app name to a variable named *name*
- If *name* was already in the unique apps list, we append that *name* to the duplicate apps list. 
- Else we will append the *name* into the unique apps list. 

Lastly, we will print the length of each (*duplicate names*, and *unique names*) to gain an understanding of how many apps need to be removed from the dataset. 

In [33]:
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')    
print('Examples of duplicate apps:', duplicate_apps[:15])
print('\n')
print('Number of unique apps:', len(unique_apps))

Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


Number of unique apps: 9659


## Removing Duplicate Apps in GooglePlay Dataset

We now understand there are 1,181 duplicate apps in the andriod dataset. After we remove the duplicates, we will expect to be left with 9,659 rows of unique apps. 

To remove duplicates we will do the following:
- Create a dictionary where each dictionary key is a unique app name and the corresponding value is the highest number of reviews of that app. (we have chosen this value because it is unique for each row, even for duplicate apps in the andriod data set, and we will only want to keep the row with the *highest* reviews of the app as it will be the most updated version of that information)
- Use the information stored in the dictionary in creation of a new dataset, which will have only one entry per app (again, only selected with the *highest* number for reviews)

In [58]:
reviews_max = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

print('Length of Reviews_Max Dictionary:',len(reviews_max))

android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

print('Length of Android_Cleaned List:',len(android_clean))
print('\n')
explore_data(android_clean, 0, 3, True)

Length of Reviews_Max Dictionary: 9659
Length of Android_Cleaned List: 9659


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] 

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'] 

['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'] 

Number of rows: 9659
Number of columns 13


## Removing Non-English Apps from App Store Dataset

Previously, we manged to remove duplicate app entries in the GooglePlay dataset. The App Store does not contain duplciates, but does contain non-english speaking apps. We will remove the non-engish speaking apps from the App Store dataset by completing the following:

**NOTE: Since we understand each character used in a string has a corresponding number associated with it, and accoridng to the [ASCII](https://en.wikipedia.org/wiki/ASCII)(American Standard Code for Information Interchange) system, english characters are in the range 0 to 127. Therefore, any characters that are equal to or less than 127 then we know the character belongs to the English language.**

- Write a function that takes in a string and returns *FALSE* if there is any character in the string that does not belong to the set of English characters (<=127)
 - In the function, iterate over the input string and for each iteration check whether the number is associated with a character > 127 and if so, should return *FALSE*
 - If the loop finished running without the retunr statement being excuted, then it means no characters had a corresponding number > 127 & the app name is English and should return *TRUE*. 
- We will test the following names to confirm if they are detected as English or Non-English:
 - 'Instagram'
 - '爱奇艺PPS -《欢乐颂2》电视剧热播'
 - 'Docs To Go™ Free Office Suite'
 - 'Instachat 😜'


In [64]:
def is_english(string):
    for character in string:
        if ord(character) > 127:
            return False
    
    return True

#Testing if function works
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))


True
False
False
False


In reviewing the results of the function detecting non-english apps, we can see that if there are special characters in the English app name, it will return *FALSE*. Emojis and characters like ™ fall outside the ASCII range and have corresponding numbers greater than 127. 

See results below for corresponding numbers to 😜 and ™. 

In [68]:
print(ord('™'))
print(ord('😜'))

8482
128540


To minimize loosing useful data, we will only remove an app if the name has more than three characters with corresponding numbers greater than 127. 
