# Predicting Profitable Apps Based on User Reviews

In this project our objective will be to determine what makes app attractive to potential users. We will be working 
as data analysts for a app developing company that specializes in free apps sold on the App Store and Google Play marketplaces.

Given that our company only develops free apps, the overwhelming majority of revenue will be coming from in app advertisements. This implies that the most relevant metric when analyzing the apps data will be the number of users for each app. By analyzing the data with speficic interest in the aforementioned metric of number of users, we will offer useful insights to our company.

# Opening the Datasets
The iOS App Store and Android Google Play Store, have about 2 million and 2.1 million apps available for download, respectively. Analyzing the entirety of the two datasets would be expensive and time consuming and is beyond the scope of this project. However, we are still able to analyze two samples of the two data sets.
* This is a sample of the original Google Play Store data set. It consists of data for about 10,000 apps and is available for download [here](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv)
* This is a sample of the original iOS App Store data set. It consists of data for about 7,000 apps and is available for download [here](https://dq-content.s3.amazonaws.com/350/AppleStore.csv)

We will begin by opening each of the samples in Python

In [1]:
#The App Store Marketplace
from csv import reader
f_object = open('/Users/Tornyeli/Datasets/AppleStore.csv')
read_file = reader(f_object)
apple = list(read_file)
apple_header = apple[0]
apple = apple[1:]

#The Google Play Marketplace
from csv import reader
f_object = open('/Users/Tornyeli/Datasets/googleplaystore.csv')
read_file = reader(f_object)
android = list(read_file)
android_header = android[0]
android = android[1:]

In the interest of efficiency we'll create a function that helps us to explore the datasets, present the data in a way that is more comprehensible, and count the rows and columns of any dataset. 

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    """A function used to present segments of, or all of the data from a dataset"""
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
print(android_header)
print('\n')
explore_data(android, 1, 4, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10841
Number of columns: 13


So our function has helped us determine an exact amount of apps for this data set, 10,841. Relevant columns include: 'App', 'Category', 'Reviews', 'Installs', 'Type', 'Price', and 'Genres'.

Below we will explore the iOS dataset.

In [4]:
print(apple_header)
print('\n')
explore_data(apple, 1, 4, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7197
Number of columns: 16


There are 7197 iOS apps in this data set. Relevant columns include: 'track_name', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', and 'prime_genre'. Some of the column names are not very descriptive, but futher explanation is provided in the data set [documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home)

# Deleting Duplicate/Incorrect Data
In the [discussion session](https://www.kaggle.com/lava18/google-play-store-apps/discussion) of the Google Play Store dataset there is a specific [discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) covering an error in row 10472. To get a better picture of the error we'll print the header and a correct row and compare them with the error row.

In [5]:
print(android[10472]) # error row

print(android[0],'\n') # correct row

print(len(android_header)) #header

for row in android[1:]:
    if len(row) != len(android_header):
        print(row)
        print("\n")
        print("Index postion is:", android.index(row))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] 

13
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Index postion is: 10472


The row 10472 corresponds to the app Life Made WI-Fi Touchscreen Photo Frame, and has a rating of 19. This must be an error because the maximum rating for a Google Play app is 5. The discussion session attributes the error to a missing value in the 'Category' column, since we do not have the correct value to replace the missing one, we'll just delete the row.

In [6]:
print(len(android))
del android[10472]
print(len(android))

10841
10840


In both datasets, there are apps that listed multiple times, take, for example, the Instagram app below.

In [7]:
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [8]:
duplicate_apps_ios = []
unique_apps_ios = []
for app in apple:
    name = app[1]
    if name in unique_apps_ios:
        duplicate_apps_ios.append(name)
    else:
        unique_apps_ios.append(name)

print('Number of duplicate apps:', len(duplicate_apps_ios))
print('\n')
print('Examples of duplicate apps:', duplicate_apps_ios[:15])        

Number of duplicate apps: 2


Examples of duplicate apps: ['Mannequin Challenge', 'VR Roller Coaster']


In [9]:
duplicate_apps_droid = []
unique_apps_droid = []
for app in android:
    name = app[0]
    if name in unique_apps_droid:
        duplicate_apps_droid.append(name)
    else:
        unique_apps_droid.append(name)
        
print('Number of duplicate apps:', len(duplicate_apps_droid))
print('\n')
print('Examples of duplicate apps:', duplicate_apps_droid[:15])    

Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


Counting the same app multiple times will be redundant and will likely skew our data. This leads us to our next question: On what basis will we remove duplicate entries? If we take a look at the multiple entries for the Instagram app displayed previously, we can see that the one column with values that differ between each of the entries is the fourth column which stores the quantity of reviews. The differences in this value are due to the data being recorded at different times throughout the day. So our basis for removing rows will be the number of reviews, with preference for keeping the entries with more reviews (data will be more reliable).

To do this we will create a dictionary that has key-value pairs of app name and review count respectively. Once we've finished creating the dictionary we'll use it to make a new dataset where each app only appears once with the highest number of reviews taken.

In [10]:
reviews_max = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

In a previous code cell, we discovered that there are 1,181 instances of duplicate apps across the two datasets, so the length of our dictionary should reflect that and have 1,181 less items (key-value pairs) than the length of our dataset.

In [11]:
print('Expected length:', len(android) - 1181)
print('Actual length:', len(reviews_max))

Expected length: 9659
Actual length: 9659


We'll use the reviews_max dictionary to sort our duplicate free data into a new list titled android_clean. The purpose of the if block is to add each app to the android_clean list if the number of reviews matches up with the quanitity listed in the reviews_max dictionary and the name is not in the already_added list, if either of these conditions are not met then the app will be placed in the already_added list.

In [12]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

We'll do a quick check using the function we defined earlier, if everything is working properly we should get 9,659 for the number of rows.

In [13]:
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


# Removing Non-English Apps
Some of the apps in the google play store dataset are made for non-English speakers. Rather then attempt to translate each of the non-English apps we'll just delete them. To do this we'll delete all apps that have characters that are not found in the English alphabet. We can onsider numbers (0-9), punctuation marks, and common symbols (+, =, -, etc.) to be part of the English language in these cases. Each of these alphanumeric characters is  encoded using the ASCII standard, so each character has its own unique ASCII number that represents it. The English and alphanumeric characters only go up to 127 in the ASCII system so we can use this to our advantage by creating a function that checks whether a string has any non-English/alphanumeric ASCII characters.

The built in function ord() will check the ASCII number of a character for us so that we don't have to check the ASCII table.

In [14]:
def language_checker(string):
    
    for character in string:
        if ord(character) > 127:
            return False
        
    return True

print(language_checker('Instagram'))
print(language_checker('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(language_checker('Docs To Go™ Free Office Suite'))
print(language_checker('Instachat 😜'))

True
False
False
False


The function's performance is still not satisfactory because we are getting false positives for the third and fourth test cases. Emojis and symbols like ™ are included in the alphanumeric/English alphabet range so they trigger a false result. To improve the function we can adjust it such that the number of characters necessary to trigger a false result is increased. Although it isn't a perfect solution, it will greatly decrease the number of false positives we attain.

In [15]:
def eng_checker(string):
    
    non_eng_chars = 0
    
    for character in string:
        if ord(character) > 127:
            non_eng_chars += 1
    
    if non_eng_chars > 3:
        return False
    
    else:
        return True

print(eng_checker('Docs To Go™ Free Office Suite'))
print(eng_checker('Instachat 😜'))

True
True


Now that we've improved our function's performance we can use it to create two lists of the english apps for their respective marketplaces.

In [16]:
eng_apps_droid = []
for app in android:
    name = app[0]
    eng_checker(name)
    if eng_checker(name):
        eng_apps_droid.append(app)     

eng_apps_ios = []        
for row in apple:
    name_ios = row[1]
    eng_checker(name_ios)
    if language_checker(name_ios):
        eng_apps_ios.append(row)
        
explore_data(eng_apps_droid, 4, 8, True)
print('\n')
explore_data(eng_apps_ios, 4, 8, True)        

['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIGN', '3.8', '178', '19M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'April 26, 2018', '1.1', '4.0.3 and up']


['Infinite Painter', 'ART_AND_DESIGN', '4.1', '36815', '29M', '1,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'June 14, 2018', '6.1.61.1', '4.2 and up']


Number of rows: 10795
Number of columns: 13


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


['429047995', 'Pinterest', '74778624', 'USD', '0.0', '1061624', '1814', '4.5', '4.0', '6.26', '12+', 'Social Networ

# Removing Paid Apps

In [27]:
free_droid_apps = []
free_ios_apps = []

for app in eng_apps_droid:
    price = app[6]
    name = app[0]
    if price == ('Free') or ('0.00'):
        free_droid_apps.append(name)

for app in eng_apps_ios:
    price = app[4]
    name = app[1]
    if price == ('Free') or ('0.0'):
        free_ios_apps.append(name)           

In [28]:
print(free_ios_apps)

