## Profitable App Profiled for the Apple App Store and Google Play Store

- This project is about analyzing the applications from both the Google play store and the Apple app store in order to know which applications are worth investing in. 
- My goal in this project is to apply everything I have learnt so far and utilize it in this analysis.

## Opening and Exploring the Data

As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play
Collecting data for over 4 million apps requires a significant amount of time and money, so we'll try to analyze a sample of data instead. To avoid spending resources on collecting new data ourselves, we should first try to see if we can find any relevant existing data at no cost. Luckily, these data sets that seem suitable for our goals:

- [A data set](https://www.kaggle.com/lava18/google-play-store-apps) containing data of about 10,000 Android apps from Google Play; the data set can be downloaded directly from [this link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv)
- [A data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data of about 7,000 iOS apps from the App Store: the data set can be downloaded directly from [this link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv)

In [1]:
from csv import reader

# Opening AppleStore Apps
open_file = open('AppleStore.csv', encoding='utf-8')
read_file = reader(open_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

# Opening Google PlayStore Apps
open_file = open('googleplaystore.csv',encoding='utf-8')
read_file = reader(open_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

Now we create a function to explore the data sets and make our analysis more understandable and easier. This function `explore_data()` will allow us explore rows and be able to show us the number of rows and columns.

In [2]:
def explore_data(dataset, start, end, rows_and_columns = True):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row,'\n') #prints row and adds a linespace after each row
        
    if rows_and_columns:
        print('Number of rows:',len(dataset))
        print('Number of columns:',len(dataset[0]))

print(ios_header,'\n')
explore_data(ios,0,5)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1'] 

['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1'] 

['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1'] 

['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1'] 

['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37'

We can see that there are 7197 number of apps in this dataset and the columnn mostly important to us in this project are the 'track_name', 'size_bytes', 'price', rating_count_tot', 'user_rating', and 'prime_genre' columns

More information on the ios apps columns can be found [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

In [3]:
print(android_header,'\n')
explore_data(android,0,5)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] 

['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up'] 

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'] 

['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'] 

['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Eve

We can see that there are 7197 number of apps in this dataset and the columnn mostly important to us in this project are the 'App', 'Category', 'Rating', 'Size', 'Price', 'Content Rating' and 'Genres' columns

More information on the android apps columns can be found [here](https://www.kaggle.com/lava18/google-play-store-apps)


## Data Cleaning

This is one of the most important parts of this project and it involves removing or correcting wrong data, removing duplicate data, and modifying the data for the sole purpose of our analysis.

In [4]:
# Delete this android row with the error and run only ONCE

print(android_header,'\n')
print(android[10472],'\n')
# del android[10472]
print(android[10472],'\n')

print(len(android)) #the new lenght has been reduced by 1

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up'] 

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up'] 

10840


Above, we can see that the row 10472 in the android dataset does not have a category.

Now the row with an error has been deleted and we will proceed with the rest of the data cleaning process

In [5]:
# To check if there are similar errors let us check the length of each row

for app in android:
    if len(app) != len(android_header):
        print(app)
        
for App in ios:
    if len(App) != len(ios_header):
        print(App)
        
# No output means the rows are now all equal

## Removing Duplicate Apps
From the [discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion) section, it can be noticed that some apps were duplicated and this has to be corrected

In [6]:
# finding out the number of duplicate apps in android

for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


Now let us find the number of duplicate apps in android since ther is no such error for ios(by running the above code for ios)

In [7]:
# Finding the number of duplicate app for android

duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of duplicate_apps:', len(duplicate_apps),'\n')
print('Number of unique_apps:', len(unique_apps),'\n')
print('Examples of duplicate apps:',duplicate_apps[:20])

Number of duplicate_apps: 1181 

Number of unique_apps: 9659 

Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software', 'MailChimp - Email, Marketing Automation', 'Crew - Free Messaging and Scheduling', 'Asana: organize team projects', 'Google Analytics', 'AdWords Express']


In [8]:
# Showing you that some apps indeed appear more than once

app_freq = {}
for app in android:
    name = app[0]
    if name in app_freq:
        app_freq[name] += 1
    else:
        app_freq[name] = 1
        
# print(app_freq)

### Removing Duplicates
### Part One
Since there are multiple duplicated apps, we need to keep one and remove the others. To make this possible, we will use the product reviews since the highest review would be the most recent one. This way, we can eliminate the duplicates.

We found out that there were 1,181 duplicate apps and we have 9,659 unique apps

In [9]:
print('Expected length:', len(android)-1181)     #To confirm the observation

Expected length: 9659


To remove the duplicates:

- We create a dictionary where each ket is a unique app name and the value is the highest number of reviews of that app
- The information in the new dictionaty is used to form a new dataset, which has only one entry per app(app with highest reviews)

In [10]:
# Creating the dictionary

reviews_max = {}
for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
    
# print(reviews_max)
print(len(reviews_max))

9659


### Part two
Now we will use the `reviews_max` dictionary to remove the duplicates. Remember that we only need the apps with the highest number of reviews since we eliminated the duplicates. This is how the code below works:
* We start by initializing two empty lists, `android_clean` and `already_added`
* We loop through the android set for every iteration
    * We isolate the name and number of reviews of the app
    * We add the app to the `android_cleaned` list and the app name to `already_added` if:
        * The number of reviews is the same as the maximum number of reviews of the app in the `reviews_max` dictionary
        * The app name is not already in the `already_added` list. This is so that we do not have some apps with the same number of reviews still become a duplicate. If we only check for `n_reviews == reviews_max[name]`, some apps have the same maximum reviews and they will become duplicates.

In [11]:
# Creating the clean data

android_clean = []
already_added = []
for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)
        
explore_data(android_clean,1,5)
explore_data(already_added,1,5)

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'] 

['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'] 

['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'] 

['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up'] 

Number of rows: 9659
Number of columns: 13
U Launcher Lite – FREE Live Cool Themes, Hide Apps 

Sketch - Draw & Paint 

Pixel Draw - Number Art Coloring Book 

Paper flowers instructions 

Number of rows: 9659
Number of columns: 46


### Removing Non-English Apps
### Part One
If we explore both app data sets, we will notice some apps with non-english names and our company focuses only on english apps

In [12]:
print(ios[813][1])
print(ios[6731][1],'\n')
print(android_clean[4412][0])
print(android_clean[7940][0])

436672029
1144164707 

中国語 AQリスニング
لعبة تقدر تربح DZ


We are not interested in keeping app with non-english names and we will remove them. The english text commonly used are numbers and other punctuation marks(.,!,?,;) and symbols(+,`*`,/)

Behind the scenes, each character we use in a string has a corresponding number associated with it. For instance, the corresponding number for characer `'a'` is 97, character `'A'` is 65, and character `'爱'` is 29,233. These are because of the ASCII standards that has characters ranging between 0 and 127.

We can take advantage of this using the in-built function `ord()` to eliminate characters that are not within the ASCII range.

If an app contains a character greater than 127, then it is probably not an english name.