# Profitable App Profiles for the App Store and Google Play Markets
---
Find mobile app profiles that are profitable for the App Store and Google Play markets. We're working as data analysts for a company that builds Android and iOS mobile apps, and our job is to enable our team of developers to make data-driven decisions with respect to the kind of apps they build.

At our company, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that our revenue for any given app is mostly influenced by the number of users that use our app. Our goal for this project is to analyze data to help our developers understand what kinds of apps are likely to attract more users.

---

In [1]:
from csv import reader

### The Google Play data set ###
opened_file = open('googleplaystore.csv', encoding='UTF-8')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

### The App Store data set ###
opened_file = open('AppleStore.csv', encoding='UTF-8')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

### Printing Each Header to determine what the values of each column indicate ###
print(android_header)
print(ios_header)


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line between rows
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
explore_data(ios,0,2,True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7197
Number of columns: 16


In [4]:
explore_data(android,0,2,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


---
## Clearing Corrupted Data

__Corrupted Data__

It is essential that all the data in the `csv` file is consistant. A single row that has different format from the rest of the data will alter the result.

Following Dataset for google had corrupted data at index `android[10472]` The following Data had one less column than other datasets which could lead to our desired result becoming corrupt.

---

In [5]:
print(android[10472])
print('Columns: ', len(android[10472]))

del android[10472]
explore_data(android,0,2,True)

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
Columns:  12
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10840
Number of columns: 13


___
## Duplicate Data

__Part One__
___

Checking dataset for duplicates and coming up with the criterion for duplicate elimination can result in more accurate result/prediction

Once corrup data has been removed from the dataset, Two lists need to be created in order to sort the duplicate datasets.
`unique_apps` and `duplicate_apps`. `unique_apps` will store one of each app names.

In [15]:
unique_apps   = []
duplicate_apps= []

for app in android:
    if app[0] in unique_apps:
        duplicate_apps.append(app[0])
    else:
        unique_apps.append(app[0])

print(len(unique_apps))
print(len(duplicate_apps))

9659
1181


---
There are in total `1181` Duplicate Data sets.
In order to fully utilize the datasets, removing duplicates randomly will reduce the result/predictions due to the fact that each of the duplicate rows have different number of reviews.

Unlike android List, ios list does not contain duplicate data as displayed above and thus only the duplicate lists in android dataset needs to be removed.

In [12]:
ios_unique_apps   = []
ios_duplicate_apps= []

for app in ios:
    if app[0] in ios_unique_apps:
        ios_duplicate_apps.append(app[0])
    else:
        ios_unique_apps.append(app[0])

print(len(ios_duplicate_apps))

0


__Part Two__

Since we have decided that the number of reviews is an important factor, it would be reasonable to keep the row with highest value of reviews from the duplicate list.

In order to accomplish that, I've created map that consists of each __app name__ as `key` and __reviews__ as `val`


In [18]:
print(android[0])

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


In [22]:
max_reviews = {}

for app in android:
    name = app[0]
    reviews = float(app[3])
    
    if name in max_reviews and max_reviews[name] < reviews:
        max_reviews[name] = reviews
    elif name not in max_reviews:
        max_reviews[name] = reviews
    


This process should have created a dictionary with max reviews of all the apps without duplicates
In order to check if the following the length of the dictionary should equal to the length of `android` excluding the duplicates which is `1181`


In [23]:
print('max_reviews Length : ',len(max_reviews))
print('android list without dup : ', len(android)-1181)

max_reviews Length :  9659
android list without dup :  9659


As Seen above, dictionary without duplicates has been created, now in order to remove the duplicates from `android` list, simply initialize an empty list to store final data

We loop through the android data set, and for every iteration:
   - We isolate the name of the app and the number of reviews.
   - We add the current row (app) to the android_clean list, and the app name (name) to the already_added list if:
        - The number of reviews of the current app matches the number of reviews of that app as described in the `reviews_max` dictionary; and The name of the app is not already in the `already_added` list. We need to add this supplementary condition to account for those cases where the highest number of reviews of a duplicate app is the same for more than one entry (for example, the 'Box app' has three entries, and the number of reviews is the same). If we just check for `reviews_max[name] == n_reviews`, we'll still end up with duplicate entries for some apps.

In [26]:
final_android = []
already_added = []

for app in android:
    name = app[0]
    review = float(app[3])
    
    if (max_reviews[name] == review) and (name not in already_added):
        final_android.append(app)
        already_added.append(name)
        
print(len(final_android))
print(final_android[:2])

9659
[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']]



## Removing Non-English Apps

__Part One__

