### Analysing Profitable Apps in Google and Apple Store

Google and Apple app stores have millions of apps. Out of those millions of apps some are very famous and profitable while others are not. This project is about finding what makes an app profitable or which category of apps are more famous and profitable than others.

The goal of this project is to get an understanding of what key points make an app successful. The project should give us an insight on which category apps are more famous so that the developers can get an idea of how to attract the users for their next app.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    """utility function to print dataset"""
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
    if rows_and_columns:
        print("Number of rows:    ", len(dataset))
        print("Number of columns: ", len(dataset[0]))

In [3]:
# create list of lists for both apple and google dataset
from csv import reader
with open("AppleStore.csv", "r") as fp:
    apple = reader(fp)
    apple_data = list(apple)
with open("googleplaystore.csv", "r") as fp:
    google = reader(fp)
    google_data = list(google)
    
print("Exploring Apple Dataset: \n")
explore_data(apple_data, 1, 3, True)

Exploring Apple Dataset: 

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows:     7198
Number of columns:  16


In [4]:
print("Apple columns")
print(apple_data[0])
print("\n")
print("Google columns")
print(google_data[0])

Apple columns
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Google columns
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


#### More Info on Datasets

To get detailed information about the Google Play Store dataset, go to [Google dataset](https://www.kaggle.com/lava18/google-play-store-apps)

Similarly, more info about Apple App Store dataset can be found at [Apple dataset](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

### Data Cleaning

The process of preparing our data for analysis is calle __data cleaning__.
It includes -
* _deleting or correcting_ wrong data
* _deleting_ duplicate data
* _modifing_ data to fit our needs

`It is said that Data Scientist spend 80% of their time cleaning the data and 20% of their time in analysing it.`

### Deleting wrong data
The google dataset has a dedicated [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion) and we can check that [one of the discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) suggest us about the wrong data row. We'll try to find out if it is actually wrong.

In [7]:
print(google_data[10472])  # correct row
print('\n')
print(google_data[0])  # header row
print('\n')
print(google_data[10473])  # incorrect row

['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [8]:
# Looking at the kaggle discussion of the google dataset, we can see that the row at 10473 is missing some data points.
print(len(google_data))
del google_data[10473]
print(len(google_data))

10842
10841


### Removing duplicate entries
__PART 1__

By closing looking at the dataset, we can see that there are many duplicate entries. For instance, the app "Instagram" has four entries.

In [9]:
for app in google_data:
    name = app[0]
    if name == "Instagram":
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [12]:
duplicate_apps = []
unique_apps = []
for app in google_data:
    app_name = app[0]
    if app_name in unique_apps:
        duplicate_apps.append(app_name)
    else:
        unique_apps.append(app_name)
        
print(f"Length of duplicate apps: {len(duplicate_apps)}")
print(f"Length of unique apps: {len(unique_apps)}")
print("\n")
print("Few examples of duplicate apps -")
print(duplicate_apps[:15])

Length of duplicate apps: 1181
Length of unique apps: 9660


Few examples of duplicate apps -
['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


We can see that there are total of duplicate 1181 apps.

If we look at the instagram duplicates we can find that the review column of it is not unique. The different values show that the data was collected at different times. We can use this criterion for keeping rows. We will not randomly delete the duplicate rows but keep only that row which has more reviews as they make more reliabile ratings.

__PART 2 __