### Analysing Profitable Apps in Google and Apple Store

Google and Apple app stores have millions of apps. Out of those millions of apps some are very famous and profitable while others are not. This project is about finding what makes an app profitable or which category of apps are more famous and profitable than others.

The goal of this project is to get an understanding of what key points make an app successful. The project should give us an insight on which category apps are more famous so that the developers can get an idea of how to attract the users for their next app.

In [1]:
def explore_data(dataset, start, end, rows_and_columns=False):
    """utility function to print dataset"""
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
    if rows_and_columns:
        print("Number of rows:    ", len(dataset))
        print("Number of columns: ", len(dataset[0]))

In [2]:
# create list of lists for both apple and google dataset
from csv import reader
with open("AppleStore.csv", "r") as fp:
    apple = reader(fp)
    apple_data = list(apple)
with open("googleplaystore.csv", "r") as fp:
    google = reader(fp)
    google_data = list(google)
    
android_header = google_data[0]
google_data = google_data[1:]
ios_header = apple_data[0]
apple_data = apple_data[1:]
print("Exploring Apple Dataset: \n")
explore_data(apple_data, 1, 3, True)

Exploring Apple Dataset: 

['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows:     7197
Number of columns:  16


In [3]:
print("Apple columns")
print(apple_data[0])
print("\n")
print("Google columns")
print(google_data[0])

Apple columns
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Google columns
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


#### More Info on Datasets

To get detailed information about the Google Play Store dataset, go to [Google dataset](https://www.kaggle.com/lava18/google-play-store-apps)

Similarly, more info about Apple App Store dataset can be found at [Apple dataset](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

### Data Cleaning

The process of preparing our data for analysis is calle __data cleaning__.
It includes -
* _deleting or correcting_ wrong data
* _deleting_ duplicate data
* _modifing_ data to fit our needs

`It is said that Data Scientist spend 80% of their time cleaning the data and 20% of their time in analysing it.`

### Deleting wrong data
The google dataset has a dedicated [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion) and we can check that [one of the discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) suggest us about the wrong data row. We'll try to find out if it is actually wrong.

In [4]:
print(google_data[10471])  # correct row
print('\n')
print(android_header)  # header row
print('\n')
print(google_data[10472])  # incorrect row

['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [5]:
# Looking at the kaggle discussion of the google dataset, we can see that the row at 10473 is missing some data points.
print(len(google_data))
del google_data[10472]
print(len(google_data))

10841
10840


### Removing duplicate entries
__PART 1__

By closing looking at the dataset, we can see that there are many duplicate entries. For instance, the app "Instagram" has four entries.

In [6]:
for app in google_data:
    name = app[0]
    if name == "Instagram":
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [7]:
duplicate_apps = []
unique_apps = []
for app in google_data:
    app_name = app[0]
    if app_name in unique_apps:
        duplicate_apps.append(app_name)
    else:
        unique_apps.append(app_name)
        
print(f"Length of duplicate apps: {len(duplicate_apps)}")
print(f"Length of unique apps: {len(unique_apps)}")  # -1 for header
print("\n")
print("Few examples of duplicate apps -")
print(duplicate_apps[:15])

Length of duplicate apps: 1181
Length of unique apps: 9659


Few examples of duplicate apps -
['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


We can see that there are total of duplicate 1181 apps.

If we look at the instagram duplicates we can find that the review column of it is not unique. The different values show that the data was collected at different times. We can use this criterion for keeping rows. We will not randomly delete the duplicate rows but keep only that row which has more reviews as they make more reliabile ratings.

__PART 2__

We will start by creating a dictionary that contains key as the app name and its highest review (among all its duplicate rows) as its value.

In [8]:
from collections import defaultdict
reviews_max = defaultdict(int)

for app in google_data:
    app_name = app[0]
    reviews = float(app[3])
    old_review = reviews_max[app_name]
    if old_review < reviews:
        reviews_max[app_name] = reviews

In [9]:
print(len(unique_apps) == len(reviews_max))  # confirming actual vs expected length

True


Using the dictionary we created, we will iterate over the entire dataset and select only those rows that match with dictionary (as the dictionary contains only those values with highest ratings). We will create a new dataset with the cleaned data.

We will also create an additional data structure for keeping just the app names that have been added to the cleaned dataset. The reason for this is to ignore those entries which are duplicate and have the same number of ratings. If we just use `reviews==max_review` then there will still be some cases where the duplicate entries have same number of ratings.

In [14]:
cleaned_android = []
already_added = []
for app in google_data:
    app_name = app[0]
    reviews = float(app[3])
    max_review = reviews_max[app_name]
    if max_review == reviews and app_name not in already_added:
        cleaned_android.append(app)
        already_added.append(app_name)

print(len(cleaned_android) == len(reviews_max) == len(unique_apps))

9659
True


In [16]:
explore_data(cleaned_android, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows:     9659
Number of columns:  13


### Removing non-english apps
#### Part 1
If we explore the dataset enough, we would find that are certain apps that are not directed towards English speaking audience. Below we see a couple of examples from both dataset -

In [44]:
print(apple_data[813][1])
print(apple_data[6731][1])

print(cleaned_android[4412][0])
print(cleaned_android[7940][0])

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜
中国語 AQリスニング
لعبة تقدر تربح DZ


We will write a function that takes a string and checks if the string is ASCII charater or not. ASCII chart suggests that the english characters fall under ASCII numbers between 0 and 127. To convert a character into ASCII number, we'll use in-built function `ord()`. Thus, our function would check if the string character falls under 127 or not and return boolean correspondingly.

In [19]:
def is_english(word):
    for c in word:
        if ord(c) > 127:
            return False
    return True

In [20]:
test_non_english = ["Instagram", "爱奇艺PPS -《欢乐颂2》电视剧热播", "Docs To Go™ Free Office Suite", "Instachat 😜"]
for sample in test_non_english:
    print(is_english(sample))

True
False
False
False


There can be some cases where the app is intended for english speaking audience but has emoji or characters like `™`.
We may loose many data if we filter out harshly. To counter that, we will accept at most 3 non-ascii characters in the app name and accept it as english app if it satisfied the boundary set. Although it is not a perfect solution but it should include those apps with 3 emojis or less and other use cases.

To achieve that we will use a flag to count the number of non-ascii characters in a word.

#### Part 2

To minimize the data loss, let's override the basic `is_english` function.

In [39]:
def is_english(word):
    non_ascii = 0
    for c in word:
        if ord(c) > 127:
            if non_ascii >= 3:
                return False
            else:
                non_ascii += 1
    return True

In [40]:
test_non_english = ["Instagram 😜😜😜", "Instagram 😜😜😜😜", "爱奇艺PPS -《欢乐颂2》电视剧热播", "Docs To Go™ Free Office Suite", "Instachat 😜"]
for sample in test_non_english:
    print(is_english(sample))

True
False
False
True
True


At this point, we save fairly good amount of english directed apps. We will stop the optimization at this point for english related apps and now use the function to clean our dataset.

In [46]:
android_english = []
ios_english = []
for app in cleaned_android:
    app_name = app[0]
    if is_english(app_name):
        android_english.append(app)

for app in apple_data:
    app_name = app[1]
    if is_english(app_name):
        ios_english.append(app)
        
print("Length of english android apps: ", len(android_english))
print("Length of english ios apps: ", len(ios_english))

Length of english android apps:  9614
Length of english ios apps:  6183


We see that we are left with __9614__ android apps and __6183__ ios apps.

In [58]:
non_english_android_perc = (1 - (len(android_english) / len(cleaned_android))) * 100
print(f"We reduced: {round(non_english_android_perc, 2)} percentage of non-english android apps")
non_english_ios_perc = (1 - (len(ios_english) / len(apple_data))) * 100
print(f"We reduced: {round(non_english_ios_perc, 2)} percentage of non-english ios apps")

We reduced: 0.47 percentage of non-english android apps
We reduced: 14.09 percentage of non-english ios apps


### Selecting only the free apps
We are only interested in building free apps, as our main source of revenue is in-app ads. We will separate or select only those apps that are free. This will be our last step in the process of _Data Cleaning_ in which we already covered - 
> * Removing inaccurate data
> * Removing duplicate entries
> * Removing non-english apps

In [60]:
print("Checking Android Header")
print(android_header)
print('\n')
print("Checking iOS Header")
print(ios_header)

Checking Android Header
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Checking iOS Header
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In [73]:
explore_data(android_english, 0, 2, False)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']




In [75]:
android_final = []
ios_final = []
for app in android_english:
    price = app[7]
    if price == "0":
        android_final.append(app)
for app in ios_english:
    price = app[4]
    if price == "0.0":
        ios_final.append(app)
        
print(len(android_final))
print(len(ios_final))

8864
3222
