# Improving the profitability of applications in online stores

In this project I combine all the knowledge acquired through my learning and make use of them to perform basic data analysis.

Considering myself as a data analyst for a company that builds Android and iOS mobile apps, which are free to download and whose revenue generation is through the in-app ads. Which implies that the app's revenue is dependent on the number of active users that engage with the ads, and interact with the application.

Goal: Analyze data to help developers understand what type of apps are likely to attract more users.

## Loading the data
Performing necessary imports

In [None]:
from csv import reader

Defining a function to explore the dataset

In [None]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Opening the datasets

In [None]:
appleDataOpened = open("../input/google-and-apple-store/AppleStore.csv", encoding = "utf-8")
googleDataOpened = open("../input/google-and-apple-store/googleplaystore.csv", encoding = "utf-8")

Reading the data

In [None]:
appleData = list(reader(appleDataOpened))
googleData = list(reader(googleDataOpened))

In [None]:
appleHeader = appleData[0]
appleData = appleData[1:]
googleHeader = googleData[0]
googleData = googleData[1:]

Exploring the data
* Apple Store data

In [None]:
explore_data(dataset = appleData,start = 0, end = 5, rows_and_columns = True)

* Google play store data

In [None]:
explore_data(dataset = googleData,start = 0, end = 5, rows_and_columns = True)

## Data processing
* Detect inaccurate data, and correct or remove it.
* Detect duplicate data, and remove the duplicates.

The app developemnt is targetted towards english speaking audience, hence:
* Remove non-English apps like 爱奇艺PPS -《欢乐颂2》电视剧热播.
* Remove apps that aren't free.  

From the [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion) at kaggle we find that the row 10472 has some error. Let us examine it compared against a correct data

In [None]:
print(googleHeader)
print(googleData[13])
print(googleData[10472])

The 'Genres' column seems to be missing data.  
Removing the row with error 

In [None]:
del googleData[10472]

### Removing duplicate entries

In [None]:
appCount = {}
for app in googleData:
    if(app[0] in appCount):
        appCount[app[0]] +=1
    else:
        appCount[app[0]] = 1

In [None]:
uniqueApps = []
duplicateApps = []
for app in appCount:
    if(appCount[app]>1):
        duplicateApps.append(app)
    else:
        uniqueApps.append(app)

Examing the apps that have duplicate records

In [None]:
print(duplicateApps[:15],'\nThere are',len(duplicateApps),'duplicate apps.. The first 15 are displayed above')

Examing an app and it's duplicate records

In [None]:
subData = []
print(googleHeader)
for app in googleData:
    if(app[0]=='Instagram'):
        print(app)

We observe varying 'Reviews' columns denoting different timings when the app details were extracted from the store.  
We use this field to remove duplicates, we keep only the record that has the highest number of reviews in the output data file.

Expected number of records after removal of duplicates in google playstore dataset

In [None]:
expectedRecordsGoogle = len(set(duplicateApps)) + len(uniqueApps) 
print(expectedRecordsGoogle)

Creating a dictionary where each dictionary key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.

In [None]:
appDictGoogle = {}
for app in googleData:
    name = app[0]
    reviews = int(app[3])
    if(name in appDictGoogle and appDictGoogle[name] < reviews):
        appDictGoogle[name] = reviews
    elif(name not in appDictGoogle):
        appDictGoogle[name] = reviews

We've now obtained the best(latest) record of each duplicate set  
Removing the duplicates from the googleData

In [None]:
cleanedData = []
already_added = []
for app in googleData:
    name = app[0]
    reviews = int(app[3])
    if(appDictGoogle[name] == reviews and name not in already_added):
        cleanedData.append(app)
        already_added.append(name)
googleData = cleanedData       

In [None]:
print('Number of records in dataset before removing duplicates: ',len(googleData))
print('Expected number of records in dataset after removing duplicates: ',expectedRecordsGoogle)
print('Number of records in dataset after removing duplicates: ',len(cleanedData))

Checking if the App Store dataset has any duplicates 

In [None]:
appCount = {}
for app in appleData:
    if(app[0] in appCount):
        appCount[app[0]] +=1
    else:
        appCount[app[0]] = 1
uniqueApps = []
duplicateApps = []
for app in appCount:
    if(appCount[app]>1):
        duplicateApps.append(app)
    else:
        uniqueApps.append(app)
print('Number of duplicates found: ',len(duplicateApps))

### Removing non-english entries
The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the ASCII (American Standard Code for Information Interchange) system.  
Based on this number range, we can build a function that detects whether a character belongs to the set of common English characters or not.  

Note: In my interpretation some apps use emojis in their name, to avoid filtering out these I check for more than 1 character in name whose ASCII value is not in prescribed range.

Function to find names of apps that are non-english and that need to be removed

In [None]:
def findNonEnglish(dataset):
    removeList = []
    for app in dataset:
        name = app[0]
        nameSplit = [x for x in name]
        count = 0
        for char in nameSplit:
            c_ord = ord(char)
            if(c_ord not in range(0,128)):
                count+=1
                if(count>3):
                    if(name not in removeList):
                        removeList.append(name)
    return(removeList)

Function to remove unnecessary apps from an input list

In [None]:
def required(dataset,removeList):
    requiredApps = []
    for app in dataset:
        name = app[0]
        if(name not in removeList):
            requiredApps.append(app)
    return(requiredApps)

Removing all unnecessary apps from google playstore data

In [None]:
removeList = findNonEnglish(googleData)
googleData = required(googleData,removeList)

In [None]:
print('Length of google playstore dataset after removing unnecessary apps: ',len(googleData))

Performing the same action for appstore apps

In [None]:
removeList = findNonEnglish(appleData)
appleData = required(appleData,removeList)

In [None]:
print('Length of apple appstore dataset after removing unnecessary apps: ',len(appleData))

### Isolating the free apps

In [None]:
googleFinal = []
appleFinal = []

for app in googleData:
    price = app[7]
    if price == '0':
        googleFinal.append(app)
        
for app in appleData:
    price = app[4]
    if price == '0.0':
        appleFinal.append(app)

In [None]:
print('Free Google Playstore apps: ',len(googleFinal))
print('Free Apple appstore apps: ',len(appleFinal))


Function to generate frequency table

In [None]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages

Function to display the contents in descending order of value

In [None]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Analyzing the apps by category to see which category is the most popular on both platforms

In [None]:
display_table(appleData, -5)

In [None]:
display_table(googleData, 9)

The most popular genre seems to be "Tools"

Analyzing data based on category

In [None]:
display_table(googleData, 1)

## Pandas Analysis

Here's the scenario. You're working for one of Google's data science teams and someone from another team, specifically an account manager — "a person who works for a company and is responsible for the management of sales and relationships with particular customers" — approaches you with a request. She wants to take a prophylactic approach and improve the revenue of undervalued apps to motivate the developers to keep working on them.

Since her department's budget for marketing won't allow her to invest on ads (which would boost the number of sales), the only way to improve the revenue is by tweaking the price. She requests that you determine which paid apps are undervalued (undervalued here means that their price could be increased without lowering demand).

In [None]:
import numpy as np
import pandas as pd

In [None]:
playstore = pd.read_csv("../input/google-and-apple-store/googleplaystore.csv")
print(playstore.shape)

In [None]:
playstore.head()

### Fixing problematic fields
we dropped the problematic row because it is a free app and our project is focused on paid apps. Let's continue exploring our dataset

In [None]:
playstore.drop(labels=10472, inplace=True)

### Fixing datatypes

Several columns which should have a numeric format but whose type is object. Specifically, Reviews, Size, and Price.

1. Reviews: No problems with this column, the only reason why pandas didn't used the proper type for this column was the presence of the problematic row we got rid of in the last screen.

2. Size: You may have noticed in the previous screen that the Size values contains letters like M and k, signifying memory size units. To clean this column, we'll use the function defined in the display code. Don't worry about the details of this function at this point. For now, just know that it takes strings — like the values of the price column — as input, and it returns a float number representing the size in megabytes.

3. Price: Some of the values include a $, it's enough to get rid of it to get the value ready for conversion.

Function to correct data within the memory column

In [None]:
def clean_size(size):
    size = size.replace("M","")
    if size.endswith("k"):
        size = float(size[:-1])/1000
    elif size == "Varies with device":
        size = np.NaN
    else:
        size = float(size)
    return size

In [None]:
playstore["Price"] = playstore["Price"].str.replace("$", "").astype("float")
paid = playstore[playstore["Price"] != 0].copy()
paid.drop("Type", axis = "columns",inplace = True)
paid["Reviews"] = paid["Reviews"].astype(int)
paid["Size"] = paid['Size'].apply(clean_size).astype(float)
paid.info()

### Removing duplicates

In [None]:
paid = paid.sort_values("Reviews", ascending = False)
paid.drop_duplicates(subset = "App", keep = "first",inplace = True)
paid.duplicated("App").sum()
paid.reset_index(drop = True, inplace = True)

### Exploratory analysis around undervalued apps

In [None]:
paid["Price"].plot(kind = "hist")

In [None]:
paid.sort_values(by = "Price", ascending = False)

Most of the apps at the top of the list are very niche and get in the way of our analysis, therefore we filter to just the apps whose price are below $50

In [None]:
affordable_apps = paid[paid["Price"]<50].copy()
affordable_apps['Price'].plot(kind = "hist", figsize=(12,6))

The graph is still skewed to the right, we seggregate the data according to price at $5 price mark

In [None]:
cheap = affordable_apps['Price']<5
reasonable = affordable_apps['Price']>=5

#### Viewing the cheaper apps

In [None]:
affordable_apps['Price'][cheap].plot(kind = "hist", figsize=(12,6))

#### Viewing reasonably priced apps

In [None]:
affordable_apps['Price'][reasonable].plot(kind = "hist", figsize=(12,6))

In [None]:
affordable_apps["affordability"] = affordable_apps["Price"].apply(lambda x: "cheap" if x<5 else "resonable")

### Investigating correlations in variables

In [None]:
affordable_apps[cheap].plot(x = "Price", y = "Rating", kind = "scatter", figsize=(12,6))

Examining the Pearson's correlation coefficient for the pair of variables 'Reviews' and 'Price'

In [None]:
affordable_apps[cheap].corr().loc["Rating", "Price"]

The value of which is close to 0 hence we can deduce that the price has very little do with the rating of an app for apps that are priced below $5.   
This is good news for our price tweaking strategy, because it suggests that we can change prices without it being reflected in the apps' rating.

In [None]:
cheap_mean = affordable_apps['Price'][cheap].mean()
cheap_mean

In [None]:
affordable_apps["price_criterion"] = affordable_apps["Price"].apply(lambda x: 1 if x<cheap_mean else 0)

And similarly for the reasonably priced apps

In [None]:
affordable_apps[reasonable].plot(x = "Price", y = "Rating", kind = "scatter", figsize=(12,6))

In [None]:
affordable_apps[reasonable].corr().loc["Rating", "Price"]

In [None]:
reasonable_mean = affordable_apps['Price'][reasonable].mean()
reasonable_mean

In [None]:
affordable_apps["price_criterion"] = affordable_apps["Price"].apply(lambda x: 1 if x<reasonable_mean else 0)

This we've labelled the apps for which we can strategically increase the price without affecting the reviews

### Examing the genre and category fields

In the interest of getting some quick results for our prototype, we'll now focus on the categories and genres, leaving other features for another time.  

In [None]:
affordable_apps.head(5)

Looking at the first few rows of affordable_apps, we see that multiple genres are separated by ';'

Since affordable_apps has only around 700 rows and the genres column can take many multiple values separated by ';', segmenting by this column could spread our data too thin to extract any significant insights. 

Instead of simply ignoring it, we'll extract some information from there and see where that leaves us.

In [None]:
affordable_apps["Genres"].unique()

 Looking at the possible values for this column, we see that ; isn't part of the name of genre row with just one single value
 
 Creating a column that counts the number of values in the 'Genres' column, for which we just count the number of occurances of ';'

In [None]:
affordable_apps["genre_count"] = affordable_apps["Genres"].str.count(";")+1

We now take a look at the variation in the mean price for the two categories of apps - 'reasonable' and cheap for the different counts of Genres.

In [None]:
genres_mean = affordable_apps.groupby(["affordability", "genre_count"]).mean()[["Price"]]
genres_mean

Curiously, apps that belong to two genres are more expensive among the cheap apps and cheaper among the reasonable apps.

For each segment, let's label the apps that cost less than their corresponding segments' mean with 1, and the others with 0

Function to perform the labelling:

In [None]:
def label_genres(row):
    aff = row["affordability"]
    gc = row["genre_count"]
    price = row["Price"]

    if price < genres_mean.loc[(aff, gc)][0]:
        return 1
    else:
        return 0

In [None]:
affordable_apps["genre_criterion"] = affordable_apps.apply(label_genres, axis="columns")

And now, similarly for the category variable

Creating a dataframe that stores the mean price for each segment

In [None]:
categories_mean = affordable_apps.groupby(["affordability", "Category"]).mean()[["Price"]]
categories_mean

For each app whose price is less than the category mean we label the newly created category_criterion with 1 and 0 otherwise.

Function to perform the same

In [None]:
def label_categories(row):
    aff = row["affordability"]
    cat = row["Category"]
    price = row["Price"]
    if price < categories_mean.loc[(aff, cat)][0]:
        return 1
    else:
        return 0

In [None]:
affordable_apps["category_criterion"] = affordable_apps.apply(label_categories, axis="columns")

### Majority voting criteron for selection of apps whose price is to be increased

Since we have 3 voting parameters namely
* price_criterion
* genre_criterion
* category_criterion  

If for an app more an 1 category is set to 1, that app qualifies to be filtered for price variation

In [None]:
criteria = ["price_criterion", "genre_criterion", "category_criterion"]
affordable_apps["Result"] = affordable_apps[criteria].mode(axis='columns')

number of apps eligible for price increase

In [None]:
affordable_apps["Result"].sum()

Which is approximately 50% of the total number of affordable apps

### Estimation of impact of price increase

We estimate the new price for these apps as:
* For apps in cheap category the new price will be max(cheap_mean, Price)
* For apps in reasonable category the new price will be max(reasonable_mean, Price)

Creating function to compute price

In [None]:
def new_price(row):
    if row["affordability"] == "cheap":
        return round(max(row["Price"], cheap_mean), 2)
    else:
        return round(max(row["Price"], reasonable_mean), 2)

In [None]:
affordable_apps["New Price"] = affordable_apps.apply(new_price, axis="columns")

Computing the impact which is - Total number of installs \* (New Price - Old Price)

In [None]:
affordable_apps["Installs"] = affordable_apps["Installs"].str.replace("[+,]", "").astype(int)

In [None]:
affordable_apps["Impact"] = (affordable_apps["New Price"]-affordable_apps["Price"])*affordable_apps["Installs"]

In [None]:
total_impact = affordable_apps["Impact"].sum()
total_impact

## To-do
Include unused data in the analysis, specifically:
* The number of reviews;
* The size of the app;
* The content rating;
* The last time the app was updated;
* The app's Android versions;