# Advanced Certification Program in Computational Data Science

##  A program by IISc and TalentSprint

### Mini Project Notebook 1: Data Analytics and Pandas (Solution)

Ungraded Mini-project

**DISCLAIMER:** THIS NOTEBOOK IS PROVIDED ONLY AS A REFERENCE SOLUTION NOTEBOOK FOR THE MINI-PROJECT. THERE MAY BE OTHER POSSIBLE APPROACHES/METHODS TO ACHIEVE THE SAME RESULTS.   

## Learning Objectives



At the end of the experiment, you will be able to :


* understand the requirements for a “clean” dataset, ready for use in statistical analysis

* use Python libraries like Pandas, Numpy, and Matplotlib to perform the  data-preprocessing steps accordingly

* derive meaningful insights from the data


## Dataset

The dataset chosen for this experiment is the **Play Store** dataset which is  publicly available and created with this [methodology](https://nycdatascience.com/blog/student-works/google-play-store-everything-that-you-need-to-know-about-the-android-market/)  

This dataset consists of 10841 records. Each record is made up of 13 fields.

**For example**, Each record consists of App, Category, Rating, Reviews, Size, Installs, Type, Price, Content Rating, Genres, Last Updated, Current Ver, and Android Ver.

## Problem Statement

Before we can derive any meaningful insights from the Play Store data, it is essential to pre-process the data and make it suitable for further analysis. This pre-processing step forms a major part of data wrangling (or data munging) and ensures better quality data. It consists of the transformation and mapping of data from a "raw" data form into another format so that it is more valuable for a variety of downstream purposes such as analytics. Data analysts typically spend a sizeable amount of time in the process of data wrangling, compared to the actual analysis of the data.

After data munging is performed, several actionable insights can be derived from the Play Store apps data. Such insights could help to unlock the enormous potential to drive app-making businesses to success.

In [None]:
#@title Download the data
!wget -qq https://cdn.iisc.talentsprint.com/CDS/Datasets/googleplaystore.csv

#### Import required packages

In [None]:
import numpy as np
import seaborn as sns
import pandas as pd
from matplotlib import pyplot as plt
from scipy import stats

#### Load the dataset

In [None]:
# YOUR CODE HERE
playstore_data = pd.read_csv("googleplaystore.csv")
playstore_data.head()

## Pre-processing

### Task 1: Data Cleaning

* Check whether there are any null values and figure out how you want to handle them?
  
    **Hint:** isnan(), dropna(), fillna()
* If there is any duplication of a record, how would you like to handle it?

    Hint: [drop_duplicates](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html)

* Are there any non-English apps? And how to filter them?

* In the size column, multiply 10,000,000 with entries having M and multiply by 10,000 if we have K in the cell.

In [None]:
playstore_data.isna().sum()

In [None]:
playstore_data.dropna(inplace=True)

In [None]:
# identify the duplicate apps
len(set(playstore_data['App'].values)), playstore_data.shape

In [None]:
# Remove the duplicate apps
playstore_data = playstore_data.drop_duplicates(['App'], keep='first')

In [None]:
# Check for any null values
playstore_data.isnull().sum()

In [None]:
# Check datatype of each column
playstore_data.dtypes

Find out the Non-English Apps

In [None]:
# below function to detect the string which contains non-english characters
# It only selects the apps which are having non-english characters atleast half the length of string
# Ex: there are english apps with emoji's and symbols(should be ignored).
# target is to find non-engish apps which might have alphabets at the ending, but mostly non-english.
def is_English(string):
    spl_count = 0
    for character in string:
        if ord(character) > 127:
          spl_count += 1
    if spl_count > len(string) // 2:
      return False
    return True

In [None]:
# Find the Non-English Apps
playstore_data[~playstore_data['App'].apply(is_English)]

In [None]:
# Filter the Non English Apps
playstore_data = playstore_data[playstore_data['App'].apply(is_English)]
playstore_data.shape

In the size column, multiply 1000,000 with M in the cell and multiply by 1000 if we have K in the cell.

In [None]:
playstore_data.Size.value_counts()

In [None]:
playstore_data['Size'] = playstore_data['Size'].apply(lambda x: str(x).replace('Varies with device','NaN') if 'Varies with device' in x else x)
playstore_data['Size'] = playstore_data['Size'].apply(lambda x: float(str(x).rstrip('M'))*(10**6) if 'M' in str(x) else x)
playstore_data['Size'] = playstore_data['Size'].apply(lambda x: float(str(x).rstrip('k'))*(10**3) if 'k' in str(x) else x)
playstore_data = playstore_data[~(playstore_data['Size'] == 'NaN')]
playstore_data['Size'] = playstore_data['Size'].astype(float)

## Visualization

### Task 2: Perform the following:

##### Exercise 1: Find the number of apps in various categories by using an appropriate plot.

In [None]:
# YOUR CODE HERE
playstore_data['Category'].nunique()

In [None]:
counts_of_Apps = playstore_data['Category'].value_counts()
counts_of_Apps

In [None]:
counts_of_Apps.index.values

In [None]:
plt.figure(figsize=(10, 15))
plt.pie(counts_of_Apps, labels = counts_of_Apps.index.values, autopct='%1.1f%%')
plt.show()

##### Exercise 2: Explore the distribution of free and paid apps across different categories

**Hint:** Stacked Bar Chart

In [None]:
# YOUR CODE HERE

# Identify the free and paid apps
free_apps = playstore_data[playstore_data.Type == "Free"]
paid_apps = playstore_data[playstore_data.Type == "Paid"]
paid_apps.shape, free_apps.shape

In [None]:
paid_categories = paid_apps['Category'].value_counts()
free_categories = free_apps['Category'].value_counts()
paid_categories

In [None]:
len(free_categories), len(paid_categories)

In [None]:
# Bar Chart showing Distribution of Free and Paid apps category wise
N = 10

idx = np.arange(N)

p1 = plt.bar(idx, free_categories.values[:10] )
p2 = plt.bar(idx , paid_categories.values[:10], bottom= free_categories.values[:10])

plt.xticks(idx, free_categories.index[:10], rotation=35)
plt.legend((p1[0], p2[0]),('Free', 'Paid'))
plt.show()

##### Exercise 3: Represent the distribution of app rating on a scale of 1-5 using an appropriate plot

**Hint:** histogram / strip plot

In [None]:
# YOUR CODE HERE
ratings = playstore_data['Rating']

plt.hist( ratings, bins=5)
plt.title('Rating Distribution')
plt.xlabel('Ratings')
plt.show()

In [None]:
# 2nd option
# Distribution of Rating using stripplot
sns.set(style="whitegrid")
sns.stripplot(data=ratings, jitter=True, orient='h');


##### Exercise 4: Identify outliers of the rating column by plotting the boxplot category wise and Handle them.

**Hint:** Removing Outliers using z-score, quantile [link](https://kanoki.org/2020/04/23/how-to-remove-outliers-in-python/)

In [None]:
# YOUR CODE HERE

df_categories = playstore_data.groupby('Category').filter(lambda x: len(x) >= 120)

sns.boxplot(y=df_categories.Rating, x=df_categories.Category,data=playstore_data);
plt.xticks(rotation=50)
plt.xlabel('Categories',fontsize=17, fontweight='bold', color='#191970', )
plt.ylabel('Ratings', fontsize=17, fontweight='bold', color='#191970')

In [None]:
def remove_outliers(data):
  data_mean, data_std = data.mean(), data.std()
  cut_off = data_std * 3
  lower, upper = data_mean - cut_off, data_mean + cut_off
  outliers_removed = [x if x > lower and x < upper else data_mean for x in data ]
  return outliers_removed

In [None]:
playstore_data['Rating'] = remove_outliers(playstore_data['Rating'])

In [None]:
df_categories = playstore_data.groupby('Category').filter(lambda x: len(x) >= 120)

sns.boxplot(y=df_categories.Rating, x=df_categories.Category,data=playstore_data);
plt.xticks(rotation=50)
plt.xlabel('Categories',fontsize=17, fontweight='bold', color='#191970', )
plt.ylabel('Ratings', fontsize=17, fontweight='bold', color='#191970')

##### Exercise 5: Plot the barplot of all the categories indicating no. of installs

In [None]:
# YOUR CODE HERE

playstore_data['Installs'] = playstore_data['Installs'].str.rstrip('+').str.replace(',','')
playstore_data['Installs'] = playstore_data['Installs'].astype(int) #Converting to int

In [None]:
temp_df = playstore_data.groupby(['Category']).agg({'Installs':'sum'}).sort_values(by='Installs',ascending=False).reset_index()

sns.barplot(x=temp_df['Installs'], y=temp_df['Category'] )

# plt.figure()
plt.yticks(rotation=10)
plt.xlabel('Installs', fontsize=15, color='#191970')
plt.ylabel('Categories', fontsize=15, color='#191970')

## Insights


### Task 3: Derive the below insights

##### Exercise 1: Does the price correlate with the size of the app?

  **Hint:** plot the scatterplot of `Size` and `Price`

In [None]:
playstore_data['Price'].unique()

In [None]:
# YOUR CODE HERE
playstore_data['Price'] = playstore_data['Price'].str.lstrip('$')
playstore_data['Price'] = playstore_data['Price'].astype(float)

sns.lmplot(x='Price', y='Size', data=playstore_data, fit_reg=False) #,size=10);

**Conclusion:** No particular pattern, inceasing or decreasing,

*Price does not depend on size of the App*

##### Exercise 2: Find the popular app categories based on rating and no. of installs

**Hint:** [df.groupby.agg()](https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.core.groupby.DataFrameGroupBy.agg.html); Taking the average rating could be another approach



In [None]:
# YOUR CODE HERE
popular_categories = playstore_data.groupby(['Category']).agg({'Installs':'sum','Rating':'sum'}).sort_values(by='Rating',ascending=False).reset_index()
popular_categories.head()

In [None]:
# Another approch using average rating
popular_categories1 = playstore_data.groupby(['Category']).Rating.mean().sort_values(ascending=False).reset_index()
popular_categories1

##### Exercise 3: How many apps are produced in each year category-wise ?

  * Create a `Year` column by slicing the values of `Last Updated` column and find the Year with most no. of apps produced

    **For example**, slice the year `2017` from `February 8, 2017`

  * Find the categories which have a consistent rating in each year

      **Hint:** `sns.countplot`

In [None]:
# YOUR CODE HERE

# Create a Year column
playstore_data["Year"] = playstore_data['Last Updated'].str[-4:]

playstore_data["Year"].unique()

In [None]:
App2018 = playstore_data[playstore_data["Year"]== "2018"]

plt.title('Downloads in 2018')
plt.xticks(rotation = 'vertical')
sns.countplot(hue = 'Year', x = 'Category', data = App2018)

In [None]:
App2017 = playstore_data[playstore_data["Year"]== "2017"]
plt.title('Downloads in 2017')
plt.xticks(rotation = 'vertical')
sns.countplot(hue = 'Year', x = 'Category', data = App2017)

##### Exercise 4: Identify the highest paid apps with a good rating

**Assumption:** `App` with a rating equal to, or greater than 4 can be considered as `App with good rating`

In [None]:
topRated = playstore_data[(playstore_data.Rating > 4.0) & (playstore_data.Type == 'Paid')].sort_values(by='Price',ascending=False)

In [None]:
topRated['Reviews'].head()

##### Exercise 5: Are the top-rated apps genuine ? How about checking reviews count of top-rated apps ?

In [None]:
# YOUR CODE HERE
topRated = playstore_data[playstore_data.Rating == playstore_data.Rating.max()]
idx_topRate = np.arange(0, len(topRated))

topRated['Reviews'] = topRated['Reviews'].astype(int)
topRated['Reviews']

In [None]:
topRated['Reviews'].max(), topRated['Reviews'].min()

In [None]:
plt.title("Distribution of Review count for top-rated apps")
plt.plot(idx_topRate, topRated['Reviews'])
plt.show()

In [None]:
# Frequency distribution of Reviews count
sns.distplot(playstore_data[playstore_data.Rating == playstore_data.Rating.max()].Reviews)

##### Exercise 6: If the number of reviews of an app is very low, what could be the reason for its top-rating ?

In [None]:
Apps_Below_review_5 = topRated[topRated['Reviews'] < 5]
Free_apps_below_ReviewCount5 = Apps_Below_review_5[Apps_Below_review_5['Type'] == 'Free'].shape[0]
Paid_apps_below_ReviewCount5 = Apps_Below_review_5[Apps_Below_review_5['Type'] == 'Paid'].shape[0]
Free_apps_below_ReviewCount5 , Paid_apps_below_ReviewCount5

# Conclusion: Most of the top-rated and less reviews are free, that why user rated 5.0

##### Exercise 7: What is the 95% confidence interval for the rating of apps in the Google Play Store?

In [None]:
data = playstore_data.Rating

# Calculate mean and standard error of the mean (SEM)
mean = np.mean(data)
sem = stats.sem(data)

# Calculate 95% confidence interval
confidence_interval = stats.t.interval(0.95, len(data)-1, loc=mean, scale=sem)

print("Mean:", mean)
print("95% Confidence Interval:", confidence_interval)

##### Exercise 8: Test if there is a statistically significant difference in the ratings between free and paid apps using a t-test

Steps:

* Set the null hypothesis and alternate hypothesis
* Separate the ratings of free and paid apps.
* Perform t-test: Use an independent samples t-test.
* Interpret results based on the p-value, decide whether to reject or fail to reject the null hypothesis.

Step 1: Set the null hypothesis and alternate hypothesis
Null Hypothesis (H0): There is no statistically significant difference in ratings between free and paid apps.
Alternate Hypothesis (H1): There is a statistically significant difference in ratings between free and paid apps.

Separate the ratings of free and paid apps

In [None]:
ratings_free_apps = playstore_data[playstore_data['Type'] == 'Free']['Rating']

In [None]:
ratings_paid_apps = playstore_data[playstore_data['Type'] == 'Paid']['Rating']

Perform t-test - use an independent samples t-test

In [None]:
from scipy.stats import ttest_ind

In [None]:
# Perform t-test assuming unequal variances (use equal_var=False if variances are not equal)
t_stat, p_value_ttest = ttest_ind(ratings_free_apps, ratings_paid_apps)

print(f"T-statistic: {t_stat:.4f}")
print(f"P-value for t-test: {p_value_ttest:.4f}")

Interpret results based on the p-value

In [None]:
alpha = 0.05  # significance level

if p_value_ttest < alpha:
    print("Reject the null hypothesis. There is a statistically significant difference.")
else:
    print("Fail to reject the null hypothesis. There is no statistically significant difference.")