<a href="https://colab.research.google.com/github/subho99/Computational-Data-Science/blob/main/M0_NB_MiniProject_01_Data_Munging.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Certification Program in Computational Data Science

##  A program by IISc and TalentSprint

### Mini Project Notebook 1 : Data munging

(Ungraded Mini-Project)

## Learning Objectives



At the end of the experiment, you will be able to


* understand the requirements for a “clean” dataset, ready for use in statistical analysis.

* use Python libraries like Pandas, Numpy, and Matplotlib to perform the  data-preprocessing steps accordingly.

* derive meaningful insights from the data


## Dataset

The dataset chosen for this experiment is **play store** dataset which is  publicly available and created with this [methodology](https://nycdatascience.com/blog/student-works/google-play-store-everything-that-you-need-to-know-about-the-android-market/)  

This dataset consists of 10841 records. Each record consists of 13 fields (features).

**For example**, one record consists of App, Category, Rating, Reviews, Size, Installs, Type, Price, Content Rating, Genres, Last Updated, Current Ver, and Android Ver.

## Problem Statement

Before we can derive any meaningful insights from the Play Store data, it is essential to pre-process the data and make it suitable for further analysis. This pre-processing step forms a major part of data wrangling (or data munging) and ensures better quality data. It consists of the transformation and mapping of data from a "raw" data form into another format so that it is more valuable for a variety of downstream purposes such as analytics and modelling. Data analysts typically spend a sizeable amount of time in the process of data wrangling (data munging), compared to the actual analysis of the data.

After data munging is performed, several actionable insights can be derived from the Play Store apps data. Such insights could help to unlock the enormous potential to drive app-making businesses to success.

In [None]:
#@title Download the data
!wget -qq https://cdn.iisc.talentsprint.com/CDS/Datasets/googleplaystore.csv
print("Data downloaded successfully!")

#### Load the dataset

In [None]:
# YOUR CODE HERE
import numpy as np
import pandas as pd
import seaborn as sns

In [None]:
data_set = pd.read_csv('googleplaystore.csv')
data_set.info()

## Pre-processing

There are different steps involved in Data Preprocessing. These steps are as follows:

    1. Data Cleaning → In this step the primary focus is on
        -Handling missing data
        -Handling noisy data
        -Detection and removal of outliers
    
    2. Data Integration → This process is used when data is gathered from various data sources
    and data are combined to form consistent data. This data after performing cleaning is used
    for analysis.
    
    3. Data Transformation → In this step we will convert the raw data into a specified format
    according to the need of the model we are building. There are many options used for
    transforming the data as below:
        -Normalization
        -Aggregation
        -Generalization
        
    4. Data Reduction → After data transformation and scaling the redundancy within the data
    is removed and efficiently organizing the data is performed.

### Task 1: Data Cleaning

* Check whether there are any null values and figure out how you want to handle them? 
  
    **Hint:** isnan(), dropna(), fillna()
* If there is any duplication of a record, how would you like to handle it?

    Hint: [drop_duplicates](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html)

* Are there any non-English apps? How would you filter them?

* In the size column, multiply 1,000,000 with M in the cell and multiply by 1000 if we have K in the cell.

In [None]:
# Checking for Missing data
data_set.isnull()

In [None]:
# Checking for duplicate data based on all columns
duplicates = data_set[data_set.duplicated()]
print("Duplicate Rows :")
duplicates

## Visualization

### Task 2: Perform the following:

##### Exercise 1: Find the number of apps in various categories by using an appropriate plot.

In [None]:
#Getting data on unique categories in dataset
import matplotlib.pyplot as plt

cat = data_set.Category.unique()

cat

In [None]:
# Plotting data based on category

plt.figure(figsize=(12,12))
most_cat = data_set.Category.value_counts()
sns.barplot(x = most_cat, y=most_cat.index, data=data_set)

##### Exercise 2: Explore the distribution of free and paid apps across different categories

**Hint:** Stacked Bar Chart

In [None]:
# TYPE per Category

data_set.Type.unique()



In [None]:
data_set['Type'].replace(to_replace=['0'], value = ['Free'], inplace = True)
data_set['Type'].fillna('Free',inplace = True)

In [None]:
print(data_set.groupby('Category')['Type'].value_counts())
Type_Cat = data_set.groupby('Category')['Type'].value_counts().unstack().plot.barh(figsize = (10,20), width = 0.7)
plt.show()

##### Exercise 3: Represent the distribution of app rating on a scale of 1-5 using an appropriate plot

**Hint:** histogram / strip plot

In [None]:
# Finding out the correlation between rating and category

data_set.Rating.unique()

In [None]:
data_set['Rating'].replace(to_replace=[19.0], value = [1.9], inplace = True)
sns.distplot(data_set.Rating)

In [None]:
cat_rate = sns.FacetGrid(data_set, col = 'Category', palette = 'Set1', col_wrap = 5, height = 5)
cat_rate = (cat_rate.map(sns.distplot,"Rating", hist = False, rug = True, color = 'r'))

Horizontal Bar is the Rating and the vertically is the quantity of the rating

In [None]:
# Mean Rating plot category wise

plt.figure(figsize=(12,12))
mean_rat = data_set.groupby(['Category'])['Rating'].mean().sort_values(ascending = False)
sns.barplot(x = mean_rat, y = mean_rat.index, data = data_set)

##### Exercise 4: Identify outliers of the rating column by plotting the boxplot category wise and Handle them.

**Hint:** Removing Outliers using z-score, quantile [link](https://kanoki.org/2020/04/23/how-to-remove-outliers-in-python/) 

In [None]:
# YOUR CODE HERE

##### Exercise 5: Plot the barplot of all the categories indicating no. of installs

In [None]:
# INSTALL Dataset

data_set.Installs.unique()

In [None]:
data_set['Installs'].replace(to_replace = ['0','Free'], value = ['0+', '0+'], inplace = True)
Installs = []

for x in data_set.Installs:
  x = x.replace(',','')
  Installs.append(x[:-1])

Installs = list(map(float,Installs))
data_set['Installs'] = Installs
sns.distplot(Installs)

In [None]:
#Distributed value of Installs of Each Category

inst = sns.FacetGrid(data_set, col = 'Category', palette = 'Set1', col_wrap = 5, height = 4)
inst = (inst.map(plt.hist, "Installs", bins = 5, color = 'c'))

In [None]:
#Total Installs

plt.figure(figsize = (12,12))
tot_inst = data_set.groupby(['Category'])['Installs'].sum().sort_values(ascending = False)
sns.barplot(x = tot_inst, y = tot_inst.index, data = data_set)

## Insights


### Task 3: Derive the below insights

##### Exercise 1: Does the price correlate with the size of the app?

  **Hint:** plot the scatterplot of `Size` and `Price`

In [None]:
# YOUR CODE HERE

##### Exercise 2: Find the popular app categories based on rating and no. of installs

**Hint:** [df.groupby.agg()](https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.core.groupby.DataFrameGroupBy.agg.html); Taking the average rating could be another approach

In [None]:
# YOUR CODE HERE

##### Exercise 3: How many apps are produced in each year category-wise ?

  * Create a `Year` column by slicing the values of `Last Updated` column and find the Year with most no. of apps produced 

    **For example**, slice the year `2017` from `February 8, 2017` 

  * Find the categories which have a consistent rating in each year

      **Hint:** `sns.countplot`

In [None]:
# YOUR CODE HERE

##### Exercise 4: Identify the highest paid apps with a good rating

In [None]:
# YOUR CODE HERE

##### Exercise 5: Are the top-rated apps genuine ? How about checking reviews count of top-rated apps ?

In [None]:
# YOUR CODE HERE

##### Exercise 6: If the number of reviews of an app is very low, what could be the reason for its top-rating ?

In [None]:
# YOUR CODE HERE