In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<h2 style=color:green align="left"> Table-of-contents </h2>

* [1) Introduction](#1)

* [2) Load Required Libraries](#2)
* [3) Read Data](#3)
* [4) DataPrep (AutoEDA)](#4)
  *  [a) Telecom Dataset](#a)
     * [4.1) Analyze distributions with plot()](#4.1)
     
     * [4.2) Analyze correlations with plot_correlation()](#4.2)
     * [4.3) Analyze missing values with plot_missing()](#4.3)
     * [4.4) Create a profile report with create_report()](#4.4)
     
  *  [b) Titanic Dataset](#b)
  
  *  [c) COVID19 in india Dataset](#c)

<h1 style="background-color:LimeGreen; font-family:newtimeroman; font-size:200%; text-align:center; border-radius: 15px 50px;"> 1) Introduction </h1>

In [None]:
from IPython.display import Image
Image("../input/images/dataprep_01.png",width=400,height=400)

In [None]:
Image("../input/images/dataprep_03.png",width=700,height=800)

### Introduction to Exploratory Data Analysis and dataprep.eda


- DataPrep.EDA is the fastest and the easiest EDA (Exploratory Data Analysis) tool in Python. It allows you to understand a Pandas/Dask DataFrame with a few lines of code in seconds.

- You can create a beautiful profile report from a Pandas/Dask DataFrame with the create_report function. DataPrep.EDA has the following advantages compared to other tools:

 - DataPrep.EDA is **10-100X faster** than **Pandas-based profiling** tools due to its highly optimized Dask-based computing module.

 - DataPrep.EDA generates interactive visualizations in a report, which makes the report look more appealing to end users.

 - DataPrep.EDA naturally supports big data stored in a Dask cluster by accepting a Dask dataframe as input.

- **Exploratory Data Analysis (EDA)** is the process of exploring a dataset and getting an understanding of its main characteristics. The dataprep.eda package simplifies this process by allowing the user to explore important characteristics with simple APIs. Each API allows the user to analyze the dataset from a high level to a low level, and from different perspectives. Specifically, dataprep.eda provides the following functionality:

#### 1) Analyze distributions with plot()

- **Analyze column distributions with plot().** The function plot() explores the column distributions and statistics of the dataset. It will detect the column type, and then output various plots and statistics that are appropriate for the respective type. The user can optionally pass one or two columns of interest as parameters: If one column is passed, its distribution will be plotted in various ways, and column statistics will be computed. If two columns are passed, plots depicting the relationship between the two columns will be generated.

 - **plot(df):** plots the distribution of each column and calculates dataset statistics

 - **plot(df, x):** plots the distribution of column x in various ways and calculates column statistics

 - **plot(df, x, y):** generates plots depicting the relationship between columns x and y

#### 2) Analyze correlations with plot_correlation()

- **Analyze correlations with plot_correlation().** The function plot_correlation() explores the correlation between columns in various ways and using multiple correlation metrics. By default, it plots correlation matrices with various metrics. The user can optionally pass one or two columns of interest as parameters: If one column is passed, the correlation between this column and all other columns will be computed and ranked. If two columns are passed, a scatter plot and regression line will be plotted.

  - **plot_correlation()**: explores the correlation between columns in various ways and using multiple correlation metrics. It generates correlation matrices using **Pearson, Spearman, and KendallTau correlation coefficients**.

  - **plot_correlation(df):** plots correlation matrices (correlations between all pairs of columns)

  - **plot_correlation(df, x):** plots the most correlated columns to column x

  - **plot_correlation(df, x, y):** plots the joint distribution of column x and column y and computes a regression line

#### 3) Analyze missing values with plot_missing()

- **Analyze missing values with plot_missing().** The function plot_missing() enables thorough analysis of the missing values and their impact on the dataset. By default, it will generate various plots which display the amount of missing values for each column and any underlying patterns of the missing values in the dataset. To understand the impact of the missing values in one column on the other columns, the user can pass the column name as a parameter. Then, plot_missing() will generate the distribution of each column with and without the missing values from the given column, enabling a thorough understanding of their impact.

 - **plot_missing()**: enables thorough analysis of the missing values and their impact on the dataset

 - **plot_missing(df):** plots the amount and position of missing values, and their relationship between columns

 - **plot_missing(df, x):** plots the impact of the missing values in column x on all other columns

 - **plot_missing(df, x, y):** plots the impact of the missing values from column x on column y in various ways.

#### 4) Create a profile report with create_report()

 - **create_report()**: generates a comprehensive profile report of the dataset.

 - **Overview:** detect the types of columns in a dataframe

 - **Variables:** variable type, unique values, distint count, missing values

 - **Quantile** statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range

 - **Descriptive statistics** like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness

 - **Text analysis** for length, sample and letter

 - **Correlations:** highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices

 - **Missing Values:** bar chart, heatmap and spectrum of missing values

#### 5) Time Series Data Analysis

 - plot(covid, "Date", "Confirmed", "State/UnionTerritory", agg='sum')
 - eu = covid.loc[covid['State/UnionTerritory'] == 'Maharashtra']
    plot(eu, "Date", "Confirmed", "State/UnionTerritory", agg='sum', ngroups=50)

In [None]:
Image("../input/images/dataprep_04.png",width=800,height=800)

### I want an overview of the dataset
#### plot(df)

---------------------------------------------------------

### Understand Missing Value
#### plot_missing(df)

--------------------------------------------------------

### Understand Correlation
#### plot_correlation(df)

-------------------------------------------------------

### Understand Numerical Column
#### plot(df, 'Age')

-------------------------------------------------------

### Understand Text Column
#### plot(df, 'Name')

-------------------------------------------------------

### Understand Column Relationship
#### plot(df, 'Price', bins=50)

-------------------------------------------------------

<h1 style="background-color:LimeGreen; font-family:newtimeroman; font-size:200%; text-align:center; border-radius: 15px 50px;"> 2) Load Required Libraries </h1>

In [None]:
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

<h1 style="background-color:LimeGreen; font-family:newtimeroman; font-size:200%; text-align:center; border-radius: 15px 50px;"> 3) Read Data </h1>

In [None]:
telecom = pd.read_csv("/kaggle/input/telecom-users-dataset/telecom_users.csv")
covid = pd.read_csv("/kaggle/input/covid19-in-india/covid_19_india.csv", parse_dates=['Date'], index_col=0)

In [None]:
display(telecom.head())
display(covid.head())

In [None]:
display(telecom.shape)
display(covid.shape)

In [None]:
display(telecom.info())
print('-'*75)
display(covid.info())

In [None]:
display(telecom.isnull().sum())
print('-'*75)
display(covid.isnull().sum())

<h1 style="background-color:LimeGreen; font-family:newtimeroman; font-size:200%; text-align:center; border-radius: 15px 50px;"> 4) DataPrep (AutoEDA) </h1>

<h1 style="background-color:skyblue; font-family:newtimeroman; font-size:180%; text-align:left;"> a) Telecom Dataset </h1>

<h2 style=color:green align="left"> Table-of-contents </h2>

* [1) Analyze distributions with plot()](#1)

* [2) Analyze correlations with plot_correlation()](#2)

* [3) Analyze missing values with plot_missing()](#3)

* [4) Create a profile report with create_report()](#4)

In [None]:
!pip install dataprep

In [None]:
from dataprep.eda import *

<h1 style="background-color:LimeGreen; font-family:newtimeroman; font-size:200%; text-align:left;"> 4.1) Analyze distributions with plot() </h1>

 - **a) The function plot()** explores the distributions and statistics of the dataset. The following describes the functionality of plot() for a given dataframe df.


 - **b) plot(df):** plots the distribution of each column and calculates dataset statistics (“I want to see an overview of the dataset”
)


 - **c) plot(df, x):** plots the distribution of column x in various ways and calculates column statistics (“I want to understand the column x”)


 - **d) plot(df, x, y):** generates plots depicting the relationship between columns x and y. (“I want to understand the relationship between x and y”)

In [None]:
from dataprep.eda import plot

In [None]:
# plots the distribution of each column and calculates dataset statistics
plot(telecom)

In [None]:
# plots the distribution of column x in various ways and calculates column statistics
plot(telecom, 'MonthlyCharges')

In [None]:
# generates plots depicting the relationship between columns x and y
plot(telecom, 'MonthlyCharges', 'Churn')

<h1 style="background-color:LimeGreen; font-family:newtimeroman; font-size:200%; text-align:left;"> 4.2) Analyze correlations with plot_correlation() </h1>

 - The function **plot_correlation()** explores the **correlation between columns** in various ways and using multiple correlation metrics. It generates correlation matrices using **Pearson, Spearman, and KendallTau correlation** coefficients 

 - **plot_correlation(df):** plots correlation matrices (correlations between all pairs of columns)

 - **plot_correlation(df, x):** plots the most correlated columns to column x

 - **plot_correlation(df, x, y):** plots the joint distribution of column x and column y and computes a regression line

In [None]:
from dataprep.eda import plot_correlation

In [None]:
plot_correlation(telecom)

In [None]:
telecom["Churn"] = telecom["Churn"].replace({"Yes":1, 'No':0})

In [None]:
# plots the most correlated columns to column "Churn"
plot_correlation(telecom, 'Churn')

In [None]:
# plots the joint distribution of column x and column y and computes a regression line
plot_correlation(telecom, 'tenure', 'Churn')

<h1 style="background-color:LimeGreen; font-family:newtimeroman; font-size:200%; text-align:left;"> 4.3) Analyze missing values with plot_missing() </h1>

 - The function **plot_missing()** enables thorough analysis of the missing values and their impact on the dataset. The following describes the functionality of plot_missing() for a given dataframe df.

 - **plot_missing(df):** plots the amount and position of missing values, and their relationship between columns (“I want to understand the missing values of the dataset”)

 - **plot_missing(df, x):** plots the impact of the missing values in column x on all other columns

 - **plot_missing(df, x, y):** plots the impact of the missing values from column x on column y in various ways.

In [None]:
from dataprep.eda import plot_missing

In [None]:
plot_missing(telecom)

<h1 style="background-color:LimeGreen; font-family:newtimeroman; font-size:200%; text-align:left;"> 4.4) Create a profile report with create_report() </h1>


- The function **create_report()** generates a **comprehensive profile report** of the dataset. create_report() **combines the individual components** of the dataprep.eda package and outputs them into a nicely formatted **HTML** document. The document contains the following information:

 - **Overview:** detect the types of columns in a dataframe

 - **Variables:** variable type, unique values, distint count, missing values

 - **Quantile statistics** like minimum value, Q1, median, Q3, maximum, range, interquartile range

 - **Descriptive statistics** like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness

 - **Text analysis** for length, sample and letter

 - **Correlations:** highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices

 - **Missing Values:** bar chart, heatmap and spectrum of missing values

In [None]:
create_report(telecom)

<h1 style="background-color:skyblue; font-family:newtimeroman; font-size:180%; text-align:left;"> b) Titanic Dataset </h1>

In [None]:
from dataprep.datasets import load_dataset
titanic = load_dataset("titanic")
titanic.head()

<h1 style="background-color:magenta; font-family:newtimeroman; font-size:180%; text-align:left;"> Univariate Analysis </h1>

In [None]:
plot(titanic, 'sex', bins=26)

<h1 style="background-color:magenta; font-family:newtimeroman; font-size:180%; text-align:left;"> Bivariate Analysis </h1>
 
 - Numerical and Numerical
 
 - Categorical and Categorical
 
 - Numerical and Categorical

In [None]:
# Numerical and Numerical
plot(titanic, 'age', 'fare')

In [None]:
# Numerical and Numerical
plot(titanic, 'sex', 'survived')

In [None]:
# Categorical and Categorical
plot(titanic, 'sex', 'class')

In [None]:
# Numerical and Categorical
plot(titanic, 'sex', 'age')

In [None]:
# Numerical and Categorical
plot(titanic, 'survived', 'class')

In [None]:
# Missing values
plot_missing(titanic)

In [None]:
create_report(titanic)

<h1 style="background-color:skyblue; font-family:newtimeroman; font-size:180%; text-align:left;"> c) Stanford COVID Vaccine (Time Series Data Analysis) </h1>

In [None]:
plot(covid, "Date", "Confirmed", "State/UnionTerritory", agg='sum')

In [None]:
eu = covid.loc[covid['State/UnionTerritory'] == 'Maharashtra']
plot(eu, "Date", "Confirmed", "State/UnionTerritory", agg='sum', ngroups=50)

<h1 style="background-color:LimeGreen; font-family:newtimeroman; font-size:200%; text-align:left;"> If you like the kernal... Don't forget to upvote!!!!!!!!!! </h1>