## Objective: To explore various AutoEDA capabilities and perform analysis on a given dataset

In [None]:
from IPython.display import Image
Image("../input/autoedaimage/AutoEDA.png")

# 1. AutoEDA - Pandas Profiling

### Dataset Reference: Loan Prediction dataset from Kaggle

### Features:

* General Overview - Quick insights of all variables in the dataset
* Details about each variables / features in the dataset
* Interactions between numeric variables
* Correlations between variables - Pearson's Correlation Coefficient, Spearman's Rank Correlation Coefficient, Kendall's Rank Correlation Coefficient, Phik Correlation Coefficient, Cramer's V for displaying association measure for nominal random variables
* Missing Values - Count, Matrix, Heatmap, Dendogram representations
* Sample data - first and last 10 rows


### When To Use?

* Dataset size is not very large
* Need some quick insights about an unknown dataset
* Use this as a basis for your further EDA analysis on top of it

In [None]:
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import pandas_profiling as pp

In [None]:
df_train = pd.read_csv("../input/loan-eligible-dataset/loan-train.csv")

df_train.head()

In [None]:
df_test = pd.read_csv("../input/loan-eligible-dataset/loan-test.csv")

df_test.head()

In [None]:
df_train.shape

In [None]:
df_test.shape

In [None]:
pp.ProfileReport(df_train)

# 2. AutoEDA - DataPrep

### Features:

* General Overview - Quick insights of all variables in the dataset using the plot dataframe.
* Details about each variables / features in the dataset by using create_report - overview, variables, interactions, correlations, missing values
* Interactions - based on x-axis and y-axis scatter plots
* Correlations between variables - Pearson's Correlation Coefficient, Spearman's Rank Correlation Coefficient, Kendall's Rank Correlation Coefficient
* Missing Values - Bar chart, Spectrum, Heatmap, Dendogram representations
* We can pick one particular feature and analyze - Stats, Bar chart, Pie chart, Word Count, Word Frequency etc as per applicability


### When To Use?

* Dataset size is fairly very large (this seems to be 10X faster than Pandas Profiling tools due to it's highly optimized Dask-based computing module)
* Need some quick insights about an unknown dataset
* Use this as a basis for your further EDA analysis on top of it

In [None]:
!pip install dataprep  # Please use it for the first time if it is not installed in your environment

from dataprep.eda import create_report, plot, plot_correlation, plot_missing

In [None]:
plot(df_train)

In [None]:
create_report(df_train)

In [None]:
plot(df_train, "Property_Area")

# 3. AutoEDA - SweetViz

### Features:

* General Overview - Quick insights of all variables in the dataset using the associations / correlation in the form of a heatmap (including how many duplicates, categorical/numerical/text variables etc.)
* Details about each variables / features in the dataset - missing values, distinct etc.
* Compares Train and Test datasets
* Provides visualization of target variable in context of train dataset


### When To Use?

* Need some quick insights about an unknown dataset
* Use this as a basis for your further EDA analysis on top of it
* Need to compare some quick statistical insights between train and test datasets

In [None]:
!pip install sweetviz # Please use it for the first time if it is not installed in your environment

In [None]:
import sweetviz as sv

In [None]:
analysis_report = sv.analyze(df_train)

In [None]:
# analysis_report.show_html() # This will generate a separate report named SWEETVIZ_REPORT.html
analysis_report.show_notebook(w="100%",h="full")

In [None]:
analysis_report2 = sv.analyze([df_train,'Train'], target_feat='Loan_Status')

In [None]:
analysis_report2.show_notebook(w="100%",h="full")

In [None]:
analysis_report3 = sv.compare([df_train,'Train'],[df_test,'Test'],target_feat='Loan_Status')

analysis_report3.show_notebook(w="100%",h="full")

# Final Interpretation - AutoEDA:

* If we want to get quick insights on statistical inferences, missing values, duplicates, categorical/numerical/text features, correlations, interactions, top and bottom 10 rows, comparision between train and test datasets, then we can leverage some of these AutoEDA libraries and their capabilities.

* This helps in saving time significantly as we quickly generate some statistical inferences, insights as part of these reports / outcomes.

* If the dataset is large, we may use "DataPrep" which seems to be 10X faster than Pandas Profiling as it uses Dask based computing methods.

* We have used one sample dataset (Loan Eligible / Loan Prediction dataset) for the analysis to demonstrate features of various techniques. These can be used based on scenarios, data context, business need, feasibility etc. These are non-exhaustive list.

* Some of the AutoEDA libraries in Python are as follows:
    * Pandas Profiling
    * DataPrep
    * SweetViz
    * AutoViz
    * LUX
    * DTale


We will keep adding more experiments.

### Please do provide your feedback, comments and any specific experiences around AutoEDA techniques, what has worked for you, for which industry use cases etc.