## Objective: To explore various AutoEDA capabilities and perform analysis on a given dataset

### This notebook will focus on DataPrep

# 2. AutoEDA - DataPrep

### Dataset Reference: Loan Prediction dataset from Kaggle

### Features:

* General Overview - Quick insights of all variables in the dataset using the plot dataframe.
* Details about each variables / features in the dataset by using create_report - overview, variables, interactions, correlations, missing values
* Interactions - based on x-axis and y-axis scatter plots
* Correlations between variables - Pearson's Correlation Coefficient, Spearman's Rank Correlation Coefficient, Kendall's Rank Correlation Coefficient
* Missing Values - Bar chart, Spectrum, Heatmap, Dendogram representations
* We can pick one particular feature and analyze - Stats, Bar chart, Pie chart, Word Count, Word Frequency etc as per applicability


### When To Use?

* Dataset size is fairly very large (this seems to be 10X faster than Pandas Profiling tools due to it's highly optimized Dask-based computing module)
* Need some quick insights about an unknown dataset
* Use this as a basis for your further EDA analysis on top of it

In [None]:
import pandas as pd
import warnings

warnings.filterwarnings("ignore")

In [None]:
!pip --disable-pip-version-check install dataprep  # Please use it for the first time if it is not installed in your environment

In [None]:
from dataprep.eda import create_report, plot, plot_correlation, plot_missing, plot_diff

In [None]:
df_train = pd.read_csv("../input/loan-eligible-dataset/loan-train.csv")

df_train.head()

In [None]:
df_test = pd.read_csv("../input/loan-eligible-dataset/loan-test.csv")

df_test.head()

In [None]:
df_train.shape

In [None]:
df_test.shape

# 2.1 Analyze distributions

* plot(df): plots the distribution of each column and computes dataset statistics
* plot(df, col1): plots the distribution of column col1 in various ways, and computes its statistics
* plot(df, col1, col2): generates plots depicting the relationship between columns col1 and col2

In [None]:
plot(df_train)

In [None]:
# plots the distribution of column x in various ways and calculates column statistics

plot(df_train, "Property_Area")

In [None]:
# generates plots depicting the relationship between columns

plot(df_train, "Property_Area","Loan_Status")

# 2.2 Analyze correlations

* plot_correlation(df): plots correlation matrices (correlations between all pairs of columns)
* plot_correlation(df, col1): plots the most correlated columns to column col1
* plot_correlation(df, col1, col2): plots the joint distribution of column col1 and column col2 and computes a regression line

In [None]:
# plots correlation matrices (correlations between all pairs of columns)

plot_correlation(df_train)

In [None]:
# plots the most correlated columns to column x
# Please ensure x are numerical columns to be analyzed for this

plot_correlation(df_train, "LoanAmount")

In [None]:
# plots the joint distribution of column col1 and column col2 and computes a regression line

plot_correlation(df_train, "LoanAmount","ApplicantIncome")

# 2.3 Analyze missing values

* plot_missing(df): plots the amount and position of missing values, and their relationship between columns
* plot_missing(df, col1): plots the impact of the missing values in column col1 on all other columns
* plot_missing(df, col1, col2): plots the impact of the missing values from column col1 on column col2 in various ways.

In [None]:
# plots the amount and position of missing values, and their relationship between columns

plot_missing(df_train)

In [None]:
# plots the impact of the missing values in column col1 on all other columns

plot_missing(df_train, "Credit_History")

In [None]:
# plots the impact of the missing values from column col1 on column col2 in various ways

plot_missing(df_train, "Credit_History", "Loan_Status")

# 2.4 Analyze difference between dataframes

* plot_diff(): explores the difference of column distributions and statistics across multiple datasets

In [None]:
# We can analyze differences with plot_diff()
# This is a quick way to get some insights between Train and Test datasets

plot_diff([df_train,df_test])

# 2.5 Create Profile Report

* Captures a consolidated report with summary
    * Overview: detect the types of columns in a dataframe
    * Variables: variable type, unique values, distint count, missing values
    * Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
    * Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
    * Text analysis for length, sample and letter
    * Correlations: highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
    * Missing Values: bar chart, heatmap and spectrum of missing values

In [None]:
create_report(df_train)

# Interpretation Summary

### Key Features
* Analyze distributions
* Analyze correlations
* Analyze missing values
* Analyze difference between dataframes
* Creatiing profile report

### When to use
* Dataset size is fairly very large (this seems to be 10X faster than Pandas Profiling tools due to it's highly optimized Dask-based computing module)
* Need some quick insights about an unknown dataset
* Use this as a basis for your further EDA analysis on top of it

In Summary, we need to use it as preliminary analysis and leverage the charts, visualizations created to perform our "actions" on top of it.