# Introduction

## Exploration and story telling are the most important part of any data science flow. There are ways to plot and explore data via multiple visualization libraries. I am trying to explore better solutions to automate some basic EDA and see how it can enable Data Scientists to tell their story better.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

In [None]:
train_data = pd.read_csv('/kaggle/input/loan-prediction/train_loan.csv')

### Let's have a baseline understanding of the data so that we can see how good the other libraries are able to capture the data.

In [None]:
train_data.head()

In [None]:
train_data.info()

### So here we have a good variety of data, numerical, textual, all unique ids, categorical textual data, categorical numeric data. Good.

# 1. Pandas Profiling

#### This is a well known library for EDA. Let's see what it captures for this dataset.

In [None]:
from pandas_profiling import ProfileReport
profile = ProfileReport(train_data, title="Pandas Profiling Report")

In [None]:
profile.to_widgets()

#### What does pandas profiling tell about the data?
### 1. Overview tab - This tab gives us a bird's eye view level of information about the dataset. 
#### What stood out?
The warning tab stood out to me where at a larger level there are insights about dataset's properties such as high cardinality in some columns or too many zeros that can act as a good indicator of the attention needed at the right place. 

### 2. Variables Tab - Contains drop down menus for each variable/column in the dataset along with important details like distribution, uniqueness, missing count, datatype etc.
#### What stood out?
It is good to see that the variable with only 0's and 1's and the variables with Yes/No or even Y/N are treated as Boolean type even though there are other variables with only two categories but they are only marked as categorical in datatype (Gender and Education in this example). So the identification level is appreciable.

### 3. Interactions Tab- Takes all continuous variables and shows in sub tabs the interactions among them.

### 4. Correlations Tab - Takes all numeric variables and shows different sorts of correlations between them.

### 5. Missing Tab - Shows different visualisations on missing count. 
#### What stood out?
The heatmap seems very useful in choosing to make certain types of imputations.

### 6. Sample Tab - Printing out a few of the first and last rows from the dataset

#### You can get this as an HTML report for presentation purpose via this command

In [None]:
profile.to_file("test_report.html")

# 2. SweetViz

#### As of creating this notebook Kaggle doesn't provide this library built-in so below is the way to download it. Make sure that you have your internet option in settings pane activated for this to work.

In [None]:
!pip install sweetviz

In [None]:
import sweetviz as sv
first_report = sv.analyze(train_data)
first_report.show_notebook() 

#### 1. The Sweetviz dashboard looks very attractive.
#### 2. Right on the top pane there is a set of statistics and information regarding the dataset at a larger level.
#### 3. Each of the columns are explained in separate tabs with a lot of statistical details and distribution plots. Missing value details are also there. On the right there's detailed information about the correlation analysis regarding the selected variable.
#### 4. For numeric columns details like most frequent values etc. are there which stood out. 
#### 5. The association plot heatmap is highly intuitive, I really like how the size and shape of the filling for each correlation change as the level of correlation changes. This is quite different from the generic heatmaps we see.

In addition to showing the traditional numerical correlations, it unifies in a single graph both numerical correlation but also the uncertainty coefficient (for categorical-categorical) and correlation ratio (for categorical-numerical). Squares represent categorical-featured-related variables and circles represent numerical-numerical correlations.

IMPORTANT: categorical-categorical associations (provided by the uncertainty coefficient) are ASSYMMETRICAL, meaning that each row represents how much the row title (on the left) gives information on each column.

## 2.1 Support Functions

### 1. The compare() function facilitates a comparitive study between different parts of the same data. This can be very helpful in comparing test and train dataset. Specifying the target variable can be very helpful in this case.

#### Below is an example of a random split.

In [None]:
compare_report = sv.compare([train_data[500:], "Part 1"], [train_data[:-(len(train_data)-500)], "Part 2"],
                           "Loan_Status")
compare_report.show_notebook()

* Notice how there are two different colours used in every visualisation and data representation. The two different colors show comparison between two different datasets.
* The top pane shows stats about the two datasets and also shows which color represents which of the two datasets. 
* This function can be very useful in different comparitive analysis before and after model development for evaluation.

### 2. The compare_intra() helps divide the dataset into various parts basis the categories in a column/feature.
#### Below is an example where the target variable has been used to show contrast between the two datasets.

In [None]:
compare_intra_report = sv.compare_intra(train_data, train_data['Loan_Status']=='Y', ["Y", "N"])
compare_intra_report.show_notebook()

## 2.2 Possible error while installing sweetviz library

###### AttributeError: module 'numpy.linalg.lapack_lite' has no attribute '_ilp64'
#### Solution that worked in this case: Restart notebook, reinstall the library and import it. 

## To be updated with more such libraries and their demo and features!