In [None]:
import pandas as pd

# Introduction

Before even considering any feature engineering or modeling aspect, it is imperative to explore the data we have. This step is usually called **Exploratory Data Analysis (EDA)**.

If we consider a standard Machine Learning pipeline, EDA is the [most time consuming task](https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/).

**Have you ever found yourself repeating the same procedures over and over at the beginning of a new competition?**

Well...let's start using some **automation** and let me introduce you to [`Pandas Profiling`](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/introduction.html#).

# Pandas Profiling in action

Pandas Profiling is a fantastic library that allows you to save time by getting rid of all the repetitive tasks in the initial exploration phase of your tabular datasets.

Let's import the library and let's get started!

In [None]:
import pandas_profiling as pp

As an example, let's now import our training set

In [None]:
train = pd.read_csv("../input/tabular-playground-series-jan-2022/train.csv")

After that, basic use is [really simple](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/getting_started.html): we just need to instantiate the ProfileReport class, passing our dataset and providing a meaningful title.

In [None]:
profile = pp.ProfileReport(train, title="EDA on training set", explorative=True)

Almost done!

There are now **two main ways** to explore the output: through widgets and through a HTML report.

## Widget call

In [None]:
profile.to_widgets()

## HTML report

In [None]:
profile.to_notebook_iframe()

As stated in the documentation, for each column the following statistics - if relevant for the column type - are presented in an interactive HTML report:

- **Type inference**: detect the types of columns in a dataframe.
- **Essentials**: type, unique values, missing values
- **Quantile statistics** like minimum value, Q1, median, Q3, maximum, range, interquartile range
- **Descriptive statistics** like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
- **Most frequent values**
- **Histograms**
- **Correlations** highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
- **Missing values** matrix, count, heatmap and dendrogram of missing values
- **Duplicate rows** Lists the most occurring duplicate rows
- **Text analysis** learn about categories (Uppercase, Space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data

If you want to save your HTML report, you can use the `to_file()` function:

In [None]:
profile.to_file("report_training.html")

# Conclusion

It goes without saying how much time can be saved with this type of automation. Starting from this first exploration thanks to `Pandas Profiling`, we can **focus on more detailed explorations**, and then move on to the subsequent phases of the ML pipeline.

For more **customizations**, you can refer to the [Advanced section](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/advanced_usage.html) of the documentation, which explains in detail how to change the default settings.

For **large datasets**, there is a specific section [here](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/big_data.html)!



### I hope this short notebook will help you enjoy more this competition (and many others)!