Know Where Your Data Comes From
-----------------------

Before jumping in any EDA, you should know as much as possible on the provenance of the data you are analyzing. You need to understand how the data was collected and how it was processed. Are there any past transformations on the data that could affect your analysis?

**You should be able to answer those questions on your dataset:
**

How was it collected?

Is it a sample?

Was it properly sampled?

Was the dataset transformed in any way?

Are there some know problems on the dataset?


If you don’t understand where the data is coming from, you will have a hard time drawing any meaningful conclusions from the dataset. You are also at risk of making very important analysis mistakes.

Additionally, you should make sure the dataset is structured in a standardized manner. The recommended format is the third normal form, also named tidy data. A “tidy” dataset has the following attributes:

Each variable forms a column and contains values
Each observation forms a row
Each type of observational unit forms a table

Respecting this standardized format will speed up your analysis since this it’s compatible with many tools and libraries.

Introducing, **pandas_profiling** for simple and fast exploratory data analysis of a Pandas Datafram
----------------------------

**Exploratory Data Analysis (EDA)** plays a very important role in understanding the **dataset**. Whether you are going to build a Machine Learning Model or if it's just an exercise to bring out insights from the given data, EDA is the primary task to perform.

Exploratory data analysis (EDA) is a statistical approach that aims at discovering and summarizing a dataset. At this step of the data science process, you want to explore the structure of your dataset, the variables and their relationships.


While it's undeniable that EDA is very important, ***The task of performing Exploratory Data Analysis grows in parallel with the number of columns your dataset has got.***



**For example: **
----------------------
Assume you've got a dataset with 10 rows x 2 columns. It's very simply to specify those two column names separately and plot all the required plots to perform EDA. 

Alternatively, If the dataset has got 20 columns, you've to repeat the same above exercise for another 10 times. 

Now, there's another layer of complexity because the visualization that you choose for a continuous variable and categorical variable is different, hence ***the type of the plot changes when the data type changes.*******



Given all these conditions, EDA sometimes becomes a tedious task - but remember it's all driven by a set of rules - 

like plot **boxplot** and **histogram** for a **continous variable**, 

Measure missing values, 

Calculate **frequency** if it's **categorical** variable - thus giving us opportunity to automate things. 

That's the base of this python module **pandas_profiling** that helps one in automating the first-level of EDA.


**pandas-profiling**
--------------------------------------------

***Generates profile reports from a pandas DataFrame*.** The pandas **df.describe()** function is great but a little basic for serious exploratory data analysis.

For each column the following statistics - if relevant for the column type - are presented in an interactive HTML report:

1. **Essentials**: type, unique values, missing values

2. **Quantile statistics **like minimum value, Q1, median, Q3, maximum, range, interquartile range

3. **Descriptive statistics** like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness

4. **Most frequent values**

5. **Histogram**

6. **Correlations** highlighting of highly correlated variables, Spearman and Pearson matrixes


**Github Link:** You can refer more about this module here on github: https://github.com/pandas-profiling/pandas-profiling



In [None]:
import pandas as pd
import pandas_profiling as pp

**Loading Training Dataset**
-------------

In [None]:
train_data = pd.read_csv('../input/train/train.csv')

In [None]:
train_data.head()

In [None]:
pp.ProfileReport(train_data)

**To retrieve the list of variables which are rejected due to high correlation:**
--------------------

In [None]:
profile = pp.ProfileReport(train_data)
rejected_variables = profile.get_rejected_variables(threshold=0.9)
rejected_variables

**Advanced usage**
-------------------
A set of options are available in order to adapt the report generated.

**bins (int):** Number of bins in histogram (10 by default).

**Correlation settings:**

**check_correlation (boolean):** Whether or not to check correlation (True by default)

**correlation_threshold (float):** Threshold to determine if the variable pair is correlated (0.9 by default).

**correlation_overrides (list): ** Variable names not to be rejected because they are correlated (None by default).

**check_recoded (boolean):** Whether or not to check recoded correlation (False by default). Since it's an expensive 
computation it can be activated for small datasets.

**pool_size (int): **Number of workers in thread pool. The default is equal to the number of CPU.