# Exploratory data analysis
In this notebook, we will perform an extensive exploratory data analysis in order to get more insight into the data. This will be useful for more elaborate preprocessing.

### Table of contents
* [Load data](#load)
* [First data inspection](#inspection)
* [Faulty data](#faulty)
* [Missing data](#missing)
* [Categorical variables](#categorical)
* [Distributions](#distributions)
* [Quantitative relationships & feature importance](#quantitative)
* [Qualitative relationships](#qualitative)
* [Outliers](#outliers)
* [Clusters](#clusters)

In [4]:
import pandas as pd

train_data_file = "../data/train.csv"
test_data_file = "../data/test.csv"
target_col = "SalePrice"

## Load data <a class="anchor"  id="load"></a>

In [11]:
train_df = pd.read_csv(train_data_file)
train_df.set_index("Id", inplace=True)
y_train = train_df[target_col]
X_train = train_df.drop(columns=[target_col])

X_test = pd.read_csv(test_data_file)
X_test.set_index("Id", inplace=True)

X = pd.concat([X_train, X_test])

## First data inspection <a class="anchor"  id="inspection"></a>
Questions: What is the size of data? What are the variable data types (numerical, categorical)? Basic statistics?

## Faulty data <a class="anchor"  id="faulty"></a>
Questions: Does the feature’s data type make sense? Is a numerical feature actually categorical or vice versa? Do all values of one feature have the same data type? Are all values in the expected range? If multiple features are connected, is there any inconsistency? Are there duplicated rows or columns?

## Missing data <a class="anchor"  id="missing"></a>
Questions: How is missing data represented? How much data is missing? Do missing values actually mean a specific value?

## Categorical variables <a class="anchor"  id="categorical"></a>
Questions: Is there an ordering in the categories? Are there special characters (non ASCII)?

## Distributions <a class="anchor"  id="distributions"></a>
Questions: Is the feature continuous or discrete? If discrete: How many distinct values are there? Is the distribution similar to a normal distribution? Is the distribution skewed? Are there particularly frequent or rare values? Are there inf values? Do the features have a similar range? Are the assumptions of the ML algorithm you want to use met?

## Quantitative relationships & feature importance <a class="anchor"  id="quantitative"></a>
Questions: Are there collinear variables? Which features are most predictive?

## Qualitative relationships <a class="anchor"  id="qualitative"></a>
Questions: Is there a clear relationship of a feature with the target that has not been caught by the quantitative analysis? Is a transformation of a feature helpful with your ML algorithm?

## Outliers <a class="anchor"  id="outliers"></a>
Questions: Are there outliers? Are those outliers misleading the prediction?

## Clusters <a class="anchor"  id="clusters"></a>
Questions: Are there clear clusters in the data that should be predicted separately?