# Exploratory data analysis
In this notebook, we will perform an extensive exploratory data analysis (EDA) in order to get more insight into the data. This will be useful for more elaborate preprocessing.

### Table of contents
* [Load data](#load)
* [First data inspection](#inspection)
* [Faulty data](#faulty)
* [Missing data](#missing)
* [Categorical variables](#categorical)
* [Distributions](#distributions)
* [Quantitative relationships & feature importance](#quantitative)
* [Qualitative relationships](#qualitative)
* [Outliers](#outliers)
* [Clusters](#clusters)

In [74]:
import pandas as pd

# display full dataframes:
pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)

train_data_file = "../data/train.csv"
target_col = "SalePrice"

## Load data <a class="anchor"  id="load"></a>
To prevent any biased decisions, the EDA is performed only on the training set. Thus, only the training data is loaded.

In [82]:
train_df = pd.read_csv(train_data_file)
train_df.set_index("Id", inplace=True)
y_train = train_df[target_col]
X_train = train_df.drop(columns=[target_col])

## First data inspection <a class="anchor"  id="inspection"></a>
**Questions: What is the size of data? What does the data look like? What are the variable data types?**

In [83]:
print(f"Size of training data: {train_df.shape}")

Size of training data: (1460, 80)


In [76]:
train_df.sample(5)

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1
975,70,RL,60.0,11414,Pave,,IR1,Lvl,AllPub,Corner,Gtl,BrkSide,RRAn,Feedr,1Fam,2Story,7,8,1910,1993,Gable,CompShg,HdBoard,HdBoard,,0.0,TA,Gd,BrkTil,Gd,TA,No,Unf,0,Unf,0,728,728,GasA,TA,N,SBrkr,1136,883,0,2019,0,0,1,0,3,1,Gd,8,Typ,0,,Detchd,1997.0,Unf,2,532,TA,TA,Y,509,135,0,0,0,0,,GdPrv,,0,10,2009,WD,Normal,167500
6,50,RL,85.0,14115,Pave,,IR1,Lvl,AllPub,Inside,Gtl,Mitchel,Norm,Norm,1Fam,1.5Fin,5,5,1993,1995,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,TA,Wood,Gd,TA,No,GLQ,732,Unf,0,64,796,GasA,Ex,Y,SBrkr,796,566,0,1362,1,0,1,1,1,1,TA,5,Typ,0,,Attchd,1993.0,Unf,2,480,TA,TA,Y,40,30,0,320,0,0,,MnPrv,Shed,700,10,2009,WD,Normal,143000
1048,20,RL,57.0,9245,Pave,,IR2,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,1Story,5,5,1994,1995,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,TA,PConc,Gd,TA,No,GLQ,686,Unf,0,304,990,GasA,Ex,Y,SBrkr,990,0,0,990,0,1,1,0,3,1,TA,5,Typ,0,,Detchd,1996.0,Unf,2,672,TA,TA,Y,0,0,0,0,0,0,,,,0,2,2008,WD,Normal,145000
1309,20,RM,100.0,12000,Pave,,Reg,Lvl,AllPub,Inside,Gtl,OldTown,Norm,Norm,1Fam,1Story,5,7,1948,2005,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,TA,TA,No,GLQ,144,ALQ,608,172,924,GasA,Ex,Y,SBrkr,1122,0,0,1122,1,0,1,0,2,1,Gd,6,Typ,0,,Attchd,1948.0,Unf,2,528,TA,TA,Y,0,36,0,0,0,0,,GdWo,,0,5,2008,WD,Normal,147000
166,190,RL,62.0,10106,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Edwards,Norm,Norm,2fmCon,1.5Fin,5,7,1940,1999,Gable,CompShg,Wd Sdng,Wd Sdng,,0.0,TA,Gd,BrkTil,TA,TA,No,ALQ,351,Rec,181,112,644,GasA,Gd,Y,SBrkr,808,547,0,1355,1,0,2,0,4,2,TA,6,Typ,0,,,,,0,0,,,Y,140,0,0,0,0,0,,,,0,9,2008,WD,Normal,127500


In [77]:
# distribution of data types:
X_train.dtypes.value_counts()

object     43
int64      33
float64     3
Name: count, dtype: int64

In [78]:
print(f"Data type of target: {y_train.dtype}")

Data type of target: int64


**Findings**:
The training data has 1460 rows and 80 columns. One of these columns is the target. Consequently, there are 79 feature columns where 43 are categorical, 33 integers and 3 floating point numbers. The target is represented as integer.

## Missing data <a class="anchor"  id="missing"></a>
**Questions: How is missing data represented? How much data is missing? Do missing values actually mean a specific value?**

## Faulty data <a class="anchor"  id="faulty"></a>
**Questions: Are there duplicated rows or columns? Does the feature’s data type make sense? Is a numerical feature actually categorical or vice versa? Do all values of one feature have the same data type? Are all values in the expected range? If multiple features are connected, is there any inconsistency?**

## Categorical variables <a class="anchor"  id="categorical"></a>
**Questions: Is there an ordering in the categories? Are there special characters (non ASCII)?**

## Distributions <a class="anchor"  id="distributions"></a>
**Questions: Is the feature continuous or discrete? If discrete: How many distinct values are there? Is the distribution similar to a normal distribution? Is the distribution skewed? Are there particularly frequent or rare values? Are there inf values? Do the features have a similar range? Are the assumptions of the ML algorithm you want to use met?**

## Quantitative relationships & feature importance <a class="anchor"  id="quantitative"></a>
**Questions: Are there collinear variables? Which features are most predictive?**

## Qualitative relationships <a class="anchor"  id="qualitative"></a>
**Questions: Is there a clear relationship of a feature with the target that has not been caught by the quantitative analysis? Is a transformation of a feature helpful with your ML algorithm?**

## Outliers <a class="anchor"  id="outliers"></a>
**Questions: Are there outliers? Are those outliers misleading the prediction?**

## Clusters <a class="anchor"  id="clusters"></a>
**Questions: Are there clear clusters in the data that should be predicted separately?**