# Exploratory data analysis
In this notebook, we will perform an extensive exploratory data analysis in order to get more insight into the data. This will be useful for more elaborate preprocessing.

### Table of contents
* [Load data](#load)
* [First data inspection](#inspection)
* [Faulty data](#faulty)
* [Missing data](#missing)
* [Categorical variables](#categorical)
* [Distributions](#distributions)
* [Quantitative relationships & feature importance](#quantitative)
* [Qualitative relationships](#qualitative)
* [Outliers](#outliers)
* [Clusters](#clusters)

In [2]:
import pandas as pd

train_data_file = "../data/train.csv"
test_data_file = "../data/test.csv"
target_col = "SalePrice"

## Load data <a class="anchor"  id="load"></a>

In [28]:
train_df = pd.read_csv(train_data_file)
train_df.set_index("Id", inplace=True)
y_train = train_df[target_col]
X_train = train_df.drop(columns=[target_col])

X_test = pd.read_csv(test_data_file)
X_test.set_index("Id", inplace=True)

X = pd.concat([X_train, X_test])

## First data inspection <a class="anchor"  id="inspection"></a>
Questions: What is the size of data? What does the data look like? What are the variable data types?

In [31]:
print(f"Size of training data: {train_df.shape}")
print(f"Size of test data: {X_test.shape}")

Size of training data: (1460, 80)
Size of test data: (1459, 79)


In [32]:
# peek into data:
X.sample(5)

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1817,70,RM,60.0,9600,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,6,2009,WD,Normal
1501,160,FV,,2980,Pave,,Reg,Lvl,AllPub,Corner,...,0,0,,,,0,5,2010,WD,Normal
51,60,RL,,13869,Pave,,IR2,Lvl,AllPub,Corner,...,0,0,,,,0,7,2007,WD,Normal
1112,60,RL,80.0,10480,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,9,2008,WD,Normal
558,50,C (all),60.0,11040,Pave,,Reg,Low,AllPub,Inside,...,0,0,,,,0,9,2006,COD,Normal


In [33]:
X.dtypes.value_counts()

object     43
int64      25
float64    11
Name: count, dtype: int64

In [34]:
# basic statistics of numerical columns:
X.describe()

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
count,2919.0,2433.0,2919.0,2919.0,2919.0,2919.0,2919.0,2896.0,2918.0,2918.0,...,2918.0,2919.0,2919.0,2919.0,2919.0,2919.0,2919.0,2919.0,2919.0,2919.0
mean,57.137718,69.305795,10168.11408,6.089072,5.564577,1971.312778,1984.264474,102.201312,441.423235,49.582248,...,472.874572,93.709832,47.486811,23.098321,2.602261,16.06235,2.251799,50.825968,6.213087,2007.792737
std,42.517628,23.344905,7886.996359,1.409947,1.113131,30.291442,20.894344,179.334253,455.610826,169.205611,...,215.394815,126.526589,67.575493,64.244246,25.188169,56.184365,35.663946,567.402211,2.714762,1.314964
min,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0
25%,20.0,59.0,7478.0,5.0,5.0,1953.5,1965.0,0.0,0.0,0.0,...,320.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,2007.0
50%,50.0,68.0,9453.0,6.0,5.0,1973.0,1993.0,0.0,368.5,0.0,...,480.0,0.0,26.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0
75%,70.0,80.0,11570.0,7.0,6.0,2001.0,2004.0,164.0,733.0,0.0,...,576.0,168.0,70.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0
max,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,1526.0,...,1488.0,1424.0,742.0,1012.0,508.0,576.0,800.0,17000.0,12.0,2010.0


In [35]:
# basic statistics of categorical columns:
X.describe(include=["object"])

Unnamed: 0,MSZoning,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,...,GarageType,GarageFinish,GarageQual,GarageCond,PavedDrive,PoolQC,Fence,MiscFeature,SaleType,SaleCondition
count,2915,2919,198,2919,2919,2917,2919,2919,2919,2919,...,2762,2760,2760,2760,2919,10,571,105,2918,2919
unique,5,2,2,4,4,2,5,3,25,9,...,6,3,5,5,3,3,4,4,9,6
top,RL,Pave,Grvl,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Norm,...,Attchd,Unf,TA,TA,Y,Ex,MnPrv,Shed,WD,Normal
freq,2265,2907,120,1859,2622,2916,2133,2778,443,2511,...,1723,1230,2604,2654,2641,4,329,95,2525,2402


## Faulty data <a class="anchor"  id="faulty"></a>
Questions: Does the feature’s data type make sense? Is a numerical feature actually categorical or vice versa? Do all values of one feature have the same data type? Are all values in the expected range? If multiple features are connected, is there any inconsistency? Are there duplicated rows or columns?

## Missing data <a class="anchor"  id="missing"></a>
Questions: How is missing data represented? How much data is missing? Do missing values actually mean a specific value?

## Categorical variables <a class="anchor"  id="categorical"></a>
Questions: Is there an ordering in the categories? Are there special characters (non ASCII)?

## Distributions <a class="anchor"  id="distributions"></a>
Questions: Is the feature continuous or discrete? If discrete: How many distinct values are there? Is the distribution similar to a normal distribution? Is the distribution skewed? Are there particularly frequent or rare values? Are there inf values? Do the features have a similar range? Are the assumptions of the ML algorithm you want to use met?

## Quantitative relationships & feature importance <a class="anchor"  id="quantitative"></a>
Questions: Are there collinear variables? Which features are most predictive?

## Qualitative relationships <a class="anchor"  id="qualitative"></a>
Questions: Is there a clear relationship of a feature with the target that has not been caught by the quantitative analysis? Is a transformation of a feature helpful with your ML algorithm?

## Outliers <a class="anchor"  id="outliers"></a>
Questions: Are there outliers? Are those outliers misleading the prediction?

## Clusters <a class="anchor"  id="clusters"></a>
Questions: Are there clear clusters in the data that should be predicted separately?