# Exploratory data analysis
In this notebook, we will perform an extensive exploratory data analysis (EDA) in order to get more insight into the data. This will be useful for more elaborate preprocessing.

### Table of contents
* [Load data](#load)
* [First data inspection](#inspection)
* [Faulty data](#faulty)
* [Missing data](#missing)
* [Categorical variables](#categorical)
* [Distributions](#distributions)
* [Quantitative relationships & feature importance](#quantitative)
* [Qualitative relationships](#qualitative)
* [Outliers](#outliers)
* [Clusters](#clusters)

In [12]:
import pandas as pd

# display full dataframes:
pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)

train_data_file = "../data/train.csv"
target_col = "SalePrice"

## Load data <a class="anchor"  id="load"></a>
To prevent any biased decisions, the EDA is performed only on the training set. Thus, only the training data is loaded.

In [13]:
train_df = pd.read_csv(train_data_file)
train_df.set_index("Id", inplace=True)
y_train = train_df[target_col]
X_train = train_df.drop(columns=[target_col])

## First data inspection <a class="anchor"  id="inspection"></a>
**Questions: What is the size of data? What does the data look like? What are the variable data types?**

In [4]:
print(f"Size of training data: {train_df.shape}")

Size of training data: (1460, 80)


In [5]:
train_df.sample(5)

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1
866,20,RL,,8750,Pave,,IR1,Lvl,AllPub,Inside,Gtl,NAmes,Norm,Norm,1Fam,1Story,5,6,1970,1970,Gable,CompShg,MetalSd,MetalSd,BrkFace,76.0,TA,TA,CBlock,TA,TA,No,BLQ,828,Unf,0,174,1002,GasA,TA,Y,SBrkr,1002,0,0,1002,1,0,1,0,3,1,TA,5,Typ,0,,Detchd,1973.0,Unf,2,902,TA,TA,Y,0,0,0,0,0,0,,MnPrv,,0,8,2009,WD,Normal,148500
674,20,RL,110.0,14442,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Crawfor,Norm,Norm,1Fam,1Story,6,7,1957,2004,Hip,CompShg,CemntBd,CmentBd,BrkFace,106.0,TA,TA,PConc,TA,TA,No,GLQ,1186,Unf,0,291,1477,GasA,Ex,Y,SBrkr,1839,0,0,1839,1,0,2,0,3,1,Gd,7,Typ,2,TA,Attchd,1957.0,Fin,2,416,TA,TA,Y,0,87,0,0,200,0,,,,0,6,2007,WD,Normal,257500
1316,60,RL,85.0,11075,Pave,,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Norm,Norm,1Fam,2Story,6,5,1969,1969,Gable,CompShg,HdBoard,HdBoard,,0.0,TA,TA,CBlock,Fa,TA,Mn,ALQ,500,LwQ,276,176,952,GasA,TA,Y,SBrkr,1092,1020,0,2112,0,0,2,1,4,1,TA,9,Typ,2,Gd,Attchd,1969.0,Unf,2,576,TA,TA,Y,280,0,0,0,0,0,,,,0,6,2008,WD,Normal,206900
1242,20,RL,83.0,9849,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Somerst,Norm,Norm,1Fam,1Story,7,6,2007,2007,Hip,CompShg,VinylSd,VinylSd,Stone,0.0,Gd,TA,PConc,Gd,TA,Av,Unf,0,Unf,0,1689,1689,GasA,Ex,Y,SBrkr,1689,0,0,1689,0,0,2,0,3,1,Gd,7,Typ,0,,Attchd,2007.0,RFn,3,954,TA,TA,Y,0,56,0,0,0,0,,,,0,6,2007,New,Partial,248328
410,60,FV,85.0,10800,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Somerst,Norm,Norm,1Fam,2Story,8,5,2007,2008,Gable,CompShg,VinylSd,VinylSd,Stone,100.0,Gd,TA,PConc,Ex,TA,No,GLQ,789,Unf,0,245,1034,GasA,Ex,Y,SBrkr,1050,1028,0,2078,1,0,2,1,3,1,Ex,8,Typ,1,Gd,Attchd,2008.0,Fin,3,836,TA,TA,Y,0,102,0,0,0,0,,,,0,4,2008,New,Partial,339750


In [6]:
# distribution of data types:
X_train.dtypes.value_counts()

object     43
int64      33
float64     3
Name: count, dtype: int64

In [7]:
print(f"Data type of target: {y_train.dtype}")

Data type of target: int64


**Findings**:
The training data has 1460 rows and 80 columns. One of these columns is the target. Consequently, there are 79 feature columns where 43 are categorical, 33 integers and 3 floating point numbers. The target is represented as integer.

## Missing data <a class="anchor"  id="missing"></a>
**Questions: How is missing data represented? How much data is missing? Do missing values actually mean a specific value?**

In [8]:
# check for other representations of missing values than np.nan:
other_nan_rerpresentations = [
    "?",
    "-",
    "",
    " ",
    "None",
    None,
    "nan",
    "NAN",
    "n/a",
    "na",
    "NA",
    "null",
    "NULL",
    "nil",
    "NIL",
    "empty",
]
other_nan_rep_present = X_train.isin(other_nan_rerpresentations).any().any()
print(
    "Is there any other representation for missing values than np.nan?"
    f" {other_nan_rep_present}"
)

Is there any other representation for missing values than np.nan? False


In [9]:
# count missing values:
sum_missing = X_train.isna().sum().sort_values(ascending=False)
sum_missing[sum_missing > 0.0]

PoolQC          1453
MiscFeature     1406
Alley           1369
Fence           1179
MasVnrType       872
FireplaceQu      690
LotFrontage      259
GarageCond        81
GarageType        81
GarageYrBlt       81
GarageQual        81
GarageFinish      81
BsmtFinType2      38
BsmtExposure      38
BsmtFinType1      37
BsmtCond          37
BsmtQual          37
MasVnrArea         8
Electrical         1
dtype: int64

Looking into the data description, I make the following findings: For the following features, a missing value means that the feature is not available: Alley, MasVnrType, BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinType2, FireplaceQu, GarageType, GarageFinish, GarageQual, GarageCond, PoolQC, Fence, MiscFeature

**Findings:**
- All missing values seem to be represented by `np.nan`. 
- There are 19 columns with missing values in the training set. The features with the most missing values are PoolQC, MiscFeature, Alley, Fence, MasVnrType, FireplaceQu and LotFrontage. 
- For all features with missing values except for LotFrontage, GarageYrBlt, MasVnrArea & Electrical, it is stated in the data description that a missing value means that the feature is not available. 
- I assume that for the numerical features (LotFrontage, GarageYrBlt, MasVnrArea), the data description omits to state the meaning of missing values but that it means the same, that is, the corresponding feature is missing. 
- As for the one missing value in the categorical feature named Electrical, I think, this is a truely missing value with no special meaning.

For the next steps of the EDA, the missing data should already be handled.

**Handling missing values:**
- For the categorical features Alley, MasVnrType, BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinType2, FireplaceQu, GarageType, GarageFinish, GarageQual, GarageCond, PoolQC, Fence, MiscFeature, missing values can be replaced by some unique string like "None" for example.
- For the numerical features LotFrontage and MasVnrArea that are measuring a size, a reasonable replacement of missing values would be their lower bound zero.
- For GarageYrBlt (numerical) it is more difficult since it is not measuring a size. Maybe the median or rounded mean would be sensible here as replacement. Note that the mean should be rounded to an integer value in order to maintain the data type of the column consistent.
- The row with the single missing value of Electrical can be deleted as a whole since it is only one row.

In [15]:
# handling the missing values:
nan_to_string = [
    "Alley",
    "MasVnrType",
    "BsmtQual",
    "BsmtCond",
    "BsmtExposure",
    "BsmtFinType1",
    "BsmtFinType2",
    "FireplaceQu",
    "GarageType",
    "GarageFinish",
    "GarageQual",
    "GarageCond",
    "PoolQC",
    "Fence",
    "MiscFeature",
]
nan_to_zero = ["LotFrontage", "MasVnrArea"]
nan_to_mean = ["GarageYrBlt"]
nan_delete_row = ["Electrical"]

X_train[nan_to_string] = X_train[nan_to_string].fillna("None")
X_train[nan_to_zero] = X_train[nan_to_zero].fillna(0)
X_train[nan_to_mean] = X_train[nan_to_mean].apply(
    lambda col: round(col.fillna(col.mean())), axis=0
)
X_train.dropna(axis=0, subset=nan_delete_row, inplace=True)

## Faulty data <a class="anchor"  id="faulty"></a>
**Questions: Are there duplicated rows or columns? Does the feature’s data type make sense? Is a numerical feature actually categorical or vice versa? Do all values of one feature have the same data type? Are all values in the expected range? If multiple features are connected, is there any inconsistency?**

## Categorical variables <a class="anchor"  id="categorical"></a>
**Questions: Is there an ordering in the categories? Are there special characters (non ASCII)?**

## Distributions <a class="anchor"  id="distributions"></a>
**Questions: Is the feature continuous or discrete? If discrete: How many distinct values are there? Is the distribution similar to a normal distribution? Is the distribution skewed? Are there particularly frequent or rare values? Are there inf values? Do the features have a similar range? Are the assumptions of the ML algorithm you want to use met?**

## Quantitative relationships & feature importance <a class="anchor"  id="quantitative"></a>
**Questions: Are there collinear variables? Which features are most predictive?**

## Qualitative relationships <a class="anchor"  id="qualitative"></a>
**Questions: Is there a clear relationship of a feature with the target that has not been caught by the quantitative analysis? Is a transformation of a feature helpful with your ML algorithm?**

## Outliers <a class="anchor"  id="outliers"></a>
**Questions: Are there outliers? Are those outliers misleading the prediction?**

## Clusters <a class="anchor"  id="clusters"></a>
**Questions: Are there clear clusters in the data that should be predicted separately?**