In [21]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [22]:
df = pd.read_csv('train.csv')

In [23]:
df.shape

(1460, 81)

In [24]:
columns = df.iloc[:,:30].columns

In [25]:
# List the features with nan values
df[columns].isna().sum(axis=0)[df[columns].isna().sum(axis=0) > 0]

LotFrontage     259
Alley          1369
MasVnrType      872
MasVnrArea        8
dtype: int64

## MasVnrType and MasVnrArea
First let's take a closer look at MasVnrType and MasVnrArea

In [26]:
df.MasVnrType.value_counts(dropna=False)

MasVnrType
NaN        872
BrkFace    445
Stone      128
BrkCmn      15
Name: count, dtype: int64

As mentioned in the description of the MasVnrType NaN means that it does not have any masonary veneer. Which means that MasVnrArea also should have nan when there is no masonary veneer. Now we try to verify this

In [27]:
df[df.MasVnrType.isna()].MasVnrArea.value_counts(dropna=False)

MasVnrArea
0.0      859
NaN        8
1.0        2
288.0      1
344.0      1
312.0      1
Name: count, dtype: int64

Which shows that when masonry veneer type is nan the area is 0 or nan except for 5 other observations. Which means that those nans for the type are just missing values or these non-zero values are measurement mistakes. However, since there are only 5 ambiguous rows we can just drop them.

In [28]:
df.query("~(MasVnrType.isna() & MasVnrArea > 0)", inplace=True)

Meanwhile when the MasVnrArea is nan or zero MasVnrType should be nan. The following code shows that the

In [29]:
df[df.MasVnrArea.isin([0, np.nan])].MasVnrType.value_counts(dropna=False)

MasVnrType
NaN        867
BrkFace      1
Stone        1
Name: count, dtype: int64

This is the case for when the value for masonary area is zero it should not have any masonary veneer type. But this is violated only two cases which since they are not that many therefor we can drop.

In [30]:
df.query(expr="~((MasVnrArea == 0 | MasVnrArea.isna()) & MasVnrType.notna())", inplace=True)

## LotFrontage

There are 259 nan values for the lot frontage which might be reasobable since the property might not face the street. In order to investigate this we need to take a look at the LotConfig

In [32]:
df.LotConfig[df.LotFrontage.isna()].value_counts(dropna=False)

LotConfig
Inside     133
Corner      62
CulDSac     48
FR2         14
Name: count, dtype: int64

This is reasonable for inside lot configuration since there is no direct street access to the property. For the other configuration this is reasonable since there might be cases that the lot does not extend toward the street. I have no idea!