## Random sample imputation - pandas

To download the House Prices dataset, please refer to the lecture **Datasets** in **Section 2** of this course.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [2]:
# We'll use the following variables:

cols_to_use = [
    "OverallQual",
    "TotalBsmtSF",
    "1stFlrSF",
    "GrLivArea",
    "WoodDeckSF",
    "BsmtUnfSF",
    "LotFrontage",
    "MasVnrArea",
    "GarageYrBlt",
    "BsmtQual",
    "FireplaceQu",
    "SalePrice",
]

In [3]:
# Load the House Prices dataset.

data = pd.read_csv("../../houseprice.csv", usecols=cols_to_use)

data.head()

Unnamed: 0,LotFrontage,OverallQual,MasVnrArea,BsmtQual,BsmtUnfSF,TotalBsmtSF,1stFlrSF,GrLivArea,FireplaceQu,GarageYrBlt,WoodDeckSF,SalePrice
0,65.0,7,196.0,Gd,150,856,856,1710,,2003.0,0,208500
1,80.0,6,0.0,Gd,284,1262,1262,1262,TA,1976.0,298,181500
2,68.0,7,162.0,Gd,434,920,920,1786,TA,2001.0,0,223500
3,60.0,7,0.0,TA,540,756,961,1717,Gd,1998.0,0,140000
4,84.0,8,350.0,Gd,490,1145,1145,2198,TA,2000.0,192,250000


In [4]:
# Let's separate into training and testing sets.

X_train, X_test, y_train, y_test = train_test_split(
    data.drop("SalePrice", axis=1),
    data["SalePrice"],
    test_size=0.3,
    random_state=0,
)

X_train.shape, X_test.shape

((1022, 11), (438, 11))

In [5]:
vars_na = [var for var in X_train.columns if X_train[var].isnull().sum() > 0]

vars_na

['LotFrontage', 'MasVnrArea', 'BsmtQual', 'FireplaceQu', 'GarageYrBlt']

We dropped a lot of observations, this is probably not what we want.

## Random sample imputation

In [6]:
for var in vars_na:

    # extract the random sample to fill the na:
    # remember we do this always from the train set, and we use
    # these to fill both train and test

    random_sample_train = (
        X_train[var].dropna().sample(X_train[var].isnull().sum(), random_state=0)
    )

    random_sample_test = (
        X_train[var].dropna().sample(X_test[var].isnull().sum(), random_state=0)
    )

    # what is all of the above code doing?

    # 1) dropna() removes the NA from the original variable, this
    # means that I will randomly extract existing values and not NAs

    # 2) sample() is the method that will do the random sampling

    # 3) X_train[var].isnull().sum() is the number of random values to extract
    # I want to extract as many values as NAs are present in the original variable

    # 4) random_state sets the seed for reproducibility, so that I extract
    # always the same random values, every time I run this notebook

    # pandas needs to have the same index in order to merge datasets
    random_sample_train.index = X_train[X_train[var].isnull()].index
    random_sample_test.index = X_test[X_test[var].isnull()].index

    # replace the NA in the newly created variable
    X_train.loc[X_train[var].isnull(), var] = random_sample_train
    X_test.loc[X_test[var].isnull(), var] = random_sample_test

In [7]:
X_train.isnull().sum()

LotFrontage    0
OverallQual    0
MasVnrArea     0
BsmtQual       0
BsmtUnfSF      0
TotalBsmtSF    0
1stFlrSF       0
GrLivArea      0
FireplaceQu    0
GarageYrBlt    0
WoodDeckSF     0
dtype: int64

In [8]:
X_test.isnull().sum()

LotFrontage    0
OverallQual    0
MasVnrArea     0
BsmtQual       0
BsmtUnfSF      0
TotalBsmtSF    0
1stFlrSF       0
GrLivArea      0
FireplaceQu    0
GarageYrBlt    0
WoodDeckSF     0
dtype: int64