## Mean / Median imputation in Pandas
We will use the data from the 
[Housing Dataset](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data?select=train.csv)

In [1]:
import io
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [2]:
from google.colab import files
uploaded = files.upload()

Saving houseprice.csv to houseprice.csv


In [3]:
# Limit our data and use only these columns
cols_to_use = [
    "TotalBsmtSF",
    "GrLivArea",
    "BsmtUnfSF",
    "LotFrontage",
    "MasVnrArea",
    "GarageYrBlt",
    "SalePrice",
]

In [4]:
# Load the House Prices dataset.
data = pd.read_csv(io.BytesIO(uploaded['houseprice.csv']), usecols=cols_to_use)
data.head()

Unnamed: 0,LotFrontage,MasVnrArea,BsmtUnfSF,TotalBsmtSF,GrLivArea,GarageYrBlt,SalePrice
0,65.0,196.0,150,856,1710,2003.0,208500
1,80.0,0.0,284,1262,1262,1976.0,181500
2,68.0,162.0,434,920,1786,2001.0,223500
3,60.0,0.0,540,756,1717,1998.0,140000
4,84.0,350.0,490,1145,2198,2000.0,250000


**Remember that the mean or the median that we will use to replace the NA are calculated using the train set.**

In [5]:
# Let's separate into training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(
    data.drop("SalePrice", axis=1),
    data["SalePrice"],
    test_size=0.3,
    random_state=42,
)

X_train.shape, X_test.shape

((1022, 6), (438, 6))

In [6]:
# Find missing data
X_train.isnull().sum()

LotFrontage    190
MasVnrArea       3
BsmtUnfSF        0
TotalBsmtSF      0
GrLivArea        0
GarageYrBlt     54
dtype: int64

In [7]:
X_train.isnull().mean()

LotFrontage    0.185910
MasVnrArea     0.002935
BsmtUnfSF      0.000000
TotalBsmtSF    0.000000
GrLivArea      0.000000
GarageYrBlt    0.052838
dtype: float64

In [8]:
# Capture the variables to impute in a list.
vars_to_impute = [var for var in X_train.columns if X_train[var].isnull().sum() > 0]
vars_to_impute

['LotFrontage', 'MasVnrArea', 'GarageYrBlt']

In [9]:
# This is the same as above
vars_to_impute = list()
for var in X_train.columns:
  if X_train[var].isnull().sum() > 0:
    vars_to_impute.append(var)
vars_to_impute

['LotFrontage', 'MasVnrArea', 'GarageYrBlt']

In [10]:
# Capture the median of the 3 variables in a dictionary
imputation_dict = X_train[vars_to_impute].median().to_dict()
imputation_dict

{'LotFrontage': 70.0, 'MasVnrArea': 0.0, 'GarageYrBlt': 1980.0}

To perform mean imputation instead of median, we just replace the previous code by: 
```python
imputation_dict = X_train[vars_to_impute].mean().to_dict()
```

In [11]:
# Replace missing data
X_train.fillna(imputation_dict, inplace=True)
X_test.fillna(imputation_dict, inplace=True)

In [12]:
# Validate Replacement for Train Data
X_train.isnull().sum()

LotFrontage    0
MasVnrArea     0
BsmtUnfSF      0
TotalBsmtSF    0
GrLivArea      0
GarageYrBlt    0
dtype: int64

In [13]:
# Validate Replacement for Test Data
X_test.isnull().sum()

LotFrontage    0
MasVnrArea     0
BsmtUnfSF      0
TotalBsmtSF    0
GrLivArea      0
GarageYrBlt    0
dtype: int64