# Missing Values Handling
This notebook is an exercise in the Intermediate Machine Learning courses in Kaggle. This is the tutorial link: https://www.kaggle.com/code/alexisbcook/missing-values/tutorial
Housing Prices Data sources: https://www.kaggle.com/competitions/home-data-for-ml-course/data

In [24]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
X_full = pd.read_csv('train.csv', index_col='Id')
X_test_full = pd.read_csv('test.csv', index_col='Id')

# Remove rows with missing target, separate target from predictors
X_full.dropna(axis=0, subset=['SalePrice'], inplace=True) #explanation below
y = X_full.SalePrice
X_full.drop(['SalePrice'], axis=1, inplace=True)

# To keep things simple, I will use only numerical predictors
X = X_full.select_dtypes(exclude=['object'])
X_test = X_test_full.select_dtypes(exclude=['object'])

# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

- dropna:
Pandas DataFrame method used for removing missing or NaN (Not-a-Number) values from the DataFrame.
- axis=0: rows & axics=1: columns
- subset=['SalePrice']: which column(s) you want to consider when checking for missing values
- inplace=True: The inplace parameter is a boolean flag that, when set to True, means that the operation will modify the DataFrame in place, and no new DataFrame will be created. The missing rows will be removed from X_full.

In [25]:
X_train.head()

Unnamed: 0_level_0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
619,20,90.0,11694,9,5,2007,2007,452.0,48,0,...,774,0,108,0,0,260,0,0,7,2007
871,20,60.0,6600,5,5,1962,1962,0.0,0,0,...,308,0,0,0,0,0,0,0,8,2009
93,30,80.0,13360,5,7,1921,2006,0.0,713,0,...,432,0,0,44,0,0,0,0,8,2009
818,20,,13265,8,5,2002,2002,148.0,1218,0,...,857,150,59,0,0,0,0,0,7,2008
303,20,118.0,13704,7,5,2001,2002,150.0,0,0,...,843,468,81,0,0,0,0,0,1,2006


## Preliminary Investigation

In [26]:
# Shape of training data (num_rows, num_columns)
print(X_train.shape)

# Number of missing values in each column of training data
missing_val_count_by_column = (X_train.isnull().sum())
print(missing_val_count_by_column[missing_val_count_by_column>0])

(1168, 36)
LotFrontage    212
MasVnrArea       6
GarageYrBlt     58
dtype: int64


## Function to measure MAE

In [27]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

## 1. Drop columns with missing values

In [28]:
# Get names of columns with missing values
cols_with_missing = [col for col in X_train.columns
                    if X_train[col].isnull().any()] 

# Drop columns in training and validation data
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)

# Mean Absolute Error
print("MAE for the data which droped columns with missing values: ")
print(score_dataset(reduced_X_train, reduced_X_valid, y_train, y_valid))

MAE for the data which droped columns with missing values: 
17837.82570776256


## 2. Imputation

In [29]:
from sklearn.impute import SimpleImputer

# Imputation
my_imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))

# Imputation removed columns names; put them back
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns

print("MAE for Imputation data: ")
print(score_dataset(imputed_X_train, imputed_X_valid, y_train, y_valid))

MAE for Imputation data: 
18062.894611872147


## 3) Different Approach for Imputer
Use median to fill in the missing values instead of mean. 

In [30]:
# Imputation
final_imputer = SimpleImputer(strategy='median')
final_X_train = pd.DataFrame(final_imputer.fit_transform(X_train))
final_X_valid = pd.DataFrame(final_imputer.transform(X_valid))

# Imputation removed column names; put them back
final_X_train.columns = X_train.columns
final_X_valid.columns = X_valid.columns

# Define and fit model
model = RandomForestRegressor(n_estimators=100, random_state=0)
model.fit(final_X_train, y_train)

# Get validation predictions and MAE
preds_valid = model.predict(final_X_valid)
print("MAE (Your approach):")
print(mean_absolute_error(y_valid, preds_valid))

MAE (Your approach):
17791.59899543379


## Compare all different approaches to decide the best result

In [31]:
# first approach: dropping columns with missing values
print("1. MAE for droped columns with missing values: ")
print(score_dataset(reduced_X_train, reduced_X_valid, y_train, y_valid))

# second appraoch: filling in the missing values with mean 
print("2. MAE for Imputation with mean: ")
print(score_dataset(imputed_X_train, imputed_X_valid, y_train, y_valid))

# third approach : filling in the missing values with median
print("3. MAE for Imputation with median: ")
print(score_dataset(final_X_train, final_X_valid, y_train, y_valid))

1. MAE for droped columns with missing values: 
17837.82570776256
2. MAE for Imputation with mean: 
18062.894611872147
3. MAE for Imputation with median: 
17791.59899543379


## Preprocess test data and prediction

In [32]:
# Define and fit model
model = RandomForestRegressor(n_estimators = 100, random_state=0)
model.fit(final_X_train, y_train)

# preprocess test data
final_X_test = pd.DataFrame(final_imputer.transform(X_test))

# get test predictions
preds_test = model.predict(final_X_test)



## Save the results to a CSV file 

In [34]:
output = pd.DataFrame({'Id': X_test.index,
                      'SalePrice': preds_test})
output.to_csv('submission2.csv', index=False)