**Q. How do you handle missing data in a dataset, and why is it important?**

**Ans.**

**Techniques for handling missing data:**

We can either delete the row or colum having the missing data.

We have other techniques as well if we don't want to delete row or column such as filling missing values by mean, mode or median of the column. We can also use KNN Imputer which fills in missing values from nearest data points.

**Importance of handling missing data:**

Missing data can compromise the quality and reliability of your analysis.

It can lead to biased results, inaccurate conclusions, and reduced model performance.

# Practical

### Importing required libraries

In [47]:
from sklearn.datasets import fetch_california_housing
import numpy as np
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import root_mean_squared_error, r2_score

In [2]:
data = fetch_california_housing(as_frame=True)
df = data.frame

In [3]:
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [4]:
df.isna().sum()

MedInc         0
HouseAge       0
AveRooms       0
AveBedrms      0
Population     0
AveOccup       0
Latitude       0
Longitude      0
MedHouseVal    0
dtype: int64

There is no null value in any column

Let's introduce some

In [6]:
missing_columns = ["AveRooms", "AveBedrms", "Population"]
for col in missing_columns:
    df.loc[df.sample(frac=0.2, random_state=42).index, col] = np.nan

In [7]:
df.isna().sum()

MedInc            0
HouseAge          0
AveRooms       4128
AveBedrms      4128
Population     4128
AveOccup          0
Latitude          0
Longitude         0
MedHouseVal       0
dtype: int64

Yes, now there are 20% missing values in three columns

Lets try imputing them with various techniques and compare the results

### Using mean imputer

In [11]:
mean_imputer = SimpleImputer(strategy="mean")
df_mean_imputed = df.copy()
col_names = df.columns
df_mean_imputed[col_names] = mean_imputer.fit_transform(df[col_names])

In [12]:
df_mean_imputed

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,5.435235,1.096685,1426.453004,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.971880,2401.000000,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.000000,2.802260,37.85,-122.24,3.521
3,5.6431,52.0,5.435235,1.096685,1426.453004,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.000000,2.181467,37.85,-122.25,3.422
...,...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.435235,1.096685,1426.453004,2.560606,39.48,-121.09,0.781
20636,2.5568,18.0,6.114035,1.315789,356.000000,3.122807,39.49,-121.21,0.771
20637,1.7000,17.0,5.205543,1.120092,1007.000000,2.325635,39.43,-121.22,0.923
20638,1.8672,18.0,5.435235,1.096685,1426.453004,2.123209,39.43,-121.32,0.847


### Splitting dataset into train and test sets

In [15]:
X = df_mean_imputed.drop(["MedHouseVal"], axis=1)
y = df_mean_imputed["MedHouseVal"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Training Random Forest Regressor

In [18]:
forest_reg = RandomForestRegressor(random_state=42)
forest_reg.fit(X_train, y_train)

In [20]:
y_preds = forest_reg.predict(X_test)

In [24]:
forest_rmse = root_mean_squared_error(y_test, y_preds)
forest_r2 = r2_score(y_test, y_preds)

In [29]:
heading = "Metrics for Mean Imputer"
print(heading)
print("="*len(heading))
print(f"RMSE: {forest_rmse}")
print(f"R2 Score: {forest_r2}")

Metrics for Mean Imputer
RMSE: 0.544265415633582
R2 Score: 0.7739447397159552


### Using KNN Imputer

In [32]:
knn_imputer = KNNImputer()
df_knn_imputed = df.copy()
df_knn_imputed[col_names] = knn_imputer.fit_transform(df[col_names])

In [33]:
X = df_knn_imputed.drop(["MedHouseVal"], axis=1)
y = df_knn_imputed["MedHouseVal"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [34]:
forest_reg = RandomForestRegressor(random_state=42)
forest_reg.fit(X_train, y_train)

In [35]:
y_preds = forest_reg.predict(X_test)

In [37]:
forest_rmse = root_mean_squared_error(y_test, y_preds)
forest_r2 = r2_score(y_test, y_preds)

In [39]:
heading = "Metrics for KNN Imputer"
print(heading)
print("="*len(heading))
print(f"RMSE: {forest_rmse}")
print(f"R2 Score: {forest_r2}")

Metrics for KNN Imputer
RMSE: 0.5148864557609095
R2 Score: 0.7976905937550018


### Using KNN Imputer with weights='distance'

In [41]:
knn_imputer = KNNImputer(weights='distance')
df_knn_imputed = df.copy()
df_knn_imputed[col_names] = knn_imputer.fit_transform(df[col_names])

In [42]:
X = df_knn_imputed.drop(["MedHouseVal"], axis=1)
y = df_knn_imputed["MedHouseVal"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [43]:
forest_reg = RandomForestRegressor(random_state=42)
forest_reg.fit(X_train, y_train)

In [44]:
y_preds = forest_reg.predict(X_test)

In [45]:
forest_rmse = root_mean_squared_error(y_test, y_preds)
forest_r2 = r2_score(y_test, y_preds)

In [49]:
heading = "Metrics for KNN Imputer with weights: 'distance'"
print(heading)
print("="*len(heading))
print(f"RMSE: {forest_rmse}")
print(f"R2 Score: {forest_r2}")

Metrics for KNN Imputer with weights: 'distance'
RMSE: 0.5147548326176598
R2 Score: 0.7977940153775995


### Drawing conclusions

In [50]:
# 1. KNN Imputer with weights: 'distance' outperformed KNN Imputer with default weights: 'uniform'
# 2. Mean Imputer is fast but slightly worst than KNN Imputer.

### Advantages and Disadvantages of KNN Imputer

**Advantages:**
1. More accurate due to filling missing values based on nearest neighbours.

**Disadvantages:**
1. More number of calculations (due to distance calculation of a point with all other points).
2. Entire dataset should already be deployed on server in order to fill missing value for new data point (Uses storage on server).