# 🤖 Assigning Values to NaN Using Machine Learning Algorithms

Missing values (**NaN**) are common in real-world datasets. Instead of simply dropping rows/columns or filling with mean/median, we can use **machine learning algorithms** to make smarter imputations by learning from the patterns in the data.

## 🔹 1. K-Nearest Neighbors (KNN) Imputation
- Uses similarity between data points to estimate missing values.  
- For each sample with a missing value:
  1. Find its **k nearest neighbors** based on distance (e.g., Euclidean).  
  2. Replace the missing value with the **average (or majority vote)** from its neighbors.  
- **Pros**: Captures local structure in the data.  
- **Cons**: Computationally expensive on large datasets.

## 🔹 2. Random Forest Imputation
- Uses an **ensemble of decision trees** to predict missing values.  
- For each feature with missing values:
  1. Treat the feature as the **target variable**.  
  2. Use all other features as predictors to train a Random Forest model.  
  3. Predict missing entries with the trained model.  
- **Pros**: Handles both linear and nonlinear relationships well.  
- **Cons**: More complex and slower compared to simpler methods.

## 🔹 3. Expectation-Maximization (EM) Algorithm
- A **probabilistic method** that estimates missing data iteratively:
  1. **E-step (Expectation):** Estimate missing values using current parameter estimates.  
  2. **M-step (Maximization):** Update parameters based on the filled dataset.  
  3. Repeat until convergence.  
- **Pros**: Statistically rigorous, works well with multivariate distributions.  
- **Cons**: Requires distributional assumptions (e.g., normality) and can be slow.

---


#### » Load titanic dataset and select only the numerical columns

In [1]:
import seaborn as sns
titanic = sns.load_dataset("titanic")
titanic = titanic.select_dtypes(include=["float64", "int64"])
df = titanic.copy()
df.head()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
0,0,3,22.0,1,0,7.25
1,1,1,38.0,1,0,71.2833
2,1,3,26.0,0,0,7.925
3,1,1,35.0,1,0,53.1
4,0,3,35.0,0,0,8.05


#### » Display number of NaN values in each column

In [2]:
df.isnull().sum()

survived      0
pclass        0
age         177
sibsp         0
parch         0
fare          0
dtype: int64

## 1. K-Nearest Neighbors (KNN) Imputation Method

#### » Save the column names, then convert the data frame to a numpy array (knn wants array)

In [3]:
import numpy as np
var_names = list(df)
np_df = np.array(df)
np_df

array([[ 0.    ,  3.    , 22.    ,  1.    ,  0.    ,  7.25  ],
       [ 1.    ,  1.    , 38.    ,  1.    ,  0.    , 71.2833],
       [ 1.    ,  3.    , 26.    ,  0.    ,  0.    ,  7.925 ],
       ...,
       [ 0.    ,  3.    ,     nan,  1.    ,  2.    , 23.45  ],
       [ 1.    ,  1.    , 26.    ,  0.    ,  0.    , 30.    ],
       [ 0.    ,  3.    , 32.    ,  0.    ,  0.    ,  7.75  ]])

#### » Apply the method using the desired neighbor count (k)

In [4]:
np.infty = np.inf
from ycimpute.imputer import knnimput
dff = knnimput.KNN(k=4).complete(np_df)

Imputing row 1/891 with 0 missing, elapsed time: 0.060
Imputing row 101/891 with 0 missing, elapsed time: 0.061
Imputing row 201/891 with 0 missing, elapsed time: 0.061
Imputing row 301/891 with 1 missing, elapsed time: 0.061
Imputing row 401/891 with 0 missing, elapsed time: 0.061
Imputing row 501/891 with 0 missing, elapsed time: 0.062
Imputing row 601/891 with 0 missing, elapsed time: 0.062
Imputing row 701/891 with 0 missing, elapsed time: 0.062
Imputing row 801/891 with 0 missing, elapsed time: 0.062


In [5]:
type(dff)

numpy.ndarray

#### » Convert the array to a dataframe

In [6]:
import pandas as pd
dff = pd.DataFrame(dff,columns=var_names)
type(dff)

pandas.core.frame.DataFrame

In [7]:
dff

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
0,0.0,3.0,22.000000,1.0,0.0,7.2500
1,1.0,1.0,38.000000,1.0,0.0,71.2833
2,1.0,3.0,26.000000,0.0,0.0,7.9250
3,1.0,1.0,35.000000,1.0,0.0,53.1000
4,0.0,3.0,35.000000,0.0,0.0,8.0500
...,...,...,...,...,...,...
886,0.0,2.0,27.000000,0.0,0.0,13.0000
887,1.0,1.0,19.000000,0.0,0.0,30.0000
888,0.0,3.0,26.026414,1.0,2.0,23.4500
889,1.0,1.0,26.000000,0.0,0.0,30.0000


In [8]:
dff.isnull().sum()

survived    0
pclass      0
age         0
sibsp       0
parch       0
fare        0
dtype: int64

## 2. Random Forest Imputation

#### » Get default dataframe

In [9]:
df = titanic.copy()
df.isnull().sum()

survived      0
pclass        0
age         177
sibsp         0
parch         0
fare          0
dtype: int64

#### » Create dataframes by seperating the values contains NaN 

In [10]:
df_known = df[df['age'].notnull()]
df_unknown = df[df['age'].isnull()]
df_known

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
0,0,3,22.0,1,0,7.2500
1,1,1,38.0,1,0,71.2833
2,1,3,26.0,0,0,7.9250
3,1,1,35.0,1,0,53.1000
4,0,3,35.0,0,0,8.0500
...,...,...,...,...,...,...
885,0,3,39.0,0,5,29.1250
886,0,2,27.0,0,0,13.0000
887,1,1,19.0,0,0,30.0000
889,1,1,26.0,0,0,30.0000


In [11]:
df_unknown

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
5,0,3,,0,0,8.4583
17,1,2,,0,0,13.0000
19,1,3,,0,0,7.2250
26,0,3,,0,0,7.2250
28,1,3,,0,0,7.8792
...,...,...,...,...,...,...
859,0,3,,0,0,7.2292
863,0,3,,8,2,69.5500
868,0,3,,0,0,9.5000
878,0,3,,0,0,7.8958


#### » Train the random forest algorithm with the dataframe which does not contain NaN

In [12]:
from sklearn.ensemble import RandomForestRegressor

x = df_known.drop(['age'], axis=1)
y = df_known['age']

regressor = RandomForestRegressor(n_estimators=100, random_state=42, oob_score=True)
regressor.fit(x, y)

#### » Apply the algorithm to predict ages

In [13]:
X_pred = df_unknown.drop(['age'], axis=1)
pre_ages = regressor.predict(X_pred)
pre_ages

array([23.94872727, 33.47462571, 18.421     , 34.52981349, 22.78747619,
       27.63197569, 35.58506667, 22.30313889, 16.9745    , 27.63197569,
       31.41960196, 34.50933333, 22.30313889, 23.26333333, 39.65      ,
       39.18583333, 14.6568    , 27.63197569, 31.41960196, 23.233     ,
       31.41960196, 31.41960196, 27.63197569, 22.63188889, 30.10113187,
       31.41960196, 40.32982541, 13.96955   , 31.28016667, 30.05747556,
       25.30996374, 11.9132417 , 24.99133333, 58.0955873 ,  8.43706421,
       11.9132417 , 33.25      , 57.84      , 25.52416667, 40.32982541,
       22.30313889, 11.9132417 , 36.11547901, 27.63197569,  8.43706421,
       36.30157143, 23.85113294, 25.52416667, 30.05747556, 35.21166667,
       40.32982541, 40.32982541, 52.41833333, 22.30313889, 36.11517599,
       59.1155873 , 39.18583333, 37.05428571, 22.30313889, 26.83113095,
       31.47235073, 31.41960196, 29.25085714, 11.9132417 , 26.83113095,
       31.42333333, 27.63197569, 26.17866667, 60.52      , 34.52

#### » Fill the NaN values with the predicted ages 

In [14]:
df.loc[df["age"].isnull(), "age"] = pre_ages
df.isnull().sum()

survived    0
pclass      0
age         0
sibsp       0
parch       0
fare        0
dtype: int64

## 3. Expectation-Maximization (EM) Algorithm

#### » Copy the original DataFrame

In [15]:
df = titanic.copy()
df.isnull().sum()

survived      0
pclass        0
age         177
sibsp         0
parch         0
fare          0
dtype: int64

#### » Save the column names, then convert the data frame to a numpy array (EM wants array)

In [20]:
var_names = list(df)
np_df = np.array(df)

#### » Apply the method

In [21]:
from ycimpute.imputer import EM
dff = EM().complete(np_df)
dff

array([[ 0.    ,  3.    , 22.    ,  1.    ,  0.    ,  7.25  ],
       [ 1.    ,  1.    , 38.    ,  1.    ,  0.    , 71.2833],
       [ 1.    ,  3.    , 26.    ,  0.    ,  0.    ,  7.925 ],
       ...,
       [ 0.    ,  3.    ,  0.    ,  1.    ,  2.    , 23.45  ],
       [ 1.    ,  1.    , 26.    ,  0.    ,  0.    , 30.    ],
       [ 0.    ,  3.    , 32.    ,  0.    ,  0.    ,  7.75  ]])

#### » Convert the array to a dataframe

In [17]:
dff = pd.DataFrame(dff,columns=var_names)
dff.isnull().sum()

survived    0
pclass      0
age         0
sibsp       0
parch       0
fare        0
dtype: int64