## Numeric Data

The two most common and most straightforward ways are filling in the missing values with the mean or median of the unmissed values with numeric data. These are safe choices because the mean or median values are highly likely to occur.

It is worth noting that this averaging or median should be weighed against the data before or after handling outliers.

The scikit-learn library with the class [`sklearn.impute.SimpleImputer`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) is commonly used for this task.

Take for example the `Age` column in the Titanic data. In this dataset, the set `train.csv` has $891 - 714 = 177$ missing points, the set `test.csv` has $418 - 332 = 86$ missing points.

In [99]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 

# create path to read more times
titanic_path = "/Users/charles/MLE/Pre_Data/EDA/Data/"

df_train = pd.read_csv(titanic_path + "Titanic_train.csv")
df_test = pd.read_csv(titanic_path + "Titanic_test.csv")

In [100]:
df_train[["Age"]].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Age     714 non-null    float64
dtypes: float64(1)
memory usage: 7.1 KB


In [101]:
df_test[["Age"]].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Age     332 non-null    float64
dtypes: float64(1)
memory usage: 3.4 KB


Another point worth noting is that the calculation of the fill value is based only on the training data, in this case, the set `train.csv.`

When filling in missing values on the set `test.csv,` we need to use the results obtained in the set `train.csv.`

In [102]:
from sklearn.impute import SimpleImputer

# Create object SimpleImputer
imputer_num = SimpleImputer(strategy="median")          # imputer by median of training set
imputer_num.fit(df_train[["Age"]])                      # fit Age to Objective (note: df w 1 col df[[X]] not df[X])

# Applied to new col (converted by full age) & impute test set = median of training set
df_train[["Imputed_Age"]] = imputer_num.transform(df_train[["Age"]])
df_test[["Imputed_Age"]] = imputer_num.transform(df_test[["Age"]])

In [103]:
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Imputed_Age
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,22.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,38.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,26.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,35.0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,35.0


In [104]:
df_test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Imputed_Age
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,34.5
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,47.0
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,62.0
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,27.0
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,22.0


The above calculates the median (`strategy='median'`) based on the no missing in the training set and then populates both groups. We see that the `NaN` values in the `Age` column have been populated with a value close to $28.0$ in the `ImputedAge` column. You can also try other `strategy`s to see what gives the best results. Remember that there is no one right way to enter values for all types of data; you need to understand the data to develop the option you think has the best results.

You can fill in the values more meticulously if you have more time. For example, fill in different missing age values for each gender category.

## Category Data


With category data, since we cannot calculate the mean, the usual way is to fill in the value that occurs most often (`strategy='most_frequent'`) or treat the omission itself as a special value ( `strategy='constant'`) with a unique value passed through the variable `fill_value` by [`sklearn.impute.SimpleImputer`](https://scikit-learn.org/stable/modules/generated/)

In [105]:
df_train[["Cabin"]].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Cabin   204 non-null    object
dtypes: object(1)
memory usage: 7.1+ KB


In [106]:
from sklearn.impute import SimpleImputer

imputer_cat = SimpleImputer(strategy="most_frequent")
imputer_cat.fit(df_train[["Cabin"]])
df_train[["Imputed_Cabin"]] = imputer_cat.transform(df_train[["Cabin"]])
df_test[["Imputed_Cabin"]] = imputer_cat.transform(df_test[["Cabin"]])

In [107]:
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Imputed_Age,Imputed_Cabin
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,22.0,B96 B98
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,38.0,C85
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,26.0,B96 B98
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,35.0,C123
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,35.0,B96 B98
