
# Tutorial 7b: Data Imputation

This covers:

* The deletion approach
    - Deleting the incomplete features
    - Deleting the incomplete instances

* pandas
    - Simple imputation using pandas
    - Interpolation imputation using pandas
    
* sklearn
    - Simple imputation using sklearn
    - KNN-based imputation using skearn
    - Iterative imputation using skearn

* Applying the learned models to incomplete test data

----

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

## Loading and exploring the data

In [None]:
import pandas as pd

titanic_data = pd.read_csv("https://www.openml.org/data/get_csv/16826755/phpMYEkMl.csv", na_values=['?']) 
titanic_data.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


**Values considered “missing”**

There are many ways to represent missing values in both the dataset file and the python pandas.

Missing values in the data might be blank entries, or '?', or something else that data collecters agreed on to represent unobserved data.
In this case it is '?' -- knowing this, we tell `pandas` what to consider as missing values via `na_values=['?']`.

At the "other end", `pandas` can represent missing values in several different ways. As can be seen above, "NaN" is the default missing value marker, however, we need to be able to easily detect this value with data of different types: floating point, integer, boolean, and general object. In many cases, however, some other forms can refer to missing values such as None “missing” or “not available”, “NA", or (-)inf .


In [None]:

titanic_data.drop(['name','ticket', 'embarked', 'boat' ,'body' ,'home.dest'], axis=1, inplace=True)

Now we will split the data to train and test subsets as **ONLY** the training data will be used to learn the imputers then the learnt models are applied to the test data

In [8]:
from sklearn.model_selection import train_test_split
y=titanic_data['survived']
X=titanic_data.drop(['survived'], axis=1)
X_titanic_train, X_titanic_test, y_titanic_train, y_titanic_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
y = titanic_data['survived']
X = titanic_data.drop(['survived'], axis=1)


X = pd.get_dummies(X, drop_first=True)


X = X.fillna(0)


X_titanic_train, X_titanic_test, y_titanic_train, y_titanic_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)
from sklearn.ensemble import RandomForestClassifier

classifier = RandomForestClassifier(random_state=42)
classifier.fit(X_titanic_train, y_titanic_train)



0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


# There is a problem that some features contain string values, namely the features "sex" and "cabin", so lets encode these features

In [None]:

import sklearn
!pip install -U scikit-learn
print('The scikit-learn version is {}.'.format(sklearn.__version__))

The scikit-learn version is 1.7.2.



[notice] A new release of pip is available: 25.0.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:

import numpy as np
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

data = {
    'Sex': ['male', 'female', 'female', 'male', np.nan],
    'Cabin': ['C85', 'E46', np.nan, 'G6', 'C103'],
    'Age': [22, 38, 26, 35, 28],
    'Fare': [7.25, 71.2833, 7.925, 53.1, 8.05],
    'Survived': [0, 1, 1, 1, 0]
}
titanic = pd.DataFrame(data)


X_titanic_train, X_titanic_test, y_titanic_train, y_titanic_test = train_test_split(
    titanic.drop(columns=['Survived']),
    titanic['Survived'],
    test_size=0.2,
    random_state=42
)


X_titanic_train_encoded = X_titanic_train.copy()
X_titanic_train_encoded.columns = X_titanic_train_encoded.columns.str.strip().str.lower()


print("Các cột trong tập train:", X_titanic_train_encoded.columns.tolist())


if 'sex' in X_titanic_train_encoded.columns:
    encoder_sex = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
    X_titanic_train_encoded['sex'] = encoder_sex.fit_transform(
        X_titanic_train_encoded['sex'].astype(str).values.reshape(-1, 1)
    )
else:
    print("⚠️ Không tìm thấy cột 'sex' trong dữ liệu!")


if 'cabin' in X_titanic_train_encoded.columns:
    encoder_cabin = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=np.nan)
    X_titanic_train_encoded['cabin'] = encoder_cabin.fit_transform(
        X_titanic_train_encoded['cabin'].astype(str).values.reshape(-1, 1)
    )
 
    cabin_nan_code = encoder_cabin.transform([['nan']])[0][0]
    X_titanic_train_encoded['cabin'].replace(cabin_nan_code, np.nan, inplace=True)
else:
    print("⚠️ Không tìm thấy cột 'cabin' trong dữ liệu!")

classifier = RandomForestClassifier()
classifier.fit(X_titanic_train_encoded.fillna(-1), y_titanic_train)

print("✅ Huấn luyện thành công!")




Các cột trong tập train: ['sex', 'cabin', 'age', 'fare']
✅ Huấn luyện thành công!


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_titanic_train_encoded['cabin'].replace(cabin_nan_code, np.nan, inplace=True)


## `X_titanic_train_encoded` is the encoded incomplete training data

In [None]:

X_titanic_train_encoded.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, 4 to 3
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Sex     3 non-null      object 
 1   Cabin   3 non-null      object 
 2   Age     4 non-null      int64  
 3   Fare    4 non-null      float64
dtypes: float64(1), int64(1), object(2)
memory usage: 160.0+ bytes


In [None]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split


data = {
    'Sex': ['male', 'female', 'female', 'male', np.nan],
    'Cabin': ['C85', 'E46', np.nan, 'G6', 'C103'],
    'Age': [22, 38, 26, 35, 28],
    'Fare': [7.25, 71.2833, 7.925, 53.1, 8.05],
    'Survived': [0, 1, 1, 1, 0]
}
titanic = pd.DataFrame(data)


X_titanic_train, X_titanic_test, y_titanic_train, y_titanic_test = train_test_split(
    titanic.drop(columns=['Survived']),
    titanic['Survived'],
    test_size=0.2,
    random_state=42
)


X_titanic_train_encoded = X_titanic_train.copy()
X_titanic_train_encoded.columns = X_titanic_train_encoded.columns.str.strip().str.lower()


if 'sex' in X_titanic_train_encoded.columns:
    encoder_sex = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
    X_titanic_train_encoded['sex'] = encoder_sex.fit_transform(
        X_titanic_train_encoded['sex'].astype(str).values.reshape(-1, 1)
    )
else:
    print("⚠️ Không tìm thấy cột 'sex' trong dữ liệu!")


if 'cabin' in X_titanic_train_encoded.columns:
    encoder_cabin = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=np.nan)
    X_titanic_train_encoded['cabin'] = encoder_cabin.fit_transform(
        X_titanic_train_encoded['cabin'].astype(str).values.reshape(-1, 1)
    )
  
    cabin_nan_code = encoder_cabin.transform([['nan']])[0][0]
    X_titanic_train_encoded['cabin'].replace(cabin_nan_code, np.nan, inplace=True)
else:
    print("⚠️ Không tìm thấy cột 'cabin' trong dữ liệu!")


X_titanic_train_encoded = X_titanic_train_encoded.fillna(-1)


print("📊 Kiểu dữ liệu sau encode:")
print(X_titanic_train_encoded.dtypes)
print("\n📈 Dữ liệu mẫu:")
print(X_titanic_train_encoded.head())


classifier = RandomForestClassifier(random_state=42)
classifier.fit(X_titanic_train_encoded, y_titanic_train)

print("\n✅ Huấn luyện thành công!")


📊 Kiểu dữ liệu sau encode:
sex      float64
cabin    float64
age        int64
fare     float64
dtype: object

📈 Dữ liệu mẫu:
   sex  cabin  age    fare
4  2.0    0.0   28   8.050
2  0.0   -1.0   26   7.925
0  1.0    1.0   22   7.250
3  1.0    2.0   35  53.100

✅ Huấn luyện thành công!


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_titanic_train_encoded['cabin'].replace(cabin_nan_code, np.nan, inplace=True)


## Note the error:ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

We need to handle the missing values before performing the classification.

Lets show the number of missing values in each feature of the encoded train data



In [None]:
print("The number of missing values ")
print(X_titanic_train_encoded.isnull().sum())

The number of missing values 
pclass      0
sex         0
age       187
sibsp       0
parch       0
fare        1
cabin     712
dtype: int64


We have three incomplete features "age", "fare", and "cabin"

## The deletion approach

### Deleting the incomplete features

In [None]:
X_titanic_train_complete=X_titanic_train_encoded.copy()
X_titanic_train_complete.dropna(axis=1, inplace=True)
X_titanic_train_complete

Unnamed: 0,pclass,sex,sibsp,parch
1214,3,1.0,0,0
677,3,1.0,0,0
534,2,0.0,0,0
1174,3,0.0,8,2
864,3,0.0,0,0
...,...,...,...,...
1095,3,0.0,0,0
1130,3,0.0,0,0
1294,3,1.0,0,0
860,3,0.0,0,0


In [None]:

print(X_titanic_train_complete.isnull().sum())

pclass    0
sex       0
sibsp     0
parch     0
dtype: int64


### Deleting the incomplete instances

In [None]:
X_titanic_train_complete=X_titanic_train_encoded.copy()
X_titanic_train_complete.dropna(axis=0, inplace=True)

X_titanic_train_complete

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,cabin
39,1,1.0,48.0,0,0,50.4958,14.0
30,1,1.0,45.0,0,0,35.5000,145.0
242,1,0.0,33.0,0,0,27.7208,0.0
136,1,1.0,53.0,0,0,28.5000,68.0
3,1,1.0,30.0,1,2,151.5500,61.0
...,...,...,...,...,...,...,...
189,1,1.0,29.0,0,0,30.0000,113.0
252,1,1.0,61.0,1,3,262.3750,35.0
21,1,0.0,47.0,1,1,52.5542,101.0
276,1,1.0,57.0,1,0,146.5208,42.0


## Notice the reduction in the number of instances

Another important point for the instance deletion approach is that there is a need to remove the target values (from y_train) that correspond to the incomplete (deleted) data instances

In [None]:

print(X_titanic_train_complete.isnull().sum())

pclass    0
sex       0
age       0
sibsp     0
parch     0
fare      0
cabin     0
dtype: int64


The deletion approach has several drawbacks. It reduces the availlable data, which limits the learning ability, especially when there are many missing values.

Furthermore, the approach of deleting incomplete instances is not practical for test data: we really want to know the answer!

## Imputation using `pandas`

### Simple imputation (`pandas`)

In [None]:

X_titanic_data_complete=X_titanic_train_encoded.copy()
X_titanic_data_complete['age']=X_titanic_data_complete['age'].fillna(X_titanic_data_complete['age'].mean())
X_titanic_data_complete['fare']=X_titanic_data_complete['fare'].fillna(X_titanic_data_complete['fare'].mean())
X_titanic_data_complete['cabin']=X_titanic_data_complete['cabin'].fillna(X_titanic_data_complete['cabin'].mean())

print(X_titanic_data_complete.isnull().sum())

pclass    0
sex       0
age       0
sibsp     0
parch     0
fare      0
cabin     0
dtype: int64


In [None]:
X_titanic_data_complete.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,cabin
1214,3,1.0,29.102309,0,0,8.6625,73.27451
677,3,1.0,26.0,0,0,7.8958,73.27451
534,2,0.0,19.0,0,0,26.0,73.27451
1174,3,0.0,29.102309,8,2,69.55,73.27451
864,3,0.0,28.0,0,0,7.775,73.27451


## "interpolation" (`pandas`)

In [None]:
X_titanic_data_complete = X_titanic_train_encoded.copy()
X_titanic_data_complete = X_titanic_data_complete.interpolate()

X_titanic_train_complete = pd.DataFrame(X_titanic_train_complete)
print(X_titanic_train_complete.isna().sum())

pclass    0
sex       0
age       0
sibsp     0
parch     0
fare      0
cabin     0
dtype: int64


## Imputation using `sklearn`

### Simple imputation (`sklearn`)

In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer()
X_titanic_train_complete = imputer.fit_transform(X_titanic_train_encoded)

X_titanic_train_complete=pd.DataFrame(X_titanic_train_complete, columns=X_titanic_train_encoded.columns)
print("The number of missing values :\n", X_titanic_train_complete.isnull().sum())

The number of missing values :
 pclass    0
sex       0
age       0
sibsp     0
parch     0
fare      0
cabin     0
dtype: int64


In [None]:
X_titanic_train_encoded

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,cabin
1214,3,1.0,,0,0,8.6625,
677,3,1.0,26.0,0,0,7.8958,
534,2,0.0,19.0,0,0,26.0000,
1174,3,0.0,,8,2,69.5500,
864,3,0.0,28.0,0,0,7.7750,
...,...,...,...,...,...,...,...
1095,3,0.0,,0,0,7.6292,
1130,3,0.0,18.0,0,0,7.7750,
1294,3,1.0,28.5,0,0,16.1000,
860,3,0.0,26.0,0,0,7.9250,


## The default strategy for sklearn simple imputer is the "mean", you can change it using the strategy parameter

In [None]:
imputer = SimpleImputer(strategy="median")
X_titanic_train_complete = imputer.fit_transform(X_titanic_train_encoded)

X_titanic_train_complete=pd.DataFrame(X_titanic_train_complete, columns=X_titanic_train_encoded.columns)
print("The number of missing values :\n", X_titanic_train_complete.isnull().sum())

The number of missing values :
 pclass    0
sex       0
age       0
sibsp     0
parch     0
fare      0
cabin     0
dtype: int64


## kNN imputer (`sklearn`)

In [None]:
from sklearn.impute import KNNImputer
imputer = KNNImputer()
X_titanic_train_complete = imputer.fit_transform(X_titanic_train_encoded)

X_titanic_train_complete=pd.DataFrame(X_titanic_train_complete, columns=X_titanic_train_encoded.columns)
print("The number of missing values :\n", X_titanic_train_complete.isnull().sum())

The number of missing values :
 pclass    0
sex       0
age       0
sibsp     0
parch     0
fare      0
cabin     0
dtype: int64


In [None]:

imputer = KNNImputer(n_neighbors=2)


## Iterative Imputer (`sklearn`)

Note this is sklearn's implementation of a method originally known as "MICE" -- see lecture 2 from this week for an explanation.

In [None]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imputer = IterativeImputer()
X_titanic_train_complete = imputer.fit_transform(X_titanic_train_encoded)

X_titanic_train_complete=pd.DataFrame(X_titanic_train_complete, columns=X_titanic_train_encoded.columns)
print("The number of missing values :\n", X_titanic_train_complete.isnull().sum())

The number of missing values :
 pclass    0
sex       0
age       0
sibsp     0
parch     0
fare      0
cabin     0
dtype: int64


You can reset the default parameters of the iterative imputer. For example, you can set the number of iterations. Moreover, you can specify the estimator for estimating the missing values.

In [None]:

from sklearn.tree import DecisionTreeRegressor
imputer = IterativeImputer(estimator=DecisionTreeRegressor())
X_titanic_train_complete = imputer.fit_transform(X_titanic_train_encoded)

X_titanic_train_complete=pd.DataFrame(X_titanic_train_complete, columns=X_titanic_train_encoded.columns)
print("The number of missing values :\n", X_titanic_train_complete.isnull().sum())

The number of missing values :
 pclass    0
sex       0
age       0
sibsp     0
parch     0
fare      0
cabin     0
dtype: int64




## Applying the learned models to incomplete test data

First, apply the encoders

In [None]:

X_titanic_test_encoded=X_titanic_test.copy()
X_titanic_test_encoded['sex'] = encoder_sex.transform(X_titanic_test_encoded['sex'].values.reshape(-1, 1))


X_titanic_test_encoded['cabin'] = encoder_cabin.transform(X_titanic_test_encoded['cabin'].values.reshape(-1, 1).astype(str))

X_titanic_test_encoded['cabin'].replace(cabin_nan_code,np.nan,inplace=True)


Second, use the learned imputer to estimate the missing values in the test data

In [None]:
print("The number of missing values in the test data before imputation :\n", X_titanic_test_encoded.isnull().sum())
X_titanic_test_complete = imputer.transform(X_titanic_test_encoded)
X_titanic_test_complete=pd.DataFrame(X_titanic_test_complete, columns=X_titanic_test_encoded.columns)
print("The number of missing values in the test data after imputation :\n", X_titanic_test_complete.isnull().sum())

The number of missing values in the test data before imputation :
 pclass      0
sex         0
age        76
sibsp       0
parch       0
fare        0
cabin     349
dtype: int64
The number of missing values in the test data after imputation :
 pclass    0
sex       0
age       0
sibsp     0
parch     0
fare      0
cabin     0
dtype: int64


Finally, we can perform the classification using the imputed complete data.

In [None]:

from sklearn.metrics import f1_score
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(random_state=0)

classifier.fit(X_titanic_train_complete, y_titanic_train)
print("F1 score after imputation = ", f1_score(classifier.predict(X_titanic_test_complete), y_titanic_test))

F1 score after imputation =  0.7289719626168224


----