This Notebook is to compare the mean radius column (NaN) values which is calculated using different imputation strategies
includes Simple Imputer, KNN Imputer, Iterative Imputer/MICE, and Linear regression.

Author : Sangeetha Vijayam
Date : 14-Feb-2025

In [1]:
# import required libraries
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.impute import SimpleImputer
from sklearn.impute import KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import BayesianRidge
from sklearn.linear_model import LinearRegression

In [2]:
# Load dataset
brst_cancer = load_breast_cancer(as_frame=True)
df = brst_cancer.data

In [3]:
pd.set_option("display.max_columns", None)
df.head(3)

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758


In [4]:
# Check missing values and print the missing values
missing_values = df.isnull().sum()
print(missing_values[missing_values > 0])  

Series([], dtype: int64)


Randomly select 5 rows and make it NaN for testing different imputation strategies

In [5]:
np.random.seed(42)
# Randomly select 5 indices from the DataFrame and set mean radius column to NaN
random_indices = df.sample(n=5).index
df.loc[random_indices, 'mean radius'] = np.nan

print(df.loc[random_indices, 'mean radius'])

204   NaN
70    NaN
131   NaN
431   NaN
540   NaN
Name: mean radius, dtype: float64


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[random_indices, 'mean radius'] = np.nan


In [6]:
# Check missing values and print the missing values after NaN update
print(missing_values[missing_values > 0]) 

Series([], dtype: int64)


In [7]:
nan_rows = df[df.isnull().any(axis=1)]
nan_rows

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
70,,21.31,123.6,1130.0,0.09009,0.1029,0.108,0.07951,0.1582,0.05461,0.7888,0.7975,5.486,96.05,0.004444,0.01652,0.02269,0.0137,0.01386,0.001698,24.86,26.58,165.9,1866.0,0.1193,0.2336,0.2687,0.1789,0.2551,0.06589
131,,19.48,101.7,748.9,0.1092,0.1223,0.1466,0.08087,0.1931,0.05796,0.4743,0.7859,3.094,48.31,0.00624,0.01484,0.02813,0.01093,0.01397,0.002461,19.26,26.0,124.9,1156.0,0.1546,0.2394,0.3791,0.1514,0.2837,0.08019
204,,18.6,81.09,481.9,0.09965,0.1058,0.08005,0.03821,0.1925,0.06373,0.3961,1.044,2.497,30.29,0.006953,0.01911,0.02701,0.01037,0.01782,0.003586,14.97,24.64,96.05,677.9,0.1426,0.2378,0.2671,0.1015,0.3014,0.0875
431,,17.68,81.47,467.8,0.1054,0.1316,0.07741,0.02799,0.1811,0.07102,0.1767,1.46,2.204,15.43,0.01,0.03295,0.04861,0.01167,0.02187,0.006005,12.88,22.91,89.61,515.8,0.145,0.2629,0.2403,0.0737,0.2556,0.09359
540,,14.44,74.65,402.9,0.09984,0.112,0.06737,0.02594,0.1818,0.06782,0.2784,1.768,1.628,20.86,0.01215,0.04112,0.05553,0.01494,0.0184,0.005512,12.26,19.68,78.78,457.8,0.1345,0.2118,0.1797,0.06918,0.2329,0.08134


Select Row # 70 for the analysis

In [8]:
columns_of_interest = ['mean radius', 'mean texture', 'mean perimeter', 'mean area', 'mean smoothness']
rows_70 = df.loc[[70], columns_of_interest]
rows_70

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness
70,,21.31,123.6,1130.0,0.09009


In [9]:
# Check the columns mean texture mean perimeter mean area of Row index 70
# and use those values to filter other rows around the values. 
# These values can be compared and verified after the imputation strategy is applied on the NaN values

# For Row index 70 21.31, 123.6, 1130 are the mean texture	mean perimeter	mean area.
# so check the rows with values from 20 to 23 for mean texture
# from 123 to 126 for mean perimeter
# and from 1103 to 1110 for mean area

filtered_rows = df[
    (df['mean texture'].between(20, 23, inclusive='both')) &
    (df['mean perimeter'].between(123, 126, inclusive='both'))
     |
    (df['mean area'].between(1103.0, 1110.0, inclusive='both')
     ) 
     
]
filtered_rows

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
42,19.07,24.81,128.3,1104.0,0.09081,0.219,0.2107,0.09961,0.231,0.06343,0.9811,1.666,8.83,104.9,0.006548,0.1006,0.09723,0.02638,0.05333,0.007646,24.09,33.17,177.4,1651.0,0.1247,0.7444,0.7242,0.2493,0.467,0.1038
70,,21.31,123.6,1130.0,0.09009,0.1029,0.108,0.07951,0.1582,0.05461,0.7888,0.7975,5.486,96.05,0.004444,0.01652,0.02269,0.0137,0.01386,0.001698,24.86,26.58,165.9,1866.0,0.1193,0.2336,0.2687,0.1789,0.2551,0.06589
400,17.91,21.02,124.4,994.0,0.123,0.2576,0.3189,0.1198,0.2113,0.07115,0.403,0.7747,3.123,41.51,0.007159,0.03718,0.06165,0.01051,0.01591,0.005099,20.8,27.78,149.6,1304.0,0.1873,0.5917,0.9034,0.1964,0.3245,0.1198
433,18.82,21.97,123.7,1110.0,0.1018,0.1389,0.1594,0.08744,0.1943,0.06132,0.8191,1.931,4.493,103.9,0.008074,0.04088,0.05321,0.01834,0.02383,0.004515,22.66,30.93,145.3,1603.0,0.139,0.3463,0.3912,0.1708,0.3007,0.08314


In [10]:
# Get the the colums of interest for these specific rows with the selected indices
df_to_compare = df.loc[[70, 42, 400, 433], columns_of_interest]
df_to_compare

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness
70,,21.31,123.6,1130.0,0.09009
42,19.07,24.81,128.3,1104.0,0.09081
400,17.91,21.02,124.4,994.0,0.123
433,18.82,21.97,123.7,1110.0,0.1018


This dataframe is used for further analysis 
create new columns for each imputation strategy 
Apply all imputations on the mean radius column 
and update the new column
later thsese can be used for comparision 

In [11]:
# Median and Mean Imputation
imputer = SimpleImputer(strategy='median')
df_to_compare['mean radius simpleImputer median'] = imputer.fit_transform(df_to_compare[['mean radius']])
imputer = SimpleImputer(strategy='mean')
df_to_compare['mean radius simpleImputer mean'] = imputer.fit_transform(df_to_compare[['mean radius']])

df_to_compare.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean radius simpleImputer median,mean radius simpleImputer mean
70,,21.31,123.6,1130.0,0.09009,18.82,18.6
42,19.07,24.81,128.3,1104.0,0.09081,19.07,19.07
400,17.91,21.02,124.4,994.0,0.123,17.91,17.91
433,18.82,21.97,123.7,1110.0,0.1018,18.82,18.82


In [12]:
# KNN Imputer
imputer = KNNImputer(n_neighbors=5)

df_to_compare['mean radius KNN Imputer'] = imputer.fit_transform(df_to_compare[['mean radius']])
df_to_compare

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean radius simpleImputer median,mean radius simpleImputer mean,mean radius KNN Imputer
70,,21.31,123.6,1130.0,0.09009,18.82,18.6,18.6
42,19.07,24.81,128.3,1104.0,0.09081,19.07,19.07,19.07
400,17.91,21.02,124.4,994.0,0.123,17.91,17.91,17.91
433,18.82,21.97,123.7,1110.0,0.1018,18.82,18.82,18.82


In [13]:
# IterativeImputer
# Initialize IterativeImputer with BayesianRidge estimator
imputer = IterativeImputer(estimator=BayesianRidge(), max_iter=10, random_state=42)

df_to_compare['mean radius Iterative Imputer'] = imputer.fit_transform(df_to_compare[['mean radius']])
df_to_compare

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean radius simpleImputer median,mean radius simpleImputer mean,mean radius KNN Imputer,mean radius Iterative Imputer
70,,21.31,123.6,1130.0,0.09009,18.82,18.6,18.6,18.6
42,19.07,24.81,128.3,1104.0,0.09081,19.07,19.07,19.07,19.07
400,17.91,21.02,124.4,994.0,0.123,17.91,17.91,17.91,17.91
433,18.82,21.97,123.7,1110.0,0.1018,18.82,18.82,18.82,18.82


In [14]:
# Use Linear regression and update manually by making mean radius as target

# Split NaN data as test and NonNaN data as train
train_data = df_to_compare[df_to_compare['mean radius'].notna()]
test_data = df_to_compare[df_to_compare['mean radius'].isna()]

# Split independent variable and target 
X_train = train_data.drop(columns=['mean radius'])  # independent variable
y_train = train_data['mean radius']                 # Target 
X_test = test_data.drop(columns=['mean radius'])

# Create Linear Regression Model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
predictions = model.predict(X_test)

# create new column for linear regression update
df_to_compare['mean radius Linear Reg'] = df_to_compare['mean radius']

# Conside the NaN values only for the prediction updates
df_to_compare.loc[df_to_compare['mean radius'].isna(), 'mean radius Linear Reg'] = predictions
df_to_compare

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean radius simpleImputer median,mean radius simpleImputer mean,mean radius KNN Imputer,mean radius Iterative Imputer,mean radius Linear Reg
70,,21.31,123.6,1130.0,0.09009,18.82,18.6,18.6,18.6,18.949927
42,19.07,24.81,128.3,1104.0,0.09081,19.07,19.07,19.07,19.07,19.07
400,17.91,21.02,124.4,994.0,0.123,17.91,17.91,17.91,17.91,17.91
433,18.82,21.97,123.7,1110.0,0.1018,18.82,18.82,18.82,18.82,18.82


Conclusion:

Based on the updated values of the mean radius column for the row index 70, comparing all the 4 imputation strategies, there is only minute change in the median radius value.