In [1]:
# Top 5 Interview Questions on Missing Value Imputation

In [2]:
# https://www.analyticsvidhya.com/blog/2022/11/top-5-interview-questions-on-missing-value-imputation/?utm_source=related_WP&utm_medium=https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/

In [5]:
#importing required libraries
import numpy as np
import pandas as pd
# importing the dataset
df= pd.read_csv('iris.csv')

In [6]:
df.shape

(150, 5)

In [7]:
print(df.isnull().sum())

sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64


In [8]:
print(df.isna().sum())

sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64


In [9]:
# dropping the missing data
df = df.dropna()

In [10]:
df.shape

(150, 5)

Although, Complete case analysis is not the best solution for handling the missing data, as by dropping the missing data, we are also losing some of the information of the data, and also it might be possible that sometimes the dropped missing data could also contain a piece of important information that the other data does not. So in most cases, while handling the missing data, complete case analysis is not preferred unless and until there is not any other option.

According to the researcher, these techniques should be considered when there is 5% or less than 5% of the data is missing from the dataset.

In [11]:
# Handling Missing Data with SimpleImputer

In [12]:
# https://www.analyticsvidhya.com/blog/2022/10/handling-missing-data-with-simpleimputer/?utm_source=related_WP&utm_medium=https://www.analyticsvidhya.com/blog/2022/11/top-5-interview-questions-on-missing-value-imputation/

In [13]:
# importing simpleimputer
from sklearn.impute import SimpleImputer

In [14]:
# Example
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_mean.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])
age = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]
print(imp_mean.transform(age))

[[ 7.   2.   3. ]
 [ 4.   3.5  6. ]
 [10.   3.5  9. ]]


Performing “Most_frequent” Imputation
Most frequent imputation is a technique that is used for handling categorical missing data. This technique is used when we have missing values in a categorical column.

Using a most frequent imputation technique on the particular categorical column will allow us to fill the missing values bu the most frequent value from the column occurring in the dataset.

Code:

imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imputer.fit(df['category'])
df['category']= imputer.fit_transform(df['category'])

In [16]:
# Example:

imp_mf = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imp_mf.fit([['one', 'two', 'three'], ['four', np.nan, 'six'], ['two', 'five', 'two']])
category = [[np.nan, 'two', 'two'], ['four', np.nan, 'six'], ['ten', np.nan, 'nine']]
print(imp_mf.transform(category))

[['four' 'two' 'two']
 ['four' 'five' 'six']
 ['ten' 'five' 'nine']]


Performing “Constant” Imputation
Constant imputation is a technique in simple imputer using which we can fill the missing value by any desired value we want. This can be used on strings and numerical datasets.

Passing the desired value to the fill_value parameter, we can fill all the missing values present in the dataset by the value passed in the fill_value parameter.

In [17]:
# Example:

imp_constant = SimpleImputer(missing_values=np.nan, strategy='constant',fill_value=20)
imp_constant.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])
age = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]
print(imp_constant.transform(age))

[[20.  2.  3.]
 [ 4. 20.  6.]
 [10. 20.  9.]]


Conclusion
In this article, the handling of missing data with the class SimpleImputer is discussed in detail. A total of 4 strategies, mean median, most_frequent, and constant, can be used to fill in the missing value and are discussed in the code example above.

Some Key Takeaways From this article are:
    
    1. We should consider an outlier scenario while working with a meaningful strategy, as outliers can impact the data imputed and may result in a less accurate model with unexpected behavior. (avoid using mean strategy in case of outliers).

2. Mean and Median is a strategy that only can be used on numerical data, and the most frequent strategy can be used only on categorical data. They are one of the easiest and lower computational methods.

3. Constant strategy can be used when we have a better understanding of the dataset, and we already know the impact of imputing the missing values by our desired number or string. It can be used on strings and numerical data.

In [18]:
# Effective Strategies for Handling Missing Values in Data Analysis (Updated 2023)

In [19]:
# https://www.analyticsvidhya.com/blog/2021/10/handling-missing-value/?utm_source=related_WP&utm_medium=https://www.analyticsvidhya.com/blog/2022/10/handling-missing-data-with-simpleimputer/

In [None]:
#https://www.analyticsvidhya.com/blog/2015/06/tuning-random-forest-model/

In [None]:
# https://courses.analyticsvidhya.com/courses/take/loan-prediction-practice-problem-using-python/texts/6119370-hypothesis-generation

In [None]:
# https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/

In [None]:
# https://www.analyticsvidhya.com/blog/2015/12/improve-machine-learning-results/