# Module 1: Data Analysis and Data Preprocessing

## Section 1: Handling missing data

### Part 2: Simple imputer

In this part, we will explore techniques for handling missing data by imputing values using the mean, median, most frequent values or constant. Missing data can occur for various reasons, and imputation allows us to estimate or replace the missing values based on the available data. Let's get started!

### 2.1 Understanding imputation techniques

Imputation is the process of filling in missing data with estimated or substituted values. It is a common approach to handle missing values in a dataset. The choice of imputation technique depends on the nature of the data and the reasons for the missing values.

### 2.2 Imputation with mean

Imputing missing values with the mean is a simple and commonly used technique. It replaces the missing values with the mean value of the available data for the respective feature. It works only with a numeric fields. This method assumes that the missing values are missing at random and that the mean value is representative of the feature's distribution.

Scikit-Learn provides the SimpleImputer class to perform mean imputation. Here's an example of how to use it:

In [None]:
import pandas as pd
from sklearn.impute import SimpleImputer

# Sample data with missing values
data = {
    'Height': [165, 175, None, 158, 180, None, 170, 163, 172, 168]
}

# Convert data to a DataFrame
df = pd.DataFrame(data)

# Display the original DataFrame
print("Original DataFrame:")
print(df)

# Create a SimpleImputer object with the mean strategy
imputer = SimpleImputer(strategy='mean')

# Fit and transform the data to impute missing values with the mean
imputed_data = imputer.fit_transform(df)

# Convert the imputed_data back to a DataFrame
imputed_df = pd.DataFrame(imputed_data, columns=df.columns)

# Display the DataFrame after imputation
print("\nDataFrame after imputation:")
print(imputed_df)

As you can see, the missing values in the "Height" column (indicated by "NaN") have been replaced with the mean value of the available heights, which is approximately 169.6 in this case. The SimpleImputer helps us handle missing data by providing a reasonable estimate based on the available information, which is crucial for building accurate machine learning models.

### 2.3 Imputation with median

Imputing missing values with the median is another technique commonly used when dealing with outliers or skewed distributions. It replaces the missing values with the median value of the available data for the respective feature. It works only with a numeric fields. This method is less sensitive to extreme values compared to mean imputation.

Scikit-Learn's SimpleImputer class can also be used for median imputation. Here's an example:

In [None]:
import pandas as pd
from sklearn.impute import SimpleImputer

# Sample data with missing values
data = {
    'Age': [25, 30, None, 22, None, 27, 35, 29, None, 31]
}

# Convert data to a DataFrame
df = pd.DataFrame(data)

# Display the original DataFrame
print("Original DataFrame:")
print(df)

# Create a SimpleImputer object with the median strategy
imputer = SimpleImputer(strategy='median')

# Fit and transform the data to impute missing values with the median
imputed_data = imputer.fit_transform(df)

# Convert the imputed_data back to a DataFrame
imputed_df = pd.DataFrame(imputed_data, columns=df.columns)

# Display the DataFrame after imputation
print("\nDataFrame after imputation:")
print(imputed_df)

As you can see, the missing values in the "Age" column (indicated by "NaN") have been replaced with the median value of the available ages, which is 29.0 in this case. The SimpleImputer with the median strategy provides us with a robust estimate for handling missing data based on the existing information in the dataset.

### 2.4 Imputation with most frequent values

Imputing missing values with the most frequent values. This approach is applicable for both numeric and categorical columns. It replaces the missing values with the most frequent value (mode) of the available data for the respective feature. This method assumes that the missing values are most likely to have the same value as the majority of the observations. If there is more than one such value, only the smallest is returned. 

If data has no mode then we can't use mode as a measure of central tendency, instead, we can use mean, median, etc.

Scikit-Learn's SimpleImputer class can handle categorical variables for mode imputation. Here's an example:

In [None]:
import numpy as np
import pandas as pd

# Create a radnom datset of 10 rows and 4 columns
df = pd.DataFrame(np.random.randn(10, 4), columns=list('ABCD'))

# Randomly set some values as null
df = df.mask(np.random.random((10, 4)) < .20)
print()
# Duplicate two cells with same values
df['B'][8] = df['B'][9]
# Display the original DataFrame
print("Original DataFrame:")
print(df)

most_frequent_imputer = SimpleImputer(strategy='most_frequent')
result_most_frequent_imputer = most_frequent_imputer.fit_transform(df)
# Display the DataFrame after imputation
print("\nDataFrame after imputation:")
print(pd.DataFrame(result_most_frequent_imputer, columns=list('ABCD')))

As you can see, the missing values (indicated by "None") have been replaced with the most frequent value. The SimpleImputer with the most frequent strategy provides us with a straightforward way to handle missing data by filling in the most commonly occurring values based on the available information in the dataset.

### 2.5 Imputation with constant

The constant strategy allows you to fill in the missing values with a specified constant value. This approach is applicable for both numeric and categorical columns. 

Here's an example:

In [None]:
import pandas as pd
from sklearn.impute import SimpleImputer

# Sample data with missing values
data = {
    'Age': [25, 30, None, 22, None, 27, 35, 29, None, 31]
}

# Convert data to a DataFrame
df = pd.DataFrame(data)

# Display the original DataFrame
print("Original DataFrame:")
print(df)

# Create a SimpleImputer object with the constant strategy
imputer = SimpleImputer(strategy='constant', fill_value=100)

# Fit and transform the data to impute missing values with the constant value
imputed_data = imputer.fit_transform(df)

# Convert the imputed_data back to a DataFrame
imputed_df = pd.DataFrame(imputed_data, columns=df.columns)

# Display the DataFrame after imputation
print("\nDataFrame after imputation:")
print(imputed_df)

As you can see, the missing values in the "Age" column (indicated by "NaN") have been replaced with the constant value of 100, as specified in the SimpleImputer. The SimpleImputer with the constant strategy allows us to handle missing data by filling in a fixed value, which can be useful in certain situations where the constant value is meaningful for the analysis or modeling task.

### 2.6 Summary

Imputation is a crucial step in handling missing data. Depending on the nature of the data and the reasons for missing values, we can use imputation techniques such as mean, median, or most frequent value imputation. Scikit-Learn's SimpleImputer class provides a convenient way to perform imputation in both numerical and categorical features.

In the next part, we will explore imputation using iterative imputer.