# Module 1: Introduction to Scikit-Learn

## Section 2: Exploratory Data Analysis (EDA) and Data Preprocessing

### Part 1: Imputation using Mean, Median, or Most Frequent Values

In this section, we will explore techniques for handling missing data by imputing values using the mean, median, or most frequent values. Missing data can occur for various reasons, and imputation allows us to estimate or replace the missing values based on the available data. Let's get started!

### 1.1 Understanding Imputation Techniques

Imputation is the process of filling in missing data with estimated or substituted values. It is a common approach to handle missing values in a dataset. The choice of imputation technique depends on the nature of the data and the reasons for the missing values.

### 1.2 Imputation with Mean

Imputing missing values with the mean is a simple and commonly used technique. It replaces the missing values with the mean value of the available data for the respective feature. This method assumes that the missing values are missing at random and that the mean value is representative of the feature's distribution.

Scikit-Learn provides the SimpleImputer class to perform mean imputation. Here's an example of how to use it:

```python
from sklearn.impute import SimpleImputer

# Create an instance of the SimpleImputer with the strategy set to 'mean'
imputer = SimpleImputer(strategy='mean')

# Fit and transform the numerical feature(s)
imputed_features = imputer.fit_transform(numerical_features)
```

### 1.3 Imputation with Median

Imputing missing values with the median is another technique commonly used when dealing with outliers or skewed distributions. It replaces the missing values with the median value of the available data for the respective feature. This method is less sensitive to extreme values compared to mean imputation.

Scikit-Learn's SimpleImputer class can also be used for median imputation. Here's an example:

```python
from sklearn.impute import SimpleImputer

# Create an instance of the SimpleImputer with the strategy set to 'median'
imputer = SimpleImputer(strategy='median')

# Fit and transform the numerical feature(s)
imputed_features = imputer.fit_transform(numerical_features)
```

### 1.4 Imputation with Most Frequent Values

Imputing missing values with the most frequent values is suitable for categorical or discrete features. It replaces the missing values with the most frequent value (mode) of the available data for the respective feature. This method assumes that the missing values are most likely to have the same value as the majority of the observations.

Scikit-Learn's SimpleImputer class can handle categorical variables for mode imputation. Here's an example:

```python
from sklearn.impute import SimpleImputer

# Create an instance of the SimpleImputer with the strategy set to 'most_frequent'
imputer = SimpleImputer(strategy='most_frequent')

# Fit and transform the categorical feature(s)
imputed_features = imputer.fit_transform(categorical_features)
```

### 1.5 Dealing with Multiple Features

If your dataset has multiple features with missing values, you can apply imputation to each feature independently or simultaneously. Scikit-Learn's SimpleImputer supports multiple features, and you can transform them together in a single step.

### 1.6 Summary

Imputation is a crucial step in handling missing data. Depending on the nature of the data and the reasons for missing values, we can use imputation techniques such as mean, median, or most frequent value imputation. Scikit-Learn's SimpleImputer class provides a convenient way to perform imputation in both numerical and categorical features.

In the next part, we will explore feature scaling techniques to standardize the scale of numerical features.

Feel free to practice imputation using mean, median, or most frequent values on your own datasets. Adapt the strategies based on the specific characteristics of your data and the nature of the missing values.