# Module 1: Data Analysis and Data Preprocessing

## Section 1: Handling missing data

### Part 5: K-Nearest Neighbors (KNN) Imputation

In this part, we will explore the K-Nearest Neighbors (KNN) imputation technique applied to handle missing data. KNN imputation fills in missing values by using the values from neighboring data points.

### 5.1 Understanding K-Nearest Neighbors (KNN) Imputation

K-Nearest Neighbors (KNN) imputation is a technique that estimates missing values based on the values of their nearest neighbors. It assumes that similar data points have similar feature values. The algorithm finds the K nearest neighbors of a data point with missing values and imputes the missing values using the average or weighted average of the neighbors' values.

The key idea behind KNN imputation is to leverage the similarity between data points to estimate missing values. By using the values of the nearest neighbors, the imputed values are expected to be closer to the true values.

### 5.2 Training and Imputation

To apply KNN imputation, we need a dataset with missing values. The algorithm identifies the nearest neighbors of each data point with missing values based on a distance metric, such as Euclidean distance. It then imputes the missing values using the average or weighted average of the neighbors' values.

Scikit-Learn provides the KNNImputer class for performing KNN imputation. Here's an example of how to use it:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.impute import KNNImputer

np.random.seed(1)
# Generate 25 random points for each category with distinct centers
category1 = np.random.normal(loc=[0,0], scale=5, size=(25, 2))
category2 = np.random.normal(loc=[50,50], scale=5, size=(25, 2))
category3 = np.random.normal(loc=[100,100], scale=5, size=(25, 2))
# Combine the points from each category
data = np.vstack([category1, category2, category3])
# Create labels for each category
labels = np.repeat([1, 2, 3], 25)
# Convert the NumPy array to a DataFrame
df = pd.DataFrame(data, columns=['X', 'Y'])
df['Category'] = labels
# Add 25 more random points to the dataset without category
new_points = np.random.rand(25, 2) * 120  # Generate random points within the range (0, 120)
df = pd.concat([df, pd.DataFrame(new_points, columns=['X', 'Y'])], ignore_index=True)

# Create a KNNImputer instance
imputer = KNNImputer(n_neighbors=25)
# Impute the 'Category' column
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
# Convert the imputed 'Category' column to integers (1, 2, or 3)
df_imputed['Category'] = df_imputed['Category'].round().astype(int)

print("Original DataFrame")
print(df)
print("DataFrame with Imputed Categories:")
print(df_imputed)

# Scatter plot to visualize the 3 categories with NaN points
plt.figure(figsize=(12, 6))  # Updated figsize to accommodate both plots side by side
plt.subplot(1, 2, 1)  # Subplot 1: Scatter plot with NaN Categories
plt.scatter(df[df['Category'] == 1]['X'], df[df['Category'] == 1]['Y'], label='Category 1', marker='o', s=100, color='blue')
plt.scatter(df[df['Category'] == 2]['X'], df[df['Category'] == 2]['Y'], label='Category 2', marker='o', s=100, color='green')
plt.scatter(df[df['Category'] == 3]['X'], df[df['Category'] == 3]['Y'], label='Category 3', marker='o', s=100, color='orange')
# Mark NaN points with 'x'
plt.scatter(df['X'][df['Category'].isnull()], df['Y'][df['Category'].isnull()], label='Unknown Category', marker='x', s=100, color='red')
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Scatter Plot of 3 Categories with NaN Categories')
plt.legend()
plt.grid(True)
# Scatter plot to visualize the 3 categories with imputed categories
plt.subplot(1, 2, 2)  # Subplot 2: Scatter plot with Imputed Categories
plt.scatter(df_imputed[df_imputed['Category'] == 1]['X'], df_imputed[df_imputed['Category'] == 1]['Y'], label='Category 1', marker='o', s=100, color='blue')
plt.scatter(df_imputed[df_imputed['Category'] == 2]['X'], df_imputed[df_imputed['Category'] == 2]['Y'], label='Category 2', marker='o', s=100, color='green')
plt.scatter(df_imputed[df_imputed['Category'] == 3]['X'], df_imputed[df_imputed['Category'] == 3]['Y'], label='Category 3', marker='o', s=100, color='orange')
plt.scatter(df_imputed['X'][df_imputed['Category'].isnull()], df_imputed['Y'][df_imputed['Category'].isnull()], label='Unknown Category', marker='x', s=100, color='red')
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Scatter Plot of 3 Categories with Imputed Categories')
plt.legend()    
plt.grid(True)
plt.tight_layout()  # Adjust spacing between subplots
plt.show()

This example generates a dataset with three categories of points and introduces NaN categories to some points. Then, it uses the KNNImputer to predict the categories for the NaN points. The dataset is then plotted before and after the imputation process.
The first subplot shows the scatter plot of 3 categories with NaN categories marked with 'x'. The second subplot shows the scatter plot of the same categories after imputing the NaN categories using KNNImputer.

### 5.3 Choosing Parameters

The KNNImputer class has several important parameters that need to be set appropriately. The n_neighbors parameter determines the number of neighbors to consider when imputing missing values. Other parameters include the distance metric and weights used in the imputation process.

The value of k, the number of nearest neighbors to consider, significantly affects the model's performance. Choosing an optimal value for k requires careful consideration.

- A small value of k can lead to high variance and overfitting.
- A large value of k can lead to high bias and underfitting.

### 5.4 Summary

KNN imputation is a powerful technique for handling missing data. However, it requires that the dataset has a sufficient number of data points with complete information to estimate missing values accurately. Additionally, the choice of K and the distance metric can have a significant impact on the imputation results. Scikit-Learn provides the KNNImputer class for performing KNN imputation easily. Understanding the concepts, training, and parameter tuning is crucial for effectively using KNN imputation in practice.

In the next part, we will explore how different machine learning models can solve missing data.