# Module 1: Data Analysis and Data Preprocessing

## Section 1: Handling missing data

### Part 6: Regesssors and classifiers

In this part, we will explore two approaches to handle missing data using machine learning algorithms: Handling Missing Data with Regressors and Handling Missing Data with Classifiers.

We will explore in the following sections different machine learning models for solving regression or classification problems. For now, keep in mind that they can predict data, so they can also predict missing data in a data set.

Remember that the models that we are going to create are for handling missing data in our original dataset. To compute new predictions then we need create another model with all the data completed (now without missing values).

### 6.1 Handling Missing Data with Regressors

When dealing with continuous or numerical features that have missing values, one effective approach is to use regressors to predict the missing values. The process involves the following steps:

1. Identify the numerical features with missing values in the dataset.
2. Split the dataset into two parts: one with complete data for the feature being imputed (training set) and one with missing values (test set).
3. For each numerical feature with missing values, use the complete data from the training set to build a regression model.
4. Use the regression model to predict the missing values in the test set.

The advantage of using regressors for imputation lies in the ability to capture the relationships between features and utilize this information to make more accurate predictions. Popular regressors for this task include Linear Regression, Decision Trees, Random Forests, and Gradient Boosting.

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

# Generate synthetic data
np.random.seed(42)
x = np.linspace(0, 10, 100)
y = 2 * x + 5 + np.random.normal(0, 2, 100)
# Introduce missing values to 'y' feature only
missing_indices = np.random.choice(100, size=20, replace=False)
y_with_missing = y.copy()
y_with_missing[missing_indices] = np.nan
# Convert the NumPy arrays to a DataFrame
df = pd.DataFrame({'x': x, 'y': y_with_missing})

# Split the dataset into complete data (training set) and missing data (test set)
train_data = df.dropna()
test_data = df[df['y'].isnull()]
# Create a linear regression model
regressor = LinearRegression()
# Fit the model on the training data
regressor.fit(train_data[['x']].values, train_data['y'].values)
# Predict the missing values using the fitted model
imputed_values = regressor.predict(test_data[['x']].values)
# Fill in the missing values in the original DataFrame
#df.loc[df['y'].isnull(), 'y'] = imputed_values

# Plot the original data with missing values and the imputed data
plt.scatter(df['x'], df['y'], label='Original Data with Missing Values', color='blue')
plt.scatter(test_data['x'], imputed_values, label='Imputed Values', color='red', marker='x', s=100)
plt.plot(x, regressor.predict(x.reshape(-1, 1)), label='Linear Regression Line', color='green')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Linear Regression Imputation of Missing Data')
plt.legend()
plt.grid(True)
plt.show()

In this example, we first generate synthetic data with a linear relationship between 'x' and 'y'. We then introduce 20 missing values in the 'y' feature using randomly chosen indices. We use Linear Regression to predict and impute the missing values in the 'y' feature based on the available data. The red 'x' markers represent the imputed values, and the blue points represent the original data with missing values. The linear regression imputation helps fill in the missing values with estimates based on the relationship between 'x' and 'y'.
The green line represents the linear regression fitted to the original data points. It helps us visualize how the imputed values fit into the overall trend of the data.

### 6.2 Handling Missing Data with Classifiers

For categorical features with missing values, the approach is similar, but we use classification algorithms instead of regressors. The steps are as follows:

1. Identify the categorical features with missing values in the dataset.
2. Split the dataset into two parts: one with complete data for the feature being imputed (training set) and one with missing values (test set).
3. For each categorical feature with missing values, use the complete data from the training set to build a classification model.
4. Use the classification model to predict the missing categories in the test set.

Using classifiers for imputation provides the advantage of preserving relationships between features and maintaining the categorical nature of the data. Common classifiers used for this purpose are Decision Trees, Random Forests, Naive Bayes, and Support Vector Machines.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Generate synthetic data
np.random.seed(1)
X1 = np.random.normal(loc=[0, 0], scale=1, size=(50, 2))
X2 = np.random.normal(loc=[2, 2], scale=1, size=(50, 2))
X3 = np.random.normal(loc=[5, 5], scale=1, size=(50, 2))
X = np.vstack([X1, X2, X3])
y = np.repeat([1, 2, 3], 50)
# Introduce missing class labels randomly
missing_indices = np.random.choice(150, size=10, replace=False)
y_with_missing = y.copy()
y_with_missing[missing_indices] = -1  # Use -1 to indicate missing class labels
# Convert the NumPy arrays to a DataFrame
df = pd.DataFrame(X, columns=['X1', 'X2'])
df['Class'] = y_with_missing

# Split the dataset into complete data (training set) and missing data (test set)
train_data = df[df['Class'] != -1]
test_data = df[df['Class'] == -1]
# Split the training and test data into features (X) and class labels (y)
X_train = train_data[['X1', 'X2']]
y_train = train_data['Class']
X_test = test_data[['X1', 'X2']]

# Create an SVM classifier
classifier = SVC()
# Fit the classifier on the training data
classifier.fit(X_train, y_train)
# Predict the missing class labels using the fitted classifier
imputed_labels = classifier.predict(X_test)
df_final = df.copy()
# Fill in the missing class labels in the original DataFrame
df_final.loc[df_final['Class'] == -1, 'Class'] = imputed_labels
# Convert the class labels back to integers (optional)
df_final['Class'] = df_final['Class'].astype(int)

# Evaluate the classifier's accuracy on the complete data
y_true = y[df_final['Class'] != -1].astype(int)
y_pred = df_final['Class'][df_final['Class'] != -1].astype(int)
accuracy = accuracy_score(y_true, y_pred)

print("Accuracy of SVM Classifier on Complete Data:", accuracy)

# Plot the data points with different classes and the imputed points
plt.figure(figsize=(8, 6))
plt.scatter(df_final[df_final['Class'] == 1]['X1'], df_final[df_final['Class'] == 1]['X2'], label='Class 1', marker='o', s=100, color='blue')
plt.scatter(df_final[df_final['Class'] == 2]['X1'], df_final[df_final['Class'] == 2]['X2'], label='Class 2', marker='o', s=100, color='green')
plt.scatter(df_final[df_final['Class'] == 3]['X1'], df_final[df_final['Class'] == 3]['X2'], label='Class 3', marker='o', s=100, color='orange')
plt.scatter(df[df['Class'] == -1]['X1'], df[df['Class'] == -1]['X2'], label='Imputed Points', marker='x', s=100, color='red')
plt.xlabel('X1')
plt.ylabel('X2')
plt.title('Data Points with Different Classes and Imputed Points')
plt.legend()
plt.grid(True)
plt.show()

In this example, we have generated a synthetic dataset with three classes, introduced missing class labels to the data, and used an SVM classifier to predict the missing class labels. The SVM classifier is trained on the data with complete class labels, and then used to predict the class labels of the missing data points. Finally, we fill in the missing class labels in the original DataFrame with the imputed values and evaluate the accuracy of the classifier on the complete data.


### 6.3 Benefits of Machine Learning-Based Imputation

1. Preservation of Relationships: By leveraging the power of machine learning algorithms, both regressors and classifiers can take into account the underlying relationships between features, leading to more accurate imputations.
2. Flexibility: The use of various machine learning algorithms allows for flexibility in handling different types of missing data and handling both continuous and categorical features.
3. Handling Multiple Features: These methods can handle missing data in multiple features simultaneously, capturing correlations between different features in the dataset.

### 6.4 Caveats and Considerations

While machine learning-based imputation techniques can be powerful, it is essential to be cautious when applying them:

- Overfitting: There is a risk of overfitting, especially with complex models, leading to biased predictions. Regularization and cross-validation can help mitigate this issue.
- Model Selection: Careful selection of the appropriate regression or classification model is crucial to obtain accurate imputations.
- Data Distribution: The quality of the imputations depends on the underlying data distribution and the relationships between features.

### 6.5 Summary

In conclusion, handling missing data with machine learning algorithms offers a powerful and flexible approach that can significantly improve the completeness and usefulness of datasets. However, proper model selection, regularization, and evaluation are essential to ensure the reliability of the imputed data.