# Part 1: Preprocessing

## Preprocessing Categorical Variables

In machine learning, most algorithms require numerical input data, so categorical data must be converted into a numerical format. Two common approaches for this are label encoding and one-hot encoding. We learnt both methods in `Unit 1.8`, let's do a recap.

### Label Encoding

Label encoding is a simple and straightforward method where each unique category value is assigned an integer value.

**How it Works**

For example, if you have a `color` feature with three categories: `red`, `green`, and `blue`, label encoding would replace them with `0`, `1`, and `2`, respectively.

**When to Use**

Label encoding is ideal for ordinal data, where the categories have some inherent order. However, using this method on nominal data (no intrinsic order) can introduce a new problem: the model might assume a natural ordering between categories which may result in poor performance or unexpected results.

We can implement label encoding and one-hot encoding using `pandas` or `scikit-learn`.
### One-Hot Encoding

One-hot encoding converts categorical values into a binary vector representation where only one bit is set to `1` out of all the bits representing the categories.

**How it Works**

Taking the same `color` example: for `red`, `green`, and `blue`, one-hot encoding would create three features, `is_red`, `is_green`, and `is_blue`. If the color is `red`, the corresponding feature `is_red` would be `1`, and the rest would be `0`: `red` = `[1, 0, 0]`, `green` = `[0, 1, 0]`, `blue` = `[0, 0, 1]`.

**When to Use**

One-hot encoding is best used for nominal data where no ordinal relationship exists. The downside is that it can lead to a high-dimensional feature space, which might be problematic for models that struggle with high dimensionality.

### 1. Encoding with Pandas

This section demonstrates the quickest way to encode categorical data using built-in pandas functionality. This approach is often used for quick data analysis or preprocessing before modeling.

**Label Encoding**: Achieved by converting the column to the 'category' data type and accessing the numerical codes.

**One-Hot Encoding**: Achieved using the pd.get_dummies() function, which automatically creates new binary columns.

In [1]:
import pandas as pd
import numpy as np

# Initialize a DataFrame and perform label encoding on the 'color' 
# column using pandas category codes.

df = pd.DataFrame({
    'color': ['red', 'green', 'blue', 'green', 'red']
})
df['color_encoded'] = df['color'].astype('category').cat.codes
display(df)

Unnamed: 0,color,color_encoded
0,red,2
1,green,1
2,blue,0
3,green,1
4,red,2


In [2]:
# Perform one-hot encoding on the 'color' column using pandas get_dummies.

df_one_hot = pd.get_dummies(df, columns=['color'])
display(df_one_hot)

Unnamed: 0,color_encoded,color_blue,color_green,color_red
0,2,False,False,True
1,1,False,True,False
2,0,True,False,False
3,1,False,True,False
4,2,False,False,True


### 2. Encoding with Scikit-Learn

This approach uses scikit-learn preprocessing classes. This is the standard method for machine learning pipelines because it allows you to fit the encoder on training data and consistently transform future test data.

**LabelEncoder**: Converts labels into integers. Note: In sklearn, this is technically designed for target labels (y), but is often used for simple ordinal encoding of features (X).

**OneHotEncoder**: The standard transformer for creating binary variables from categorical features. It expects 2D array inputs and can output sparse matrices to save memory (disabled here with sparse_output=False for readability).

In [3]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# Use scikit-learn's LabelEncoder to convert categorical 'color' 
# labels into numerical codes within a DataFrame.

df = pd.DataFrame({
    'color': ['red', 'green', 'blue', 'green', 'red']
})
le = LabelEncoder()
df['color_encoded'] = le.fit_transform(df['color'])
df['color_encoded'] = le.fit_transform(df['color'])
display(df)

Unnamed: 0,color,color_encoded
0,red,2
1,green,1
2,blue,0
3,green,1
4,red,2


In [4]:
# Apply One-Hot Encoding to the 'color' feature using scikit-learn to generate a binary feature DataFrame.

colors = df['color'].values.reshape(-1, 1)
encoder = OneHotEncoder(sparse_output=False)
colors_encoded = encoder.fit_transform(colors)

# Convert encoded features into a pandas DataFrame with descriptive column names from the encoder.

df_one_hot_sklearn = pd.DataFrame(
    colors_encoded, 
    columns=encoder.get_feature_names_out(['color'])
)
display(df_one_hot_sklearn)

Unnamed: 0,color_blue,color_green,color_red
0,0.0,0.0,1.0
1,0.0,1.0,0.0
2,1.0,0.0,0.0
3,0.0,1.0,0.0
4,0.0,0.0,1.0


## Preprocessing Numerical Variables

Preprocessing numerical variables is crucial to ensure that models perform optimally. Common techniques for preprocessing numerical data includes scaling, normalization, and handling missing values.

In [5]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, Normalizer


# Initialize a pandas DataFrame from a dictionary to structure feature data as a table.

data = { 'feature1': [1, 2, 3],
         'feature2': [4, 5, 6]
}
df = pd.DataFrame(data)
df

Unnamed: 0,feature1,feature2
0,1,4
1,2,5
2,3,6


### Scaling

Scaling adjusts the range of data so that different features contribute equally to the final prediction. It's essential when using algorithms that are sensitive to the magnitude of the variables, such as Support Vector Machines (SVM) or K-nearest neighbors (KNN).

**Standardization**

Standardization rescales data to have a mean (μ) of 0 and standard deviation (σ) of 1 (unit variance).

$$ x' = \frac{x - \mu}{\sigma} $$

In [6]:
# Standardize the dataset features using scikit-learn's StandardScaler and reconstruct the DataFrame.

scaler = StandardScaler()
df_standardized = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
df_standardized

Unnamed: 0,feature1,feature2
0,-1.224745,-1.224745
1,0.0,0.0
2,1.224745,1.224745


**Min-Max Scaling**

Min-max scaling rescales the feature to a fixed range, usually 0 to 1.

$$ x' = \frac{x - \text{min}(x)}{\text{max}(x) - \text{min}(x)} $$

In [7]:
# Scale dataset features to a fixed range (typically 0 to 1) using MinMaxScaler and reconstruct the DataFrame.

minmax_scaler = MinMaxScaler()
df_minmax = pd.DataFrame(minmax_scaler.fit_transform(df), columns=df.columns)
df_minmax

Unnamed: 0,feature1,feature2
0,0.0,0.0
1,0.5,0.5
2,1.0,1.0


**Normalization**

Adjusts the "direction" of your data rather than its "magnitude." Imagine each data sample is an arrow pointing from the origin. Some arrows are very long (large values), and some are short (small values). Normalization shrinks or stretches every arrow so they all have the exact same length (usually 1), while keeping them pointing in the original direction.

This ensures that the patterns (ratios between features) matter more than the raw counts or volumes.

**L2 Normalization**

In `scikit-learn`, the `Normalizer()`function defaults to L2 normalization which uses the "straight-line" distance (Euclidean). It squares every number in the row, adds them up, takes the square root and then divides every number by that total.

$$x' = \frac{x}{\sum |x_i|} = \frac{x}{||x||_1}$$

In [8]:
# Normalize the dataset row-wise and return the result as a DataFrame with original column names.
# By default (norm='l2'), this scales each *row* (sample) individually to have a unit norm.

normalizer = Normalizer() 
df_normalized = pd.DataFrame(normalizer.fit_transform(df), columns=df.columns)
df_normalized

Unnamed: 0,feature1,feature2
0,0.242536,0.970143
1,0.371391,0.928477
2,0.447214,0.894427


### Handling Missing Values

Missing values can significantly affect the performance of machine learning models. Common strategies for handling missing data include imputation and removing records with missing values. You've learnt about this in `Unit 1.8`.

**Imputation**

Imputation fills in missing values with a specific value, such as the mean, median, or mode of the column.

**Removing Missing Values**

If the dataset has only a few missing values, it might be reasonable to drop those records. However, this can lead to loss of valuable data.

You learnt about handling missing data using `pandas` in `Unit 1.8`, `sklearn` also provides utilities to deal with missing data.

In [9]:
from sklearn.impute import SimpleImputer

# Initialize a pandas DataFrame from a dictionary, handling missing values as NaNs.

data_with_missing = {'feature1': [1, 2, None], 'feature2': [4, None, 6]}
df_missing = pd.DataFrame(data_with_missing)

# Display the DataFrame to see the missing values (NaN)
df_missing

Unnamed: 0,feature1,feature2
0,1.0,4.0
1,2.0,
2,,6.0


In [10]:
# Impute missing values in the DataFrame using the mean strategy and reconstruct with original column names.

imputer = SimpleImputer(strategy='mean')
df_imputed = pd.DataFrame(imputer.fit_transform(df_missing), columns=df_missing.columns)
df_imputed

Unnamed: 0,feature1,feature2
0,1.0,4.0
1,2.0,5.0
2,1.5,6.0


In [11]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Generate synthetic classification data, partition into train/test sets, and train a Logistic Regression model.

X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

model = LogisticRegression()
model.fit(X_train, y_train)



We will use several key metrics:

- **Confusion Matrix**: A table showing true positives, false positives, true negatives, and false negatives.

- **Accuracy**: The overall percentage of correct predictions.

- **Precision & Recall**: Precision measures how many selected items are relevant, while Recall measures how many relevant items are selected.

- **F1 Score** : The harmonic mean of precision and recall, useful when class distribution is uneven.

- **ROC Curve & AUC**: The Receiver Operating Characteristic curve plots the True Positive Rate against the False Positive Rate at various threshold settings. AUC (Area Under the Curve) represents the degree of separability.

In [12]:
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, roc_curve, auc

# Generate class predictions (0 or 1) and probability predictions (0.0 to 1.0)
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]  # Get probabilities for the positive class

# Calculate performance metrics
conf_matrix = confusion_matrix(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Calculate ROC curve points and Area Under the Curve (AUC)
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)

In [13]:
print(f"Confusion Matrix:\n{conf_matrix}")
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")
print(f"AUC: {roc_auc:.2f}")

Confusion Matrix:
[[ 90   3]
 [  4 103]]
Accuracy: 0.96
Precision: 0.97
Recall: 0.96
F1 Score: 0.97
AUC: 0.98


### Accuracy

Accuracy is the most intuitive performance measure. It is simply the ratio of correctly predicted observations to the total observations.

$$ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Observations}} $$

**When to Use**

Accuracy is best used when the target classes are well balanced. However, it can be misleading when dealing with imbalanced datasets.

### Confusion Matrix

A confusion matrix is a table that is used to describe the performance of a classification model on a set of test data for which the true values are known.

**Components**

![confusion-matrix](../assets/confusion-matrix.png)

- True Positive (TP): Correctly predicted positives
- True Negative (TN): Correctly predicted negatives
- False Positive (FP): Incorrectly predicted positives (Type I error)
- False Negative (FN): Incorrectly predicted negatives (Type II error)

**When to Use**

The confusion matrix is not a metric but a helpful tool for computing various metrics and gaining a more detailed insight into where the model is making errors.

### Precision

Precision is the ratio of correctly predicted positive observations to the total predicted positive observations.

$$ \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} $$

**When to Use**

Use precision when the cost of a false positive is high, such as in spam email detection.

### Recall (also known as True Positve Rate or Sensitivity)

Recall is the ratio of correctly predicted positive observations to all observations in the actual class.

$$ \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} $$

**When to Use**

Use recall when the cost of a false negative is high, such as in fraud detection.

### Specificity (also known as True Negative Rate)

Specificity measures the proportion of actual negatives that are correctly identified as such. It complements recall (sensitivity) by focusing on the model's performance with the negative class.

$$ \text{Specificity} = \frac{\text{TN}}{\text{TN} + \text{FP}} $$

**When to Use**

Specificity is particularly important in situations where the cost of a false positive is high. For example, in medical diagnostics, a false positive might lead to unnecessary treatment, which could be costly or harmful.

### F1 Score

The F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account.

$$ \text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $$

**When to Use**

Use the F1 score when you want to balance precision and recall, especially if there is an uneven class distribution.

