# Chapter 2: Pre-Model Workflow and Data Preprocessing

This notebook provides "recipes" for using the scikit-learn Python library to preprocess data before modeling. Each recipe includes explanations, code examples, visualizations, best practices, and common pitfalls.

## Handling Missing Data

In this section, we will explore different strategies for handling missing data using scikit-learn's imputation tools.

### Getting ready

To begin, we will create a toy dataset composed of random, quantitative data, ten features, and several missing data values randomly spread throughout. We will then store the dataset in a pandas DataFrame() object for better readability.

In [None]:
# Load libraries
import numpy as np
import pandas as pd

# Create a larger sample dataset with missing values
np.random.seed(2024)  # For reproducibility
n_samples = 20
n_features = 10

# Generate random data
data = {
    f"Feature{i+1}": np.random.uniform(0, 100, n_samples) for i in range(n_features)
}

# Convert to DataFrame
df = pd.DataFrame(data)

# Randomly introduce missing values (approximately 20% of the data)
for column in df.columns:
    mask = np.random.random(n_samples) < 0.2
    df.loc[mask, column] = np.nan

# Display the DataFrame with missing values
display(df)

### How to do it...

The `SimpleImputer` class provides basic strategies for imputing missing values. It can replace missing values using a constant, the mean, median, or most frequent value of each column.

In [None]:
# Load libraries
from sklearn.impute import SimpleImputer

# Initialize the SimpleImputer and set the strategy to "mean," "median", or "most_frequent"
imputer = SimpleImputer(strategy="mean")

# Fit and transform the data
imputed_data = imputer.fit_transform(df)
imputed_df = pd.DataFrame(imputed_data, columns=df.columns)
imputed_df

The `KNNImputer` class uses the k-Nearest Neighbors approach to impute missing values. It considers the nearest neighbors to estimate the missing values.

In [None]:
# Load libraries
from sklearn.impute import KNNImputer

# Initialize the KNNImputer
knn_imputer = KNNImputer(n_neighbors=2)

# Fit and transform the data using the previously defined DataFrame
knn_imputed_data = knn_imputer.fit_transform(df)
knn_imputed_df = pd.DataFrame(knn_imputed_data, columns=df.columns)
knn_imputed_df

The `IterativeImputer` class models each feature with missing values as a function of other features, and iteratively estimates missing values.

In [None]:
# Load libraries
from sklearn.experimental import enable_iterative_imputer # Experimental feature requires loading
from sklearn.impute import IterativeImputer

# Initialize the IterativeImputer
iterative_imputer = IterativeImputer()

# Fit and transform the data using the previously defined DataFrame
iterative_imputed_data = iterative_imputer.fit_transform(df)
iterative_imputed_df = pd.DataFrame(iterative_imputed_data, columns=df.columns)
iterative_imputed_df

## Scaling Techniques

Scaling and normalization are crucial steps in preprocessing data for machine learning models. They ensure that each feature contributes equally to the distance calculations in algorithms like k-NN and SVM.

### Getting ready

We will use the previously defined `iterative_imputed_df` DataFrame for this recipe so no need to redefine it.

### How to do it...

The `StandardScaler` standardizes features by removing the mean and scaling to unit variance.

In [None]:
# Load libraries
from sklearn.preprocessing import StandardScaler

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit and transform the data using the iterative imputed DataFrame
scaled_data = scaler.fit_transform(iterative_imputed_df)
scaled_df = pd.DataFrame(scaled_data, columns=iterative_imputed_df.columns)
scaled_df

The `MinMaxScaler` transforms features by scaling each feature to a given range, often between zero and one.

In [None]:
# Load libraries
from sklearn.preprocessing import MinMaxScaler

# Initialize the MinMaxScaler
minmax_scaler = MinMaxScaler()

# Fit and transform the data using the iterative imputed DataFrame
minmax_scaled_data = minmax_scaler.fit_transform(iterative_imputed_df)
minmax_scaled_df = pd.DataFrame(
    minmax_scaled_data, columns=iterative_imputed_df.columns
)
minmax_scaled_df

The `Normalizer` scales individual samples to have unit norm.

In [None]:
# Load libraries
from sklearn.preprocessing import Normalizer

# Initialize the Normalizer
normalizer = Normalizer()

# Fit and transform the data using the iterative imputed DataFrame
normalized_data = normalizer.fit_transform(iterative_imputed_df)
normalized_df = pd.DataFrame(normalized_data, columns=iterative_imputed_df.columns)
normalized_df

## Encoding Categorical Variables

Encoding categorical variables is essential for converting non-numeric data into a format that can be used by machine learning algorithms.

### Getting ready

To begin, we will create a toy dataset composed of random, quantitative data, ten features, and several missing data values randomly spread throughout. We will then store the dataset in a pandas DataFrame() object for better readability.

In [None]:
# Load libraries
import numpy as np

# Create sample categorical data with 20 records
np.random.seed(2024)  # for reproducibility
categories = ["A", "B", "C", "D"]
categorical_data = pd.DataFrame(
    {
        "Department": np.random.choice(categories, size=20),
        "Position": np.random.choice(["Junior", "Senior", "Manager"], size=20),
        "Location": np.random.choice(["NY", "SF", "LA", "CHI"], size=20),
    }
)

# Display the DataFrame with categorical values
display(categorical_data)

### How to do it...

The `OneHotEncoder` converts categorical values into a one-hot numeric array.

In [None]:
# Load libraries
from sklearn.preprocessing import OneHotEncoder

# Initialize the OneHotEncoder
onehot_encoder = OneHotEncoder(sparse_output=False)

# Fit and transform the data
onehot_encoded_data = onehot_encoder.fit_transform(categorical_data)
onehot_encoded_df = pd.DataFrame(
    onehot_encoded_data, columns=onehot_encoder.get_feature_names_out()
)
onehot_encoded_df

The `LabelEncoder` encodes target labels with values between 0 and n_classes-1.

In [None]:
# Load libraries
from sklearn.preprocessing import LabelEncoder

# Initialize the LabelEncoder
label_encoder = LabelEncoder()

# Create a new DataFrame to store label encoded values
label_encoded_df = pd.DataFrame()

# Fit and transform each categorical column
for column in categorical_data.columns:
    label_encoded_df[f"{column}_encoded"] = label_encoder.fit_transform(
        categorical_data[column]
    )
label_encoded_df

The `ColumnTransformer` allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space.

In [None]:
# Load libraries
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

# Create sample mixed data with 20 records
np.random.seed(2024)  # for reproducibility
mixed_data = pd.DataFrame(
    {
        "Age": np.random.randint(25, 65, size=20),
        "Salary": np.round(np.random.normal(60000, 15000, size=20), 2),
        "Experience": np.random.randint(1, 20, size=20),
        "Department": np.random.choice(["IT", "HR", "Sales", "Finance"], size=20),
        "Position": np.random.choice(["Junior", "Senior", "Manager"], size=20),
    }
)

# Display the DataFrame with mixed data
display(mixed_data)

# Initialize the ColumnTransformer
column_transformer = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), ["Age", "Salary", "Experience"]),
        ("cat", OneHotEncoder(), ["Department", "Position"]),
    ],
    remainder="passthrough",
)

# Fit and transform the data
transformed_data = column_transformer.fit_transform(mixed_data)

# Get feature names for the transformed columns
numeric_cols = ["Age_scaled", "Salary_scaled", "Experience_scaled"]
categorical_cols = column_transformer.named_transformers_["cat"].get_feature_names_out(
    ["Department", "Position"]
)

# Create the transformed DataFrame
transformed_df = pd.DataFrame(
    transformed_data, columns=numeric_cols + list(categorical_cols)
)
transformed_df

## Introduction to Pipelines

Pipelines are a simple way to streamline a machine learning workflow by chaining together transformers and estimators.

### Getting ready

The general syntax for defining a pipeline is as follows:

```
pipeline = Pipeline(
    [("name of step", transformer), ("name of step", transformer),…, (“name of step”, estimator]
)
```

### How to do it...

A basic pipeline chains together a sequence of transformations and a final estimator.

In [None]:
# Load libraries
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# First, separate features and target (assuming last column is target)
X = transformed_df.iloc[:, :-1]  # all columns except last
y = transformed_df.iloc[:, -1]  # last column

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=2024
)

# Create a pipeline
pipeline = Pipeline(
    [
        ("imputer", SimpleImputer(strategy="mean")),  # handle missing values
        ("scaler", StandardScaler()),  # scale the features
    ]
)

# Fit and transform the training data
X_train_transformed = pipeline.fit_transform(X_train)

# Transform the test data
X_test_transformed = pipeline.transform(X_test)

# Create DataFrames with the transformed data (to preserve column names)
X_train_transformed = pd.DataFrame(
    X_train_transformed, columns=X_train.columns, index=X_train.index
)

X_test_transformed = pd.DataFrame(
    X_test_transformed, columns=X_test.columns, index=X_test.index
)

### Visualizing Pipelines

Visualizing pipelines can help understand the workflow and ensure all steps are correctly configured.

In [None]:
# Load libraries
from sklearn import set_config

# Set display configuration
set_config(display="diagram")

# Display the pipeline
pipeline

## Feature Engineering

Feature engineering involves creating new features or modifying existing ones to improve model performance.

### Getting ready

We will use the previously defined `X_train_transformed` DataFrame for this recipe so no need to redefine it.

### How to do it...

The `PolynomialFeatures` transformer generates polynomial and interaction features.

In [None]:
# Load libraries
from sklearn.preprocessing import PolynomialFeatures

# Initialize the PolynomialFeatures
poly = PolynomialFeatures(degree=2)

# Fit and transform the X_train_transformed data
poly_features = poly.fit_transform(X_train_transformed)
poly_features_df = pd.DataFrame(
    poly_features, columns=poly.get_feature_names_out(X_train_transformed.columns)
)
poly_features_df

The `KBinsDiscretizer` discretizes continuous features into k bins.

In [None]:
# Load libraries
from sklearn.preprocessing import KBinsDiscretizer

# Initialize the KBinsDiscretizer
kbins = KBinsDiscretizer(n_bins=3, encode="ordinal", strategy="uniform")

# Fit and transform the X_train_transformed data
binned_data = kbins.fit_transform(X_train_transformed)
binned_df = pd.DataFrame(binned_data, columns=X_train_transformed.columns)
binned_df

`RFE()` is a powerful technique that recursively removes the least important features based on a specified estimator's importance ranking.

In [None]:
# Load libraries
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

# Initialize the RFE
rfe = RFE(estimator=LinearRegression(), n_features_to_select=1)

# Fit the RFE to the X_train_transformed and y_train data
rfe.fit(X_train_transformed, y_train)

# Get the ranking of features
rfe.ranking_

`SelectFromModel()` allows users to select features based on their importance weights derived from a given model.

In [None]:
# Load libraries
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LinearRegression

# Initialize SelectFromModel with LinearRegression
selector = SelectFromModel(
    estimator=LinearRegression(),
    prefit=False,
    threshold="mean",  # Use mean of feature importances as threshold
)

# Fit the selector
selector.fit(X_train_transformed, y_train)

# Get selected features
selected_features_mask = selector.get_support()

# Get feature names that were selected
selected_features = X_train_transformed.columns[selected_features_mask].tolist()

# Print feature importance scores and selection status
feature_importance = pd.DataFrame(
    {
        "Feature": X_train_transformed.columns,
        "Importance": selector.estimator_.coef_,
        "Selected": selected_features_mask,
    }
)
feature_importance.sort_values("Importance", key=abs, ascending=False)

## Practical Exercise on Data Preprocessing

In this section, we will combine all the recipes into a comprehensive pipeline and apply it to the California Housing dataset.

### Comprehensive Pipeline

We will create a pipeline that includes imputation, scaling, encoding, and modeling steps.

In [None]:
# Load libraries
YOUR CODE HERE

# Load the California Housing dataset
YOUR CODE HERE

# Split the data
YOUR CODE HERE

# Create a comprehensive pipeline
YOUR CODE HERE

# Fit the pipeline
YOUR CODE HERE

# Evaluate the pipeline
YOUR CODE HERE