### Building a sklearn Pipeline - Example ###

This exercise is about building a pipeline for preprocessing, save the model and use saved model for prediction.

I have created a synthetic dataset that includes scaling requirement, missing value imputation, and categorical to numeric conversion. 

Don't worry about accuracy as it is conjured up data.

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


# New Imports
from sklearn.compose import ColumnTransformer # used to apply different preprocessing to different columns
from sklearn.pipeline import Pipeline # used to chain together different transformers
from sklearn.impute import SimpleImputer # used to fill in missing values
import joblib # used to save the model


In [2]:
# Read in data
data = pd.read_csv('data.csv') # read in the data

print(f"Data: {data.head(10)}") # print the first 10 rows of the data

X = data.drop('label',axis=1) # seperate the features from the labels
y = data['label'] # set the labels
print(f"Missing values in different features:\n{X.isna().sum()}") # print the number of missing values in each column

Data:     age   income  gender  education  label
0  37.0  45749.0  female        PhD      1
1  69.0  73161.0    male        PhD      1
2  23.0  47514.0    male  bachelors      1
3  29.0  45953.0  female    masters      1
4  54.0  89857.0    male        PhD      1
5  69.0  39660.0     NaN    masters      0
6  35.0  49887.0     NaN  bachelors      1
7  61.0  80866.0     NaN        PhD      0
8  22.0  30073.0     NaN    masters      0
9  67.0  22538.0     NaN    masters      0
Missing values in different features:
age          10
income       10
gender       10
education     0
dtype: int64


In [3]:
# Define the type of features to make it easier to define which colums should be processed by the column transformer

numerical_features = ['age', 'income'] # numerical features
categorical_features = ['gender', 'education'] # categorical features

In [4]:
# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # split the data into training and testing sets

# Build the pipeline
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')), # fill in missing values with the mean of the column
    ('scaler', StandardScaler()) # scale the data
]) # Numeric transformer

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')), # fill in missing values with the mode of the column
    ('encoder', OneHotEncoder(handle_unknown='ignore')) # encode the categorical data
]) # Categorical transformer only works on categorical data

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numerical_features), # use the numeric_transformer on the numerical_features
        ('cat', categorical_transformer, categorical_features) # use the categorical_transformer on the categorical_features
    ]) # Combine the transformers into a single preprocessor

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor), # Apply the preprocessor to the data
    ('model', RandomForestClassifier()) # Use a random forest classifier
]) # Combine the preprocessor and the model into a single pipeline



In [5]:
# Fit the model and make predictions
pipeline.fit(X_train, y_train) # Fit the pipeline on training data
y_pred = pipeline.predict(X_test) # Make predictions on test data

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred) # Calculate the accuracy of the model
print(f"Accuracy: {accuracy:.2f}") # Print the accuracy of the model (low accuracy)

Accuracy: 0.52


In [6]:
joblib.dump(pipeline, 'model_pipeline.pkl') # Save the entire pipeline (including preprocessing and model)

['model_pipeline.pkl']

In [7]:
loaded_pipeline = joblib.load('model_pipeline.pkl') # use joblib to load the pipeline

# Now we do not have to preprocess the missing values in the new data, the pipeline will do it for us
new_data = pd.DataFrame({
    'age': [37, np.nan, 40], # one missing value numeric data
    'income': [46000, 90000, np.nan], # one missing value numeric data
    'gender': ['Female', 'Female', 'Male'], # categorical data
    'education': ['PhD', 'Master', 'PhD'] # categorical data
}) # Generate new synthetic data for demonstration


new_predictions = loaded_pipeline.predict(new_data) # Make predictions using the loaded pipeline
print("New Data Predictions:") # Print the predictions
print(new_predictions) # Print the predictions


New Data Predictions:
[1 0 0]
