<a href='https://ai.meng.duke.edu'> = <img align="left" style="padding-top:10px;" src=https://storage.googleapis.com/aipi_datasets/Duke-AIPI-Logo.png>

# Using Pipelines for Pre-processing and Modeling
The dataset we will work with in this example includes data on over 400k historical sales of heavy machinery equipment (bulldozers, etc) at auction.  The objective is to develop a model which can predict the expected sale price of a piece of machinery given different information about the machine's condition and type, similar to what Kelly Blue Book does for used cars. 

In [1]:
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression, mutual_info_regression, f_classif
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

path = Path('')

import warnings
warnings.filterwarnings("ignore")

In [2]:
# Run this before any other code cell
# This downloads the csv data files into the same directory where you have saved this notebook

import urllib.request
from pathlib import Path
import os
path = Path()

# Dictionary of file names and download links
files = {'bulldozer_data.csv':'https://storage.googleapis.com/aipi_datasets/bulldozer_data.csv'}

# Download each file
for key,value in files.items():
    filename = path/key
    url = value
    # If the file does not already exist in the directory, download it
    if not os.path.exists(filename):
        urllib.request.urlretrieve(url,filename)

In [3]:
# Read in the raw data
df_raw = pd.read_csv(path/'bulldozer_data.csv',)
print('Shape is {}'.format(df_raw.shape))
df_raw.head().T

Shape is (401125, 47)


Unnamed: 0,0,1,2,3,4
SalePrice,66000,57000,10000,38500,11000
ModelID,3157,77,7009,332,17311
YearMade,2004,1996,2001,2001,2007
MachineHoursCurrentMeter,68.0,4640.0,2838.0,3486.0,722.0
UsageBand,Low,Low,High,High,Medium
fiModelDesc,521D,950FII,226,PC120-6E,S175
fiBaseModel,521,950,226,PC120,S175
fiSecondaryDesc,D,F,,,
fiModelSeries,,II,,-6E,
fiModelDescriptor,,,,,


In [4]:
def split_data(df):
    # Split data into train and test sets
    X = df.drop('SalePrice',axis=1)
    y = df['SalePrice']
    X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)
    return X_train,X_test,y_train,y_test

X_train, X_test, y_train, y_test = split_data(df_raw)

### Build pipeline
Let's create a pipeline from all the preprocessing and modeling steps for our problem.  Below are the steps we will need to incldue for our pipeline for this problem:  
- **Data preprocessing:**  
    - Numerical features:  
        - Fill missing data with median value of feature
        - Standardize each feature. 
    - Categorical features:  
        - Fill missing data with mode value of feature  
        - One-hot encode values for each feature. 
- **Modeling:**  
    - Train a linear regression model on the data
    
We can create two transformer pipelines for the preprocessing of the numerical features and the categorical features, and combine them using `ColumnTransformer` to apply them to the numerical features and categorical features respectively.  We can then place our `ColumnTransformer` in a pipeline together with our model (`LinearRegression`).  

We fit the entire pipeline on our training data (X_train and y_train) using `pipeline.fit(X_train,y_train)`.  Once our pipeline is fitted, we can then apply it to new data using `pipeline.predict(new_data)`.  The preprocessing steps in the pipeline are first applied to the new data and then the processed data is fed into the LinearRegression model to generate predictions.

In [32]:
# Create transformer for numeric features
# Fill missing values with median and then scale
numeric_features = ['YearMade','MachineHoursCurrentMeter']
numeric_transformer = Pipeline(steps=
                               [("fill_missing", SimpleImputer(strategy="median")), 
                                ("scaler", StandardScaler())])

# Create transformer for categorical features
# Fill missing values with mode and then one-hot encode
categorical_features = ['ModelID', 'fiModelDesc', 'fiBaseModel', 'fiSecondaryDesc', 
                        'fiModelSeries', 'fiModelDescriptor', 'ProductSize', 'fiProductClassDesc', 
                        'ProductGroup', 'ProductGroupDesc', 'Drive_System', 'Enclosure', 'Forks', 
                        'Pad_Type', 'Ride_Control', 'Stick', 'Transmission', 'Turbocharged', 
                        'Blade_Extension', 'Blade_Width', 'Enclosure_Type', 'Engine_Horsepower', 'Hydraulics', 
                        'Pushblock', 'Ripper', 'Scarifier', 'Tip_Control', 'Tire_Size', 'Coupler', 
                        'Coupler_System', 'Grouser_Tracks', 'Hydraulics_Flow', 'Track_Type', 
                        'Undercarriage_Pad_Width', 'Stick_Length', 'Thumb', 'Pattern_Changer', 'Grouser_Type', 
                        'Backhoe_Mounting', 'Blade_Type', 'Travel_Controls', 'Differential_Type', 
                        'Steering_Controls','UsageBand']
# Convert all to strings
for feat in categorical_features:
    df_raw[feat]=df_raw[feat].astype(str)
# Create transformer pipeline for categorical data
categorical_transformer = Pipeline(steps=
                                   [("fill_missing",SimpleImputer(strategy="most_frequent")),
                                    ("encoder",OneHotEncoder(handle_unknown="ignore"))])

# Combine the two transformers into single ColumnTransformer preprocessor
preprocessor = ColumnTransformer(transformers=
                                 [("num", numeric_transformer, numeric_features),
                                  ("cat", categorical_transformer, categorical_features)])

# Create pipeline with preprocessor and model
model_pipeline = Pipeline(steps=[("preprocessor", preprocessor), ("model", LinearRegression())])

# Fit the pipeline on the training data
model_pipeline.fit(X_train, y_train)

# Use the pipeline to get predictions on test set and evaluate
test_preds = model_pipeline.predict(X_test)
r2 = r2_score(y_test,test_preds)
print("R-squared on test set: {:.3f}".format(r2))

R-squared on test set: 0.751


In [29]:
# Save fitted pipeline to pkl file for later use
import joblib
joblib.dump(model_pipeline, 'pipeline.pkl')

['pipeline.pkl']

In [27]:
# Load pipeline and use to generate predictions on new data
model_pipeline = joblib.load('pipeline.pkl')
new_data = X_test
new_predictions = model_pipeline.predict(new_data)