# Putting it all together

## Revisiting the entire pipeline

We've covered a lot. And so far, it seems to be all over the place, which it is.

But not to worry, machine learning projects often start out like this. 

A whole bunch of experimenting and code all over the place at the start and then once you've found something which works, the refinement process begins.

What would this refinement process look like?

We'll use the car sales regression problem (predicting the sale price of cars) as an example.

To tidy things up, we'll be using Scikit-Learn's [`sklearn.pipeline.Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) class. 

You can imagine `Pipeline` as being a way to string a number of different Scikit-Learn processes together.

### 7.1 Creating a regression [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)
You might recall when, way back in Section 2: Getting Data Ready, we dealt with the car sales data, to build a regression model on it, we had to encode the categorical features into numbers and fill the missing data.

The code we used worked, but it was a bit all over the place. 

Good news is, `Pipeline` can help us clean it up.

Let's remind ourselves what the data looks like.

In [3]:
import pandas as pd
import sklearn
import numpy as np

In [4]:
data = pd.read_csv("data/car-sales-extended-missing-data.xls")
data

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


In [5]:
data.dtypes

Make              object
Colour            object
Odometer (KM)    float64
Doors            float64
Price            float64
dtype: object

In [6]:
data.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

There's 1000 rows, three features are categorical (`Make`, `Colour`, `Doors`), the other two are numerical (`Odometer (KM)`, `Price`) and there's 249 missing values.

We're going to have to turn the categorical features into numbers and fill the missing values before we can fit a model.

We'll build a `Pipeline` to do so.

`Pipeline`'s main input parameter is `steps` which is a list of tuples (`[(step_name, action_to_take)]`) of the step name, plus the action you'd like it to perform.

In our case, you could think of the steps as:
1. Fill missing data
2. Convert data to numbers
3. Build a model on the data

Let's do it!

In [23]:
# Getting data ready
import pandas as pd
import sklearn
from sklearn.compose import ColumnTransformer #-- 
from sklearn.pipeline import Pipeline #
from sklearn.impute import SimpleImputer  #------ To fill missing values
from sklearn.preprocessing import OneHotEncoder

# Modelling 
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV

# Setup random seed
import numpy as np
np.random.seed(42)

# Import data and drop rows with missing labels
data = pd.read_csv("data/car-sales-extended-missing-data.xls")
data.dropna(subset=["Price"], inplace= True)


# Define different features and transformer pipeline
    # Categorical features = fit into categories[either | or]
categorical_features = ["Make", "Colour"]

# To transform(change) data from categorical to numerical
    # Using pipeline- setps to make a pipeline
categorical_transformer = Pipeline(steps=[
                            ("imputer",SimpleImputer(strategy="constant", fill_value="missing")),# Fills missing value
                            ("onehot", OneHotEncoder(handle_unknown="ignore"))]) # Convert categorical data to numeric

door_feature = ["Doors"]
door_transformer = Pipeline(steps=[
                            ("imputer", SimpleImputer(strategy="constant", fill_value=4))])

numeric_feature= ["Odometer (KM)"]
numeric_transformer = Pipeline(steps=[("imputer", SimpleImputer(strategy="mean"))])


# Setup preprocessing steps (fill missing values, then convert to numbers)
preprocessor = ColumnTransformer(
                    transformers=[
                        ("cat",categorical_transformer, categorical_features),
                        ("door", door_transformer, door_feature),
                        ("num",numeric_transformer, numeric_feature)])


# Creating a preprocessing and modelling pipeline 
model = Pipeline(steps=[("preprocessor", preprocessor), 
                               ("model", RandomForestRegressor())])

# Splitting data
X = data.drop("Price", axis=1)
y = data.Price
xtr, xte, ytr, yte = train_test_split(X, y , test_size=0.2)

#  Fit and score the model
model.fit(xtr, ytr)
model.score(xte,yte)


0.22188417408787875

What we've done is combine a series of data preprocessing steps (filling missing values, encoding numerical values) as well as a model into a `Pipeline`.

Doing so not only cleans up the code, it ensures the same steps are taken every time the code is run rather than having multiple different processing steps happening in different stages.

It's also possible to `GridSearchCV` or `RandomizedSearchCV` with a `Pipeline`.

The main difference is when creating a hyperparameter grid, you have to add a prefix to each hyperparameter (see the [documentation for `RandomForestRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) for a full list of possible hyperparameters to tune).

The prefix is the name of the `Pipeline` step you'd like to alter, followed by two underscores.

For example, to adjust `n_estimators` of `"model"` in the `Pipeline`, you'd use: `"model__n_estimators"` (note the double underscore after `model__` at the start).

Let's see it!

> **Note:** Depending on your computer's processing power, the cell below may take a few minutes to run. For reference, it took about ~60 seconds on my M1 Pro MacBook Pro.

In [29]:
pipe_grid = {
        "preprocessor__num__imputer__strategy": ["mean", "median"], # note the double underscore after each prefix "preprocessor__"
        "model__n_estimators": [100, 1000],
        "model__max_depth": [None, 5],
        "model__max_features": ["sqrt"],
        "model__min_samples_split": [2, 4]}


# GridSearchCV
%time gs= GridSearchCV(model, pipe_grid, cv= 5, verbose=2)
%time gs.fit(xtr, ytr)

Fitting 5 folds for each of 16 candidates, totalling 80 fits
[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_split=2, model__n_estimators=100, preprocessor__num__imputer__strategy=mean; total time=   0.2s
[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_split=2, model__n_estimators=100, preprocessor__num__imputer__strategy=mean; total time=   0.2s
[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_split=2, model__n_estimators=100, preprocessor__num__imputer__strategy=mean; total time=   0.2s
[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_split=2, model__n_estimators=100, preprocessor__num__imputer__strategy=mean; total time=   0.2s
[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_split=2, model__n_estimators=100, preprocessor__num__imputer__strategy=mean; total time=   0.2s
[CV] END model__max_depth=None, model__max_features=sqrt, model__min_sampl

In [30]:
%time gs.score(xte, yte)

CPU times: user 20.5 ms, sys: 104 µs, total: 20.6 ms
Wall time: 18.8 ms


0.2970584538514702