<h1 style="text-align:center;">Pipelines</h1>

# Introduction

Scikit-learn pipelines streamline the machine learning workflow by organizing sequential steps, preventing data leakage, and enhancing code readability. They offer a unified interface for training, prediction, and hyperparameter tuning. By encapsulating the entire process, pipelines ensure consistent results, simplify cross-validation, and facilitate smoother model deployment, making them an essential tool for efficient and error-free machine learning development.

Often times when we are tackling some stuff in machine learning and datas science, we would need to perform a sequence of different transformations on the input data, such as finding or generating a set of features before performing some sort of estimation on it. Pipelines can be used to encapsulate the transfomers and the predictors to simplify the process. In other words, we can sequentially apply a list of transformers and a final estimator.

### Pipelines in Machine Learning

Machine learning workflows often involve a series of steps - from data cleaning and preprocessing to model training and evaluation. Managing these steps can become complex, especially when you have to ensure that the right preprocessing is done on both training and test data. Pipelines provide a way to streamline this process, ensuring that data transformations and model training are handled consistently.

### Key Concepts

#### 1. Transformers
Transformers are primarily used to modify or filter the dataset. They implement two key methods:

- `fit()`: It computes the necessary parameters needed to apply the transformation. For instance, when scaling features, `fit` would compute the mean and standard deviation of the feature.
- `transform()`: It applies the transformation to the data. Using the earlier example, `transform` would scale the features based on the computed mean and standard deviation.

Commonly used transformers in `scikit-learn` include:
- **Scaler classes** like `StandardScaler`, `MinMaxScaler`, etc., for feature scaling.
- **`OneHotEncoder`** for converting categorical variables into a one-hot encoded format.
- **`SimpleImputer`** for handling missing values.
- **`PolynomialFeatures`** for generating polynomial combinations of features, as mentioned in your text.

#### 2. Estimators
Estimators are algorithms that can learn from data. They also implement the `fit()` method, but in addition to that, they have a `predict()` method to make predictions on new data. Some estimators also have a `transform()` method, making them both transformers and estimators.

Commonly used estimators in `scikit-learn` include:
- Regression models like `LinearRegression`, `Ridge`, and `Lasso`.
- Classification models like `LogisticRegression`, `SVM`, and `RandomForestClassifier`.

### Preprocessing Steps in `scikit-learn`

1. **Data Cleaning**: Before preprocessing, it's crucial to clean the data, which might involve:
   - Removing duplicates.
   - Handling missing values using transformers like `SimpleImputer`.
   - Removing outliers or erroneous data.

2. **Feature Engineering**:
   - Generating new features that might help improve model performance.
   - Using transformers like `PolynomialFeatures` to generate interaction terms.

3. **Feature Scaling**: Bringing all features to a similar scale using:
   - `StandardScaler`: Standardizes features by removing the mean and scaling to unit variance.
   - `MinMaxScaler`: Transforms features by scaling each feature to a given range, usually [0, 1].

4. **Encoding Categorical Variables**: Converting categorical data into a format suitable for machine learning models:
   - `OneHotEncoder`: Converts categorical variables into a one-hot encoded format.
   - `OrdinalEncoder`: Converts categorical variables into an integer format.

5. **Feature Selection**: Reducing the number of input features, which can help improve model performance and reduce overfitting. Techniques include:
   - Recursive Feature Elimination (RFE).
   - Using `SelectKBest` to choose the top 'k' features based on certain criteria.

6. **Data Splitting**: Dividing the data into training and test sets using `train_test_split`.

### Constructing Pipelines in `scikit-learn`
Pipelines can be constructed using the `Pipeline` class from `scikit-learn`. The `Pipeline` class takes tuples of the transformer alias (any name you choose to call it) and actual transformer object, all arranged in the order we want the transformations to be made on the datasets. We next call `fit_transform` on the train data but when its the test data, we call `transform` on it.

The main advantage of using pipelines is to ensure that the same preprocessing steps are applied to both training and test data, reducing the risk of data leakage.

---

We are going to go through all we have gone through in the previous notebooks this time passing everything through a pipeline and creating a transformer too.

## Bike Rentals

We will now proceed to test out pipelines to our first attempt with regression.

In [1]:
import pandas as pd
import numpy as np
from wrangle_bike_rentals import *
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn import model_selection
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from category_encoders import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from xgboost import XGBRegressor, XGBClassifier

In [2]:
url = 'https://raw.githubusercontent.com/theAfricanQuant/XGBoost4machinelearning/main/data/bike_rentals.csv'

In [3]:
df = get_data(url)
df.head()

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1.0,0.0,1.0,0.0,6.0,0.0,2,0.344167,0.363625,0.805833,0.160446,331,654,985
1,2,2011-01-02,1.0,0.0,1.0,0.0,0.0,0.0,2,0.363478,0.353739,0.696087,0.248539,131,670,801
2,3,2011-01-03,1.0,0.0,1.0,0.0,1.0,1.0,1,0.196364,0.189405,0.437273,0.248309,120,1229,1349
3,4,2011-01-04,1.0,0.0,1.0,0.0,2.0,1.0,1,0.2,0.212122,0.590435,0.160296,108,1454,1562
4,5,2011-01-05,1.0,0.0,1.0,0.0,3.0,1.0,1,0.226957,0.22927,0.436957,0.1869,82,1518,1600


We are next going to create our transformer that will apply our `prep_data` function that we created in notebook 001. But we will modify it abit. We were too fancy with the inputation. Here we will just allow sklearn to handle that for us. We will delete that part and rename it to `prep_bike_data` function.

In [4]:
def prep_bike_data(data):
    return (data
            .assign(windspeed = data["windspeed"]
                    .fillna((data["windspeed"]
                             .median())),
                    hum = (data['hum']
                   .fillna(data.groupby('season')['hum']
                           .transform('median'))),
                    dteday = pd.to_datetime(data['dteday']),
                    mnth = lambda x: x['dteday'].dt.month,
                    yr = data['yr'].ffill()
                   )
            .drop(['dteday', 'casual','registered'], axis=1)
           )

In [5]:
class PrepDataTransformer(BaseEstimator,
    TransformerMixin):
    """
    This transformer takes a Pandas DataFrame containing our survey 
    data as input and returns a new version of the DataFrame. 
    
    ----------
    ycol : str, optional
        The name of the column to be used as the target variable.
        If not specified, the target variable will not be set.
    Attributes
    ----------
    ycol : str
        The name of the column to be used as the target variable.
    """
    def __init__(self, ycol=None):
        self.ycol = ycol
    
    def transform(self, X):
        return prep_bike_data(X)

    def fit(self, X, y=None):
        return self

Next we create the pipeline using the `Pipeline` class from sklearn. We will pass in tuples of the transformer alias and the actual transformer object we want to use.

In [6]:
pipe = Pipeline(
    [('tweak', PrepDataTransformer()),
     ('imputer', SimpleImputer(strategy='median')),  # Imputing null values using mean
     ('scaler', StandardScaler())
    ]
)

Time to split our data

In [7]:
def splitX_y(df, trgt):
    features = [col for col in df.columns if col not in trgt]
    return (df[features], df[trgt])

In [8]:
bikes_X, bikes_y = splitX_y(df, 'cnt')

print(f"shape of target vector: {bikes_y.shape}")
print(f"shape of feature matrix: {bikes_X.shape}")

shape of target vector: (731,)
shape of feature matrix: (731, 15)


In [9]:
bikes_X_train, bikes_X_test, bikes_y_train, bikes_y_test = (model_selection
                                    .train_test_split(bikes_X, bikes_y, 
                                                      test_size=.3, 
                                                      random_state=43,)
                                                        )

In [10]:
X_train = pipe.fit_transform(bikes_X_train, bikes_y_train)
X_test = pipe.transform(bikes_X_test)
X_train

array([[-0.42851322,  1.41014134, -0.99804496, ...,  0.62710628,
         1.80584285, -0.95596646],
       [-0.42382335,  1.41014134, -0.99804496, ...,  0.63094142,
         1.54909312, -0.57245613],
       [-1.0100577 , -0.38906521, -0.99804496, ...,  1.11278731,
         0.34996609,  0.06987143],
       ...,
       [-0.39099422,  1.41014134, -0.99804496, ...,  0.34000416,
         0.14281996, -0.21207628],
       [-0.49417147,  0.51053806, -0.99804496, ...,  0.80298336,
         0.59504315, -0.65839276],
       [-0.1893296 ,  1.41014134, -0.99804496, ..., -0.89546132,
        -0.36192674,  1.40103935]])

In [11]:
lin_reg = LinearRegression()
lin_reg.fit(X_train, bikes_y_train)
y_pred = lin_reg.predict(X_test)

mse = mean_squared_error(bikes_y_test, y_pred)

rmse = np.sqrt(mse)

print(f"RMSE: {rmse:.2f}")

RMSE: 983.05


This is worst that the first one. Let us see what the `xgboost` will get us.

In [12]:
xgb_reg = XGBRegressor()
xgb_reg.fit(X_train, bikes_y_train)
y_pred = xgb_reg.predict(X_test)

mse = mean_squared_error(bikes_y_test, y_pred)

rmse = np.sqrt(mse)

print(f"RMSE: {rmse:.2f}")

RMSE: 647.12


I love this one!

Next is Cross Validation. Here we pass in the entire pipeline as the scores will be more reliable.

In [13]:
lr_pipe = Pipeline(
    [('tweak', PrepDataTransformer()),
     ('imputer', SimpleImputer(strategy='median')),  # Imputing null values using mean
     ('scaler', StandardScaler()),
     ('linreg', LinearRegression())
    ]
)
scores = cross_val_score(lr_pipe, bikes_X, bikes_y, scoring='neg_mean_squared_error', cv=10)

rmse = np.sqrt(-scores)

print(f'Reg rmse: {np.round(rmse, 2)}')

print(f'RMSE mean: {rmse.mean()}')

Reg rmse: [ 504.73  842.99 1142.28  728.66  639.54  970.19 1133.83 1252.64 1085.75
 1432.39]
RMSE mean: 973.3001612624137


In [14]:
xg_pipe = Pipeline(
    [('tweak', PrepDataTransformer()),
     ('imputer', SimpleImputer(strategy='median')),  # Imputing null values using mean
     ('scaler', StandardScaler()),
     ('xgb', XGBRegressor())
    ]
)
scores = cross_val_score(xg_pipe, bikes_X, bikes_y, scoring='neg_mean_squared_error', cv=10)

rmse = np.sqrt(-scores)

print(f'Reg rmse: {np.round(rmse, 2)}')

print(f'RMSE mean: {rmse.mean()}')

Reg rmse: [ 675.05  696.86  555.72  690.63  862.77 1039.04 1008.01  846.13  875.39
 1649.44]
RMSE mean: 889.9043382575034


## Census

We now switch our attention to the census data we worked with in the last notebook while working on classification.

In [15]:
url_census = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'

def prep_census(url):
    col_names = ['age', 'workclass', 'fnlwgt', 
                 'education', 'education-num', 
                 'marital-status', 'occupation',
                  'relationship', 'race', 'sex', 
                 'capital-gain', 'capital-loss', 
                 'hours-per-week', 'native-country', 
                   'income']
    return (pd
            .read_csv(url, header=None)
            .pipe(lambda x: x.rename(columns={i: name for i, name in enumerate(col_names)}))
            .drop(['education'], axis=1)
    )

df_census = prep_census(url_census)
df_census.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education-num   32561 non-null  int64 
 4   marital-status  32561 non-null  object
 5   occupation      32561 non-null  object
 6   relationship    32561 non-null  object
 7   race            32561 non-null  object
 8   sex             32561 non-null  object
 9   capital-gain    32561 non-null  int64 
 10  capital-loss    32561 non-null  int64 
 11  hours-per-week  32561 non-null  int64 
 12  native-country  32561 non-null  object
 13  income          32561 non-null  object
dtypes: int64(6), object(8)
memory usage: 3.5+ MB


In [16]:
df_census.select_dtypes(include=['object']).columns.tolist()

['workclass',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'native-country',
 'income']

In [17]:
class OHETransformer(BaseEstimator, TransformerMixin):
    """
    A transformer that applies one-hot encoding to columns of type 'object' 
    in a Pandas DataFrame. The one-hot encoding process converts categorical 
    columns into a format that can be provided to machine learning algorithms 
    to improve predictions.

    The transformer identifies columns with data type 'object' and uses the
    OneHotEncoder from the category_encoders library to perform the encoding.

    Attributes
    ----------
    cols : list
        List of column names in the DataFrame identified for one-hot encoding.
    
    encode : OneHotEncoder object
        The encoder instance from category_encoders that performs the actual 
        one-hot encoding.
    """

    def __init__(self):
        self.cols = None
        self.encode = None

    def fit(self, X, y=None):
        self.cols = X.select_dtypes(include=['object']).columns.tolist()
        self.encode = OneHotEncoder(cols=self.cols, use_cat_names=False)
        self.encode.fit(X[self.cols])
        return self

    def transform(self, X):
        X_encoded = self.encode.transform(X[self.cols])
        X = X.drop(columns=self.cols)
        return pd.concat([X, X_encoded], axis=1)


In [18]:
census_pipe = Pipeline(
    [('ohe', OHETransformer())]
)

In [19]:
cen_X, cen_y = splitX_y(df_census, 'income')

print(f"shape of target vector: {cen_y.shape}")
print(f"shape of feature matrix: {cen_X.shape}")

shape of target vector: (32561,)
shape of feature matrix: (32561, 13)


In [20]:
cen_X_train, cen_X_test, cen_y_train, cen_y_test = (model_selection
                                    .train_test_split(cen_X, cen_y, 
                                                      test_size=.3, 
                                                      random_state=43,)
                                                        )

X_train = census_pipe.fit_transform(cen_X_train)
X_test = census_pipe.transform(cen_X_test)
X_train

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,workclass_1,workclass_2,workclass_3,workclass_4,...,native-country_33,native-country_34,native-country_35,native-country_36,native-country_37,native-country_38,native-country_39,native-country_40,native-country_41,native-country_42
20717,24,32950,13,0,0,35,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
11366,37,34996,9,0,0,40,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
28940,46,189498,13,0,1848,45,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
28302,50,301583,9,0,0,40,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
10929,46,224559,9,0,0,40,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26901,42,86185,10,0,0,40,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
7985,33,142675,13,0,0,30,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
18687,36,35429,10,0,0,60,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
19776,20,493443,7,0,0,40,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


In [21]:
logreg = LogisticRegression(solver='saga', max_iter=10000)
logreg.fit(X_train, cen_y_train)

In [22]:
cen_y_train

20717     <=50K
11366     <=50K
28940      >50K
28302     <=50K
10929     <=50K
          ...  
26901     <=50K
7985      <=50K
18687     <=50K
19776     <=50K
14148      >50K
Name: income, Length: 22792, dtype: object

In [23]:
logreg.score(X_test, cen_y_test)

0.7958849421639881

This looks so good. We will next check out xgboost. One issue though is the fact that xgboost won't accept our labels as is. We will need to encode it using sklearns `LabelEncoder` class.

In [24]:
le = LabelEncoder()

In [25]:
y_train = le.fit_transform(cen_y_train)
y_test = le.transform(cen_y_test)

In [26]:
le.classes_

array([' <=50K', ' >50K'], dtype=object)

In [27]:
xgbc= XGBClassifier()
xgbc.fit(X_train, y_train)
xgbc.score(X_test, y_test)

0.8708158460436073

Our score is great compared with that of Logistic regression. Let us put everything thru a cross validation.

In [28]:
logreg_pipe = Pipeline(
    [('ohe', OHETransformer()),
     ('logreg', LogisticRegression(solver='saga', max_iter=10000))
    ]
)
scores = cross_val_score(logreg_pipe, cen_X, cen_y, scoring='accuracy', cv=10)


print(f'Reg rmse: {np.round(scores, 2)}')

print(f'RMSE mean: {scores.mean()}')

Reg rmse: [0.8  0.8  0.79 0.79 0.79 0.8  0.79 0.8  0.8  0.8 ]
RMSE mean: 0.7952765410203237


In [29]:
xgbc_pipe = Pipeline(
    [('ohe', OHETransformer()),
     ('xgbc', XGBClassifier(n_estimators=5))
    ]
)

le_y_test = le.transform(cen_y) # transforming the labels like we did above

scores = cross_val_score(xgbc_pipe, cen_X, le_y_test, scoring='accuracy', cv=10)


print(f'Reg rmse: {np.round(scores, 2)}')

print(f'RMSE mean: {scores.mean()}')

Reg rmse: [0.85 0.86 0.87 0.85 0.86 0.86 0.86 0.87 0.86 0.86]
RMSE mean: 0.8591876766654168


Before we round up this part of the walk-through, let us create a `helper_file.py` file where we keep all the relevant helper functions we have worked on thus far. I know we had done that for the rental bike before, this one will be for all the files. We should have done it that way right from the very beginning...but hey...it is what it is...!

In [3]:
%%writefile helper_file.py

import pandas as pd
import numpy as np
from wrangle_bike_rentals import *
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn import model_selection
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from category_encoders import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from xgboost import XGBRegressor, XGBClassifier

import pandas as pd
import datetime as dt


def show_nulls(df):
    return (df[df
            .isna()
            .any(axis=1)]
           )
    
def total_nulls(df):
    return (df
         .isna()
         .sum()
         .sum()
        )

def get_data(url):
    return (
        pd.read_csv(url)
    )


url_census = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'

def prep_census(url):
    col_names = ['age', 'workclass', 'fnlwgt', 
                 'education', 'education-num', 
                 'marital-status', 'occupation',
                  'relationship', 'race', 'sex', 
                 'capital-gain', 'capital-loss', 
                 'hours-per-week', 'native-country', 
                   'income']
    return (pd
            .read_csv(url, header=None)
            .pipe(lambda x: x.rename(columns={i: name for i, name in enumerate(col_names)}))
            .drop(['education'], axis=1)
    )

df_census = prep_census(url_census)

def splitX_y(df, trgt):
    features = [col for col in df.columns if col not in trgt]
    return (df[features], df[trgt])

def prep_bike_data(data):
    return (data
            .assign(windspeed = data["windspeed"]
                    .fillna((data["windspeed"]
                             .median())),
                    hum = (data['hum']
                   .fillna(data.groupby('season')['hum']
                           .transform('median'))),
                    dteday = pd.to_datetime(data['dteday']),
                    mnth = lambda x: x['dteday'].dt.month,
                    yr = data['yr'].ffill()
                   )
            .drop(['dteday', 'casual','registered'], axis=1)
           )

url_bikes = 'https://raw.githubusercontent.com/theAfricanQuant/XGBoost4machinelearning/main/data/bike_rentals.csv'

class PrepDataTransformer(BaseEstimator,
    TransformerMixin):
    """
    This transformer takes a Pandas DataFrame containing our survey 
    data as input and returns a new version of the DataFrame. 
    
    ----------
    ycol : str, optional
        The name of the column to be used as the target variable.
        If not specified, the target variable will not be set.
    Attributes
    ----------
    ycol : str
        The name of the column to be used as the target variable.
    """
    def __init__(self, ycol=None):
        self.ycol = ycol
    
    def transform(self, X):
        return prep_bike_data(X)

    def fit(self, X, y=None):
        return self

df_bikes = get_data(url_bikes)

class OHETransformer(BaseEstimator, TransformerMixin):
    """
    A transformer that applies one-hot encoding to columns of type 'object' 
    in a Pandas DataFrame. The one-hot encoding process converts categorical 
    columns into a format that can be provided to machine learning algorithms 
    to improve predictions.

    The transformer identifies columns with data type 'object' and uses the
    OneHotEncoder from the category_encoders library to perform the encoding.

    Attributes
    ----------
    cols : list
        List of column names in the DataFrame identified for one-hot encoding.
    
    encode : OneHotEncoder object
        The encoder instance from category_encoders that performs the actual 
        one-hot encoding.
    """

    def __init__(self):
        self.cols = None
        self.encode = None

    def fit(self, X, y=None):
        self.cols = X.select_dtypes(include=['object']).columns.tolist()
        self.encode = OneHotEncoder(cols=self.cols, use_cat_names=False)
        self.encode.fit(X[self.cols])
        return self

    def transform(self, X):
        X_encoded = self.encode.transform(X[self.cols])
        X = X.drop(columns=self.cols)
        return pd.concat([X, X_encoded], axis=1)


Writing helper_file.py
