## Machine Learning Workflow

So exactly is a ML workflow?
<br>
The machine learning workflow consists of several key steps that are typically followed when developing a machine learning model. Steps such as: Defining the Problem, collecting data, Data Preprocessing, Data Splitting, model selection, Model Training, Hyperparameter Tuning, Model Evaluation, Model Interpretation, Final Model Selection, Model Deployment and finally Monitoring and Maintenance.

So What's the plan for today?
<br>
- So, today, I'll start by sharing some model selection tips with you guys. These are essentially guidelines on how to choose the right models for specific tasks. 
- Following that, we'll delve into pipelines, which we'll break down into two main sections. The first section covers the pre-processing pipeline, encompassing all pre-processing activities you already know such as scaling, imputing, and encoding, but approached in a more streamlined and efficient manner. 
- Then, we'll discuss full pipelines, which integrate these pre-processing steps with a model at the end. 
- This setup allows us to seamlessly feed our pre-processed data into a model for fitting, scoring, cross-validation, and grid searching. 
- To wrap up, I have an exciting feature to share with you.

## 1. Model Selection

#### Let's take a step back: which models have we seen so far?

The expression \( y^ = f_\beta(X) \) - "Y hat equals f beta of X." - represents a relationship between the target variable \( y^ \) and a set of predictor variables \( X \) parameterized by the vector \( \beta \). Let's break down what each component means:

- \( y^ \) is the target variable, also known as the dependent variable or the response variable. This is the variable that we aim to predict or explain using the predictor variables(X).

- \( X \) is a matrix representing the predictor variables. Each row of \( X \) corresponds to an observation, and each column corresponds to a different predictor variable. \( X \) is often referred to as the feature matrix, design matrix, or input matrix.

- \( \beta \) is a vector of parameters or coefficients that represent the relationship between the predictor variables(x) and the target variable(y^). Each element of \( \beta \) corresponds to the weight or effect of the corresponding predictor variable on the target variable.

- \( f_\beta(X) \) represents a function that maps the predictor variables \( X \) to the target variable \( y^ \), parameterized by \( \beta \). This function captures the relationship between the predictors and the target, as determined by the values of \( \beta \).

In essence, the expression \( y^ = f_\beta(X) \) describes how the target variable \( y^ \) can be predicted or explained using a function \( f_\beta \) that takes the predictor variables \( X \) as input and is parameterized by the coefficients \( \beta \). The goal in many machine learning and statistical modeling tasks is to estimate the optimal values of \( \beta \) that best capture the relationship between the predictors and the target, allowing for accurate prediction or explanation of \( y^ \) for new observations.

# BETA
![2024-02-13_17-04-15.png](attachment:a512500c-1b6b-4dbd-b925-24ac2df7b154.png)

Regression models are parametric - These are known as parametric models, characterized by a predetermined number of features and parameters that require adjustment.
<br>
The advantage of parametric models lies in their rapid training capabilities, even with large datasets. 
This efficiency comes from their use of a consistent set of parameters, allowing them to perform fast calculations.
<br>
However, we  have some drawbacks -> the drawback is the necessity of understanding the data's structure beforehand. To accurately adjust these parameters, we must have a comprehensive grasp of the data's appearance. This requirement exists guys, because parametric models rely on assumptions regarding the data's distribution, which, if incorrect, can lead to models that fail to accurately represent the relationship between variables.

#### KNN, kernel-SVM are non-parametric

Now let's tallk abut the nonparametric models, which encompass algorithms such as KNN (k-nearest neighbors) and kernel SVM. 
Who can tell me what KNN and a Kernel SVM is? (supervised, unsupervised or a reinformecent algorithm?
- KNN: K-Nearest Neighbors (KNN) is a simple yet effective supervised machine learning algorithm used for classification and regression tasks. It's a non-parametric method, 
The phrase "they are non-parametric" means that the method or model being discussed does not rely on assuming a specific form or distribution for the underlying data. Instead, it allows the data to dictate its own structure without imposing predetermined parameters. This flexibility can be advantageous in situations where the true data distribution is unknown or complex.
- ... KNN, relies on instance-based learning, where the algorithm memorizes the training data and makes predictions based on the similarity between new data points and the training examples.
- Kernel: Kernel Support Vector Machine (SVM) is a powerful supervised machine learning algorithm used for classification and regression tasks. SVMs work by finding the optimal hyperplane that separates different classes of data points in a feature space.  SVMs work by finding the optimal hyperplane that separates different classes of data points in a feature space, aiming to maximize the margin between classes. They are effective for both linearly separable and non-linearly separable data, thanks to techniques like the kernel trick, which allows SVMs to handle complex decision boundaries.SVMs are particularly effective in high-dimensional spaces and cases where the number of dimensions exceeds the number of samples. ( this refers to situations in which the number of variables (dimensions) being considered in a dataset is greater than the number of data points (samples) available for analysis. Basically, there are more features or attributes being measured or considered than there are instances or observations to analyze.). The kernel trick is a key component of SVMs that enables them to efficiently handle non-linearly separable data by transforming the input features into a higher-dimensional space.

<br>
KERNEL TRICK
<br>
When dealing with SVMs, the kernel trick allows the algorithm to effectively handle data that is not linearly separable in its original feature space. This means that the decision boundary between different classes cannot be represented by a straight line (or hyperplane) in the original feature space.

The kernel trick works by implicitly mapping the input features into a higher-dimensional space, where the data becomes linearly separable. This transformation is achieved through the use of a kernel function, which calculates the similarity (or distance) between pairs of data points in this higher-dimensional space. By transforming the input features into this higher-dimensional space, the SVM can find a hyperplane that separates the classes more effectively. Importantly, the kernel trick allows SVMs to perform this transformation without explicitly computing the coordinates of the data points in the higher-dimensional space. Instead, it calculates the dot products between pairs of data points in the original feature space, which are then used to define the decision boundary in the transformed space.<br>

These approaches operate without predefined assumptions about the data's structure. However, their complexity lies in the training process, burdened by the extensive search for numerous parameters. For instance, KNN employs a distance-based mechanism to analyze data by measuring the proximity between data points. This process inherently demands the storage of the entire dataset to calculate these distances, leading to significant computational costs and time consumption. The same way, Support Vector Machines (SVMs) require continuous computation of the kernel for every pair of data points, basically emphasizing the resource-intensive nature of these methods.<br>

But of course, there is the upside:
-  Non-parametric models are flexible and can capture complex, non-linear relationships between the variables, making them suitable for a wide range of applications.
- are often more robust to violations of the underlying assumptions than parametric models, and can still provide good estimates of the relationship between the variables even if the data is not normally distributed.
- are data-driven and do not require the specification of a fixed set of parameters, making them suitable for situations where the underlying relationship between the variables is not well understood.
- can be used with any type of data, including categorical, ordinal, and continuous data, and can be used for both regression and classification problems.
- are less prone to overfitting than parametric models, especially when the amount of available data is limited.

### KNN
![2024-02-14_04-04-57.png](attachment:98e14475-1a7c-4bcf-8581-c0a016802983.png)

### SVM
![2024-02-14_04-09-29.png](attachment:ab686c63-3234-4ecc-b64e-459c13dd4e14.png)

# Image

you've seen this image a few times before, I guess, right? From scikit learn just kind of showing a few areas of machine learning. We're still moving around this classification and regression fields. Um, not looking at clustering yet. That's, that's in the future for you guys. We've already done a bit of dimensionality reduction of course, but today we're moving just in the area of classification regression just now looking from a different angle basically

# A Pipeline is a chain of operations in a Machine Learning project (preprocessing, training, predicting, etc.) - READ

## 2. Pipelines

Dimensionality Reduction -> Dimensionality reduction in machine learning refers to the process of reducing the number of features or variables in a dataset while preserving as much of the original information as possible. This is typically done to simplify the dataset, make it easier to work with, and improve the performance of machine learning models.

So, what exactly is a pipeline? It's essentially a sequence of operations, of similuar nature of a function. Picture it as a series of defined steps where a function receives certain parameters, processes these parameters, and ultimately delivers a result. This is the core concept of a pipeline - it processes data through predefined steps and yields an outcome.

The primary benefits include enhancing the readability and comprehension of your workflow. This is particularly valuable considering how quickly one can lose track of the purpose and function of code we wrote just a few days prior. Pipelines offer a solution because they have a very straightforward and cohesive structure.

ALso, pipelines facilitate reproducibility in your work. This is akin to the functionality of a function, allowing you to consistently apply the same steps to any data input into the pipeline, which is a significant advantage.

And so if you look at this graph here below, you can kind of get an idea of what a pipeline is.

Components of a Pipeline: Here we have the key components of a pipeline: transformers and models for the learning algorithm. Transformers are used to preprocess or transform the data, while models represent the learning algorithms used for prediction tasks.

Methods of Pipelines: Pipelines have specific methods, such as `.fit` and `.predict`. The `.fit` method is used to train the pipeline, which involves running every fit and transform step of all the transformers, and then fitting the model at the end. After training, the pipeline is ready for prediction.

Prediction with Pipelines: When using the predict method of a pipeline, new data is provided as input. The pipeline first transforms this new data (using the transformations learned during training) and then uses the trained model to make predictions on the transformed data. Since the pipeline is already fitted on the training data, it doesn't need to fit again; it only transforms the new data and runs predict on the model.

# Preprocessing Pipelines

Moving on to pre-processing pipelines, we'll explore this particular dataset. The goal is to predict health insurance charges, as seen on the right, using various features. These include Age, BMI, presence of children, smoking status, and region. The idea is that these factors will help us to accurately forecast health insurance costs.

You can download the dataset just by clicking on the link here, which is also mentioned in the challenge. Now Lets define our data - alright I'm talking too much, someone tell me what this code is doing...

In [1]:
import pandas as pd

data = pd.read_csv('data_workflow.csv')
data.head()

Unnamed: 0,age,bmi,children,smoker,region,charges
0,19.0,27.9,0,True,southwest,16884.924
1,18.0,33.77,1,False,southeast,1725.5523
2,,33.0,3,False,southeast,4449.462
3,33.0,22.705,0,False,northwest,21984.47061
4,32.0,28.88,0,False,northwest,3866.8552


In [2]:
# Defining the features and the target
X = data.drop(columns='charges')
y = data['charges']

# Train-Test split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((1070, 5), (268, 5), (1070,), (268,))

# ✏️ Today's challenges:

Open notebook and talk about the imports

Run cels up to train tests

So let's say that we want to preprocess the age column. All that we do Is we build a pipeline object.<br>
Okay, So you see here is just my input pipeline.

explain pipeline

So if I press shift tab, I can see the, the mini documentation

I just put a comma and then a second tuple where I put in my scalar and instantiate the standard scalar

and that's it. That's my pipeline.

In [3]:
# Preprocess "age"
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

# Build the pipeline with the different steps
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy="median")),
    ('standard_scaler', StandardScaler())
])

pipeline.fit(X_train[['age']]) # learning the data
pipeline.transform(X_train[['age']]) #applying what we learned to the data

array([[-0.45371332],
       [-1.23239598],
       [-1.09081732],
       ...,
       [ 0.39575868],
       [ 0.32496935],
       [-1.51555331]])

## Explanation of the code:
This code snippet demonstrates how to preprocess the 'age' feature using scikit-learn's Pipeline. Preprocessing is a crucial step in machine learning pipelines as it helps in preparing the data for modeling by handling missing values and scaling features.

Let's break down the code:

1. `from sklearn.pipeline import Pipeline`: Importing the Pipeline class from the scikit-learn library, which allows chaining together multiple data processing steps.

2. `from sklearn.impute import SimpleImputer`: Importing SimpleImputer, which is used to handle missing values by filling them with a specified strategy, in this case, the median.

3. `from sklearn.preprocessing import StandardScaler`: Importing StandardScaler, which is used for scaling features by removing the mean and scaling to unit variance. 

4. Creating the pipeline: The pipeline is created using a list of tuples, where each tuple consists of a name and a transformer or estimator. Here, the first step in the pipeline is SimpleImputer to impute missing values with the median, and the second step is StandardScaler. The StandardScaler() is a preprocessing technique used in machine learning to standardize the features of a dataset. It transforms the data such that each feature has a mean of 0 and a standard deviation of 1. This process is also known as z-score normalization. when to use: When Features Have Different Scales, When Using Distance-Based Algorithms, When Using Regularization (Regularized models like Ridge Regression and Lasso Regression penalize the magnitudes of the coefficients), When Comparing Features (standardizing the features makes the coefficients directly comparable),When Using PCA(Standardizing the features before applying PCA ensures that features with larger scales do not dominate the variance calculations.)

5. `pipeline.fit(X_train[['age']])`: Fitting the pipeline to the training data for the 'age' feature. This step learns the parameters of the imputer (median) and the scaler (mean and standard deviation) from the training data.

6. `pipeline.transform(X_train[['age']])`: Applying the learned transformations to the 'age' feature in the training data. This step replaces missing values with the median and scales the values using the mean and standard deviation learned during the fitting step.

So Basically, this pipeline ensures that missing values in the 'age' feature are handled appropriately, and the feature is scaled for better performance.

And so we can look at the pipe. And the nice thing about this one is I can open all of these little arrows to see how it looks on the inside. So this this, for example, is my pipeline object as the computer sees it, right? So this is just my pipeline object where I have my steps, my imputer - simple imputer, my scalar - standard scalar. So essentially it's the same thing I coded up here. And I can open the individual steps as Well to see. 

# why AGE?

Imputation of Missing Values: Age may be an important predictor variable, and missing values in this column could be imputed using statistical methods such as mean, median, or mode imputation. Since age is often correlated with other variables such as BMI, smoking status, and region, imputing missing age values may help retain valuable information in the dataset.

Encoding for Categorical Variables: In some cases, age might be discretized into categorical groups to capture non-linear relationships or to simplify model interpretation. For example, age groups such as "adolescent," "adult," and "elderly" could be created and encoded as categorical variables. These categories might have different risk profiles or healthcare needs, making them relevant for modeling.

Feature Engineering: Age can also be used for feature engineering to create new variables that could potentially improve model performance. For instance, interaction terms between age and other variables (e.g., age multiplied by BMI) could capture additional information about the relationship between age and healthcare costs.

Domain Knowledge: Depending on the domain, age may have a significant impact on the outcome variable (in this case, insurance charges). Insurance premiums often vary based on age due to factors such as risk of health issues or life expectancy. Therefore, including age as a feature could enhance the predictive power of the model.

In [4]:
pipeline

# Column transformer

So what exactly is ColumnTransformer?

The ColumnTransformer is a class that takes a list of transformers, where each transformer is applied to a specific subset of the columns in the dataset. The transformers can be any of the scikit-learn preprocessing transformers, like StandardScaler, OneHotEncoder, SimpleImputer...

#  Let's perform the following operations in parallel:

So using the ColumnTransformer is about applying Specific changes to specific columns of your datser, but in parallel.
Because we essentially it's a matter of not wasting time, I guess, is the the crude way of saying it.<br>

So we want to impute the and scale the numerical values, which is what we just did Above. So we want to do that, but Then we also want to encode the categorical variables? The thing is, though, I don't I don't necessarily want to create a separate pipeline for both of these steps. That's the the point here - So what I wanna do is copy paste from above, so it's the same same pipeline with the imputer and scalar. I save this now as a num transformer. 

And I have here my categorical transformer for which I'm just Using my One hot encoder. 

In [5]:
from sklearn.compose import ColumnTransformer

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

from sklearn.preprocessing import OneHotEncoder


# Impute then scale numerical values: 
num_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy="mean")),
    ('standard_scaler', StandardScaler())
])

# Encode categorical values
# handle unknown=ignore is basically just If it finds an unknown categorical value, it will just ignore it instead of throwing me an error.
cat_transformer = OneHotEncoder(handle_unknown='ignore')

# Parallelize "num_transformer" and "cat_transfomer"
preprocessor = ColumnTransformer(
    [
        ('num_transformer', num_transformer, ['age', 'bmi']), # here i tell it which columns I want it to apply (only)
        ('cat_transformer', cat_transformer, ['smoker', 'region'])
])

In this example, we define two transformers: num_transformer to preprocess the numerical features (age and bmi) and cat_transformer to preprocess the categorical features (smoker and region). We then use the ColumnTransformer to apply the appropriate transformer to each subset of columns. Finally, we apply the preprocessing steps to the dataset using the fit_transform method of the ColumnTransformer. And this is our result.

In [6]:
preprocessor

Notice how this is sort of a union of because it allows you to apply different data preprocessing steps to different columns of a dataset.

as you can imagine, this is a pipeline This is actually not just an encoder, but you can have multiple pipelines, right? So you can Just create different pipelines three, four, five Six, whatever, and then just put them All inside and Then do everything In parallel At once Quite nice. 

### WHY USE THE COLUMN TRANSFORMER?
ColumnTransformer can be very useful when dealing with datasets that have columns with different data types or require different preprocessing steps. 
One good reason is that you can apply different preprocessing steps to different subsets of columns in a single pass over the data.
it can also help you avoid data leakage, which is a common problem in machine learning when you inadvertently use information from the test data to preprocess the training data. By applying the preprocessing steps separately to each subset of columns, you can ensure that the training and test data are preprocessed in the same way, without leaking any information from the test data

### How do we use it?
I take my preprocessor and do a fit_transform(X_train)

In [7]:
X_train_transformed = preprocessor.fit_transform(X_train)

#Using display instead of print to see a proper DF 
# in the form of a table with borders around rows and columns
print("Original training set")
display(X_train.head(3))

print("Preprocessed training set - Notice how we are missing feature names!")
display(pd.DataFrame(data=X_train_transformed).head(3))

Original training set


Unnamed: 0,age,bmi,children,smoker,region
1138,33.0,30.25,0,False,southeast
792,22.0,23.18,0,False,northeast
693,24.0,23.655,0,False,northwest


Preprocessed training set - Notice how we are missing feature names!


Unnamed: 0,0,1,2,3,4,5,6,7
0,-0.453795,-0.078058,1.0,0.0,0.0,0.0,1.0,0.0
1,-1.232479,-1.260928,1.0,0.0,1.0,0.0,0.0,0.0
2,-1.0909,-1.181456,1.0,0.0,0.0,1.0,0.0,0.0


BUT im missing feature names! But that is easy to solve by using the method get_feature_names_out()<br>
1. we get our feature names

# Get your features' names
`preprocessor.get_feature_names_out()`

In [8]:
preprocessor.get_feature_names_out()

array(['num_transformer__age', 'num_transformer__bmi',
       'cat_transformer__smoker_False', 'cat_transformer__smoker_True',
       'cat_transformer__region_northeast',
       'cat_transformer__region_northwest',
       'cat_transformer__region_southeast',
       'cat_transformer__region_southwest'], dtype=object)

#### 2. then we rebuild the dataframe with the new columns

In [9]:
pd.DataFrame(
    preprocessor.fit_transform(X_train), 
    columns=preprocessor.get_feature_names_out()
).head()

Unnamed: 0,num_transformer__age,num_transformer__bmi,cat_transformer__smoker_False,cat_transformer__smoker_True,cat_transformer__region_northeast,cat_transformer__region_northwest,cat_transformer__region_southeast,cat_transformer__region_southwest
0,-0.453795,-0.078058,1.0,0.0,0.0,0.0,1.0,0.0
1,-1.232479,-1.260928,1.0,0.0,1.0,0.0,0.0,0.0
2,-1.0909,-1.181456,1.0,0.0,0.0,1.0,0.0,0.0
3,-1.161689,-0.561579,1.0,0.0,0.0,1.0,0.0,0.0
4,-0.029059,0.566079,1.0,0.0,0.0,0.0,0.0,1.0


### What happened to the children column? What if we want to keep it untouched?

# remainder=passthrough

When setting remainder='passthrough', it means that any columns that are not selected for transformation (e.g., columns that are not specified as part of the pipeline's preprocessing steps) will be passed through the pipeline unchanged. 

In [10]:
preprocessor = ColumnTransformer([
    ('num_transformer', num_transformer, ['age','bmi']),
    ('cat_transformer', cat_transformer, ['region','smoker'])],
    remainder='passthrough'
)

preprocessor

In [11]:
pd.DataFrame(preprocessor.fit_transform(X_train),
            columns=preprocessor.get_feature_names_out()).head(3)

Unnamed: 0,num_transformer__age,num_transformer__bmi,cat_transformer__region_northeast,cat_transformer__region_northwest,cat_transformer__region_southeast,cat_transformer__region_southwest,cat_transformer__smoker_False,cat_transformer__smoker_True,remainder__children
0,-0.453795,-0.078058,0.0,0.0,1.0,0.0,1.0,0.0,0.0
1,-1.232479,-1.260928,1.0,0.0,0.0,0.0,1.0,0.0,0.0
2,-1.0909,-1.181456,0.0,1.0,0.0,0.0,1.0,0.0,0.0


# Custom: Function Transformer 

A Function Transformer is a type of transformer object that applies a **user-defined function** to the input data and returns the transformed data.<br>
- They can be used with either Pipelines (→ → →) or ColumnTransformers (⑂)I
- They can also be used for feature engineering and preprocessing, and it is particularly useful when we need to apply a custom function to transform the input data.

# 👆 If you want to use your own transformer in a Pipeline or a ColumnTransformer (not one already available in Sklearn), you must encapsulate your function within a FunctionTransformer.

In [12]:

import numpy as np
from sklearn.preprocessing import FunctionTransformer
# Create a transformer that compresses data to 2 digits (for instance!)
rounder = FunctionTransformer(np.round)

# We can use a lambda function for more customizable functions
rounder = FunctionTransformer(lambda array: np.round(array, decimals=2))

- OneHotEncoder: This is a preprocessing transformer in scikit-learn used to convert categorical variables into one-hot encoded format.
- `drop='if_binary'`: This parameter specifies how to handle binary categorical variables. When set to 'if_binary', it indicates that binary categorical variables should be dropped after encoding. If a categorical variable has only two unique values, it's considered binary.
- `handle_unknown='ignore'`: This parameter specifies how to handle unknown categories during encoding. When set to 'ignore', any unseen categories in the test set (i.e., categories not present in the training set) will be ignored during transformation. This means that if a category is encountered during prediction that was not present during training, the corresponding one-hot encoded columns for that category will be all zeros.

In [13]:
# Add it at the end of our numerical transformer
num_transformer = Pipeline([
    ('imputer', SimpleImputer()),
    ('scaler', StandardScaler()),
    ('rounder', rounder)])

# Encode categorical values
cat_transformer = OneHotEncoder(drop='if_binary',
                                handle_unknown='ignore')

preprocessor = ColumnTransformer([
    ('num_transformer', num_transformer, ['bmi', 'age']),
    ('cat_transformer', cat_transformer, ['region', 'smoker'])],
    remainder='passthrough')
preprocessor

In [18]:
pd.DataFrame(preprocessor.fit_transform(X_train)).head(3)

Unnamed: 0,0,1,2,3,4,5,6,7
0,-0.08,-0.45,0.0,0.0,1.0,0.0,0.0,0.0
1,-1.26,-1.23,1.0,0.0,0.0,0.0,0.0,0.0
2,-1.18,-1.09,0.0,1.0,0.0,0.0,0.0,0.0


# ❗️ FunctionTransformer only works for stateless transformations</font> ❗️

Stateless transformations in machine learning refer to transformations that do not depend on the state or distribution of the data. In other words, the transformation applied to each data point is independent of other data points in the dataset.

### Examples of transformations which don't "learn" anything:
![2024-02-14_05-07-45.png](attachment:580f2b76-e7ce-4001-8d92-03c8f365c692.png)

# 👩‍🏫 stateful transformations are transformations which store information during .fit(X_train). This information is re-used for .transform(X_test)
![2024-02-14_05-10-52.png](attachment:fa326976-df38-4465-81e9-6634625dafe9.png)

# 🕵🏻‍♂️ Transformers under the hood

# FeatureUnion | |

A feature union pipeline typically consists of two main parts:

Feature Union: The first part is the feature union step, which combines the outputs of multiple data preprocessing pipelines. The data preprocessing pipelines may contain different types of transformers, such as StandardScaler, OneHotEncoder, CountVectorizer, or TF-IDFVectorizer, that perform different types of data preprocessing operations on different subsets of the data. The feature union step merges the outputs of these transformers into a single dataset, which is then passed on to the next step in the pipeline.

Modeling: The second part of the pipeline is the modeling step, which includes a machine learning model such as a decision tree, random forest, or logistic regression. This step takes the preprocessed data produced by the feature union step and trains a model to make predictions on new, unseen data.

The feature union pipeline is particularly useful when dealing with datasets that contain heterogeneous data, i.e., data that includes different types of features such as categorical, numerical, and text data. By using a feature union pipeline, we can apply different data preprocessing techniques to each type of feature and combine the results into a single dataset that can be used for machine learning.

In [19]:
X_train.head(3)

Unnamed: 0,age,bmi,children,smoker,region
1138,33.0,30.25,0,False,southeast
792,22.0,23.18,0,False,northeast
693,24.0,23.655,0,False,northwest


In [20]:
from sklearn.pipeline import FeatureUnion

# Create a custom transformer that multiplies/divides two columns
# Notice that we are creating this new feature completely randomly just as an example
bmi_age_ratio_constructor = FunctionTransformer(lambda df: pd.DataFrame(df["bmi"] / df["age"]))

union = FeatureUnion([
    ('preprocess', preprocessor), # columns 0-7
    ('bmi_age_ratio', bmi_age_ratio_constructor) # new column 8
])

union

In [21]:
pd.DataFrame(union.fit_transform(X_train)).head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,-0.08,-0.45,0.0,0.0,1.0,0.0,0.0,0.0,0.916667
1,-1.26,-1.23,1.0,0.0,0.0,0.0,0.0,0.0,1.053636
2,-1.18,-1.09,0.0,1.0,0.0,0.0,0.0,0.0,0.985625


# Building your preprocessor with make_*** shortcuts ⚡️

# make_column_selector selects features automatically based on dtype

Pipeline vs. make_pipeline:

- Pipeline is a class that represents a sequence of transformations, where each step in the sequence is a tuple containing a name and an estimator or transformer object.
- make_pipeline is a convenience function that creates a pipeline without the need to specify step names explicitly. It automatically generates step names based on the class names of the transformers.
- The main difference is that Pipeline requires explicit naming of the steps, while make_pipeline automatically names the steps based on the class names.

In [26]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from sklearn.pipeline import make_pipeline
from sklearn.pipeline import make_union
from sklearn.compose import make_column_transformer

In [27]:
Pipeline([
    ('my_name_for_the_imputer', SimpleImputer()),
    ('my_name_for_the_scaler', StandardScaler())
])

In [28]:
make_pipeline(SimpleImputer(), StandardScaler())

In [29]:
num_transformer = make_pipeline(SimpleImputer(), StandardScaler())
cat_transformer = OneHotEncoder()

preproc_basic = make_column_transformer(
    (num_transformer, ['age', 'bmi']),
    (cat_transformer, ['smoker', 'region']),
    remainder='passthrough'
)

preproc_full = make_union(preproc_basic, bmi_age_ratio_constructor)

preproc_full

# make_column_selector selects features automatically based on dtype

`make_column_selector` function in scikit-learn allows you to automatically select columns or features from a dataset based on their data types. This function is super useful when used in conjunction with make_column_transformer to specify which columns should undergo specific transformations.

In [30]:
from sklearn.compose import make_column_selector

num_col = make_column_selector(dtype_include=['float64'])
cat_col = make_column_selector(dtype_include=['object','bool'])

In [31]:
X_train.dtypes

age         float64
bmi         float64
children      int64
smoker         bool
region       object
dtype: object

# 🎉 Complete preprocessing pipeline 🎉

In [33]:
from sklearn.compose import make_column_selector

num_transformer = make_pipeline(SimpleImputer(), StandardScaler())
num_col = make_column_selector(dtype_include=['float64'])

cat_transformer = OneHotEncoder()
cat_col = make_column_selector(dtype_include=['object','bool'])

preproc_basic = make_column_transformer(
    (num_transformer, num_col),
    (cat_transformer, cat_col),
    remainder='passthrough'
)

# create a preprocessing pipeline using make_union
preproc_full = make_union(preproc_basic, bmi_age_ratio_constructor)

preproc_full

**preproc_full:** This variable is assigned the result of combining preproc_basic with bmi_age_ratio_constructor using make_union. This creates a preprocessing pipeline that applies both the basic preprocessing steps and the additional bmi_age_ratio_constructor transformation, concatenating their outputs.

# Including models in Pipelines

Model objects can be plugged into Pipelines

Pipelines inherit the methods of the last object in the sequence -> This is because the final object in the Pipeline is typically a machine learning model that has methods for fitting the model to the training data and making predictions on new data.

When you call the fit method on the Pipeline object, scikit-learn applies the sequence of data transformations to the input data and then calls the fit method of the final object in the sequence. The same way when you call the predict method on the Pipeline object, scikit-learn applies the sequence of data transformations to the input data and then calls the predict method of the final object in the sequence.

# Next slide - read

So now let's see what a full pipeline looks like!

Ridge Regression is a machine learning model that is commonly used for regression tasks. It is a regularized linear regression method that adds a penalty term to the loss function, which helps to prevent overfitting and improve the generalization performance of the model.

In Ridge Regression, the loss function is defined as:
`L = RSS + alpha * (sum of square of coefficients)`
where RSS (Residual Sum of Squares) is the sum of the squared differences between the predicted values and the true values, and alpha is a hyperparameter that controls the strength of the regularization. The sum of the square of coefficients is also called the L2 norm.

Pipeline Setup:

pipeline: This variable is assigned a pipeline created using the make_pipeline function.
<br>

The pipeline consists of two main components:
- preproc: This represents the preprocessing steps that were defined earlier. It could be any preprocessing pipeline, such as preproc_basic or preproc_full from the previous examples.
- Ridge(): This represents the estimator or predictive model that will be trained on the preprocessed data. In this case, it's using Ridge Regression as the estimator. You can replace Ridge() with any other scikit-learn estimator.
<br>

Return Value:

- The pipeline variable now holds the complete machine learning pipeline, which includes both preprocessing and modeling steps. It's ready to be used for training and prediction tasks.

In [34]:
from sklearn.linear_model import Ridge

# Preprocessor
num_transformer = make_pipeline(SimpleImputer(), StandardScaler())
cat_transformer = OneHotEncoder()

preproc = make_column_transformer(
    (num_transformer, make_column_selector(dtype_include=['float64'])),
    (cat_transformer, make_column_selector(dtype_include=['object','bool'])),
    remainder='passthrough'
)

# Add estimator
pipeline = make_pipeline(preproc, Ridge())
pipeline

In [35]:
# Train Pipeline
pipeline.fit(X_train,y_train)

# Make predictions
pipeline.predict(X_test.iloc[0:1])

# Score model
pipeline.score(X_test,y_test)

0.7555145003327789

In [36]:
from sklearn.model_selection import cross_val_score

# Cross-validate Pipeline
cross_val_score(pipeline, X_train, y_train, cv=5, scoring='r2').mean()

0.7418694094483809

# Grid search - READ

get_params() method allows you to retrieve the parameters of an estimator or a pipeline. When applied to a pipeline, it returns a dictionary containing the parameters of each step in the pipeline.

In [37]:
# Which parameters of the pipeline are GridSearch-able?
pipeline.get_params()

{'memory': None,
 'steps': [('columntransformer',
   ColumnTransformer(remainder='passthrough',
                     transformers=[('pipeline',
                                    Pipeline(steps=[('simpleimputer',
                                                     SimpleImputer()),
                                                    ('standardscaler',
                                                     StandardScaler())]),
                                    <sklearn.compose._column_transformer.make_column_selector object at 0x7f274ed8b760>),
                                   ('onehotencoder', OneHotEncoder(),
                                    <sklearn.compose._column_transformer.make_column_selector object at 0x7f274ed88f40>)])),
  ('ridge', Ridge())],
 'verbose': False,
 'columntransformer': ColumnTransformer(remainder='passthrough',
                   transformers=[('pipeline',
                                  Pipeline(steps=[('simpleimputer',
                                

This code here demonstrates how to perform hyperparameter tuning using grid search:
- Best Parameters:
- grid_search.best_params_: After fitting, this line retrieves the best combination of parameters found by grid search.
- The best parameters are determined based on the evaluation metric specified by the scoring parameter during initialization (r2 in this case).

In [38]:
from sklearn.model_selection import GridSearchCV

grid_search = GridSearchCV(
    pipeline,
    param_grid={
        # Access any component of the Pipeline
        # and any available hyperparamater you want to optimize
        'columntransformer__pipeline__simpleimputer__strategy': ['mean', 'median'],
        'ridge__alpha': [0.1, 0.5, 1, 5, 10]
    },
    cv=5,
    scoring="r2")

grid_search.fit(X_train, y_train)

grid_search.best_params_

{'columntransformer__pipeline__simpleimputer__strategy': 'mean',
 'ridge__alpha': 5}

# 💾 Let's save the pipelined model with the best hyperparameters.

In [39]:
pipeline_tuned = grid_search.best_estimator_
pipeline_tuned

In [40]:
pipeline_tuned.predict(X_test[0:1])

array([1899.68340546])

# Cache to avoid repeated computations

Caching Pipelines is a technique used in machine learning to avoid repeated computations during the training and prediction phases. When using complex Pipelines that involve feature extraction, feature selection, and model training, it is common to have intermediate results that can be reused multiple times. Caching these intermediate results can speed up the training and prediction phases and reduce the overall computational cost.

We then create a Memory object with a caching directory, and specify this object as the memory parameter of the Pipeline. 
- When we call the fit method on the Pipeline, scikit-learn automatically caches the intermediate results of my steps, so that they can be reused during the training phase. 
- When we call the predict method on the Pipeline, scikit-learn automatically caches the intermediate results of my steps, so that they can be reused during the prediction phase.

# Debug your pipe

In scikit-learn, the named_steps attribute of a pipeline provides access to the individual components (transformers or estimators) within the pipeline by their assigned names. The keys() method retrieves the names of the components.

In [44]:
# Access the components of a Pipeline with `named_steps`
pipeline_tuned.named_steps.keys()

dict_keys(['columntransformer', 'ridge'])

This code snippet is used to demonstrate how to check the shape of the data before and after applying preprocessing steps within a pipeline. Before preprocessing, the shape of X_train is printed to understand the original dimensions of the dataset.
After preprocessing, the shape of the preprocessed training data (X_train_preprocessed) is printed to understand how the preprocessing steps have modified the dimensions of the dataset.

In [45]:
# Check intermediate steps
print("Before preprocessing, X_train.shape = ")
print(X_train.shape)
print("After preprocessing, X_train_preprocessed.shape = ")
pipeline_tuned.named_steps["columntransformer"].fit_transform(X_train).shape

Before preprocessing, X_train.shape = 
(1070, 5)
After preprocessing, X_train_preprocessed.shape = 


(1070, 9)

# Exporting models/Pipelines

Pickle files store serialized Python objects in binary format and are commonly used for saving and loading machine learning models, preprocessing pipelines, and other complex data structures in Python. They are also used for caching intermediate results, storing configuration settings, and exchanging data between Python programs.

In [46]:
import pickle

# Export Pipeline as pickle file
with open("pipeline.pkl", "wb") as file:
    pickle.dump(pipeline_tuned, file)

# Load Pipeline from pickle file
my_pipeline = pickle.load(open("pipeline.pkl","rb"))

my_pipeline.score(X_test, y_test)

0.7551048402327243

# AUTO ML

Auto Machine Learning (AutoML)is a technique used in machine learning to automatically search for the best combination of data preprocessing techniques and machine learning models to solve a given problem. AutoML can save time and effort by automating the process of model selection and tuning, which can be pretty time-consuming and sometime a tedious process.

Hoe does the magic happen? AutoML can be achieved using the Pipeline class with the GridSearchCV or RandomizedSearchCV classes. These classes perform an exhaustive or randomized search over a predefined search space of hyperparameters for each step of the Pipeline. By specifying a set of candidate hyperparameters for each step in the Pipeline, the AutoML algorithm can search for the best combination of hyperparameters that optimizes the performance of the Pipeline on a given task.

We then create a TPOTRegressor object and fit it to the training data. TPOT uses genetic programming to search for the best pipeline for the regression problem, and returns the best pipeline found after a predefined number of generations and population size. Finally, we evaluate the mean squared error of the best pipeline found by TPOT on the test data.

The TPOTRegressor constructor takes the same parameters as TPOTClassifier, including the number of generations, the population size, the verbosity level, and the scoring metric used to evaluate the fitness of the pipelines. For regression problems, TPOT can use metrics like the mean_squared_error, mean_absolute_error, r2_score, and explained_variance_score.