In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (10, 5)


# Lecture 15 - `sklearn` Pipelines

## DSC 80, Fall 2022

### Today, in DSC 80

- Building machine learning pipelines in `sklearn`

Remember to refer to [dsc80.com/resources/#regular-expressions](https://dsc80.com/resources/#regular-expressions).

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('seaborn-white')
plt.rc('figure', dpi=100, figsize=(7, 5))
plt.rc('font', size=12)

## `sklearn` overview

### The steps of the modeling pipeline

<center><img src="imgs/image_0.png" width="60%"></center>

1. Create features to best reflect the "meaning" behind data.
2. Choose a model that is appropriate to capture the relationships between features and the response.
3. Select a loss function and fit the model (i.e., determine $w^*$).
4. Evaluate the model (e.g. using RMSE).

### Features and models using `sklearn`

<center><img src="imgs/sklearn.png" width="20%"></center>
    
* Scikit-learn (`sklearn`) implements many common steps in the feature and model creation pipeline.
    - It is **widely** used throughout [industry](https://scikit-learn.org/stable/testimonials/testimonials.html#:~:text=It%20is%20very%20widely%20used,very%20approachable%20and%20very%20powerful.) and academia.
* It interfaces with `numpy` arrays, and to an extent, `pandas` DataFrames.
* Huge benefit: the [documentation online](https://scikit-learn.org/stable/modules/classes.html) is **excellent**.

### `preprocessing` and `linear_models`

For the **feature creation** step of the modeling pipeline, we will use `sklearn`'s [`preprocessing`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing) module.

<center><img src="imgs/feature_part.png" width="30%"></center>

For the **model creation** step of the modeling pipeline, we will use `sklearn`'s [`linear_model`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model) module.

<center><img src="imgs/model_part.png" width="36%"></center>

## Transformers in `sklearn`

### Transformer classes

- **Transformers** take in "raw" data and output "processed" data. They are used for **creating features**.
    - The input should be a multi-dimensional `numpy` array.
        - Inputs can be DataFrames, but `sklearn` only looks at the values (i.e. it calls `to_numpy()` on input DataFrames).
    - The output is a `numpy` array (never a DataFrame or Series).

- Transformers, like most relevant features of `sklearn`, are **classes**, not functions, meaning you need to instantiate them and call their methods.

### Example transformer: `Binarizer`

The `Binarizer` transformer allows us to map a quantitative sequence to a sequence of 1s and 0s, depending on whether values are above or below a threshold.

|Property|Example|Description|
|---|---|---|
|Initialize with parameters| `binar = Binarizer(thresh)` | set x=1 if x > thresh, else 0|
|Transform data in a dataset | `feat = binar.transform(data)` | Binarize all columns in `data`|

First, we need to import the relevant class from `sklearn.preprocessing`. (Tip: import just the relevant classes you need from `sklearn`.)

In [None]:
from sklearn.preprocessing import Binarizer

Let's try binarizing `'total_bill'`. We'll say a "large" bill is one that is over \$20.

In [None]:
tips = sns.load_dataset('tips') # To remove the columns we "engineered" before
tips['total_bill'].head()

First, we initialize a `Binarizer` object with the threshold we want.

In [None]:
bi = Binarizer(threshold=20)

Then, we call `bi`'s `transform` method and pass it the data we'd like to transform. Note that its input and output are both 2D.

In [None]:
transformed_bills = bi.transform(tips[['total_bill']]) # Must pass transform a 2D array/DataFrame
transformed_bills[:5]

Cool! We can verify that it worked correctly:

In [None]:
((tips['total_bill'] > 20).astype(int) == transformed_bills.flatten()).all()

### Example transformer: `StdScaler`

- `StdScaler` **standardizes** data using the mean and standard deviation of the data.

$$z_i = \frac{x_i - \bar{x}}{\sigma_x}$$

- Unlike `Binarizer`, `StdScaler` **requires some knowledge (mean and SD) of the dataset before transforming**.
- As such, we need to **`fit`** an `StdScaler` transformer before we can use the `transform` method.
* Typical usage: fit transformer on a sample; use that fit transformer to transform future data.


|Property|Example|Description|
|---|---|---|
|Initialize with parameters| `stdscaler = StandardScaler()` | z-scale the data (no parameters) |
|Fit the transformer| `stdscaler.fit(data)` | compute the mean and SD of `data`|
|Transform data in a dataset | `feat = stdscaler.transform(newdata)` | z-scale `newdata` with mean and SD of `data`|

It only makes sense to standardize the already-quantitative columns of `tips`, so let's select just those.

In [None]:
tips_quant = tips[['total_bill', 'tip', 'size']]
tips_quant.head()

Let's initialize a `StandardScaler` object.

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
stdscaler = StandardScaler()

Note that the following **does not work!** The error message is very helpful.

In [None]:
stdscaler.transform(tips_quant)

Instead, we need to first call the `fit` method on `stdscaler`.

In [None]:
stdscaler.fit(tips_quant)

Now, `transform` will work.

In [None]:
# First column is 'total_bill', second column is 'tip', third column is 'size'
tips_quant_z = stdscaler.transform(tips_quant)
tips_quant_z[:5]

We can also access the mean and variance `stdscaler` computed for each column:

In [None]:
stdscaler.mean_

In [None]:
stdscaler.var_

Note that we can call `transform` on DataFrames other than `tips_quant`:

In [None]:
stdscaler.transform(tips_quant.head(5))

### Example transformer: `OneHotEncoder`

Let's keep just the categorical columns in `tips`.

In [None]:
tips_cat = tips[['sex', 'smoker', 'day', 'time']]
tips_cat.head()

Like `StdScaler`, we will need to `fit` our `OneHotEncoder` transformer before it can transform anything.

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
ohe = OneHotEncoder()
ohe.fit(tips_cat)

We can look at the unique values (i.e. categories) in each column by using the `categories_` attribute:

In [None]:
ohe.categories_

In [None]:
ohe_features = ohe.transform(tips_cat)
ohe_features

Since the resulting matrix is **sparse** â€“ most of its elements are 0 â€“ `sklearn` uses a more efficient representation than a regular `numpy` array. That's no issue, though:

In [None]:
ohe_features.toarray()

Notice that the column names from `tips_cat` are no longer stored anywhere (remember, `fit` converts the input to a `numpy` array before proceeding).

We can use the `get_feature_names` method on `ohe` to access the names of the one-hot-encoded columns, though:

In [None]:
ohe.get_feature_names() # x0, x1, x2, and x3 correspond to column names in tips_cat

`ohe` also has an `inverse_transform` method, which takes a one-hot-encoded matrix and returns a categorical matrix.

In [None]:
ohe.inverse_transform(ohe_features[:10])

## Models in `sklearn`

### Model classes

- `sklearn` model classes (called "estimators") behave like transformers, in that we need to instantiate and `fit` them.
- The difference is that we also need to specify what our "response" or "target" variable is, i.e. what we are trying to predict.
    - Calling `fit` is the same as "training our model".
- There are several models in the [`linear_model`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model) package; we will start with `LinearRegression`. 

### The `LinearRegression` class

We've seen this a few times in lecture already, but never formally.

In [None]:
from sklearn.linear_model import LinearRegression

**Important:** From [the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression), we have

> LinearRegression fits a linear model with coefficients w = (w1, â€¦, wp) to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation.

In other words, `LinearRegression` minimizes mean squared error by default.

Additionally, by default the `fit_intercept` argument is set to `True`.

In [None]:
LinearRegression?

### Example: Predicting `'tip'` from `'total_bill'` and `'size'`

In [None]:
tips.head()

First, we instantiate and fit. By calling `fit`, we are saying "minimize mean squared error and find $w^*$".

In [None]:
lr = LinearRegression()

# Note that there are two arguments to fit â€“ X and y!
# (It is not necessary to write X= and y=)
lr.fit(X=tips[['total_bill', 'size']], y=tips['tip'])

After fitting, the `predict` method is available. Note that the argument to `predict` can be any 2D array with two columns.

In [None]:
# Predicted tip from a table of 3 that spends $25 
lr.predict([[25, 3]])

In [None]:
# Predicted tip from a table of 14 that spends $1000 â€“ probably not accurate!
lr.predict([[1000, 14]])

We can access the intercepts and slopes individually. This model is of the form

$$\text{predicted tip} = w_0^* + w_1^* \cdot \text{total bill} + w_2^* \cdot \text{table size}$$

so we should expect three parameters total.

In [None]:
lr.intercept_

In [None]:
lr.coef_

If we want to compute the RMSE of our model, we need to find its predictions on every row in the training data (`tips`).

In [None]:
all_preds = lr.predict(tips[['total_bill', 'size']])

In [None]:
np.sqrt(np.mean((all_preds - tips['tip']) ** 2))

It turns out that fit `LinearRegression` objects also have a `score` method:

In [None]:
lr.score(tips[['total_bill', 'size']], tips['tip'])

That doesn't look like the RMSE... what is it? ðŸ¤”

### Aside: $R^2$

- $R^2$, or the **coefficient of determination**, is a measure of the **quality of a linear fit**.
- There are a few equivalent ways of computing it, assuming your model has an intercept term:

$$R^2 = \frac{\text{var}(\text{predicted $y$ values})}{\text{var}(\text{actual $y$ values})}$$

$$R^2 = \left[ \text{correlation}(\text{predicted $y$ values}, \text{actual $y$ values}) \right]^2$$

- In the simple linear regression case, it is the square of the correlation coefficient, $r$.
- **Key idea:** $R^2$ ranges from 0 to 1. **The closer it is to 1, the better the linear fit is.**
- Interpretation: $R^2$ is the **proportion of variance in $y$ that the linear model explains**.

### Calculating $R^2$

Recall, `all_preds` contains the predicted `'tip'` for every data point in `tips`.

In [None]:
tips.head()

In [None]:
all_preds[:5]

**Method 1: $R^2 = \frac{\text{var}(\text{predicted $y$ values})}{\text{var}(\text{actual $y$ values})}$**


In [None]:
np.var(all_preds) / np.var(tips['tip'])

**Method 2:** $R^2 = \left[ \text{correlation}(\text{predicted $y$ values}, \text{actual $y$ values}) \right]^2$

Note: By correlation here, we are referring to $r$.

In [None]:
(np.corrcoef(all_preds, tips['tip'])) ** 2

**Method 3:** `lr.score`

In [None]:
lr.score(tips[['total_bill', 'size']], tips['tip'])

All three methods provide the same result!

### `LinearRegression` summary

|Property|Example|Description|
|---|---|---|
|Initialize model parameters| `lr = LinearRegression()` | Create (empty) linear regression model|
|Fit the model to the data | `lr.fit(data, responses)` | Determines regression coefficients|
|Use model for prediction |`lr.predict(newdata)`| Use regression line make predictions|
|Evaluate the model| `lr.score(data, responses)` | Calculate the $R^2$ of the LR model|
|Access model attributes| `lr.coef_` | Access the regression coefficients|

***Note:*** Once `fit`, estimators like `LinearRegression` are just transformers (`predict` <-> `transform`).

## Summary

### Summary

- Quantitative feature transformations allow us to use linear models to model non-linear data.
- Transformers in `sklearn` are used for **feature engineering**, while estimators in `sklearn` are used for **models**.
- A common pattern:
    - Instantiate.
    - `fit`.
    - `transform` / `predict`.
- We like linear models with **low RMSE** and **high $R^2$**!
- **Next:** Combining transformers and estimators in a single **pipeline**.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression

plt.style.use('seaborn-white')
plt.rc('figure', dpi=100, figsize=(7, 5))
plt.rc('font', size=12)

import warnings
warnings.simplefilter('ignore')

## Models in `sklearn`

### Example: Predicting `'tip'` from `'total_bill'` and `'size'`

In [None]:
tips = sns.load_dataset('tips')
tips.head()

First, we instantiate and fit. By calling `fit`, we are saying "minimize mean squared error and find $w^*$".

In [None]:
lr = LinearRegression()

# Note that there are two arguments to fit â€“ X and y!
# (It is not necessary to write X= and y=)
lr.fit(X=tips[['total_bill', 'size']], y=tips['tip'])

After fitting, the `predict` method is available. Note that the argument to `predict` can be any 2D array with two columns.

In [None]:
# Predicted tip from a table of 3 that spends $25 
lr.predict([[25, 3]])

In [None]:
# Predicted tip from a table of 14 that spends $1000 â€“ probably not accurate!
lr.predict([[1000, 14]])

We can access the intercepts and slopes individually. This model is of the form

$$\text{predicted tip} = w_0^* + w_1^* \cdot \text{total bill} + w_2^* \cdot \text{table size}$$

so we should expect three parameters total.

In [None]:
lr.intercept_

In [None]:
lr.coef_

If we want to compute the RMSE of our model, we need to find its predictions on every row in the training data (`tips`).

In [None]:
all_preds = lr.predict(tips[['total_bill', 'size']])

In [None]:
np.sqrt(np.mean((all_preds - tips['tip']) ** 2))

It turns out that fit `LinearRegression` objects also have a `score` method:

In [None]:
lr.score(tips[['total_bill', 'size']], tips['tip'])

That doesn't look like the RMSE... what is it? ðŸ¤”

### Aside: $R^2$

- $R^2$, or the **coefficient of determination**, is a measure of the **quality of a linear fit**.
- There are a few equivalent ways of computing it, assuming your model **is linear and has an intercept term**:

$$R^2 = \frac{\text{var}(\text{predicted $y$ values})}{\text{var}(\text{actual $y$ values})}$$

$$R^2 = \left[ \text{correlation}(\text{predicted $y$ values}, \text{actual $y$ values}) \right]^2$$

- In the simple linear regression case, it is the square of the correlation coefficient, $r$.
- **Key idea:** $R^2$ ranges from 0 to 1. **The closer it is to 1, the better the linear fit is.**
    - $R^2$ has no units of measurement, unlike RMSE.
- Interpretation: $R^2$ is the **proportion of variance in $y$ that the linear model explains**.

### Calculating $R^2$

Recall, `all_preds` contains the predicted `'tip'` for every data point in `tips`.

In [None]:
tips.head()

In [None]:
all_preds[:5]

**Method 1: $R^2 = \frac{\text{var}(\text{predicted $y$ values})}{\text{var}(\text{actual $y$ values})}$**


In [None]:
np.var(all_preds) / np.var(tips['tip'])

**Method 2:** $R^2 = \left[ \text{correlation}(\text{predicted $y$ values}, \text{actual $y$ values}) \right]^2$

Note: By correlation here, we are referring to $r$.

In [None]:
(np.corrcoef(all_preds, tips['tip'])) ** 2

**Method 3:** `lr.score`

In [None]:
lr.score(tips[['total_bill', 'size']], tips['tip'])

All three methods provide the same result!

### `LinearRegression` summary

|Property|Example|Description|
|---|---|---|
|Initialize model parameters| `lr = LinearRegression()` | Create (empty) linear regression model|
|Fit the model to the data | `lr.fit(data, responses)` | Determines regression coefficients|
|Use model for prediction |`lr.predict(newdata)`| Use regression line make predictions|
|Evaluate the model| `lr.score(data, responses)` | Calculate the $R^2$ of the LR model|
|Access model attributes| `lr.coef_` | Access the regression coefficients|

***Note:*** Once `fit`, estimators like `LinearRegression` are just transformers (`predict` <-> `transform`).

## Pipelines

<center><img src="imgs/image_0.png" width="50%"></center>

<br>

So far, we've used transformers for feature engineering and models for prediction. We can combine these steps into a single `Pipeline`.

### `Pipeline`s in `sklearn`

- A `Pipeline` object is instantiated using a **list** containing transformer(s) and a model (estimator).
```py
pl = Pipeline([feat_trans1, feat_trans2, ..., mdl])
```
- Once a `Pipeline` is instantiated, you can fit **all** steps (transformers and model) using `fit`.
```py
pl.fit(data, responses)
```
- To make predictions using **raw (untransformed) data**, use `pl.predict`.

### Creating a `Pipeline`

- To instantiate a `Pipeline`, we must provide a list with zero or more transformers followed by a single model.
    - All "steps" must have `fit` methods, and all but the last must have `transform` methods.
- The list we provide `Pipeline` with must be a list of **tuples**, where
    - The first element is a "name" (that we choose) for the step.
    - The second element is a transformer or estimator instance.

Let's build a `Pipeline` that:
- One-hot-encodes the categorical features in `tips`.
- Fits a regression model on the one-hot-encoded data.

In [None]:
tips_cat = tips[['sex', 'smoker', 'day', 'time']]
tips_cat.head()

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

In [None]:
pl = Pipeline([
    ('one-hot', OneHotEncoder()),
    ('lin-reg', LinearRegression())
])

Now that `pl` is instantiated, we `fit` it the same way we would fit the individual steps.

In [None]:
pl.fit(tips_cat, tips['tip'])

Now, to make predictions using **raw data**, all we need to do is use `pl.predict`:

In [None]:
pl.predict([['Male', 'Yes', 'Sat', 'Lunch']])

In [None]:
pl.predict(tips_cat.iloc[:5])

`pl` performs **both** feature transformation and prediction with just a single call to `predict`!

We can access individual "steps" of a `Pipeline` through the `named_steps` attribute:

In [None]:
pl.named_steps

In [None]:
pl.named_steps['one-hot'].transform(tips_cat).toarray()

In [None]:
pl.named_steps['lin-reg'].coef_

### More sophisticated `Pipeline`s

- In the previous example, we one-hot-encoded every input column. **What if we want to perform different transformations on different columns?**
- **Solution:** Use a `ColumnTransformer`.
    - Instantiate a `ColumnTransformer` using a list of tuples, where:
        - The first element is a "name" we choose for the transformer.
        - The second element is a transformer instance (e.g. `OneHotEncoder()`).
        - The third element is a **list of relevant column names**.
    - `ColumnTransformer` is extremely useful, but it was only added to `sklearn` in 2018!

<center><img src='imgs/image_3.png' width=50%></center>

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

Let's perform different transformations on the quantitative and categorical features of `tips` (so, we will not transform `'tip'`).

In [None]:
tips_features = tips.drop('tip', axis=1)
tips_features.head()

- To the **quantitative features (`'total_bill'` and `'size'`)**, we will apply the `StandardScaler` transformer.
- To the **categorical features**, we will apply the `OneHotEncoder` transformer.

In [None]:
preproc = ColumnTransformer(
    transformers = [
        ('quant', StandardScaler(), ['total_bill', 'size']),
        ('cat', OneHotEncoder(), ['sex', 'smoker', 'day', 'time'])
    ]
)

Now, let's create a `Pipeline` using `preproc` as a transformer, and `fit` it:

In [None]:
pl = Pipeline([
    ('preprocessor', preproc), 
    ('lin-reg', LinearRegression())
])

In [None]:
pl.fit(tips_features, tips['tip'])

Prediction is as easy as calling `predict`:

In [None]:
tips_features.head()

In [None]:
pl.predict(tips_features.head())

`pl` also has a `score` method, the same way a fit `LinearRegression` instance does:

In [None]:
pl.score(tips_features, tips['tip'])

Recall, we can access the individual "steps" in `pl` using the `named_steps` attribute:

In [None]:
pl.named_steps['preprocessor'].transform(tips_features)

**Note:** `ColumnTransformer` has a `remainder` argument that you can use to specify what to do with columns that aren't being transfromed (`'drop'` or `'passthrough'`).