# Pipelines

In this notebook, we will learn about pipelines in scikit-learn.

<a href="https://colab.research.google.com/github/thomasjpfan/ml-workshop-intro/blob/master/notebooks/04-pipelines.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

In [None]:
# Install dependencies for google colab
import sys
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
    %pip install -r https://raw.githubusercontent.com/thomasjpfan/ml-workshop-intro/master/requirements.txt

In [None]:
import sklearn
assert sklearn.__version__.startswith("1.0"), "Plese install scikit-learn 1.0"

In [None]:
sklearn.set_config(display='diagram')

Load data from previous notebook 

In [None]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

housing = fetch_california_housing(as_frame=True)
X, y = housing.data, housing.target

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

## Make pipeline!

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor

knr = make_pipeline(
    StandardScaler(), KNeighborsRegressor()
)
knr.fit(X_train, y_train)

In [None]:
knr.score(X_test, y_test)

In [None]:
from sklearn.preprocessing import SplineTransformer

knr_spline = make_pipeline(
    StandardScaler(),
    SplineTransformer(),
    KNeighborsRegressor()
)
knr_spline.fit(X_train, y_train)

In [None]:
knr_spline.score(X_test, y_test)

## Exercise 1

1. Load the cancer dataset from `sklearn.datasets` using the `load_breast_cancer` function.
2. Is this is a classification or regression problem?
3. Split the data into training and test dataset. (Use `random_state=0`)
4. Create a pipeline with a `StandardScaler` and a `sklearn.linear_model.LogisticRegression`.
5. Evaluate the performance of this pipeline on the test dataset.
6. **Extra**: Add a `sklearn.preprocessing.PolynomialFeatures` to the pipeline and see if the performance improves.

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import PolynomialFeatures

**If you are running locally**, you can uncomment the following cell to load the solution into the cell. On **Google Colab**, [see solution here](https://github.com/thomasjpfan/ml-workshop-intro/blob/master/notebooks/solutions/04-ex01-solution.py). 

In [None]:
# %load solutions/04-ex01-solution.py