# Pipelines

In this notebook, we will learn about pipelines in scikit-learn.

In [None]:
import sklearn
sklearn.set_config(display='diagram')

Load data from previous notebook 

In [None]:
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

boston = fetch_openml(data_id=531, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(
    boston.data, boston.target, random_state=42)

## Make pipeline!

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor

knr = make_pipeline(
    StandardScaler(), KNeighborsRegressor()
)
knr.fit(X_train, y_train)

In [None]:
knr.score(X_test, y_test)

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression

knr_select = make_pipeline(
    StandardScaler(),
    SelectKBest(score_func=f_regression, k=8),
    KNeighborsRegressor()
)
knr_select.fit(X_train, y_train)

In [None]:
knr_select.score(X_test, y_test)

## Exercise 1

1. Load the diabetes dataset from `sklearn.datasets` using the `load_breast_cancer` function.
2. Split the data into training and test dataset.
3. Create a pipeline with a `StandardScaler` and a `sklearn.linear_model.LogisticRegression`.
4. Evaluate the performance of this pipeline on the test dataset.
5. **Extra**: Add a `sklearn.preprocessing.PolynomialFeatures` to the pipeline and see if the performance improves.

In [None]:
# %load solutions/04-ex01-solution.py