# Chapter 6: Learning best practices for model evaluation and hyperparameter tuning

## Streamlining workflows with pipelines

Piplines allow for fitting a model usign an arbitrary number of transformations steps

### reading in the data:

1. get the data

In [3]:
import pandas as pd
df = pd.read_csv('wdbc.data')

2. use label encoder to transform the data into numerics

(M = malignant tumors)

In [5]:
from sklearn.preprocessing import LabelEncoder 
X = df.iloc[:, 2:].values
y = df.iloc[:, 1].values
le = LabelEncoder()
y = le.fit_transform(y)
le.classes_

array(['B', 'M'], dtype=object)

In [6]:
le.transform(['M', 'B'])

array([1, 0])

3. split the data

In [7]:
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, stratify=y, random_state=1)

### Now apply the transforms with pipeline

Data will need to be standardized.  And, to reduce the 30 features, PCA will be used.

In [8]:
from sklearn.preprocessing import StandardScaler 
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression 
from sklearn.pipeline import make_pipeline #this will make the pipeline

Make a pipeline using the scaler, PCA dimensionality reductiona, and then apply the model. `make_pipeline` will take an arbitrary number of transformer objects which are followed by an estimator. The pipeline object then acts like a "meta-estimator".

In [9]:
pipe_lr = make_pipeline(StandardScaler(), PCA(n_components=2), LogisticRegression())

`pipe_lr` will now act like a model object for the data:

In [10]:
pipe_lr.fit(X_train, y_train)
y_pred = pipe_lr.predict(X_test)
test_acc = pipe_lr.score(X_test, y_test)
print(f'Test accuracy: {test_acc:.3f}')

Test accuracy: 0.930


The `pipeline.fit()` will be used with training data and the `pipeline.predict()` is used with test data.  The training data passes through the `fit` and `transform` methods of the transformers/estimators, whereas the test data only passes through the `transform` methods.

## Using k-fold cross validation to assess model performance

