# Quick Review of scikit-learn

<a href="https://colab.research.google.com/github/thomasjpfan/ml-workshop-intermediate-1-of-2/blob/master/notebooks/00-review-sklearn.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

In [None]:
# Install dependencies for google colab
import sys
if 'google.colab' in sys.modules:
    %pip install -r https://raw.githubusercontent.com/thomasjpfan/ml-workshop-intermediate-1-of-2/master/requirements.txt

In [None]:
import sklearn
assert sklearn.__version__.startswith("1.0"), "Plese install scikit-learn 1.0"

In [None]:
import seaborn as sns
sns.set_theme(context="notebook", font_scale=1.2,
              rc={"figure.figsize": [10, 6]})
sklearn.set_config(display="diagram")

In [None]:
from sklearn.datasets import fetch_openml

steel = fetch_openml(data_id=1504, as_frame=True)

In [None]:
print(steel.DESCR)

In [None]:
_ = steel.data.hist(figsize=(30, 15), layout=(5, 8))

### Split Data

In [None]:
from sklearn.model_selection import train_test_split
X, y = steel.data, steel.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=42, stratify=y)

### Train DummyClassifer

In [None]:
from sklearn.dummy import DummyClassifier

dc = DummyClassifier(strategy='prior').fit(X_train, y_train)
dc.score(X_test, y_test)

### Train KNN based model

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsClassifier

knc = make_pipeline(
    StandardScaler(),
    KNeighborsClassifier()
)
knc.fit(X_train, y_train)

In [None]:
knc.score(X_test, y_test)

## Exercise 1

1. Load the wisconsin breast cancer dataset from `sklearn.datasets.load_breast_cancer`.
2. Is the labels imbalanced? (**Hint**: `value_counts`)
3. Split the data into a training and test set.
4. Create a pipeline with a `StandardScaler` and `LogisticRegression` and fit on the training set.
5. Evalute the pipeline on the test set.
6. **Extra**: Use `sklearn.metrics.f1_score` to compute the f1 score on the test set.

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

**If you are running locally**, you can uncomment the following cell to load the solution into the cell. On **Google Colab**, [see solution here](https://github.com/thomasjpfan/ml-workshop-intermediate-1-of-2/blob/master/notebooks/solutions/00-ex01-solutions.py). 

In [None]:
# %load solutions/00-ex01-solutions.py