# Supervised learning with scikit-learn

In this notebook, we review scikit-learn's API for training a model.

<a href="https://colab.research.google.com/github/thomasjpfan/ml-workshop-intro/blob/master/notebooks/02-supervised-learning.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

In [None]:
# Install dependencies for google colab
import sys
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
    %pip install -r https://raw.githubusercontent.com/thomasjpfan/ml-workshop-intro/master/requirements.txt

In [None]:
import sklearn
assert sklearn.__version__.startswith("1.0"), "Plese install scikit-learn 1.0"

In [None]:
from sklearn.datasets import fetch_openml

blood = fetch_openml('blood-transfusion-service-center', as_frame=True)

In [None]:
blood.frame.head()

In [None]:
X, y = blood.data, blood.target

In [None]:
X.head()

In [None]:
y.head()

In [None]:
y.value_counts(normalize=True)

## Split Data

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=0
)

In [None]:
X_train.head()

In [None]:
y_train.value_counts(normalize=True)

In [None]:
y_test.value_counts(normalize=True)

### Stratify!

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=0, stratify=y
)

In [None]:
y_train.value_counts(normalize=True)

In [None]:
y_test.value_counts(normalize=True)

## scikit-learn API

In [None]:
from sklearn.linear_model import Perceptron

In [None]:
percept = Perceptron()

In [None]:
percept.fit(X_train, y_train)

In [None]:
percept.predict(X_train)

In [None]:
y_train

In [None]:
percept.score(X_train, y_train)

In [None]:
percept.score(X_test, y_test)

## Another estimator

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rf = RandomForestClassifier()

In [None]:
rf.fit(X_train, y_train)

In [None]:
rf.score(X_train, y_train)

In [None]:
rf.score(X_test, y_test)

## Are these results any good?

In [None]:
from sklearn.dummy import DummyClassifier

In [None]:
dc = DummyClassifier()
dc.fit(X_train, y_train)
dc.score(X_test, y_test)

## Exercise 1 

1. Import and evaluate the performance of `sklearn.linear_model.LogisticRegression` on the above dataset
2. How does the test performance compare to the ones we already looked at?

In [None]:
from sklearn.linear_model import LogisticRegression

**If you are running locally**, you can uncomment the following cell to load the solution into the cell. On **Google Colab**, [see solution here](https://github.com/thomasjpfan/ml-workshop-intro/blob/master/notebooks/solutions/02-ex1-solution.py). 

In [None]:
# %load solutions/02-ex1-solution.py

## Exercise 2

1. Load the wine dataset from `sklearn.datasets` module using the `load_wine` dataset with `as_frame=True`.
2. Split it into a training and test set using `train_test_split`. (**Hint**: Use `stratify` and set `random_state=42`)
3. Train `sklearn.neighbors.KNeighborsClassifer`, `sklearn.ensemble.RandomForestClassifier` and `sklearn.linear_model.LogisticRegression` on the wine dataset.
    - **You can ignore warnings here. We will cover it in the next section**
4. How do they perform on the test set?

In [None]:
from sklearn.datasets import load_wine
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

**If you are running locally**, you can uncomment the following cell to load the solution into the cell. On **Google Colab**, [see solution here](https://github.com/thomasjpfan/ml-workshop-intro/blob/master/notebooks/solutions/02-ex2-solution.py). 

In [None]:
# %load solutions/02-ex2-solution.py