# Supervised learning with scikit-learn

In this notebook, we review scikit-learn's API for training a model.

<a href="https://colab.research.google.com/github/thomasjpfan/ml-workshop-intro/blob/master/notebooks/02-supervised-learning.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

In [1]:
# Install dependencies for google colab
import sys
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
    %pip install -r https://raw.githubusercontent.com/thomasjpfan/ml-workshop-intro/master/requirements.txt

In [2]:
import sklearn
assert sklearn.__version__.startswith("1.0"), "Plese install scikit-learn 1.0"

In [3]:
from sklearn.datasets import fetch_openml

blood = fetch_openml('blood-transfusion-service-center', as_frame=True)

In [4]:
blood.frame.head()

Unnamed: 0,V1,V2,V3,V4,Class
0,2.0,50.0,12500.0,98.0,2
1,0.0,13.0,3250.0,28.0,2
2,1.0,16.0,4000.0,35.0,2
3,2.0,20.0,5000.0,45.0,2
4,1.0,24.0,6000.0,77.0,1


In [5]:
X, y = blood.data, blood.target

In [6]:
X.head()

Unnamed: 0,V1,V2,V3,V4
0,2.0,50.0,12500.0,98.0
1,0.0,13.0,3250.0,28.0
2,1.0,16.0,4000.0,35.0
3,2.0,20.0,5000.0,45.0
4,1.0,24.0,6000.0,77.0


In [7]:
y.head()

0    2
1    2
2    2
3    2
4    1
Name: Class, dtype: category
Categories (2, object): ['1', '2']

In [8]:
y.value_counts(normalize=True)

1    0.762032
2    0.237968
Name: Class, dtype: float64

## Split Data

In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=0
)

In [10]:
X_train.head()

Unnamed: 0,V1,V2,V3,V4
449,16.0,3.0,750.0,50.0
235,8.0,10.0,2500.0,63.0
325,14.0,2.0,500.0,16.0
77,2.0,2.0,500.0,4.0
363,21.0,13.0,3250.0,57.0


In [11]:
y_train.value_counts(normalize=True)

1    0.780749
2    0.219251
Name: Class, dtype: float64

In [12]:
y_test.value_counts(normalize=True)

1    0.705882
2    0.294118
Name: Class, dtype: float64

### Stratify!

In [13]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=0, stratify=y
)

In [14]:
y_train.value_counts(normalize=True)

1    0.761141
2    0.238859
Name: Class, dtype: float64

In [15]:
y_test.value_counts(normalize=True)

1    0.764706
2    0.235294
Name: Class, dtype: float64

## scikit-learn API

In [16]:
from sklearn.linear_model import Perceptron

In [17]:
percept = Perceptron()

In [18]:
percept.fit(X_train, y_train)

Perceptron()

In [19]:
percept.predict(X_train)

array(['2', '1', '2', '1', '1', '1', '1', '1', '2', '1', '1', '2', '1',
       '1', '1', '1', '1', '2', '1', '1', '1', '2', '2', '1', '1', '1',
       '2', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1',
       '1', '1', '2', '1', '1', '2', '1', '2', '1', '1', '1', '2', '1',
       '1', '1', '2', '2', '1', '1', '1', '2', '1', '1', '1', '1', '1',
       '1', '1', '1', '1', '2', '1', '2', '1', '1', '1', '2', '1', '1',
       '1', '1', '1', '1', '1', '2', '1', '2', '1', '1', '2', '1', '2',
       '1', '1', '1', '1', '1', '1', '2', '1', '1', '1', '1', '1', '2',
       '2', '1', '1', '2', '1', '2', '1', '1', '1', '1', '1', '2', '1',
       '1', '1', '2', '1', '1', '1', '2', '2', '2', '2', '2', '1', '1',
       '2', '2', '1', '1', '2', '1', '2', '2', '1', '1', '2', '1', '1',
       '2', '1', '2', '1', '1', '1', '1', '1', '1', '1', '2', '1', '1',
       '1', '2', '1', '2', '1', '1', '1', '1', '2', '1', '1', '1', '1',
       '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '2', '1

In [20]:
y_train

511    1
345    1
523    2
273    1
200    1
      ..
393    1
41     1
206    1
12     2
543    2
Name: Class, Length: 561, dtype: category
Categories (2, object): ['1', '2']

In [21]:
percept.score(X_train, y_train)

0.7344028520499108

In [22]:
percept.score(X_test, y_test)

0.679144385026738

## Another estimator

In [23]:
from sklearn.ensemble import RandomForestClassifier

In [24]:
rf = RandomForestClassifier()

In [25]:
rf.fit(X_train, y_train)

RandomForestClassifier()

In [26]:
rf.score(X_train, y_train)

0.9429590017825312

In [27]:
rf.score(X_test, y_test)

0.7112299465240641

## Are these results any good?

In [28]:
from sklearn.dummy import DummyClassifier

In [29]:
dc = DummyClassifier(strategy='prior')
dc.fit(X_train, y_train)
dc.score(X_test, y_test)

0.7647058823529411

## Exercise 1 

1. Import and evaluate the performance of `sklearn.linear_model.LogisticRegression` on the above dataset
2. How does the test performance compare to the ones we already looked at?

In [33]:
# %load https://raw.githubusercontent.com/thomasjpfan/ml-workshop-intro/master/notebooks/solutions/02-ex1-solution.py

## Exercise 2

1. Load the wine dataset from `sklearn.datasets` module using the `load_wine` dataset.
2. Split it into a training and test set using `train_test_split`.
3. Train and evalute `sklearn.neighbors.KNeighborsClassifer`, `sklearn.ensemble.RandomForestClassifier` and `sklearn.linear_model.LogisticRegression` on the wine dataset.
4. How do they perform on the training and test set?
5. Which one is best on the training set and which one is best on the test set?

In [None]:
# %load https://raw.githubusercontent.com/thomasjpfan/ml-workshop-intro/master/notebooks/solutions/02-ex2-solution.py