# Introduction to Scikit-Learn Workshop 14th August 2019

![title](images/pydata_cardiff.jpg)

## Outline and Main Aims

This workshop will provide a basic introduction to using the API in the [scikit-learn](https://scikit-learn.org/stable/) Python machine learning library. The following topics will be demonstrated:

* How the API is used to fit a model - and predict outputs from new data points
* The difference between regression and classification problems
* How to visualise the model outputs
* Examples of using metrics to assess model performance
* Methods for post-hoc examination of features in the model

For the visualisations and post-hoc analysis, examples will be shown using additional machine learning libraries.

### Expectation Management

Machine learning is a __massive__ topic! This cannot be stressed enough! There is still active research development into various aspects in the field, and this will most likely continue in the coming decades. This workshop is intended to provide an introduction to new users, as well as a quick whirlwind tour into some additional techniques and libraries that you can use alongside scikit-learn.

The important thing to remember is that machine-learning, together with other methods (or definitions) of Artificial Intelligence and more traditional statistical learning methods, is a lot more involved that just installing a library and hitting 'GO'. All of these techniques are meant to work alongside background research, and domain expertise in any particular field.

Here is a brief list of topics that cannot be convered today:

* Detailed description of feature engineering
    * Some basic examples will be given - but this is, in itself, a massive area of ongoing study
* An in depth explanation of the mathematics behind the algorithms
    * Some basic intuition is given, but this material is aimed at a practical level
    * It is important to realise the strengths, weaknesses, and assumptions about any technique used
    * The [scikit-learn user-guide](https://scikit-learn.org/stable/user_guide.html) documentation pages are a fantastic source of information
* Domain specific interpretation of models (very important disclaimer):
    * Different fields require different outcomes of models:
        * Internet advertising - get the message out to as many as you can afford and use the model to pick the best subset. The _confetti_ approach
        * Clinical trials and drug research - you need to be far more certain of any outcome


# Introduction to Scikit-Learn

Points to mention:

* Single feature linear regression
* Single feature logistic regression
* Make blobs linearly separable
* Make moons/circles - non-linearly separable
    * Find a way to visualise these decision boundaries
* Make an overlapping dataset and overfit massively
* Show how a CV procedure can ameliorate this
* Put this into a pipeline with feature normalisation
    * This can be done by transforming the moons first
* If time - show a multi class classification
* Give an example of a Random Forest:
    * Show how the SHAP library works https://github.com/slundberg/shap
    

In [None]:
# Getting a wider notebook

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

#### Getting the basic imports - others will be added when introduced

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from copy import deepcopy
# from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline

In [None]:
sns.set_style('darkgrid')

## Compulsory _Linear Regression_ example

$y = m \cdot x + c$

While this might seem a bit overkill to use machine learning in this circumstance, when more basic statistical approaches will suffice, it is a good first example to show how the models can recover any parameters that we _know_ to be correct - because we have made them!

#### Data Creation

Here, we will make some artificial data where the data points broadly fit along a line on a 2D graph, made from the following:

* The __slope__ of the line _m_
* The __intercept__ - the point at which the line crosses the y-axis _c_
* Some added Gaussian noise (normally distributed)

In [None]:
np.random.seed(22)
noise = np.random.randn(200) * 2

intercept = -1.7
slope = 3.4

x_lin_reg = np.linspace(-2, 10, 200)
y_lin_reg = intercept + slope * x_lin_reg + noise

#### Data and Targets

This has provided us with our input data, and desired outputs. This task is an example of a __supervised__ learning algorithm, whereby the model is given information about the intended outcome of every data point. Its aim is to learn the relationship, so that other data points, that do not have labels, can be assessed after the model is trained.

The aim is to __use the x values to predict the y values__ - in other words, we want to recover the intercept and slope, as we can then use that to create any y value from and x value

In [None]:
fig, ax = plt.subplots(figsize=(16, 8))
ax.scatter(x_lin_reg, y_lin_reg);

## Importing the model type

Here we use the `LinearRegression` model from the `linear_model` module. scikit-learn relies heavily on the concept of _object oriented programming_. The model that we import is a `class` or `object` that allows a range of functional procedures to be carried out on some data.

When we use this model object, we create an __instance__ of the object, and when creating this instance, we can provide some instructions on how it should behave.

Each model has specific instructions and functionality, and all of these can be seen in the documentation.

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
# LinearRegression?

In this instance we want the intercept - this can be specified explictly, but it is the default

In [None]:
linear_model = LinearRegression()

# The following is exactly the same!
# linear_model = LinearRegression(fit_intercept=True)

## Note that the inputs to the model _must_ be a 2D array

Although this might seem strange for a single feature input - it in nevertheless necessary.

There are 2 main ways to do this in `numpy`:

* Using `reshape` - this reshapes the array as (`n_rows`, `n_columns`) - the -1 is telling `numpy` - 'use as many as you need'
* The `newaxis` method using a slicing operation

Note that this notebook will use capitals for variables that are in 2D

In [None]:
X_lin_reg = x_lin_reg.reshape(-1, 1)
X_lin_reg = x_lin_reg[:, np.newaxis]

In [None]:
X_lin_reg

In [None]:
linear_model.fit(X_lin_reg, y_lin_reg)

## Viewing recovered parameters

In [None]:
linear_model.intercept_

In [None]:
linear_model.coef_

## Viewing model output

Note that the output now contains the signal without the noise - that's ultimately what we want!

In [None]:
y_pred_lin_reg = linear_model.predict(X_lin_reg)

In [None]:
fig, ax = plt.subplots(figsize=(16, 8))
ax.scatter(x_lin_reg, y_lin_reg)
ax.scatter(x_lin_reg, y_pred_lin_reg, c='g');

# Classification problems

The previous model was used to fit a _regression_ problem, whereby the target variable contains a large range of values - it is _continuous_ in nature. However, in many cases, we want to predict classification - or membership of a particular group.

The first example with be that of _binary_ classification - either 0 or 1. We'll make some random data. If x > 3, then y = 1, otherwise y = 0

Note that one ambiguous data point is being added

In [None]:
np.random.seed(12)

x_cat = np.random.uniform(low=-2, high=8, size=200)
y_cat = np.where(x_cat > 3, 1, 0)

x_cat = np.append(x_cat, 3.5)
y_cat = np.append(y_cat, 0)

In [None]:
fig, ax = plt.subplots(figsize=(16, 8))
ax.scatter(x_cat, y_cat);

## Trying out Linear Regression

Remember to add the dimension!

In [None]:
X_cat = x_cat.reshape(-1, 1)

In [None]:
linear_model.fit(X_cat, y_cat)

In [None]:
y_cat_pred = linear_model.predict(X_cat)

In [None]:
fig, ax = plt.subplots(figsize=(16, 8))
ax.scatter(x_cat, y_cat)
ax.plot(x_cat, y_cat_pred, c='r');

# Getting a better model

The most basic model for use here is called a __Logistic Regression__. But why is it called regression??? This can be very confusing! It is actually because the same mathematic methods are use to fit to a line, but the data is transformed using the function below - this can then be used for classification

### Logistic Regression function

$\frac{1}{1 + e^{-x}}$

But remember this (as it confused me when I was starting out): __Logistic Regression is used for CLASSIFICATION!!!__

We'll now visualise how this function looks

In [None]:
def sigmoid(ar):
    return 1 / (1 + np.exp(-ar))

In [None]:
x_sig = np.linspace(-5, 8, 200)
y_sig = sigmoid(x_sig)

In [None]:
fig, ax = plt.subplots(figsize=(16, 8))
ax.plot(x_sig, y_sig, linewidth=3);

### How this works for classification

We still can map x to y - but now, if y is > 0.5 we classify it as 1, and 0 otherwise

# Logistic Regression - with slope and intercept

We can still add parameters to this function

$\frac{1}{1 + e^{slope \cdot x - intercept}}$

In [None]:
def sigmoid_extra(ar, slope, intercept):
    return 1 / (1 + np.exp(-slope * (ar - intercept)))

In [None]:
x_sig_extra = np.linspace(-5, 8, 200)
y_sig_extra = sigmoid_extra(x_sig_extra, 5, 2)

In [None]:
fig, ax = plt.subplots(figsize=(16, 8))
ax.plot(x_sig_extra, y_sig_extra, linewidth=3);

# Importing and using the model

Note that the `solver` is being specified here. This has been obtained by looking at the documentation and seeing that it is best for small problems.

As mentioned in the introduction cells, we won't go into the maths behind this here. But importantly: __don't worry__ - just read through the documentation and don't get bewildered with this! The scikit-learn developers have done a fantastic job of making their software usable.

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
log_reg = LogisticRegression(solver='liblinear')

In [None]:
X_cat = x_cat.reshape(-1, 1)

In [None]:
log_reg.fit(X_cat, y_cat)

In [None]:
y_cat_pred = log_reg.predict(X_cat)

In [None]:
sig_fitted = sigmoid_extra(x_sig, log_reg.coef_[0], -log_reg.intercept_)

In [None]:
sns.set_style('darkgrid')

In [None]:
fig, ax = plt.subplots(figsize=(16, 8))
ax.scatter(x_cat, y_cat, c='b');
ax.scatter(x_cat, y_cat_pred, c='g')
ax.plot(x_sig, sig_fitted, linewidth=3, c='r');

In [None]:
from yellowbrick.classifier import ROCAUC

In [None]:
# ROCAUC?

In [None]:
visualiser = ROCAUC(LogisticRegression(solver='liblinear'), micro=False, macro=False, per_class=True)

In [None]:
visualiser.fit(X_cat, y_cat)
visualiser.score(X_cat, y_cat)
g = visualiser.poof()

In [None]:
x_cat2 = deepcopy(x_cat)

In [None]:
x_cat2[y_cat==1] -= 1.8
X_cat2 = x_cat2.reshape(-1, 1)

In [None]:
log_reg.fit(X_cat2, y_cat)
sig_fitted2 = sigmoid_extra(x_sig, log_reg.coef_[0], -log_reg.intercept_)

In [None]:
fig, ax = plt.subplots(figsize=(16, 8))
ax.scatter(x_cat2, y_cat)
ax.plot(x_sig, sig_fitted2, c='r')

In [None]:
visualiser = ROCAUC(LogisticRegression(solver='liblinear'), macro=False, micro=False)
visualiser.fit(X_cat2, y_cat)
visualiser.score(X_cat2, y_cat)
g = visualiser.poof()

In [None]:
from sklearn.datasets import make_blobs

In [None]:
X, y = make_blobs(n_samples=300, n_features=2, centers=((1, 1), (5, 5)), cluster_std=1)

In [None]:
trm = np.array([[1, -2], [-2, 1]])

In [None]:
X = X @ trm

In [None]:
sns.set_palette('set1')

In [None]:
fig, ax = plt.subplots(figsize=(7, 7))
ax.scatter(X[:, 0], X[:, 1], c=y)
ax.axis('equal');

In [None]:
log_reg2d = LogisticRegression(solver='liblinear')

In [None]:
log_reg2d.fit(X, y)

In [None]:
y_pred = log_reg2d.predict(X)

In [None]:
fig, ax = plt.subplots(figsize=(8, 8))
ax.scatter(X[:, 0], X[:, 1], c=y_pred)
ax.axis('equal');

In [None]:
y_prob = log_reg2d.predict_proba(X)[:, 1]

In [None]:
fig, ax = plt.subplots(figsize=(8, 8))
ax.scatter(X[:, 0], X[:, 1], c=y_prob)
ax.axis('equal');

In [None]:
from mlxtend.plotting import plot_decision_regions

In [None]:
from sklearn.svm import SVC

In [None]:
X, y = make_blobs(n_samples=300, n_features=2, centers=((1, 1), (5, 5)), cluster_std=0.6)
X = X @ trm

In [None]:
svm_blobs = SVC(kernel='linear')

In [None]:
svm_blobs.fit(X, y)

In [None]:
y_svm_pred = svm_blobs.predict(X)

In [None]:
fig, ax = plt.subplots(figsize=(8, 8))
ax.scatter(X[:, 0], X[:, 1], c=y_svm_pred)
ax.axis('equal');

In [None]:
from sklearn.datasets import make_circles

In [None]:
X, y = make_circles(n_samples=500, noise=0.05, factor=0.6)

In [None]:
fig, ax = plt.subplots(figsize=(8, 8))
ax.scatter(X[:, 0], X[:, 1], c=y)
ax.axis('equal');

In [None]:
circle_df = pd.DataFrame(data=X, columns=['x1', 'x2'])
circle_df['squared'] = np.sqrt(circle_df['x1']**2 + circle_df['x2']**2)

In [None]:
circle_df.head()

In [None]:
svm_circle = SVC(kernel='linear')

In [None]:
svm_circle.fit(circle_df, y)

In [None]:
y_linear_pred = svm_circle.predict(circle_df)

In [None]:
fig, ax = plt.subplots(figsize=(8, 8))
ax.scatter(X[:, 0], X[:, 1], c=y_linear_pred)
ax.axis('equal');

In [None]:
svm_rbf = SVC(gamma='scale')

In [None]:
svm_rbf.fit(X, y)

In [None]:
y_rbf_pred = svm_rbf.predict(X)

In [None]:
fig, ax = plt.subplots(figsize=(8, 8))
ax.scatter(X[:, 0], X[:, 1], c=y_rbf_pred)
ax.axis('equal');

In [None]:
from sklearn.datasets import make_moons

In [None]:
X_moons, y_moons = make_moons(n_samples=500, noise=0.2, random_state=27)

In [None]:
fig, ax = plt.subplots(figsize=(8, 8))
ax.scatter(X_moons[:, 0], X_moons[:, 1], c=y_moons);

In [None]:
svm_moons_1 = SVC(C=1, gamma='scale')

In [None]:
svm_moons_1.fit(X_moons, y_moons)

In [None]:
fig, ax = plt.subplots(figsize=(8, 8))
plot_decision_regions(X_moons, y_moons, svm_moons_1);

In [None]:
svm_moons_100 = SVC(C=100, gamma=100)

In [None]:
svm_moons_100.fit(X_moons, y_moons)

In [None]:
fig, ax = plt.subplots(figsize=(8, 8))
plot_decision_regions(X_moons, y_moons, svm_moons_100);

In [None]:
fig, ax = plt.subplots(figsize=(8, 8))
plot_decision_regions(X_moons, y_moons, svm_moons_1, zoom_factor=0.1);

## Cross Validation

In [None]:
1 / (2 * X_moons.var())

In [None]:
C_values = np.logspace(-1, 3, 5)
C_values

In [None]:
gamma_values = np.logspace(-2, 2, 5)
gamma_values

In [None]:
kernel_values = ['linear', 'rbf']

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
parameters = {'kernel': kernel_values, 'C': C_values, 'gamma': gamma_values}

In [None]:
agnostic_svc = SVC(random_state=12)

In [None]:
clf = GridSearchCV(estimator=agnostic_svc, param_grid=parameters, scoring='roc_auc', cv=4, n_jobs=-1)

In [None]:
clf.fit(X_moons, y_moons)

In [None]:
clf.best_score_

In [None]:
clf.best_params_

In [None]:
X_moons_skewed = deepcopy(X_moons)
X_moons_skewed[:, 1] *= 1e-6

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X1, X2, y1, y2 = train_test_split(X_moons_skewed, y_moons)

In [None]:
svm_moons_1.fit(X1, y1)

In [None]:
y_pred = svm_moons_1.predict(X2)

In [None]:
y2

In [None]:
y_pred

In [None]:
from sklearn.metrics import roc_auc_score, accuracy_score

In [None]:
accuracy_score(y2, y_pred)