# The basic workflow

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
#plt.rcParams['figure.figsize'] = [10, 10]

## Explore the data

Let's load the MNIST dataset. This is a dataset of handwritten numbers. You will encounter it in a lot of machine learning tutorials.

If you want to run this notebook, you will need the torch extension with mads-datasets.
You can update it with:
```bash
pdm add "mads-datasets[torch]"
```

Because you should already have your pyproject.toml file updated by me if you read this, you can just run:
```bash
pdm install
```

In [None]:
from mads_datasets import DatasetFactoryProvider, DatasetType
fashiondataset = DatasetFactoryProvider.create_factory(
    DatasetType.FASHION
)

data = fashiondataset.create_dataset()


In [None]:
data.keys()

In [None]:
train = data['train']
test = data['valid']

The shape is (60000, 28,28) for the trainset. This means: we have 60000 cases, and every case is a 28x28 matrix. We can visualize a single instance

In [None]:
idx = 25 #let's have a look at case 25. You can change this to have a look at others
digit, y = train[idx]
img = digit.squeeze().numpy()
plt.imshow(img, cmap='binary')
print(y)

Let's start with trying to predict the cases with number 3 only.

In [None]:
y_train = train.labels.numpy()
X_train = train.data.numpy()
y_test = valid.labels.numpy()
X_test = valid.data.numpy()

In [None]:
y_train_single, y_test_single = (y_train == 3, y_test == 3)

np.mean(y_train_single) , np.mean(y_test_single)

Let's check how balanced the dataset is

In [None]:
pd.DataFrame(y_train, columns = ['train']).\
    groupby('train').\
    size().\
    plot.bar()

We can see that 10% of the dataset is a three. This is what you would expect for an evenly distributed set, which the barplot confirms. Now lets reshape the 28x28 matrices to a vector of 28x28=784 numbers.

## Prepare the data

We need to reshape the data, because our model can't handle 2D data.

In [None]:
# the -1 tells reshape: reshape to a matrix where the amount that should be at -1 is deducted from the other amounts.
# because the first number is 60.000, reshape will make sure the second value is 784, because that is the only way
# to make a matrix with 60.000 rows, in this case.
X_train = X_train.reshape(X_train.shape[0], -1)
X_test = X_test.reshape(X_test.shape[0], -1)
X_train.shape, X_test.shape

What we are actually doing, is reshaping the grid into one long vector. While that might be a weird representation for an image, a classifier works suprisingly well.

Can you understand, what the classifier is doing with this representation? Could you explain in normal language what the strategy of this approach is? The logic of why this works? What would be a downside of this approach?

The data ranges from 0 to 255, which is normal for images.

In [None]:
min(X_train[0]), max(X_train[0])

So let's scale the data to make things a bit easier for the model

In [None]:
X_train = X_train / 255
X_test = X_test / 255



# Fitting a model

The basic drill is:

1. make a train-test split
2. Explore the data, preprocess where needed
2. select and import a model, set some hyperparameters
3. fit the model
4. evaluate the model

Now, let's see that in code. The most basic syntax is:

``` 
from sklearn import model # import the model
clf = model(parameters) # set parameters
clf.fit(train_X, train_y) # fit on the data
```

And we can predict:

`yhat = clf.predict(test_X, test_y)`

and calculate a score with the metric we pick.

Let's try this for a SGDClassifier


In [None]:
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning)
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import f1_score, accuracy_score


from sklearn.linear_model import SGDClassifier # import the sgd classifier
sgd = SGDClassifier(max_iter=10) # we change the max amount of iterations for speedup
sgd.fit(X_train, y_train_single) # fit the model
yhat = sgd.predict(X_test)  # predict
accuracy_score(y_test_single, yhat) # and score

That's the simples way to select, fit and predict with a model. Sklearn handles everything about the model. We can tune some hyperparameters, but for now we just used the defaults, except for the `max_iter`. This reduces the amount of iterations from the default of 1000 to just 10, because we're just testing here and this speeds things up for testing. Nevertheless, the performance seems to be pretty good (however, there is a catch we will look at in a few moments)



## Evaluate the model
Let's visualize what the model is doing, in terms of weight. Can you explain what the model is doing, and why?

In [None]:
weights = sgd.coef_.reshape(28, 28)
sns.heatmap(weights, center = 0)

Let's use cross validation to test the performance. Here, we can specify splits of the data. We make 5 different splits, and calculate the average performance. This helps us to reduce the impact of lucky splits.

In [None]:
from sklearn.model_selection import cross_val_score
cross_val_score(sgd, X_test, y_test_single, cv = 5, scoring='accuracy')

That looks great. But don't cheer to fast... This high percentage is due to the unbalanced dataset. 
Let's see how a dummy classifier performs, that just picks the most frequent occurence (in our case: 90% is NOT a three, so the dummy will predict that everything is NOT a three.)

In [None]:
from sklearn.dummy import DummyClassifier
dummy_majority = DummyClassifier(strategy='most_frequent').fit(X_train, y_train_single)
cross_val_score(dummy_majority, X_test, y_test_single, cv = 5, scoring='accuracy')

Ai... That are pretty high scores too. Maybe we didn't do as well as simply looking at the accuracy seemed to promise. 

This should be a lesson about the problems you could encounter when trying to assess performance on an unbalanced dataset.

In [None]:
# usefull for plotting heatmaps of a confusion matrix
def cfm_heatmap(cfm, figsize = (8,8), scale = None, vmin=None, vmax=None):
    """
    figsize: tuple, default (8,8)
    scale: string. The direction over which the numbers are scaled. Either None, 'total', 'rowwise' or 'colwise'
    """

    if (scale == 'total'):
        cfm_norm = cfm / np.sum(cfm)
    elif (scale == 'rowwise'):
        cfm_norm = cfm / np.sum(cfm, axis=1, keepdims=True)
    elif (scale == 'colwise'):
        cfm_norm = cfm / np.sum(cfm, axis=0, keepdims=True)
    else:
        cfm_norm = cfm
    plt.figure(figsize=figsize)
    plot = sns.heatmap(cfm_norm, annot = cfm_norm, vmin=vmin, vmax=vmax)
    plot.set(xlabel = 'Predicted', ylabel = 'Target')


We are going to make a confusion matrix. Now it is much clearer what is going on.

In [None]:
from sklearn.metrics import confusion_matrix, f1_score
yhat_dummy = dummy_majority.predict(X_test)
cfm = confusion_matrix(y_test_single, yhat_dummy)
cfm_heatmap(cfm, scale = 'rowwise')
f1_score(y_test_single, yhat_dummy)

So, what is going on? Well, we see that in the column predicted, everything is predicted as a 0. So this means that of the targets with label 0 (not three), we predicted 90% accurate as a 0. But for target 1 (in our case, the number three), we also predicted everything as 'not three'. A nice way to express this is with the f1-score.

**Precision**: how many of the samples *predicted* as positive are actually positive

$$ Precision = \frac{TP}{TP + FP}$$

**Recall**: how many of *actual* positive samples are indeed predicted as positive

$$ Recall = \frac{TP}{TP + FN}$$

**F-score**: the harmonic mean of precision and recall

$$ F = 2 * \frac{precision * recall}{precision + recall} $$

If we look at the f1-score, it is actually zero. So, let's make a confusion matrix of the SGD classifier:

In [None]:
y_test_hat = sgd.predict(X_test)
cfm = confusion_matrix(y_test_single, y_test_hat)
cfm_heatmap(cfm, scale = 'rowwise')
f1_score(y_test_single, y_test_hat)

This looks much better. This should also make clear, how you can be deceived with a simple accuracy measure, but you can see a difference in performance if you look at the confusion matrix.

 We normalized rowwise, wich means that the rows (the actual label) sum up to 1. We see that we predicted 82 percent of the actual threes indeed as a three, making an error in 18% of the cases. We also mistook 1.8% of the non-threes for a three. So let's look at what's going on internally:

In [None]:
from sklearn.model_selection import cross_val_predict
y_decision = cross_val_predict(sgd, X_train, y_train_single, cv = 3, n_jobs = 4, method = 'decision_function')

Let's have a look at the first few values.

In [None]:
val = zip(y_decision[5:10], y_train_single[5:10])
pd.DataFrame(val)

You can probably figure out what is going on. Low values mean 'not a three', high values mean 'a three'. By using these decision values, we can change the behavior of the classifier to be more strict, or more loose when it comes deciding if something is a three, or not.

In [None]:
from sklearn.metrics import precision_recall_curve
precision, recall, thresholds = precision_recall_curve(y_train_single, y_decision)


In [None]:
data = pd.DataFrame({'precision':precision[:-1],'recall': recall[:-1], 'thresholds':thresholds})

In [None]:
sns.lineplot(x = 'thresholds', y='precision', label = 'precision', data = data)
sns.lineplot(x = 'thresholds', y='recall', label = 'recall', data=data)

As you might have figured out by now, this plot shows that you can achieve any precision you want! The only problem is that your recall will drop, and vice versa... In some cases you could want to tune this threshold.

In [None]:
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_train_single, y_decision)
data = pd.DataFrame({'fpr' : fpr, 'tpr':tpr})

In [None]:
sns.set_theme()
plot = sns.lineplot(x = 'fpr', y = 'tpr', data=data)
plot.set(xlabel = 'FPR', ylabel = 'TPR')
plt.plot([0,1], [0,1], 'k--')

Another visualization that is often used, is a ROC-curve. You plot the False Positive Rate against the True Positive Rate. The diagonal line is what you expect from coincence, so you should get away from that.

The steep rise means: even though the False Positive Rate is very low, you identify already about 60-80% of the True Positive cases. That is nice!

If you also want to get those last difficult cases, you will have to accept that your False Positive rate will also grow, meaning that you will make more mistakes in giving something a label while you should not have done so.

#  Multiclass prediction
Now let's move on to a more complex case, where we actually want to predict every number.

In [None]:
# first scale the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Instead of doing a simple fit, we can use cross validation. Internally, this splits the dataset in equal parts, fits on one part, predicts on another.

In [None]:
%%time
sgd = SGDClassifier(random_state=5, max_iter=5, n_jobs=4)
yhat = cross_val_predict(sgd, X_train, y_train, cv = 5)

In [None]:
cfm = confusion_matrix(y_train, yhat)
cfm_heatmap(cfm, figsize=(12,12), scale='rowwise', vmax= 0.05)

This might seem like a lot of information to take in. But, on the other hand, let's not forget that we have 10 cases to predict. This means that we have 10 cases, and every case might get one out of 10 labels. This gives us 100 cases in total. Considering that, the heatmap is a nice way to quickly spot the problems.

Again, we normalised over the rows. We see that what is actually a three is often mistaken for a five. The same goes the other way around. Also the seven is often mistaken for a nine.

# Scanning models at scale
So, due to the ["no free lunch theorem"](https://en.wikipedia.org/wiki/No_free_lunch_in_search_and_optimization) we might have an intuition about a best model, but there is no best by default and often we will need to simple test and compare. Let's try to scale this up.

## create synthetic data
Let's create some data and explore it a bit.

In [None]:
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split

X, y = make_moons(n_samples = 500, noise = 0.3, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.7)

In [None]:
data = pd.DataFrame(X_train)
data.head()

In [None]:
sns.scatterplot(data=data, x=0, y=1, hue=y_train, palette='Set1')


## Pick some models
Looking around in the sklearn documentation about classifiers, we can pick some classifiers.

In [None]:
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

For a single classifier, the process would look like this:

In [None]:
%%time
svc = SVC()
svc.fit(X_train, y_train)
result = cross_val_score(svc, X_test, y_test, cv = 5, scoring='f1_macro')
result

And, sure, we could repeat that by copy-pasting these lines. But why copy paste if we can program the repetitive part as a for-loop?



In [None]:
cv = 5
classifiers = [
    ('svc-linear', SVC(kernel='linear')),
    ('svc-kernel', SVC()),
    ('random-forest', RandomForestClassifier()),
    ('naive bayes', GaussianNB()),
    ('gaussian', GaussianProcessClassifier()),
    ('kNN', KNeighborsClassifier(3)),
    ('decision tree', DecisionTreeClassifier())
]

for i, (name, clf) in enumerate(classifiers):
    clf.fit(X_train, y_train)
    result = cross_val_score(clf, X_test, y_test, cv = cv, scoring='f1_macro')

    mu = np.mean(result)
    stderr = np.std(result)/np.sqrt(cv)

    plt.scatter(i, mu, label=name)
    plt.errorbar(i, mu, yerr=stderr)
    plt.legend(loc=3)

plt.xticks(np.arange(len(classifiers)), [name[0] for name in classifiers], rotation=45);
plt.show()

That looks good. We get a nice overview of performance, even without tweaking the models by changing the hyperparameters.


In addition to this, let's set up a contour plot to check the insides of the model (cf., how the model decides on the class of a point)

In [None]:
def plot_countour(X_train, y_train, model, granularity=0.1, grid_side=0.5, palette='Set1', ax=None):
    X_train = pd.DataFrame(X_train)

    # first, we get the min-max range over which we want to plot
    # this is the area for which we want to know the behavior of the model
    # we add some extra space with grid_side to the feature space.
    x0_min, x0_max = X_train.iloc[:,0].min() -grid_side, X_train.iloc[:,0].max() +grid_side
    x1_min, x1_max = X_train.iloc[:,1].min() -grid_side, X_train.iloc[:,1].max() +grid_side

    # we make a grid of coordinates
    xx, yy = np.meshgrid(np.arange(x0_min, x0_max, granularity),
                         np.arange(x1_min, x1_max, granularity))
    # and combine the grid into a new dataset.
    # this new dataset covers (with some granularity) every point of the original dataset
    # this newx is equal to the featurespace we want to examine.
    newx = np.c_[xx.ravel(), yy.ravel()]

    # we make a prediction with the new dataset. This will show us predictions over the complete featurespace.
    yhat = model.predict(newx)

    # and reshape the prediction, such that it will match our gridsize
    z = yhat.reshape(xx.shape)
    cm = sns.color_palette(palette, as_cmap=True)
    if ax is None:
        # in the case we want to make a single plot
        plt.contourf(xx, yy, z, cmap=cm, alpha = 0.5)
    else:
        # in the case we have subplots and have our own axes to plot on
        ax.contourf(xx, yy, z, cmap=cm, alpha = 0.5)

    x1, x2 = X_train.iloc[:,0], X_train.iloc[:,1]
    sns.scatterplot(x=x1, y=x2, hue=y_train, palette=palette, ax=ax,style=y_train, alpha=0.5, markers={0 : "s", 1:"o"})

This works for a single model

In [None]:
plot_countour(X_train, y_train, svc)

However, let's scale that up as well.

In [None]:
fig, axs = plt.subplots(2, 4, figsize=(16,12))
axs = axs.ravel()

for i, (name, clf) in enumerate(classifiers):
    clf.fit(X_train, y_train)
    result = cross_val_score(clf, X_test, y_test, cv = cv, scoring='f1_macro')
    plot_countour(X_train, y_train, clf, ax=axs[i], palette="Set1")
    axs[i].set_title(name)

Note that you can only use this conveniently for 2D data (because, well, how would you want to plot data that has 8 dimensions? or 30?)

Another way to check performance is using precision-recall and roc curves. However, we showed earlier, how to do this for a model with a decision function. But not every model has one, as some models work with probabilities. Those will have a `predict_proba`

In [None]:
for name, clf in classifiers:
    if hasattr(clf, "decision_function"):
        print("decision_function : {}".format(name))
    if hasattr(clf,"predict_proba"):
        print("predict_proba     : {}".format(name))

To use one of the probability models, we have to make some small modifications

In [None]:
gpc = GaussianProcessClassifier()
gpc.fit(X_train, y_train)

In [None]:
proba = gpc.predict_proba(X_train)
proba[:5], y_train[:5]

Note how the probabilities are two values. You actually need just one of them (because the other is 1-p). So, we will have to figure out, what labels corresponds to which probability. In this case, a label of value 1 will correspond with a high value in column indexed 1.

To make the prediction generalize better, let's use `cross_val_predict`. Note how we need to change the method.

In [None]:
y_decision = cross_val_predict(gpc, X_train, y_train, cv = 3, n_jobs = 4, method = 'predict_proba')
y_decision[:5], y_train[:5]

In [None]:
precision, recall, thresholds = precision_recall_curve(y_train, y_decision[:,1])
data = pd.DataFrame({'precision':precision[:-1],'recall': recall[:-1], 'thresholds':thresholds})
sns.lineplot(x = 'thresholds', y='precision', label = 'precision', data = data)
sns.lineplot(x = 'thresholds', y='recall', label = 'recall', data=data)

In [None]:
fpr, tpr, thresholds = roc_curve(y_train, y_decision[:,1])
data = pd.DataFrame({'fpr' : fpr, 'tpr':tpr})
plot = sns.lineplot(x = 'fpr', y = 'tpr', data=data)
plot.set(xlabel = 'FPR', ylabel = 'TPR')
plt.plot([0,1], [0,1], 'k--')