This notebook covers the following topics:
1. Basic Machine Learning Terminology
2. Tensorflow 2.0 Basics including Gradient Descent
3. Keras API
4. Support Vector Machines, Linear Regression, Logistic Regression, Multi-Layer Perceptrons
5. Accuracy + Confusion Matrix
6. Cross-Validation

## Basic ML Terminology

#### Training, Test, and Validation Data
In machine learning, the goal is to find a representation of a 'true' relationship that is represented by a set of examples in a dataset. It figures out which model is a good representation, the available data is typically split into different subsets dedicated to different purposes:

1. Training set: 
   A set of examples used for learning, that is to fit the parameters [i.e.,
   weights] of the classifier. 
2. Validation set: 
   A set of examples used to tune the parameters [i.e., architecture, not
   weights] of a classifier, for example to choose the number of hidden
   units in a neural network. 
3. Test set: 
   A set of examples used only to assess the performance [generalization] of
   a fully-specified classifier. 

The ratio between these sets varies depending on how much data is available. A common rule of thumb is to use 80% of the available data for training, 10% for validation, and 10% for testing. One thing to keep in mind when creating these sets for a classification task is the number of examples in each class per set. Ideally, these are balanced (same amount for each class), as an imbalance leads to the model putting more emphasis on one class than another. If classes are imbalanced, and they usually are, then the training, validation, and test set should have the same distribution over classes and mimic the distribution in the full dataset. This approach is called stratifying.

For a more in-depth description see this FAQ on neural networks: [design set, validation set, and test set?
](http://www.faqs.org/faqs/ai-faq/neural-nets/part1/section-14.html)

#### Regression vs Classification
Before even beginning to think about the model there is a very important question that needs to be answered: What is the output of the model going to be, i.e., what is the target? Is it continuous or categorical? If it is continuous then the problem is called a regression problem, otherwise, it is a classification problem. Most models can be used for both, regression and classification; however, there is a fundamental difference in designing a model for either.

The goal of regression models is to project the input data into a (typically lower-dimensional) space in which the target data, when plotted over a single dimension in a 2D graph, becomes linear, and to then learn the slope for each dimension of this space. In other words, for any of the model's features just before the output (in neural networks that is the last hidden layer), you can draw a straight line that **is very close to all data points**.

The goal of classification, on the other hand, is to project the input data into a space in which the target data, when again plotted over a single dimension in a 2D graph, separates one category as much as possible from all other categories of interest, and to then learn a threshold value for which input is no longer part of that one category. In other words, for any of the model's features just before the output, you can draw a straight line that has all examples of one category on one side, and all other examples on the other and ** is very far away from all data points**.

This means that the kind of features you would like to see for a regression model or a classification model are fundamentally different. Understanding this difference is important when assessing the quality of a dataset or inspecting the features a model has learned.

While this tutorial introduces concepts used for either problem, examples and exercises will primarily come from classification, because the problem you will tackle in the machine learning part of the project is a classification problem.


#### Inference vs Prediction
Another aspect that needs to be decided on when selecting a model to train is whether the model will be used for prediction or for inference. Prediction literally tries to predict the unknown and to interpolate well between seen examples so that it can assign the correct value to unseen ones. Inference, on the other hand, tries to infer the importance of input features on the output, meaning that the goal is to understand why a certain value is the way it is, and which of the features cause this value.

While the way models are trained for either purpose is again quite similar, there is a difference in the model's relationship to causality. Take the (classic) example of a dataset that measures how many people are drowning at a beach in each week of a year, together with other features that are happening at that beach during those weeks. One of these features might be the amount of ice cream sold by the local kiosk. As ice cream sales go up, so does the number of people downing. If the goal is to create a prediction model for the number of people drowning, i.e., estimate in which weeks there will be more demand for lifeguards, the amount of ice cream sales would be an interesting feature to add to the model; anything that reliably correlates with the target values is a useful feature. If the goal is to perform inference, i.e., explain why the number of people drowning is high in this particular month, then adding ice cream sales as a feature is a bad idea. The resulting explanation would be "eating more ice cream increases your chance of drowning", which is incorrect. It suffers from the confound that the season (summer/winter) drives both values up and down. In sum, prediction problems don't need the model's inputs and outputs to be causally linked (although it is desirable for robustness) and correlation is fine, whereas for inference correlation is nice, but one is ultimately interested in causality.


#### OneHot Encoding

OneHot encoding is a way of transforming categorical features in such a way that they can be incorporated into machine learning models. Given a feature with a set of $N$ discrete categories, one constructs "N" new features corresponding to each category of the original feature. For each new feature, its value is $1$ if the category of that example in the old feature corresponds to the category the new column represents and $0$ otherwise. This means that exactly 1 feature of the new set of features will have value 1, and all the others will have value 0; hence the name OneHot.

This step is necessary because there is no order defined on strictly categorical data. Adding it as a single dimension would artificially create such an order, and would lead the model to learn wrong parameters.

## Tensorflow 2.0

In [1]:
import tensorflow as tf
import numpy as np
import sklearn as sk
from sklearn.datasets import make_classification
import scikitplot as skplt

### Basics of Tensorflow

To understand TensorFlow and how to work with it, you need to understand that at its core it's all about one thing: Tensors. For the practical purposes of this tutorial, you can think of tensors as multi-dimensional arrays. While not every multi-dimensional array is a tensor, every tensor can be represented as a multi-dimensional array. 

These tensors are being operated on by functions - TensorFlow calls them nodes. A node takes in a tensor as the (main) input (sometimes also multiple tensors) and outputs a tensor (or multiple tensors). Because this output is a tensor, it can again be the input of another node. Stacking these nodes leads to a graph like structure called node graph, and handling the node graph efficiently is the objective of TensorFlow. In this graph, tensors "flow" back and forth between different nodes, which inspires the name of the framework: TensorFlow.

*Note: The abstract blueprint of a node is called an op (for operation). This is similar to the relationship between classes and objects, where an object is a concrete instantiation and the class is the abstract blueprint.*

In [None]:
a = tf.constant([3.0, 5.0], shape=[1,2]) # node with constant output
b = tf.constant([-4.0, 4.0], shape = [1,2])
W = tf.constant([[1.0, 2.0], [3.0, 4.0]], shape=[2,2])
node = tf.add(a, b) # a + b
node.numpy() # convert the output tensor of a node to a numpy array

Above you can see a very simple graph. Tensorflow has a large collection of primitive nodes, and I recommend that you consult the documentation at some point and just scroll through the list of ops that TensorFlow provides out of the box. Here are some more simple building blocks:

In [None]:
tf.subtract(a, b) # element-wise a - b
tf.multiply(a, b) # element-wise a * b
tf.divide(a, b) # element-wise a/b
tf.exp(a) # element-wise e^a
tf.math.log(a) # element-wise log(a)
tf.pow(a, 2) # element-wise a^b
tf.reduce_mean(a, axis=0) # compute the mean of a along axis
tf.reduce_sum(a, axis=0) # sum the elements of a along axis
tf.maximum(a, b) # element-wise maximum of both tensors
tf.matmul(a, W) # matrix multiplication W * a

With this, you can already build some complex graphs. Here is an example of a linear function:

In [None]:
X = tf.constant([1.0, 1.0, 1.0], shape=[1, 3])
W = tf.constant([[1.0, 0.0, 0.0], [0.0, 1.0, 0.0], [0.0, 0.0, 1.0]], shape=[3, 3])
c = tf.constant([0.0, 1.0, 2.0], shape=[1, 3])

Y = tf.add(tf.matmul(X, W), c)
Y.numpy()

A note on `matmul(a, W)`: In data science and machine learning, it is common to identify one example, i.e., one data point, as a row in a spreadsheet, .csv, or other data file. Higher-level functions in TensorFlow follow this standard and assume that the first rank of a tensor identifies an example; for datasets that can be written as a matrix, this means that an example is a row-vector. Classic linear algebra, on the other hand, usually assumes that a vector is a column vector and defines matrix multiplication as $A\cdot x$. To account for this difference, one has to transpose $x$ before multiplying $A\cdot x^T$, as a means of "switching between frameworks". Using basic algebra, this can also be written as $x\cdot A^T$ were $x$ is the row vector used in ML. $A$ typically is a weight matrix to be learned and is randomly initialized. Whether one randomly initializes $A$ and transposes or initializes $A^T$ directly doesn't matter much; both initialize randomly with the same distribution. Hence, instead of computing `tf.matmul(W, tf.transpose(a))` one typically computes `tf.matmul(a, W)`.

### Exercise 1

Create a graph that computes
$$L=(Y_\textrm{pred}-Y)^2$$
for the input

In [49]:
Y_pred = tf.constant([1.0], shape=(1, 1))
Y = tf.constant([1.5], shape=(1, 1))

In [50]:
tf.pow(tf.subtract(Y_pred, Y), 2)

<tf.Tensor: id=65318, shape=(1, 1), dtype=float32, numpy=array([[0.25]], dtype=float32)>

*Note: The expected output is* `<tf.Tensor: id=XX, shape=(1, 1), dtype=float32, numpy=array([[0.25]], dtype=float32)>`

This function is called the least square loss and is used to fit regression models. To perform optimization with multiple data points, one minimizes the expected value of the loss, i.e., the average squared distance between a predicted value and the true value.

### Exercise 2

Using the graph defined in exercise 1, create a graph that computes
$$L=\frac1N\sum_N(Y_\textrm{pred}-Y)^2,$$
where $N$ is the number of examples in the dataset.

Notice that 
$$\frac1N\sum_N x = \textrm{mean}(x).$$

The inputs to this function are

In [37]:
Y_pred = tf.constant([[1.0], [2.0], [3.0], [4.0], [5.0]], shape=(5, 1))
Y = tf.constant([[1.5], [3.0], [2.5], [4.0], [6.0]], shape=(5, 1))

In [39]:
tf.reduce_mean(tf.pow(tf.subtract(Y_pred, Y), 2))

<tf.Tensor: id=43899, shape=(), dtype=float32, numpy=0.5>

*Note: The expected output of this function is* `<tf.Tensor: id=XX, shape=(), dtype=float32, numpy=0.5>`

### Exercise 3

Create a graph that computes the sigmoid function
$$y = \frac{1}{1+e^{-X}}$$
element-wise for the input

In [None]:
X = tf.Variable([1.0, 2.0, 3.0, 4.0]) # like a constant, but may change value

In [None]:
np.divide(1, np.add(1, np.exp(np.multiply(-1, X))))

*Note: The expected output is* `array([0.7310586 , 0.880797  , 0.95257413, 0.98201376], dtype=float32)`

### Writing Cleaner Code

To make this more readable, the guys at Google have overloaded most python operators. This means one can work with tensorflow nodes almost in the same way one would work with normal python code. For example, take a look at the linear function from before

In [None]:
X = tf.constant([1.0, 1.0, 1.0], shape=[1, 3])
W = tf.constant([[1.0, 0.0, 0.0], [0.0, 1.0, 0.0], [0.0, 0.0, 1.0]], shape=[3, 3])
c = tf.constant([0.0, 1.0, 2.0], shape=[1, 3])

Y = tf.matmul(X, W) + c
Y.numpy()

### Exercise 4

Create a graph that computes
$$L = \max(0, 1- Y Y_\textrm{pred})$$
for the input 

In [None]:
Y_pred = tf.constant([0.5])
Y = tf.constant([-1.0])

In [None]:
s = tf.maximum(0, 1 - Y * Y_pred)
s.numpy()

*Note: The expected output is:* `array([1.5], dtype=float32)`

This function is called [hinge loss](https://en.wikipedia.org/wiki/Hinge_loss).It is used in binary classification and is most known for training support vector machines (SVMs). Similar to the least square loss presented earlier, when computing the loss for multiple data points one, again, minimizes the expected value of the loss.

One important thing to know about the hinge loss is that the binary labels, which the loss expects for $Y$, are not $0$ and $1$ (the values used in most losses for classification), but instead are $-1$ and $1$.

### Exercise 5

Create a graph that computes
$$L = \frac1N\sum_N \max(0, 1 - YY_\text{pred})$$
for the input

In [None]:
Y_pred = tf.constant([[1.0], [0.5], [0.0], [-0.5], [-1.0]], shape=[5,1])
Y = tf.constant([[1.0], [-1.0], [1.0], [-1.0], [1.0]], shape=[5,1])

In [None]:
tf.reduce_mean(tf.maximum(0, 1 - Y * Y_pred))

*Note: The expected output is* `<tf.Tensor: id=XX, shape=(), dtype=float32, numpy=1.0>`

### Using Python Functions with Tensorflow

It is also possible to build functions that create different sections of the graph. This is very useful when trying to reuse code and create more complex structures of nodes, such as layers in a neural network.

In [51]:
def linear(X, W, c):
    return tf.matmul(X, W) + c

X = tf.constant([1.0, 1.0, 1.0], shape=[1, 3])
W = tf.constant([[1.0, 0.0, 0.0], [0.0, 1.0, 0.0], [0.0, 0.0, 1.0]], shape=[3, 3])
c = tf.constant([0.0, 1.0, 2.0], shape=[1, 3])

linear(X, W, c)

<tf.Tensor: id=65327, shape=(1, 3), dtype=float32, numpy=array([[1., 2., 3.]], dtype=float32)>

### Exercise 6

Create two functions that produce graphs using tensorflow.

Function 1 is called `sigmoid`, takes $X$ as input, and has the form
$$\textrm{sigmoid}(X) = \frac{1}{1+e^{-X}}$$

Function 2 is called `Y_pred`, takes as input $X$, $W$, $c$, and applies the linear transformation from the example above followed by the sigmoid from function 1. It has the form
$$Y_{pred}(X) = \textrm{sigmoid}(\textrm{linear}(X, W, c))$$

Then, compute `Y_pred(X, W, c)`.

This time, the inputs are a fake dataset of 5 examples with 2 features each, a weight matrix, and a constant.

In [None]:
X = tf.constant([[0.5, 1], [0.5, 2], [0.5, 3], [0.5, 4], [0.001, 0.1]], shape=[5, 2])
W = tf.Variable([[0.5],[-0.5]]) # like a constant, but it can change it's value!
c = tf.Variable([0.1])

In [None]:
def sigmoid(X):
    return 1 / (1 + tf.exp(-X))

def Y_pred(X, W, c):
    return sigmoid(linear(X, W, c))

Y_pred(X, W, c)

*Note: The expected output is* 
```
<tf.Tensor: id=2845, shape=(5, 1), dtype=float32, numpy=
array([[0.46257016],
       [0.34298956],
       [0.2404891 ],
       [0.16110894],
       [0.5126223 ]], dtype=float32)>
```

The function $Y_\textrm{pred}$ is called perceptron - a binary classifier -, and computes the probability that the example belongs to the class with label $1$. In the exercise, you just completed, each example would be classified as belonging to class $0$, with exception of the last example, because its probability is above 50%.

### Computing Gradients

With the basics out of the way, it is time to turn to the most important feature of TensorFlow: automatic differentiation. With it, one can easily compute the gradient of an output tensor with respect to an input tensor, regardless of the location of either in the graph. In practical terms, this means one can figure out how the performance of a model changes if one changes a parameter of the model. (Note: TensorFlow calls these parameters trainable weights.)

As an example of how to differentiate/compute a gradient, consider the function

In [None]:
def complex_function(X):
    return 1 / (1 + tf.exp(- (0.5 * X ** 2)))

To compute the gradient, TensorFlow records the "flow" of the tensors on a `GradientTape`. By default, it only tracks `tf.Variable`s that are flagged as `trainable=True`. However, one can explicitly tell it to track other variables as well, for example, the $X$ in the function above.

In [None]:
with tf.GradientTape() as tape:
    X = tf.constant(1.0)
    tape.watch(X)
    y = complex_function(X)
gradient = tape.gradient(y, X) # the gradient of y with respect to X
gradient.numpy()

Another default setting of `GradientTape` is that it only allows you to compute a single gradient. To compute the gradient with respect to multiple variables within the same graph one needs to make the tape persistent using `persistent=True`.

For example, above you have created a function $Y_\textrm{pred}$ that computes a class probability. Using persistence, you can compute the gradient with respect to the weights $W$, and the constant $c$. This means you can figure out how to change these variables to make the probability larger.

In [None]:
# use the same data as in the exercise
X = tf.constant([[0.5, 1], [0.5, 2], [0.5, 3], [0.5, 4], [0.001, 0.1]], shape=[5, 2])
W = tf.Variable([[0.5],[-0.5]])
c = tf.Variable([0.1])

with tf.GradientTape(persistent=True) as tape:
    y = Y_pred(X, W, c)
dW = tape.gradient(y, W) # the gradient of y with respect to W
dc = tape.gradient(y, c)

print(dW)
print(dc)

Note that variables are trainable by default, unless specified as `trainable=False`.

### Gradient Descent

Using `GradientTape`, one can to implement a snippet that finds the minimum of a function using gradient descent with a fixed step size and a set number of steps:

In [None]:
# find the minimum of y = x^2
def functionToMinimize(X):
    return X * X

X = tf.Variable(5.0)
step_size = 0.1
for step in range(25):
    # compute the gradient
    with tf.GradientTape() as tape:
        tape.watch(X)
        y = functionToMinimize(X)
    gradient = tape.gradient(y, X)
    
    # make a small step away from the gradient
    # i.e. in the direction of steepest descent
    X = X - step_size * gradient
    print(f"f({X:.2f}) = {functionToMinimize(X).numpy():.2e}")

As you can see, the algorithm updates `X` in such a way that `functionToMinimize` gets smaller and smaller.

Now all the pieces are in place to do some machine learning in TensorFlow!

### Exercise 7

Create a function called `leastSquares` that takes as input $Y_{pred}$ and $Y$ and computes the expected least square loss for a set of examples.

Then, create a function called `objectiveFunction` that takes as input $X$, $Y$, $W$, $c$, and outputs the result of
```
leastSquares(linear(X_train, W, c), Y_train)
```
where `linear` is the linear mapping defined above.

Finally, print the output of `objectiveFunction(X_train, Y_train, W, c)`

and the variables are defined as

In [52]:
# Small training dataset of 10 examples
# Note: Y_train = 2 * X_train + 3 + noise
X_train = tf.constant([[-1.],
                       [-0.77777778],
                       [-0.55555556],
                       [-0.33333333],
                       [-0.11111111],
                       [0.11111111],
                       [0.33333333],
                       [0.55555556],
                       [0.77777778],
                       [1.]],
                      shape=[10, 1])
Y_train = tf.constant([[1.0319852],
                       [1.4628717],
                       [1.9185644],
                       [2.3697991],
                       [2.8011405],
                       [3.228254 ],
                       [3.6749988],
                       [4.1398144],
                       [4.6044483],
                       [5.0241356]],
                      shape= [10, 1])

# parameters of the model
W = tf.Variable([[0.5]], shape=[1, 1])
c = tf.Variable(0.0)

In [53]:
def leastSquares(Y_pred, Y):
    return tf.reduce_mean(tf.pow(tf.subtract(Y_pred, Y), 2))

def objectiveFunction(X, Y, W, c):
    return leastSquares(linear(X, W, c), Y)
    
objectiveFunction(X_train, Y_train, W, c)

<tf.Tensor: id=65352, shape=(), dtype=float32, numpy=10.072276>

*Note: The expected output is* `<tf.Tensor: id=XX, shape=(), dtype=float32, numpy=10.072276>`

### Exercise 8

Use `GradientTape` to compute the gradient of `objectiveFunction(X_train, Y_train, W, c)` with respect to both $W$ and $c$. Store the value of the objective functions in a new variable called `loss`, and the gradients in two new variables called `dW` and `dc`.

In [7]:
with tf.GradientTape(persistent=True) as tape:
    loss = objectiveFunction(X_train, Y_train, W, c)
dW = tape.gradient(loss, W)
dc = tape.gradient(loss, c)

*Note: The expected values for dW and dc are*
```
loss = <tf.Tensor: id=XX, shape=(), dtype=float32, numpy=10.072276>
dW = <tf.Tensor: id=XX, shape=(1, 1), dtype=float32, numpy=array([[-1.2230227]], dtype=float32)>
dc = <tf.Tensor: id=XX, shape=(), dtype=float32, numpy=-6.051203>
```

### Exercise 9

Update $W$ and $c$ using the gradients such that the loss will get smaller. That is compute
$$W = W - h~\mathrm{d}W$$
and
$$c = c - h~\mathrm{d}c,$$
where $h$ is the stepsize

In [11]:
h = 0.01

In [12]:
W = W - h * dW
c = c - h * dc

In [14]:
c

<tf.Tensor: id=110, shape=(), dtype=float32, numpy=0.060512025>

*Note: The expected values of W and c are*
```
W = tf.Tensor([[0.5122302]], shape=(1, 1), dtype=float32)
c = tf.Tensor(0.060512025, shape=(), dtype=float32)
```

### Exercise 10

Using the same method presented at the beginning of the chapter on gradient descent, compute the gradient of `loss` with respect to both $W$ and $c$ and then update both variables so that `loss` is being minimized. Do this for `250` steps using the step size $h$. Afterward, print the values of $W$ and $c$

In [15]:
# reset the variables of the model
W = tf.Variable([[0.5]], shape=[1, 1])
c = tf.Variable(0.0)

In [16]:
for step in range(250):
    with tf.GradientTape(persistent=True) as tape:
        tape.watch(W)
        tape.watch(c)
        loss = objectiveFunction(X_train, Y_train, W, c)
    dW = tape.gradient(loss, W)
    dc = tape.gradient(loss, c)
    
    W = W - h * dW
    c = c - h * dc

*Note: The expected values for W and C are*
```
W = <tf.Tensor: id=361749, shape=(1, 1), dtype=float32, numpy=array([[1.8068591]], dtype=float32)>
c = <tf.Tensor: id=361752, shape=(), dtype=float32, numpy=3.006222>
```

The optimal values for this problem are $W = 2$ and $c = 3$; training has come pretty close

Congratulations! You have implemented linear regression from scratch using TensorFlow. This lays the foundation for understanding the vast majority of machine learning algorithms trained via gradient descent.

### A Toy Dataset

Next on the list is classification. To make things more interesting this notebook will use a larger dataset that is generated from two Gaussian distributions with different means in 2D space. This is achieved using a utility function from `sklearn`.

In [40]:
X, Y = make_classification(n_samples=150, 
                           n_features=2, 
                           n_redundant=0, 
                           n_clusters_per_class=1, 
                           flip_y=0,
                           class_sep=2.0,
                           shuffle=True,
                           random_state=1337
                          )
Y = (2 * Y - 1) # relabel from 0, +1 to -1, +1

# convert to a datatype that supports GPU acceleration on a large number of GPUs
Y = Y.astype(np.float32)
X = X.astype(np.float32)

X_train, Y_train = tf.constant(X[:100,:]), tf.constant(Y[:100], shape=[100, 1])
X_test, Y_test = tf.constant(X[100:,:]), tf.constant(Y[100:], shape=[50, 1])

### Exercise 11

Create a function called `hingeLoss` that takes $Y_{pred}$ and $Y$ as input and returns the expected value of the hinge loss as encountered in exercise 5. 

Then, create a function called `model` that takes $X$, $W$, $c$ as input and computes `linear(X, W, c)`.

Then, create a function called `objectiveFunction` that takes $X$, $Y$, $W$, $c$ as input and computes
```
hingeLoss(model(X, W, c), Y)
```

Finally, apply the `objectiveFunction` to the training dataset generated above using `objectiveFunction(X_train, Y_train, W, c)`.

These are the values for this exercise:

In [41]:
W = tf.Variable([[0.593], [0.236]], shape=[2,1], dtype=tf.float32)
c = tf.Variable(0.0, dtype=tf.float32)

In [42]:
def model(X, W, c):
    return linear(X, W, c)

def hingeLoss(Y_pred, Y):
    return tf.reduce_mean(tf.maximum(0.0, 1 - Y * Y_pred))

def objectiveFunction(X, Y, W, c):
    return hingeLoss(model(X, W, c), Y)

objectiveFunction(X_train, Y_train, W, c)

<tf.Tensor: id=43932, shape=(), dtype=float32, numpy=2.1517909>

*Note: The expected output is* `<tf.Tensor: id=XX, shape=(), dtype=float32, numpy=2.1517909>`

### Exercise 12

Compute the gradient of `objectiveFunction` from exercise 11 with respect to both, the weights $W$ and the constant $c$. Name these variables `dW` and `dc`, similar to exercise 8.

In [43]:
with tf.GradientTape(persistent=True) as tape:
    loss = objectiveFunction(X_train, Y_train, W, c)
dW = tape.gradient(loss, W)
dc = tape.gradient(loss, c)

In [44]:
(dW, dc)

(<tf.Tensor: id=43978, shape=(2, 1), dtype=float32, numpy=
 array([[1.9143866 ],
        [0.07016731]], dtype=float32)>,
 <tf.Tensor: id=44012, shape=(), dtype=float32, numpy=-0.02>)

*Note: The expected outputs for W and c are* 
```
dW = <tf.Tensor: id=XX, shape=(2, 1), dtype=float32, numpy=
array([[1.9143866 ],
       [0.07016731]], dtype=float32)>
dc = <tf.Tensor: id=XX, shape=(), dtype=float32, numpy=-0.02>
```

### Exercise 13

Similar to exercise 10, find values for $W$ and $c$ that minimize the `objectiveFunction`. Do this using gradient descent with a constant stepsize $h$ and perform 250 steps.

In [45]:
W = tf.Variable([[0.593], [0.236]], shape=[2,1], dtype=tf.float32)
c = tf.Variable(0.0, dtype=tf.float32)
h = 0.01

In [46]:
for step in range(250):
    with tf.GradientTape(persistent=True) as tape:
        tape.watch(W)
        tape.watch(c)
        loss = objectiveFunction(X_train, Y_train, W, c)
    dW = tape.gradient(loss, W)
    dc = tape.gradient(loss, c)

    W = W - h * dW
    c = c - h * dc

In [47]:
(W, c)

(<tf.Tensor: id=65278, shape=(2, 1), dtype=float32, numpy=
 array([[-0.82727194],
        [ 0.08324701]], dtype=float32)>,
 <tf.Tensor: id=65281, shape=(), dtype=float32, numpy=-0.0955>)

*Note: After the optimzation the values of W and c should be*
```
W = <tf.Tensor: id=43875, shape=(2, 1), dtype=float32, numpy=
array([[-0.82727194],
       [ 0.08324701]], dtype=float32)>
c = <tf.Tensor: id=43878, shape=(), dtype=float32, numpy=-0.0955>

```

Amazing! You have just build and trained a linear SVM from scratch in tensorflow. The next step is to investigate the performance of the model:

In [48]:
# First we check the training accuracy
correctly_classified_examples = tf.sign(model(X_train, W, c)) == Y_train
correct_percent = tf.math.count_nonzero(correctly_classified_examples, dtype=tf.float64) / Y_train.shape[0]
print(f"Training accuracy is: {correct_percent*100:.2f}%")

# then the test accuracy
correctly_classified_examples = tf.sign(model(X_test, W, c)) == Y_test
correct_percent = tf.math.count_nonzero(correctly_classified_examples, dtype=tf.float64) / Y_test.shape[0]
print(f"Test accuracy is: {correct_percent*100:.2f}%")

Training accuracy is: 100.00%
Test accuracy is: 100.00%


A perfect score!

### High-Level Ops and Keras

As you may have noticed, certain operations (such as `linear`) appear again and again when training machine learning models. The same is also true for the code that runs gradient descent, used losses, and many more things that can't all be covered in a single tutorial.

This is where higher-level ops (and their nodes) come in. These ops generate reusable, frequently occurring, sets of nodes for all aspects of the TensorFlow pipeline, and are bundled in a module called `tf.keras`. Here are some examples:

In [None]:
linear = tf.keras.layers.Dense(1)

The `Dense` node is a more general version of the `linear` function used so far. The number, 1 in this case, refers to the number of transformations that should be created. Each transformation takes the same input but has a different set of weights. The output of each transformation is concatenated and then returned as a tensor. The Op also creates variables for $W$ and $c$ automatically.

This is a very common layer in deep neural networks, which is where the name `Dense` comes from. To apply the function to some input one would use

In [None]:
linear(X_train)

One important thing to note - and this applies to all ops that create trainable weights - is that the weights are associated with the function. This means that using `linear` in multiple places creates a graph where weights are shared in different places. It is very handy when doing more advanced things like multi-objective learning, but can be the source of nasty, hard to debug problems if one forgets this.

For gradient descent, there also exists a high-level function

In [None]:
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)

SGD stands for stochastic gradient descent. It works just like the gradient descent used above, except that it is not computed using $X_train$, but a randomly chosen subset of examples. This is useful when the dataset becomes too large to compute the gradient for all examples in $X_train$ in a reasonable time. It also has some nice numerical properties that allow for better convergence. As a special case, it is possible to choose the entire dataset as the "randomly chosen subset of examples". In this case, SGD becomes normal gradient descent.

Similarly, there is a function to compute the hinge loss

In [None]:
Y_pred = linear(X_train)
tf.keras.losses.hinge(Y_pred, Y_train)

The hinge loss in TensorFlow also comes with a convenient feature. If the true labels passed in are drawn from the standard $\{0, 1\}$ instead of $\{-1, 1\}$, it will rescale the values automatically. This means the rescaling when generating the data becomes obsolete, and the hinge loss can be used like any other loss.

In [None]:
X, Y = make_classification(n_samples=150, 
                           n_features=2, 
                           n_redundant=0, 
                           n_clusters_per_class=1, 
                           flip_y=0,
                           class_sep=2.0,
                           shuffle=True,
                           random_state=1337
                          )
Y = Y.astype(np.float32)
X = X.astype(np.float32)

X_train, Y_train = tf.constant(X[:100,:]), tf.constant(Y[:100], shape=[100, 1])
X_test, Y_test = tf.constant(X[100:,:]), tf.constant(Y[100:], shape=[50, 1])

The accuracy can then be computed using `tf.keras.metrics.BinaryAccuracy` and a threshold value of $0$. This is because the hinge loss has the, slightly odd, convention of labeling one class as $1$ and the other as $-1$.

In [None]:
acc = tf.keras.metrics.BinaryAccuracy(threshold=0)
acc.update_state(Y_train, Y_pred)
acc.result().numpy() # accuracy in %

On top of this, tensorflow provides a class called `tf.keras.models.Model` which takes care of all the plumbing. With it, building an SVM simplifies to:

In [None]:
# create the SVM model
X = tf.keras.layers.Input(shape=(2,))
Y_pred = tf.keras.layers.Dense(1,
                               kernel_initializer=tf.constant_initializer(np.array([[0.593, 0.236]])),
                               bias_initializer=tf.constant_initializer(0.145)
                              )(X)
model = tf.keras.models.Model(inputs=X, outputs=Y_pred)

# setup gradient descent setup / create training pipeline
model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=0.01),
              loss=tf.keras.losses.hinge,
              metrics=[tf.keras.metrics.BinaryAccuracy(threshold=0)]
             )

# perform gradient descent / train
model.fit(x=X_train,
          y=Y_train,
          batch_size=X_train.shape[0],
          epochs=150
         )

# evaluate the training accuracy
_, training_accuracy = model.evaluate(X_train, Y_train, verbose = 0)

# evaluate the test accuracy
_, test_accuracy = model.evaluate(X_test, Y_test, verbose = 0)

print(f"The training accuracy is {training_accuracy * 100}% and the test accuracy is {test_accuracy * 100}%")

As you can see, there is virtually no difference between the SVM that you have trained above, and the SVM trained here. The Keras API simply saves you time and produces more readable code.

### Exercise 14

Now it is your turn to create a model using the Keras API. For this create $3$ nodes.

The first node, called `X`, is of type `tf.keras.layers.Input` and has the same shape as the node `X` in the example.  The second node, called `linear`, is of type `tf.keras.layers.Dense` and has 1 unit. To initialize the weight and bias use the initializers provided below. The input for this node is `X`. The third node, called `sigmoid`, is of type `tf.math.sigmoid`. It takes `linear` as input.

Afterwards, create a `tf.keras.models.Model` which has `X` as input and `sigmoid` as output. Compile this model using the optimizer `tf.keras.optimizers.SGD` with learning rate 0.01 and the loss `tf.keras.losses.BinaryCrossentropy`. Also, add the metric `tf.keras.metrics.BinaryAccuracy` with threshold 0.5.

Next, fit the model to $X_{train}$ and $Y_{train}$ using a batch size equal to the dataset size. Train the model for $200$ steps/epochs.

Finally, evaluate the accuracy of the model on the test data $X_{test}$, $Y_{test}$.

In [None]:
W_0=tf.constant_initializer(np.array([[0.593, 0.236]]))
c_0=tf.constant_initializer(0.0)

In [None]:
X = tf.keras.layers.Input(shape=(2,))
linear = tf.keras.layers.Dense(1,
                               kernel_initializer=W_0,
                               bias_initializer=c_0
                              )(X)
sigmoid = tf.math.sigmoid(linear)
model = tf.keras.models.Model(inputs=X, outputs=sigmoid)

# setup gradient descent setup / create training pipeline
model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=0.01),
              loss=tf.keras.losses.BinaryCrossentropy(),
              metrics=[tf.keras.metrics.BinaryAccuracy(threshold=0.5)]
             )

# perform gradient descent / train
model.fit(x=X_train,
          y=Y_train,
          batch_size=X_train.shape[0],
          epochs=150
         )

# evaluate the test accuracy
_, test_accuracy = model.evaluate(X_test, Y_test, verbose = 0)
print(f"The test accuracy is {test_accuracy * 100}%")

*Note: The loss after training is* 0.2793 *with a test_accuracy of 100% and a training accuracy of 100%.*

Well done! You have now implemented logistic regression using Keras.

### Real World Data

This tutorial will now move away from artificially generated data and to easy real-world data: [Fashion MNIST](https://github.com/zalandoresearch/fashion-mnist). You may have heard about [MNIST](http://yann.lecun.com/exdb/mnist/), which is a dataset containing handwritten digits; it is the most popular dataset for teaching ML. Fashion MNIST is a contribution from Zalando Research that aims to replace MNIST. It adds just enough complexity over traditional MNIST to allow highlight the differences between both, classic models and deep models, while still being an easy dataset.

In [None]:
fashion_mnist = tf.keras.datasets.fashion_mnist
(X_train, Y_train), (X_test, Y_test) = fashion_mnist.load_data()

A lot of benchmark datasets are provided online and are free to download. For many of these, there are snippets available that will download the file and load it into TensorFlow. Fashion MNIST is no exception.

Real-World data often needs cleaning and preprocessing to be used to build models. This is the step where ML projects spend most of their time. Fortunately, fashion MNIST is already quite clean; it only needs some minor retouching, which can be done using sklearn.

In [None]:
encoder = sk.preprocessing.OneHotEncoder(dtype=np.float32)
encoder.fit(Y_train.reshape((-1, 1)))

Y_train_encoded = encoder.transform(Y_train.reshape((-1, 1))).toarray()
Y_test_encoded = encoder.transform(Y_test.reshape((-1, 1))).toarray()

The difference from the datasets used previously is that Fashion MNIST has 10 classes instead of the 2 encountered so far. Hence, it is no longer a binary classification task, but general (multi-class) classification. Interestingly, this doesn't change the training syntax much:

In [None]:
X = tf.keras.layers.Input(shape=(28,28))
flat = tf.keras.layers.Flatten()(X)
linear = tf.keras.layers.Dense(10)(flat)
softmax = tf.math.softmax(linear)
model = tf.keras.models.Model(inputs=X, outputs=softmax)

model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=1e-4),
              loss=tf.keras.losses.CategoricalCrossentropy(),
              metrics=[tf.keras.metrics.CategoricalAccuracy()]
             )

# perform gradient descent / train
model.fit(x=X_train,
          y=Y_train_encoded,
          batch_size=X_train.shape[0],
          epochs=10
         )

# evaluate the test accuracy
_, test_accuracy = model.evaluate(X_test, Y_test_encoded, verbose = 0)
print(f"The test accuracy is {test_accuracy * 100}%")

To train on multi-class data, the loss, as well as the accuracy metric, have to be replaced. The output of the model is now 10-dimensional, and to interpret it as class-probabilities one can use the softmax function. Another change is the introduction of the `Flatten` layer. This has to be done, because examples are 28x28 images, whereas a `Dense` layer expects a vector / rank-1 tensor as input. The `Flatten` layer converts between the two.

The performance is already not too shabby. To create higher accuracy, one needs to (1) identify the problem, and (2) change the training process/model based on these insights. To do this properly, it is useful to leverage some of sklearn's utility functions. This is done via the `tf.keras.wrappers.KerasClassifier` wrapper.

*Note: At the time of this writing `KerasClassifier` only works with models of type `tf.keras.models.Sequential`. So far, this tutorial used the more general and powerful functional API. `Sequential` only allows for models that have one node followed by another, which allows some more convenience methods (e.g., `add`).*

In [None]:
# this snippet will dissapear at some point in the future
# it exists because I found two bugs in KerasClassifier while writing
# this notebook. 
# Issues are raised in the tensorflow repo and this should be fixed soon(TM)

from tensorflow.keras.wrappers.scikit_learn import KerasClassifier 

class KerasClassifier_Patched(KerasClassifier):
    # bugfix: classifier doesn't declare that it is a classifier
    # in the Scikit learn API
    _estimator_type = "classifier"
    
    # bugfix: the current wrapper does not work with HotOne encoded
    # labels
    # this is only a fix in the specific case of this notebook,
    # not a general one
    def score(self, x, y, **kwargs):
        _, accuracy = self.model.evaluate(x,y, verbose=0, **kwargs)
        return accuracy

In [None]:
def setupModel():
    model = tf.keras.models.Sequential()
    model.add(tf.keras.layers.Input(shape=(28,28)))
    model.add(tf.keras.layers.Flatten())
    model.add(tf.keras.layers.Dense(10))
    model.add(tf.keras.layers.Softmax())

    model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=1e-5),
                  loss=tf.keras.losses.CategoricalCrossentropy(),
                  metrics=[tf.keras.metrics.CategoricalAccuracy()]
                 )
    return model

# this model now uses the sklearn API instead of the Keras API
# it still trains using tensorflow under the hood
model = KerasClassifier_Patched(build_fn=setupModel,
                                epochs=150,
                                batch_size=256,
                                verbose=0 # do not print training progress to console
                               )

# train the model
model.fit(X_train, Y_train_encoded)

# evaluate the test accuracy
test_accuracy = model.score(X_test, Y_test_encoded)
print(f"The test accuracy is {test_accuracy * 100:.2f}%")

*Note: training takes a while, and the output is disabled. If you want to see the progress, remove the `verbose=0` line where the model is instantiated.*

### Exercise 15

Create and train a multi-layer perceptron on fashion MNIST.

First, create a function called `setupModel` that creates and returns a `Sequential` model similar to the example above. Add another `Dense` layer between the existing `Flatten` and `Dense` layer. Set the number of units to 400, and add the optional argument `activation='sigmoid'`. This will apply a sigmoid function to the outputs of the first `Dense` layer before passing them into the second `Dense` layer. The resulting architecture is called a multi-layer perceptron (MLP).

Next, create the model in scikit learn using `KerasClassifier_Patched` and store it in a variable called `mlp_model`. Use the same parameters like the example above uses. Then, train the model and compute the accuracy on the test set.

In [None]:
def setupModel(alpha=1e-4):
    model = tf.keras.models.Sequential()
    model.add(tf.keras.layers.Input(shape=(28,28)))
    model.add(tf.keras.layers.Flatten())
    model.add(tf.keras.layers.Dense(400, activation='sigmoid'))
    model.add(tf.keras.layers.Dense(10))
    model.add(tf.keras.layers.Softmax())

    model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=alpha),
                  loss=tf.keras.losses.CategoricalCrossentropy(),
                  metrics=[tf.keras.metrics.CategoricalAccuracy()]
                 )
    return model

# this model now uses the sklearn API instead of the Keras API
# it still trains using tensorflow under the hood
mlp_model = KerasClassifier_Patched(build_fn=setupModel,
                                epochs=150,
                                batch_size=256,
                                #verbose = 0 # do not print training progress to console
                               )

# train the model
mlp_model.fit(X_train, Y_train_encoded)

# evaluate the test accuracy
test_accuracy = mlp_model.score(X_test, Y_test_encoded, verbose=0)
print(f"The test accuracy is {test_accuracy * 100:.2f}%")

*Note: For this and the following exercises there will be no target output value. The exact numbers depend on the random initialization of weights in the network. Your accuracy should be similar to the accuracy of the example.*

### Confusion Matrix

In [None]:
# increase the default plot size
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [10, 10]

What the accuracy of a model measures is largely self-explanatory: how many examples were classified correctly? While the prime objective of a classification model is a high accuracy, one can not learn much about how to improve its performance by just looking at this value alone. Instead, one has to look at the cases where the classification fails.

This is where the confusion matrix comes into play. It is a square matrix that has, along the first dimension, each category present in $Y$ and, along the second dimension, each category of $Y_\textrm{pred}$. A cell $c_{ab}$ of the matrix denotes how often an example from class $a\in Y$ has been classified with label $b\in Y_\textrm{pred}$. The numbers along the diagonal represent how many examples were classified correctly, while the other elements represent the way a model 'confuses' an example of category $a$ for an example of category $b$. Hence the name confusion matrix.

In [None]:
labels = ["T-shirt/top", "Trouser", "Pullover", "Dress", "Coat", "Sandal", "Shirt", "Sneaker", "Bag", "Ankle boot"]
sk.metrics.plot_confusion_matrix(model, 
                                 X_test, 
                                 Y_test, 
                                 normalize="pred",
                                 display_labels = labels
                                )
plt.show()

Looking at this plot, the hardest class to detect are shirts. Which are most commonly confused for Coats or T-shirts.

### Exercise 16

Plot the confusion matrix for `mlp_model`.

In [None]:
labels = ["T-shirt/top", "Trouser", "Pullover", "Dress", "Coat", "Sandal", "Shirt", "Sneaker", "Bag", "Ankle boot"]
sk.metrics.plot_confusion_matrix(mlp_model, 
                                 X_test, 
                                 Y_test_encoded, 
                                 normalize="all",
                                 display_labels = labels
                                )

### Save the trained Model

After training a good model, one usually wants to use it in other programs to make predictions. This can be done either in `.h5` or `.tf` format.

In [None]:
# unwrap the model and remove it from the scikit learn API
model = model.model

# save the model using the Keras API
model.save('model.tf')

### Load the trained Model

In [None]:
# load using the keras API (not scikit learn)
model = tf.keras.models.load_model('model.tf')

### Exercise 17

Save and load `mpl_model`.

In [None]:
mlp_model = mlp_model.model
mlp_model.save('mlp_model.tf')
mlp_model = tf.keras.models.load_model('mlp_model.tf')

### Cross-Validation

When selecting models one is often interested in a model's ability to generalize. Even though the available data might not be exhaustive, if the chosen model generalizes well, it should (in theory) learn a good enough representation to perform well on unseen examples. Yet, how should generalizability be quantified?

This is what $K$-fold cross-validation does. It takes an existing dataset and creates $K$ new datasets by splitting the data into $K$ equal chunks (folds), removing the $k$-th fold from the dataset and calling the result a new dataset. The $k$-th fold is then used as a validation set. The idea is: if the model has high validation accuracy, regardless of which chunk of data gets removed from the training set and regardless of which chunk is chosen as a validation set, then the model generally fits the data well, i.e., it generalizes well.

Evaluating the accuracy using cross-validation in scikit-learn is quite easy, and you can find a [great practical overview](https://scikit-learn.org/stable/modules/cross_validation.html) over various types of cross-validation on their website.

In [None]:
from sklearn.model_selection import cross_val_score

folds = 3 # number of chunks to split the dataset into
scores = cross_val_score(model, X_train, Y_train_encoded, cv=folds)
print(f"Accuracy: {scores.mean():.2f} Standard Deviation: {scores.std():0.4f}")

On top of this, cross validation measures how sensitive a model is to changes in the training data. $K$-fold crossvalidation results in $K$ validation scores, and their mean give an estimate of general model performance, while their standard deviation estimates the sensitivity to input data.


### Exercise 18

Compute the cross validation score for the multi-layer perceptrion model you have created above.

In [None]:
folds = 3 # number of chunks to split the dataset into
scores = cross_val_score(mlp_model, X_train, Y_train_encoded, cv=folds)
print(f"Accuracy: {scores.mean():.2f} Standard Deviation: {scores.std():0.4f}")

Being able to measure this sensitivity enables parameter optimization. While this tutorial can only scratch the surface of parameter tuning, one example of it is the impact of the number of examples on performance. There is the common belief among data scientists that "More data is better". While this is generally true, there comes a point of diminishing returns after which adding more examples only has a marginal effect on performance.

To measure the effect of the number of samples on model performance, one first creates datasets of different sizes from the available data and then performs cross-validation on each of these new datasets. The trend of the mean validation scores as the amount of training data increases then informs about the effect of dataset size on performance. If a clear slope isn't visible yet, more data will likely continue to significantly improve performance.

In [None]:
def setupModel():
    model = tf.keras.models.Sequential()
    model.add(tf.keras.layers.Input(shape=(28,28)))
    model.add(tf.keras.layers.Flatten())
    model.add(tf.keras.layers.Dense(10))
    model.add(tf.keras.layers.Softmax())

    model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=1e-5),
                  loss=tf.keras.losses.CategoricalCrossentropy(),
                  metrics=[tf.keras.metrics.CategoricalAccuracy()]
                 )
    return model

# this model now uses the sklearn API instead of the Keras API
# it still trains using tensorflow under the hood
model = KerasClassifier_Patched(build_fn=setupModel,
                                epochs=150,
                                batch_size=256,
                                verbose=0
                               )

plot_obj = skplt.estimators.plot_learning_curve(model, X_train, Y_train_encoded)

*Note: The code block above trains 15 classifiers over 150 epochs. It will take several minutes to complete; on my machine around 30.*

### Exercise 19

Compute the relationship between training examples and model accuracy for the multi-layer perceptron model you have created above. For this, create a classifier that trains over 100 epochs.

In [None]:
def setupModel():
    model = tf.keras.models.Sequential()
    model.add(tf.keras.layers.Input(shape=(28,28)))
    model.add(tf.keras.layers.Flatten())
    model.add(tf.keras.layers.Dense(400, activation='sigmoid'))
    model.add(tf.keras.layers.Dense(10))
    model.add(tf.keras.layers.Softmax())

    model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=1e-5),
                  loss=tf.keras.losses.CategoricalCrossentropy(),
                  metrics=[tf.keras.metrics.CategoricalAccuracy()]
                 )
    return model

# this model now uses the sklearn API instead of the Keras API
# it still trains using tensorflow under the hood
model = KerasClassifier_Patched(build_fn=setupModel,
                                epochs=100,
                                batch_size=256,
                                verbose=0
                               )

plot_obj = skplt.estimators.plot_learning_curve(model, X_train, Y_train_encoded)

*Note: The code block above trains 15 classifiers over 100 epochs. It will take several minutes to complete; on my machine around 20.*