# Neural Networks

Hopefully you've watched the three videos by [Grant Sanderson](https://twitter.com/3blue1brown) (a.k.a. [3blue1brown](https://www.youtube.com/channel/UCYO_jab_esuFRV4b17AJtAw)).

* [But what is a Neural Network?](https://www.youtube.com/watch?v=aircAruvnKk) (19:13)
* [Gradient descent, how neural networks learn](https://www.youtube.com/watch?v=IHZwWFHWa-w) (21:00)
* [What is back propagation really doing?](https://www.youtube.com/watch?v=Ilg3gGewQ5U) (13:53)

---

## A very brief recap from the homework

**Neurons**:

* Hold a value
* This value is related to the values of neurons on previous layers via:
    * **weights**
    * **bias**
    * **activation function**
* Some jargon: weights and biases are called **parameters** of the model (they are estimated from data automatically). The other options about the model are called **hyperparameters**.

**Neural network structure**:

<img src="https://upload.wikimedia.org/wikipedia/commons/1/1d/Neural_network_example.png"  style="width:200px;">

* Input layer
* one or more hidden layers (this is where the term "deep" comes from)
* an output layer

**Learning**:

* Minimizing a **loss function** (or **cost function**) through back propagation
  * Loss is often **Mean Squared Error** (**MSE**) between p
* An **optimizer** helps find the best possible parameters
  * Data is fed to the model with the current weights and biases, and the optimizer instructs how to adjust the weights and biases, and the process is iterated.
  * This can be **gradient descent**, which is a slow process.
  * The choice of optimizer might mean the difference between a model that is trained in minutes vs days.
  * each time the entire set of data is fed to the algorighm, it is called an **epoch**
  * some times the adjustment process can be sped up by feeding in the data in smaller **batches** (usually randomly selected) and adjusting the weights more frequently.
    * an example of this strategy is **stochastic gradient descent**
    * a modern extention to stochastic gradient descent optimizer is the **Adam** optimizer, which is now very commonly used. The math is pretty heavy, but you can read about some of the details here: [Gentle Introduction to the Adam Optimization Algorithm for Deep Learning](https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/)


<img src="https://pbs.twimg.com/media/EybMJzOU8AY8g7M?format=png&name=small"  style="width:400px;">

Now that we have some concepts defined, let's play around with a neural network before touching any code:

https://playground.tensorflow.org/

In [None]:
# Download data and solutions

import urllib.request
import os

def download_data(path, branch='main'):
    base_url = 'https://raw.githubusercontent.com/ualberta-rcg/python-machine-learning'
    if os.path.exists(path):
        return
    if not os.path.exists('data'):
        os.mkdir('data')
    if not os.path.exists('data/numbers'):
        os.mkdir('data/numbers')
    url = '{}/{}/notebooks/{}'.format(base_url, branch, path)
    output_file = path
    urllib.request.urlretrieve(url, output_file)
    print("Downloaded " + path)

download_data('data/numbers/cwant_1.png')
download_data('data/numbers/cwant_3.png')
download_data('data/numbers/cwant_5.png')
download_data('data/numbers/cwant_8.png')
download_data('data/numbers/cwant_thick_1.png')
download_data('data/numbers/cwant_thick_3.png')
download_data('data/numbers/cwant_thick_4.png')
download_data('data/numbers/cwant_thick_5.png')
download_data('data/numbers/cwant_thick_6.png')
download_data('data/numbers/cwant_thick_9.png')

In [None]:
# !pip install keras
# !pip install tensorflow

Like other package we have seen, Keras has a submodule of sample datasets. The **MNIST** dataset of handwritten numbers is included, which we can load as both training and test data.

In [None]:
import keras
from keras.datasets import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()

We can see how many samples are in the **training** features data, and the shape of each sample ...

In [None]:
x_train.shape

Same for the **test** data...

In [None]:
x_test.shape

In [None]:
# 60000
num_train = x_train.shape[0]

# 10000
num_test = x_test.shape[0]

# 784
num_pixels = x_train.shape[1] * x_train.shape[2]

We can look at an individual sample in the training data ...

In [None]:
x_train[31] # 32-nd record

But it probably makes more sense to convert this data into an image and render it. The `PIL` module makes this easy.

In [None]:
import PIL
PIL.Image.fromarray(x_train[31])

We can then check the label to see that the image corresponds to the number we think it is ...

In [None]:
y_train[31]

We will now transform the feature data to convert each 28 * 28 image to a 784 entry array through the `reshape` method from `numpy.ndarray`.

In [None]:
X_train = x_train.reshape(num_train, num_pixels)
X_test = x_test.reshape(num_test, num_pixels)

X_train.shape

In [None]:
X_train[0]

In [None]:
# Array of 28x28 inputs
print(x_train[128][14][13])

# Array of 784 inputs, basically shove each row at the end of the previous
print(X_train[128][14*28+13])

And we can convert the numbers in the label data to categorial data (basically one-hot encoding)

In [None]:
import keras.utils as ku

Y_train = ku.to_categorical(y_train, 10)
Y_test = ku.to_categorical(y_test, 10)

In [None]:
y_train[26]

In [None]:
Y_train[26]

In [None]:
Y_train[26].argmax()

## Sequential model

Sequential groups a linear stack of layers. The code below:

* Specifies the input layer as having 784 items
* Has an intermediate layer with 128 nodes
* Has an output layer of 10 nodes

Eash layer has a `sigmoid` activation function.

In [None]:
import keras.models as km
import keras.layers as kl

model = km.Sequential()
model.add(kl.Dense(128, input_dim = 784, activation = 'sigmoid', name = 'hidden'))
model.add(kl.Dense(10, name = 'output', activation = 'sigmoid'))

In [None]:
model.summary()

## Compiling the model

Compiling prepares the model for training.

The optimizer chosen here is `sgd` (Stochastic Gradient Descent).

The loss/cost function we will use is `mean_squared_error`.

The accuracy is reported during training for each epoch.

In [None]:
model.compile(optimizer='sgd',
              metrics=['accuracy'],
              loss="mean_squared_error")

## Training

Gradient Descent is a slow process, so one speed up is to send the data to the algorithms in random batches until all of the data is read (Stochastic Gradient Descent). Each time this happens, it's called an **epoch**.

An epoch can be split into **minibatch** (or just **batch**), between which the model's parameters are updated.

So the number of epochs you train is how many times the model will see each training sample.

In [None]:
%%time
history = model.fit(X_train,
                    Y_train,
                    epochs=25,
                    batch_size=100,
                    verbose=1)

We can now check out the accuracy of our model on our unseen test data

In [None]:
score = model.evaluate(X_test, Y_test)

What's up with that `history` variable that's output from training? It provides some information about the loss and accuracy for each epoch.

We can use this to plot the loss and accuracy over the epochs for this training session.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

history_df = pd.DataFrame(history.history)
history_df.loc[:, ['loss', 'accuracy']].plot();
plt.xlabel('Epoch')
plt.ylabel('Accuracy/Loss')

**Now run the training and evaluation cells again.** (Training continues where we left off, and we can continue training the same model.)

## Exercise: the ultimate test

Now the ultimate test: can this model correctly detect **your** hand-drawn numbers?

You might want to try drawing your own number here:

https://drawisland.com/?w=200&h=200

Rules:
* Draw a digit with a black pen on a white background (default)
* Perhaps bump up the pen size
* Click the **Save** button to save a `png` file to your computer (hint: put the digit you drew as part of the filename).
* Put the image (or upload to Colab) in the subdirectory `data/numbers` of your current workbook directory. There should be some `png` files of numbers I drew already in there.

To figure out the current notebook directory, uncomment one of the lines with the exclamation mark:

In [None]:
# Linux/Mac/Colab
# !pwd

# Windows
# !dir

We can write a function that loads/displays/transforms/predicts an image file:

In [None]:
import PIL.Image
import PIL.ImageOps
import numpy as np

def image_predict(model, filename):
    # Load and resize to 8x8
    image = PIL.Image.open(filename).resize( (28,28) ).convert( 'L' )
    # Switch black and white
    image = PIL.ImageOps.invert(image)
    # Display
    print("Filename: {}".format(filename))
    print("Image:")
    display(image)
    # Convert to numpy array and reshape as 784 length vector
    image_array = np.array(image)[:,:].reshape(784)
    # Predict!
    prediction = model.predict(np.array([image_array])).argmax()
    print("Prediction: {}\n".format(prediction))

We can now test it out on your file (replace `cwant_8.png` with your filename):

In [None]:
# ('data/numbers/*.png'):
image_predict(model, 'data/numbers/cwant_8.png')

Did the model predict the correct number?

We use the `glob` module to predict all of the numbers in the `data/numbers` directory:

In [None]:
import glob
for filename in glob.glob('data/numbers/*.png'):
    image_predict(model, filename)

How did you the model do?

## How about adding another layer?

We now have more than three layers (including input and output), so our network is considered to be **deep** (and we are doing **deep learning**). In general, the deeper the network, the more complex learning it can do (at the cost of having to optimize many more parameters).

In [None]:
%%time

model = km.Sequential()
model.add(kl.Dense(128, input_dim = 784, activation = 'sigmoid', name = 'hidden'))
model.add(kl.Dense(128, activation = 'sigmoid', name = 'hidden2'))
model.add(kl.Dense(10, name = 'output', activation = 'sigmoid'))

model.compile(optimizer='sgd',
              metrics=['accuracy'],
              loss="mean_squared_error")

# We will take the default batch size (32)
history = model.fit(X_train,
                    Y_train,
                    epochs=25,
                    verbose=1)

score = model.evaluate(X_test, Y_test)

In [None]:
import glob
for filename in glob.glob('data/numbers/*.png'):
    image_predict(model, filename)

This time the descent of the loss is slower, and the accuracy is less impressive. We would need a lot more epochs to train this model.

## How about just a wider layer?

Another way to have a network learn more complex patterns is with wider layers.

In [None]:
%%time

model = km.Sequential()
model.add(kl.Dense(1024, input_dim = 784, activation = 'sigmoid', name = 'hidden'))
model.add(kl.Dense(10, name = 'output', activation = 'sigmoid'))

model.compile(optimizer='sgd',
              metrics=['accuracy'],
              loss="mean_squared_error")

history = model.fit(X_train,
                    Y_train,
                    epochs=25,
                    verbose=1)

score = model.evaluate(X_test, Y_test)

In [None]:
import glob
for filename in glob.glob('data/numbers/*.png'):
    image_predict(model, filename)

## How about a different activation function?

Our current activation is a sigmoid:

There is another very popular activation function called "The Rectified Linear Unit" (ReLu) that is used in machine learning:

Lets set up our model again to use ReLu for one of the hidden layers ...

In [None]:
%%time

model = km.Sequential()
model.add(kl.Dense(128, input_dim = 784, activation = 'sigmoid', name = 'hidden'))
model.add(kl.Dense(128, activation = 'relu', name = 'hidden2'))
model.add(kl.Dense(10, name = 'output', activation = 'sigmoid'))

model.compile(optimizer='sgd',
              metrics=['accuracy'],
              loss="mean_squared_error")

history = model.fit(X_train,
                    Y_train,
                    epochs=25,
                    batch_size=100,
                    verbose=1)

The ReLu helps the training go quicker ...


## Different optimizer (adam)

In [None]:
model = km.Sequential()
model.add(kl.Dense(128, input_dim = 784, activation = 'sigmoid', name = 'hidden'))
model.add(kl.Dense(128, activation = 'relu', name = 'hidden2'))
model.add(kl.Dense(10, name = 'output', activation = 'sigmoid'))

model.compile(optimizer='adam',
              metrics=['accuracy'],
              loss="mean_squared_error")

In [None]:
%%time
history = model.fit(X_train,
                    Y_train,
                    epochs=25,
#                    batch_size=100,
                    verbose=1)

In [None]:
score = model.evaluate(X_test, Y_test)

In [None]:
history_df = pd.DataFrame(history.history)
history_df.loc[:, ['loss', 'accuracy']].plot();
plt.xlabel('Epoch')
plt.ylabel('Accuracy/Loss')

In [None]:
for filename in glob.glob('data/numbers/*.png'):
    image_predict(model, filename)

### A third ReLu layer ...

In [None]:
model = km.Sequential()
model.add(kl.Dense(128, input_dim = 784, activation = 'sigmoid', name = 'hidden'))
model.add(kl.Dense(128, activation = 'relu', name = 'hidden2'))
model.add(kl.Dense(128, activation = 'relu', name = 'hidden3'))
model.add(kl.Dense(10, name = 'output', activation = 'sigmoid'))

model.compile(optimizer='adam',
              metrics=['accuracy'],
              loss="mean_squared_error")

%time history = model.fit(X_train, Y_train, epochs=25)

score = model.evaluate(X_test, Y_test)

In [None]:
for filename in glob.glob('data/numbers/*.png'):
    image_predict(model, filename)

# Stopping early ...

In [None]:
from keras.callbacks import EarlyStopping

early_stopping = EarlyStopping(
    min_delta=0.001, # minimium amount of change to count as an improvement
    patience=3, # how many epochs to wait before stopping
    restore_best_weights=True,
)

model = km.Sequential()
model.add(kl.Dense(128, input_dim = 784, activation = 'sigmoid', name = 'hidden'))
model.add(kl.Dense(128, activation = 'relu', name = 'hidden2'))
model.add(kl.Dense(128, activation = 'relu', name = 'hidden3'))
model.add(kl.Dense(10, name = 'output', activation = 'sigmoid'))

model.compile(optimizer='adam',
              metrics=['accuracy'],
              loss="mean_squared_error")

%time history = model.fit(X_train, Y_train, epochs=25, callbacks=[early_stopping])

score = model.evaluate(X_test, Y_test)

## Dropping out ...

In [None]:
model = km.Sequential()
model.add(kl.Dense(128, input_dim = 784, activation = 'sigmoid', name = 'hidden'))
model.add(kl.Dropout(0.2))
model.add(kl.Dense(128, activation = 'relu', name = 'hidden2'))
model.add(kl.Dropout(0.2))
model.add(kl.Dense(128, activation = 'relu', name = 'hidden3'))
model.add(kl.Dropout(0.2))
model.add(kl.Dense(10, name = 'output', activation = 'sigmoid'))

model.compile(optimizer='adam',
              metrics=['accuracy'],
              loss="mean_squared_error")

%time history = model.fit(X_train, Y_train, epochs=25)

score = model.evaluate(X_test, Y_test)

In [None]:
for filename in glob.glob('data/numbers/*.png'):
    image_predict(model, filename)

# JUNKYARD

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
plt.plot(pd.DataFrame(fit.history)[['accuracy', 'loss']])
plt.xlabel('Epoch')
plt.ylabel('Accuracy')

In [None]:
fit.history

In [None]:
dir(model)

In [None]:
help(model.make_predict_function)

In [None]:
keras.backend.clear_session()
history = model.fit(
    X_train, Y_train,
    validation_data=(X_test, Y_test),
    batch_size=1000,
    epochs=1,
    # callbacks=[early_stopping], # put your callbacks in a list
    verbose=1,  # turn off training log
)

Activation functions
Optimizers

https://playground.tensorflow.org/


### Question

What do you call it when your model works great on the training data, but doesn't work so well on unseen data?


## TODO: Good section on overfitting and underfitting:

https://www.kaggle.com/ryanholbrook/overfitting-and-underfitting

## Regularization

Regularization is a method we can use to tackle overfitting.

To quote the SciNet neural networks workshop:

"Regularization is an ad hoc technique by which parameters in a model are penalized to prevent
individual parameters from becoming excessively important to the fit."

This technique involves a modification to the cost function our training uses to treat (the extent to which high parameters are penalized is controlled by a parameter lambda ($\lambda$). (Note that we can't call the parameter `lambda` below, because `lambda` is a reserved keywork in python, so we call in `lam`.)

In [None]:
import keras.models as km
import keras.layers as kl
import keras.regularizers as kr

def get_regularized_model(numnodes, lam=0.0):
  model = km.Sequential()
  model.add(kl.Dense(numnodes, input_dim = 784, activation = 'sigmoid', name = 'hidden', kernel_regularizer = kr.l2(lam)))
  model.add(kl.Dense(10, name = 'output', activation = 'sigmoid',kernel_regularizer = kr.l2(lam)))
  return model

In [None]:
model2 = get_regularized_model(30, lam = 0.001)

model2.compile(optimizer = 'sgd', metrics = ['accuracy'], loss = "mean_squared_error")

%time fit2 = model2.fit(x_train2, y_train2, epochs = 1000, batch_size = 5, verbose = 2)

In [None]:
model2.evaluate(x_test2, y_test2)

In [None]:
def get_model_more(numnodes):
  model = km.Sequential()
  model.add(kl.Dense(numnodes, input_dim = 784, activation = 'sigmoid', name = 'hidden'))
  model.add(kl.Dense(numnodes, input_dim = numnodes, activation = 'sigmoid', name = 'hidden2'))
  model.add(kl.Dense(10, name = 'output', activation = 'sigmoid'))
  return model

In [None]:
%%time

NUM_TRAINING = 4000 # 60000 max
NUM_TESTING = 1000 # 10000 max

NUM_NODES = 30
NUM_HIDDEN_LAYERS = 1

BATCH_SIZE=1000
EPOCHS=150

LAM=0.001

(x_train, y_train), (x_test, y_test) = get_data(num_training=NUM_TRAINING,
                                                num_testing=NUM_TESTING)
#model = get_model(num_nodes=NUM_NODES,
#                  num_hidden_layers=NUM_HIDDEN_LAYERS,
#                  lam=LAM)

train_model(model,
            x_train,
            y_train,
            batch_size=BATCH_SIZE,
            epochs=EPOCHS)

evaluate_model(model, x_test, y_test)

In [None]:
import glob
for filename in glob.glob('data/numbers/*.png'):
    model_predict(model, filename)

In [None]:
import pandas

y_values = [v.argmax() for v in y_train]
pandas.Series(y_values).value_counts()

In [None]:
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())


In [None]:
import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

## Example MNIST

https://www.kaggle.com/hassanamin/tensorflow-mnist-gpu-tutorial

In [None]:
mnist = keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()

In [None]:
# x_train, x_test = x_train / 255.0, x_test / 255.0

In [None]:
#y_train = ku.to_categorical(y_train[:], 10)
#y_test = ku.to_categorical(y_test[:], 10)

In [None]:
y_train[0]

In [None]:
model = keras.models.Sequential()
#  keras.layers.Flatten(input_shape=(28, 28)),
model.add(kl.Dense(128, input_dim = 784, activation='relu'))
model.add(kl.Dense(128, activation='relu'))
model.add(kl.Dense(128, activation='relu'))
#odel.add(kl.Dropout(0.2))
model.add(kl.Dense(10))


In [None]:
loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=True)

#model.compile(optimizer='adam',
#              #loss=loss_fn,
#              metrics=['accuracy'])
model.compile(optimizer = 'sgd', metrics = ['accuracy'], loss = "mean_squared_error")


In [None]:
model.fit(X_train, Y_train, epochs=10)

In [None]:
model.evaluate(x_test,  y_test, verbose=2)

In [None]:
def model_predict(model, filename):
    image = Image.open(filename).resize( (28,28) ).convert( 'L' )
    image = ImageOps.invert(image)
    print("Filename: {}".format(filename))
    print("Image:")
    display(image)
    image_array = np.array(image)
    prediction = model.predict(np.array([image_array])).argmax()
    print("Prediction: {}\n".format(prediction))

In [None]:
import glob
for filename in glob.glob('data/numbers/*.png'):
    model_predict(model, filename)

In [None]:
mnist = keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()

model = keras.models.Sequential([
  keras.layers.Flatten(input_shape=(28, 28)),
#  keras.layers.Dense(128, activation='relu'),
#  keras.layers.Dense(128, activation='relu'),
#  keras.layers.Dense(128, activation='relu'),
  keras.layers.Dropout(0.2),
  keras.layers.Dense(10)
])

loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer='adam',
              loss=loss_fn,
              metrics=['accuracy'])

%time model.fit(x_train, y_train, epochs=10)
model.evaluate(x_test,  y_test, verbose=2)

for filename in glob.glob('data/numbers/*.png'):
    image_predict(model, filename)

In [None]:
mnist = keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()

model = keras.models.Sequential([
  keras.layers.InputLayer(784),
  keras.layers.Dense(128, activation='relu'),
#  keras.layers.Dense(128, activation='relu'),
#  keras.layers.Dense(128, activation='relu'),
  keras.layers.Dropout(0.2),
  keras.layers.Dense(10)
])

loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=False)
model.compile(optimizer='adam',
              loss=loss_fn,
              metrics=['accuracy'])

%time model.fit(X_train, Y_train, epochs=10)
model.evaluate(X_test,  Y_test, verbose=2)

for filename in glob.glob('data/numbers/*.png'):
    image_predict(model, filename)

In [None]:
import PIL
PIL.Image.fromarray(x_train[0])

In [None]:
def model_predict(filename):
    image = Image.open(filename).resize( (28,28) ).convert( 'L' )
    image = ImageOps.invert(image)
    print("Image:")
    display(image)
    image_array = np.array(image)[:,:]
    prediction = model.predict(np.array([image_array])).argmax()
    print("Prediction: {}".format(prediction))

## Saving models

So you've spent a lot of time training a model... now what? If we want to use the model in the future, do we have to retrain your model again?

No. What you probably want to do is save your trained model for use elsewhere.

A potential workflow:

* Train your model on an HPC cluster
* Dump and download your model
* Use your model to predict elsewhere

Converting your in-memory data into a form that can be written to disk (and read again later) is called **serialization**. For generic use cases, Python comes with a popular package for serializing variables called **`pickle`**.

The Keras documentation has a section on how to serialize and save your trained models, using some methods that are defined for the model objects.

https://www.tensorflow.org/guide/keras/save_and_serialize

In [None]:
model.save('my_model')

In [None]:
loaded_model = keras.models.load_model("my_model")

In [None]:
loaded_model.evaluate(x_test,  y_test, verbose=2)

In [None]:
import glob
for filename in glob.glob('data/numbers/*.png'):
    model_predict(loaded_model, filename)

In [None]:
help(model.save)

In [None]:
import tensorflow as tf
help(tf.saved_model.SaveOptions)

I will be adapting a lot of this material from:

* The SciNet workshop on neural networks:
  
  https://support.scinet.utoronto.ca/education/go.php/451/index.php/ib/1//p_course/451
  
  This course goes a lot deeper into the mathematics of neural networks.
* The Kaggle course on neural networks
  
  https://www.kaggle.com/learn/intro-to-deep-learning
  
  A nice interactive approach.


## Further exploration

* Convolutional Neural Networks
  * https://www.cs.ryerson.ca/~aharley/vis/conv/flat.html
* Transfer learning
  * Using pre-trained neural networks as an initial base for more specific training
* Free book!
  * http://neuralnetworksanddeeplearning.com/
* Kaggle courses
  * https://www.kaggle.com/learn
  * Do tutorials
  * Each tutorial has a challenge notebooks to complete to get credit
  * At the end of the course you get a certificate.