# Neural Networks

Hopefully you've watched the three videos by [Grant Sanderson](https://twitter.com/3blue1brown) (a.k.a. [3blue1brown](https://www.youtube.com/channel/UCYO_jab_esuFRV4b17AJtAw)).

* [But what is a Neural Network?](https://www.youtube.com/watch?v=aircAruvnKk) (19:13)
* [Gradient descent, how neural networks learn](https://www.youtube.com/watch?v=IHZwWFHWa-w) (21:00)
* [What is back propagation really doing?](https://www.youtube.com/watch?v=Ilg3gGewQ5U) (13:53)

---

## A very brief recap from the homework

**Neurons**:

* Hold a value
* This value is related to the values of neurons on previous layers via:
    * **weights**
    * **bias**
    * **activation function**
* Some jargon: weights and biases are called **parameters** of the model (they are estimated from data automatically). The other options about the model are called **hyperparameters**.

**Neural network structure**:

<img src="https://upload.wikimedia.org/wikipedia/commons/1/1d/Neural_network_example.png"  style="width:200px;">

* Input layer
* one or more hidden layers (this is where the term "deep" comes from)
* an output layer

**Learning**:

* Minimizing a **loss function** (or **cost function**) through back propagation
  * Loss is often **Mean Squared Error** (**MSE**) between the labels and the predicted labels
* An **optimizer** helps find the best possible parameters
  * Data is fed to the model with the current weights and biases, and the optimizer instructs how to adjust the weights and biases, and the process is iterated.
  * This can be **gradient descent**, which is a slow process.
  * The choice of optimizer might mean the difference between a model that is trained in minutes vs days.
  * each time the entire set of data is fed to the algorighm, it is called an **epoch**
  * some times the adjustment process can be sped up by feeding in the data in smaller **batches** (usually randomly selected) and adjusting the weights more frequently.
    * an example of this strategy is **stochastic gradient descent**
    * a modern extention to stochastic gradient descent optimizer is the **Adam** optimizer, which is now very commonly used. The math is pretty heavy, but you can read about some of the details here: [Gentle Introduction to the Adam Optimization Algorithm for Deep Learning](https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/)


<img src="assets/silly-tshirt.png"  style="width:400px;">

Now that we have some concepts defined, let's play around with a neural network before touching any code:

https://playground.tensorflow.org/

## Download Data and Solutions

In [None]:
# Download data and solutions

import urllib.request
import os

def download_data(path, branch='main'):
    base_url = 'https://raw.githubusercontent.com/ualberta-rcg/python-machine-learning'
    if os.path.exists(path):
        return
    if not os.path.exists('data'):
        os.mkdir('data')
    if not os.path.exists('data/titanic'):
        os.mkdir('data/titanic')
    if not os.path.exists('data/numbers'):
        os.mkdir('data/numbers')
    url = '{}/{}/notebooks/{}'.format(base_url, branch, path)
    output_file = path
    urllib.request.urlretrieve(url, output_file)
    print("Downloaded " + path)
    
download_data('data/titanic/train.csv')
download_data('data/numbers/cwant_1.png')
download_data('data/numbers/cwant_3.png')
download_data('data/numbers/cwant_5.png')
download_data('data/numbers/cwant_8.png')
download_data('data/numbers/cwant_thick_1.png')
download_data('data/numbers/cwant_thick_3.png')
download_data('data/numbers/cwant_thick_4.png')
download_data('data/numbers/cwant_thick_5.png')
download_data('data/numbers/cwant_thick_6.png')
download_data('data/numbers/cwant_thick_9.png')

In [None]:
# !pip install keras
# !pip install tensorflow

Like other package we have seen, Keras has a submodule of sample datasets. The **MNIST** dataset of handwritten numbers is included, which we can load as both training and test data.

In [None]:
import keras
from keras.datasets import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()

We can see how many samples are in the **training** features data, and the shape of each sample ...

In [None]:
x_train.shape

Same for the **test** data...

In [None]:
x_test.shape

We can look at an individual sample in the training data ...

In [None]:
x_train[31] # 32-nd record

But it probably makes more sense to convert this data into an image and render it. The `PIL` module makes this easy.

In [None]:
import PIL
PIL.Image.fromarray(x_train[31])

We can then check the label to see that the image corresponds to the number we think it is ...

In [None]:
y_train[31]

We will now transform the feature data to convert each 28 * 28 image to a 784 entry array through the `reshape` method from `numpy.ndarray`.

In [None]:
X_train = x_train.reshape(60000, 784)
X_test = x_test.reshape(10000, 784)

X_train.shape

In [None]:
X_train[0]

In [None]:
# Array of 28x28 inputs
print(x_train[128][14][13])

# Array of 784 inputs
# basically each of the 28 rows is shoved at the end of the previous
print(X_train[128][14*28+13])

And we can convert the numbers in the label data to categorial data (basically one-hot encoding)

In [None]:
try:
    import keras.utils as ku
    # API change ... is the function in here?
    type(ku.to_categorical)
except:
    import keras.utils.np_utils as ku

Y_train = ku.to_categorical(y_train, 10)
Y_test = ku.to_categorical(y_test, 10)

The original y values:

In [None]:
y_train[26]

The new ones look like:

In [None]:
Y_train[26]

Getting the previous value is essentially running `argmax` (the index of the largest value)

In [None]:
Y_train[26].argmax()

## Sequential model

Sequential groups a linear stack of layers. The code below:

* Specifies the input layer as having 784 items
* Has an intermediate layer with 128 nodes
* Has an output layer of 10 nodes

Eash layer has a `sigmoid` activation function.

In [None]:
import keras.models as km
import keras.layers as kl

model = km.Sequential()
model.add(kl.Dense(128, input_dim = 784, activation = 'sigmoid', name = 'hidden'))
model.add(kl.Dense(10, name = 'output', activation = 'sigmoid'))

In [None]:
model.summary()

## Compiling the model

Compiling prepares the model for training.

The optimizer chosen here is `sgd` (Stochastic Gradient Descent).

The loss/cost function we will use is `mean_squared_error`.

The accuracy is reported during training for each epoch.

In [None]:
model.compile(optimizer='sgd',
              metrics=['accuracy'],
              loss="mean_squared_error")

## Training

Gradient Descent is a slow process, so one speed up is to send the data to the algorithms in random batches until all of the data is read (Stochastic Gradient Descent). Each time all of the data is fed into the model for training, it's called an **epoch**.

An epoch can be split into **minibatch** (or just **batch**), between which the model's parameters are updated.

So the number of epochs you train is how many times the model will see each training sample.

In [None]:
%%time
history = model.fit(X_train,
                    Y_train,
                    epochs=25,
                    batch_size=100,
                    verbose=1)

We can now check out the accuracy of our model on our unseen test data

In [None]:
score = model.evaluate(X_test, Y_test)

What's up with that `history` variable that's output from training? It provides some information about the loss and accuracy for each epoch.

We can use this to plot the loss and accuracy over the epochs for this training session.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

history_df = pd.DataFrame(history.history)
history_df.loc[:, ['loss', 'accuracy']].plot();
plt.xlabel('Epoch')
plt.ylabel('Accuracy/Loss')

**Now run the training and evaluation cells again.** (Training continues where we left off, and we can continue training the same model.)

## Exercise: the ultimate test

Now the ultimate test: can this model correctly detect **your** hand-drawn numbers?

You might want to try drawing your own number here:

https://drawisland.com/?w=200&h=200

Rules:
* Draw a digit with a black pen on a white background (default)
* Perhaps bump up the pen size
* Click the **Download** button to save a `png` file to your computer (hint: put the digit you drew as part of the filename).
* Put the image (or upload to Colab) in the subdirectory `data/numbers` of your current workbook directory. There should be some `png` files of numbers I drew already in there.

To figure out the current notebook directory, uncomment one of the lines with the exclamation mark:

In [None]:
# Linux/Mac/Colab
# !pwd

# Windows
# !dir

We can write a function that loads/displays/transforms/predicts an image file:

In [None]:
import PIL.Image
import PIL.ImageOps
import numpy as np

def image_predict(model, filename):
    # Load and resize to 28x28
    image = PIL.Image.open(filename).convert('L').resize((28,28))
    # Switch black and white
    image = PIL.ImageOps.invert(image)
    # Display
    print("Filename: {}".format(filename))
    print("Image:")
    display(image)
    # Convert to numpy array and reshape as 784 length vector
    image_array = np.array(image)[:,:].reshape(784)
    # Predict!
    prediction = model.predict(np.array([image_array])).argmax()
    print("Prediction: {}\n".format(prediction))

We can now test it out on your file (replace `cwant_8.png` with your filename):

In [None]:
# ('data/numbers/*.png'):
image_predict(model, 'data/numbers/cwant_8.png')

Did the model predict the correct number?

We use the `glob` module to predict all of the numbers in the `data/numbers` directory:

In [None]:
import glob
for filename in glob.glob('data/numbers/*.png'):
    image_predict(model, filename)

How did you the model do?

## How about adding another layer?

After adding another layer, we'll have more than three layers (including input and output), so our network is considered to be **deep** (and we are doing **deep learning**). In general, the deeper the network, the more complex learning it can do (at the cost of having to optimize many more parameters, which takes longer).

We will also add use a feature that allows for the test data to be validated on each epoch (the `validation_data` argument of the `fit` method).

In [None]:
model = km.Sequential()
model.add(kl.Dense(128, input_dim = 784, activation = 'sigmoid',
                   name = 'hidden'))
model.add(kl.Dense(128, activation = 'sigmoid', name = 'hidden2'))
model.add(kl.Dense(10, name = 'output', activation = 'sigmoid'))

model.compile(optimizer='sgd',
              metrics=['accuracy'],
              loss="mean_squared_error")

model.summary()

In [None]:
%%time

# We will take the default batch size (32)
history = model.fit(X_train,
                    Y_train,
                    epochs=25,
                    validation_data=(X_test, Y_test))
# Note, `verbose=1` is the default, so omitted

And try the test on our hand-drawn characters again:

In [None]:
import glob
for filename in glob.glob('data/numbers/*.png'):
    image_predict(model, filename)

## GPUs ...

Record the time to train the model above.

We might consider running on a GPU if your computer has one (and if tensorflow and the libraries on your computer are set up to use one). We can check ...

In [None]:
import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

If you are running in Colab, you might want to change your run time type and choose 'GPU'.
(After doing that, run the check above again).

If you did change run time, your notebook will be running on a different computer.

The next cell will get you caught up ... (but also run the [download data cell](#Download-Data-and-Solutions) at the top of the notebook).

In [None]:
import keras
from keras.datasets import mnist
try:
    import keras.utils as ku
    type(ku.to_categorical)
except:
    import keras.utils.np_utils as ku
import PIL.Image
import PIL.ImageOps
import numpy as np
import keras.models as km
import keras.layers as kl
import pandas as pd
import matplotlib.pyplot as plt

def image_predict(model, filename):
    # Load and resize to 8x8
    image = PIL.Image.open(filename).resize( (28,28) ).convert( 'L' )
    # Switch black and white
    image = PIL.ImageOps.invert(image)
    # Display
    print("Filename: {}".format(filename))
    print("Image:")
    display(image)
    # Convert to numpy array and reshape as 784 length vector
    image_array = np.array(image)[:,:].reshape(784)
    # Predict!
    prediction = model.predict(np.array([image_array])).argmax()
    print("Prediction: {}\n".format(prediction))

(x_train, y_train), (x_test, y_test) = mnist.load_data()

X_train = x_train.reshape(60000, 784)
X_test = x_test.reshape(10000, 784)
Y_train = ku.to_categorical(y_train, 10)
Y_test = ku.to_categorical(y_test, 10)

Now run the model again and watch the timing ...

In [None]:
%%time

model = km.Sequential()
model.add(kl.Dense(128, input_dim = 784, activation = 'sigmoid', name = 'hidden'))
model.add(kl.Dense(128, activation = 'sigmoid', name = 'hidden2'))
model.add(kl.Dense(10, name = 'output', activation = 'sigmoid'))

model.compile(optimizer='sgd',
              metrics=['accuracy'],
              loss="mean_squared_error")

# We will take the default batch size (32)
history = model.fit(X_train,
                    Y_train,
                    epochs=25,
                    validation_data=(X_test, Y_test))

score = model.evaluate(X_test, Y_test)

Did you get a performance boost?

## How about just a wider layer?

Another way to have a network learn more complex patterns is with wider layers. Notice the reduced number of epochs, and how quick the model is trained.

In [None]:
model = km.Sequential()
model.add(kl.Dense(1024, input_dim = 784, activation = 'sigmoid', name = 'hidden'))
model.add(kl.Dense(10, name = 'output', activation = 'sigmoid'))

model.compile(optimizer='sgd',
              metrics=['accuracy'],
              loss="mean_squared_error")

model.summary()

In [None]:
%%time

history = model.fit(X_train,
                    Y_train,
                    epochs=7,
                    validation_data=(X_test, Y_test))

In [None]:
import glob
for filename in glob.glob('data/numbers/*.png'):
    image_predict(model, filename)

## How about a different activation function?

![Popular activation functions, source: https://www.researchgate.net/publication/335845675_Reconstruction_of_porous_media_from_extremely_limited_information_using_conditional_generative_adversarial_networks](assets/common-activation.png)

Our current activation is a sigmoid.

There is another very popular activation function called "The Rectified Linear Unit" (ReLU) that is used in machine learning.

* ReLU has the advantage that it makes the math easier
* Sigmoid sometimes has a problem where the gradient can vanish (so gradient descent doesn't really step anywhere). ReLU has constant gradient in activation zone.
* ReLU has it's own problems: it can "blow up" (see that it's not bounded above).

Lets set up our model again to use ReLu for one of the hidden layers ...

In [None]:
%%time

model = km.Sequential()
model.add(kl.Dense(128, input_dim = 784, activation = 'sigmoid', name = 'hidden'))
model.add(kl.Dense(128, activation = 'relu', name = 'hidden2'))
model.add(kl.Dense(10, name = 'output', activation = 'sigmoid'))

model.compile(optimizer='sgd',
              metrics=['accuracy'],
              loss="mean_squared_error")

model.summary()

history = model.fit(X_train,
                    Y_train,
                    epochs=25,
                    validation_data=(X_test, Y_test))

Again, we can plot our loss/accuracy, this time with the validation data too...

In [None]:
history_df = pd.DataFrame(history.history)
history_df.loc[:, history_df.columns].plot();
print("Minimum validation loss: {}".format(history_df['val_loss'].min()))

## Different optimizer (adam)

Adam (Adaptive Moment Estimation) is considered the state of the art of optimizers (currently).

A full description of how it works is beyond the scope of this course, but you can check out some comparisons of optimizers here.

https://medium.com/swlh/strengths-and-weaknesses-of-optimization-algorithms-used-for-machine-learning-58926b1d69dd

In [None]:
model = km.Sequential()
model.add(kl.Dense(128, input_dim = 784, activation = 'sigmoid', name = 'hidden'))
model.add(kl.Dense(128, activation = 'relu', name = 'hidden2'))
model.add(kl.Dense(10, name = 'output', activation = 'sigmoid'))

model.compile(optimizer='adam',
              metrics=['accuracy'],
              loss="mean_squared_error")

model.summary()

In [None]:
%%time
history = model.fit(X_train,
                    Y_train,
                    epochs=25,
                    validation_data=(X_test, Y_test))

Do you notice anything different with this loss/accuracy plot?

In [None]:
history_df = pd.DataFrame(history.history)
history_df.loc[:, history_df.columns].plot();
print("Minimum validation loss: {}".format(history_df['val_loss'].min()))


... and another look at predictions based on our self-generated data.

In [None]:
for filename in glob.glob('data/numbers/*.png'):
    image_predict(model, filename)

## A third ReLu layer ...

Why not? (Well, because it will take longer to train)

In [None]:
%%time

model = km.Sequential()
model.add(kl.Dense(128, input_dim = 784, activation = 'sigmoid', name = 'hidden'))
model.add(kl.Dense(128, activation = 'relu', name = 'hidden2'))
model.add(kl.Dense(128, activation = 'relu', name = 'hidden3'))
model.add(kl.Dense(10, name = 'output', activation = 'sigmoid'))

model.compile(optimizer='adam',
              metrics=['accuracy'],
              loss="mean_squared_error")

model.summary()

history = model.fit(X_train,
                    Y_train,
                    epochs=25,
                    validation_data=(X_test, Y_test))

score = model.evaluate(X_test, Y_test)

In [None]:
for filename in glob.glob('data/numbers/*.png'):
    image_predict(model, filename)

## Stopping early ...

That last training example seemed to converge pretty quickly, so the additional benefits of the extra epochs may not have been worth it. We can configure our training to have a `callback` that checks whether a certain condition has occured, and stops early if directed to do so.

In this case, we exit early if we haven't had sufficient change in the validation loss in a specified number of epochs.

In [None]:
%%time

from keras.callbacks import EarlyStopping

early_stopping = EarlyStopping(
    min_delta=0.001, # minimium amount of change to count as an improvement
    patience=5, # how many epochs to wait before stopping
    restore_best_weights=True,
)

model = km.Sequential()
model.add(kl.Dense(128, input_dim = 784, activation = 'sigmoid', name = 'hidden'))
model.add(kl.Dense(128, activation = 'relu', name = 'hidden2'))
model.add(kl.Dense(128, activation = 'relu', name = 'hidden3'))
model.add(kl.Dense(10, name = 'output', activation = 'sigmoid'))

model.compile(optimizer='adam',
              metrics=['accuracy'],
              loss="mean_squared_error")

model.summary()

history = model.fit(X_train,
                    Y_train,
                    epochs=25,
                    callbacks=[early_stopping],
                    validation_data=(X_test, Y_test))

In [None]:
score = model.evaluate(X_test, Y_test)

## Saving models

So you've spent a lot of time training a model... now what? If we want to use the model in the future, do we have to retrain your model again?

No. What you probably want to do is save your trained model for use elsewhere.

A potential workflow:

* Train your model on an HPC cluster
* Dump and download your model
* Use your model to predict elsewhere

Converting your in-memory data into a form that can be written to disk (and read again later) is called **serialization**. For generic use cases, Python comes with a popular package for serializing variables called **`pickle`**.

The Keras documentation has a section on how to serialize and save your trained models, using some methods that are defined for the model objects.

https://www.tensorflow.org/guide/keras/save_and_serialize

In [None]:
model.save('my_model')

In [None]:
loaded_model = keras.models.load_model("my_model")

In [None]:
loaded_model.evaluate(X_test,  Y_test, verbose=2)

In [None]:
import glob
for filename in glob.glob('data/numbers/*.png'):
    image_predict(loaded_model, filename)

## Titanic revisited

Let's look at how well neural networks do on our original Titanic classification problem.

Our pipeline starts our identical to what we've already seen ...

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix

# Cherry picked seed!
np.random.seed(1337)
# This one does some strange stuff
#np.random.seed(1)

# Load data
train_df = pd.read_csv('data/titanic/train.csv')

# Choose features and lables
features = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(train_df[features], drop_first=True)
# Note: some versions of tensorflow might instead need:
# X = pd.get_dummies(train_df[features], drop_first=True).values.astype(np.float32)


y = train_df['Survived']

# Split data into training and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

Here is a basic network that performs about as well as the previous Decision Tree/Random Forest models:

In [None]:
model = km.Sequential()
model.add(kl.Dense(128, input_dim = 4, activation = 'sigmoid',
                   name = 'hidden'))
model.add(kl.Dense(1, name = 'output', activation = 'sigmoid'))

model.compile(optimizer='adam',
              metrics=['accuracy'],
              loss="mean_squared_error")

model.summary()

We will use early stopping again...

In [None]:
%%time

from keras.callbacks import EarlyStopping

early_stopping = EarlyStopping(
    min_delta=0.001, # minimium amount of change to count as an improvement
    patience=20, # how many epochs to wait before stopping
    restore_best_weights=True,
)

history = model.fit(X_train,
                    y_train,
                    epochs=200,
                    callbacks=[early_stopping],
                    validation_data=(X_test, y_test))

**Do you notice a difference between the accuracy and the validation/test accuracy? What do you suppose this means?**

We can now use the model to make predictions on the test data (and look at the first few values) ...

In [None]:
# Use model to predict on unseen test data
# Some tricks needed to predict as integers
predictions = model.predict(X_test)
predictions[:10]

Notice that we have solved a regression problem, not a classification problem!

We can convert the predicted values for the `Survived` column by selecting a cut off...

In [None]:
# We are using 0.5 for a cutoff, but we may want to
# use some other value to prevent false positives/negatives
def cut_off(x):
    if x < 0.5: return 0
    return 1

predictions = model.predict(X_test)
predictions = [cut_off(x) for x in predictions]
predictions

# Evaluate how well the model did
print('Accuracy: {}'.format(accuracy_score(y_test, predictions)))
print('Precision: {}'.format(precision_score(y_test, predictions)))
print('Recall: {}'.format(recall_score(y_test, predictions)))

# I need this or my head will explode ...
confusion_matrix(y_test, predictions)
# [TP FP]
# [FN TN]

**Change the cut off above. How do the precision and recall values change?**

## Overfitting

When there is a gap between the training and testing accuracy/loss, this is evidence that overfitting is going on (the model performs better on the training data that it does on the test data).

In [None]:
history_df = pd.DataFrame(history.history)
history_df.loc[:, history_df.columns].plot();
print("Minimum validation loss: {}".format(history_df['val_loss'].min()))

Two common ways to handle overfitting are through `regularization` and `dropping out`.

## Regularization

Regularization is a method we can use to tackle overfitting.

To quote the SciNet neural networks workshop:

"Regularization is an ad hoc technique by which parameters in a model are penalized to prevent
individual parameters from becoming excessively important to the fit."

This technique involves a modification to the cost function our training uses to treat (the extent to which high parameters are penalized is controlled by a parameter lambda ($\lambda$). (Note that we can't name the parameter `lambda` below, because `lambda` is a reserved keywork in python, so we call in `lam`.)

In [None]:
import keras.regularizers as kr

lam = 0.001

model = km.Sequential()
model.add(kl.Dense(128, input_dim = 4, activation = 'sigmoid',
                   name = 'hidden',
                   kernel_regularizer = kr.l2(lam)))
model.add(kl.Dense(1, name = 'output', activation = 'sigmoid'))

model.compile(optimizer='adam',
              metrics=['accuracy'],
              loss="mean_squared_error")

In [None]:
%%time

early_stopping = EarlyStopping(
    min_delta=0.001, # minimium amount of change to count as an improvement
    patience=20, # how many epochs to wait before stopping
    restore_best_weights=True,
)

history = model.fit(X_train,
                    y_train,
                    epochs=200,
                    callbacks=[early_stopping],
                    validation_data=(X_test, y_test))

## Dropping out ...

Again, to quote the SciNet neural networks workshop:

"The principle is simple: randomly ”drop out” neurons from the network during each batch
of the stochastic gradient descent. Like regularization, this results in the network not putting too much importance on any given weight, since the weights keep randomly disappearing from the network.
It can be thought of as averaging over several different-but-similar neural networks."

In [None]:
model = km.Sequential()
model.add(kl.Dense(128, input_dim = 4, activation = 'sigmoid',
                   name = 'hidden'))
# apply 30% dropout to the next layer
model.add(kl.Dropout(0.3))
model.add(kl.Dense(1, name = 'output', activation = 'sigmoid'))

model.compile(optimizer='adam',
              metrics=['accuracy'],
              loss="mean_squared_error")

In [None]:
%%time

from keras.callbacks import EarlyStopping

early_stopping = EarlyStopping(
    min_delta=0.001, # minimium amount of change to count as an improvement
    patience=20, # how many epochs to wait before stopping
    restore_best_weights=True,
)

history = model.fit(X_train,
                    y_train,
                    epochs=200,
                    callbacks=[early_stopping],
                    validation_data=(X_test, y_test))

## Some references

* The **SciNet workshop** on neural networks:

  https://support.scinet.utoronto.ca/education/go.php/451/index.php/ib/1//p_course/451
  
  This course goes a lot deeper into the mathematics of neural networks.

* The **Kaggle course** on neural networks

  https://www.kaggle.com/learn/intro-to-deep-learning
  
  A nice interactive approach.


## Further exploration

* Convolutional Neural Networks
  * https://adamharley.com/nn_vis/cnn/2d.html
* Transfer learning
  * Using pre-trained neural networks as an initial base for more specific training
* Free book!
  * http://neuralnetworksanddeeplearning.com/
* Kaggle courses
  * https://www.kaggle.com/learn
  * Do tutorials
  * Each tutorial has a challenge notebooks to complete to get credit
  * At the end of the course you get a certificate.