## **Kaggle's Deep Learning Course Notes**
In this notebook I'm going to summarize the course topics that I found most important, so I can refer to it later, and also for anyone interested in reading about Deep Learning.

I will use the popular Red Wine dataset to exemplify the concepts covered in the course. The dataset can be seen as either a classification or a regression problem where we want to determine the quality of wines based on their physicochemical properties.

## **Setup**

In [None]:
# libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow as tf
import matplotlib.pyplot as plt
from learntools.deep_learning_intro.dltools import animate_sgd
from tensorflow.keras import callbacks

# plotting
plt.style.use('seaborn-whitegrid')

# matplotlib defaults
plt.rc('figure', autolayout=True)
plt.rc('axes', labelweight='bold', labelsize='large',
       titleweight='bold', titlesize=18, titlepad=10)
plt.rc('animation', html='html5')

## **Dataset**

In [None]:
red_wine = pd.read_csv('../input/dl-course-data/red-wine.csv')
red_wine.head()

In [None]:
red_wine.info()

## **Data Manipulation**

In [None]:
X = red_wine.copy()
y = X.pop('quality')

X_train, X_valid, y_train, y_valid = train_test_split(
    X, y, test_size=0.2, random_state=42)

# neural networks tend to perform best when their inputs are on a common scale
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_valid = scaler.transform(X_valid)

## **Model Creation**

The easiest way to create a model in Keras is through *keras.Sequential*, which creates a neural network as a stack of layers.

In [None]:
# number of features
input_shape = [11]

# setup model
model = keras.Sequential([
    # the hidden ReLU layers (hidden because we never see their outputs)
    layers.Dense(units=512, activation='relu', input_shape=input_shape),
    layers.Dense(units=512, activation='relu'),
    # the linear output layer and the number of units (neurons), in this case we have just one output, the quality of the wine
    layers.Dense(units=1)
])

Internally, Keras represents the weights of a neural network with **tensors**. Tensors are basically TensorFlow's version of a Numpy array with a few differences that make them better suited to deep learning. One of the most important is that tensors are compatible with [GPU](https://www.kaggle.com/docs/efficient-gpu-usage) and [TPU](https://www.kaggle.com/docs/tpu)) accelerators. TPUs, in fact, are designed specifically for tensor computations.

The usual way of attaching an activation function to a `Dense` layer is to include it as part of the definition with the `activation` argument. Sometimes though you'll want to put some other layer between the `Dense` layer and its activation function. In this case, we can define the activation in its own `Activation` layer, like so:

```
layers.Dense(units=512),
layers.Activation('relu')
```

This is completely equivalent to the ordinary way: `layers.Dense(units=512, activation='relu')`.

There is a whole family of variants of the `'relu'` activation -- `'elu'`, `'selu'`, and `'swish'`, among others. Sometimes one activation will perform better than another on a given task, so you could consider experimenting with activations as you develop a model. The ReLU activation tends to do well on most problems, so it's a good one to start with. Check out the [documentation](https://www.tensorflow.org/api_docs/python/tf/keras/activations) for other activation functions.

A model's weights are kept in its `weights` attribute as a list of tensors.

In [None]:
model.weights

Keras represents weights as tensors, but also uses tensors to represent data. When you set the `input_shape` argument, you are telling Keras the dimensions of the array it should expect for each example in the training data. Setting `input_shape=[3]` would create a network accepting vectors of length 3, like `[0.2, 0.4, 0.6]`.

## **Untrained Model's Random Weights**

Regression problems are like "curve-fitting" problems: we're trying to find a curve that best fits the data. Let's take a look at the "curve" produced by an untrained linear model.
 
Before training, a model's weights are set randomly. Run the cell below a few times to see the different lines produced with a random initialization.

In [None]:
example_model = keras.Sequential([
    layers.Dense(1, input_shape=[1])
])

x_var = tf.linspace(-1.0, 1.0, 100)
y_var = example_model.predict(x_var)

plt.figure(dpi=100)
plt.plot(x_var, y_var, 'k')
plt.xlim(-1, 1)
plt.ylim(-1, 1)
plt.xlabel("Input x")
plt.ylabel("Target y")
w, b = example_model.weights
plt.title("Weight: {:0.2f}\nBias: {:0.2f}".format(w[0][0], b[0]))
plt.show()

## **Compile Method**

To define the loss and optimizer we'll use the model's `compile` method.

Adam is an SGD algorithm that has an adaptive learning rate that makes it suitable for most problems without any parameter tuning (it is "self tuning", in a sense). Adam is a great general-purpose optimizer.

In [None]:
model.compile(
    optimizer="adam",
    loss="mae")

## **Training the Model**

Once you've defined the model and compiled it with a loss and optimizer you're ready for training.

In [None]:
history = model.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    batch_size=32, # works best with powers of 2
    epochs=50,
    verbose=0 # suppress output since we'll plot the curves
)

Each iteration's sample of training data is called a minibatch (or often just "batch"), while a complete round of the training data is called an epoch. The number of epochs you train for is how many times the network will see each training example.

The next step is to look at the loss curves and evaluate the training. The fit method keeps a record of the loss produced during training in a History object. When we train a model we can plot the loss on the training set epoch by epoch. We can also plot the validation data. These plots are called the learning curves. To train deep learning models effectively, we need to be able to interpret them.

In [None]:
history_df = pd.DataFrame(history.history)
history_df.loc[:, ['loss', 'val_loss']].plot()
print("Minimum Validation Loss: {:0.4f}".format(history_df['val_loss'].min()));

With the learning rate and the batch size, you have some control over:
- How long it takes to train a model
- How noisy the learning curves are
- How small the loss becomes

To get a better understanding of these two parameters, we'll look at the linear model, our simplest neural network. Having only a single weight and a bias, it's easier to see what effect a change of parameter has.

In [None]:
# experiment with different values for the learning rate, batch size, and number of examples
animate_sgd(
    learning_rate=0.1,
    batch_size=32,
    num_examples=1600,
    # can also change these
    steps=30, # total training steps (batches seen)
    true_w=3.0, # the slope of the data
    true_b=2.0, # the bias of the data
)

Smaller batch sizes give noisier weight updates and loss curves. This is because each batch is a small sample of data and smaller samples tend to give noisier estimates. Smaller batches can have an "averaging" effect though which can be beneficial.

Smaller learning rates make the updates smaller and the training takes longer to converge. Large learning rates can speed up training, but don't "settle in" to a minimum as well. When the learning rate is too large, the training can fail completely.

## **Overfitting vs Underfitting**

You might think about the information in the training data as being of two kinds: signal and noise. The signal is the part that generalizes, the part that can help our model make predictions from new data. The noise is that part that is only true of the training data; the noise is all of the random fluctuation that comes from data in the real-world or all of the incidental, non-informative patterns that can't actually help the model make predictions. The noise is the part might look useful but really isn't.

The training loss will go down either when the model learns signal or when it learns noise. But the validation loss will go down only when the model learns signal. (Whatever noise the model learned from the training set won't generalize to new data.) So, when a model learns signal both curves go down, but when it learns noise a gap is created in the curves. The size of the gap tells you how much noise the model has learned.

A model's capacity refers to the size and complexity of the patterns it is able to learn. For neural networks, this will largely be determined by how many neurons it has and how they are connected together. If it appears that your network is underfitting the data, you should try increasing its capacity.

You can increase the capacity of a network either by making it wider (more units to existing layers) or by making it deeper (adding more layers). Wider networks have an easier time learning more linear relationships, while deeper networks prefer more nonlinear ones. Which is better just depends on the dataset.

If the validation loss begins to rise very early, while the training loss continues to decrease, is an indication that the network has begun to overfit. At this point, we would need to try something to prevent it, either by reducing the number of units or through a method like early stopping. Training with early stopping also means we're in less danger of stopping the training too early, before the network has finished learning signal. So besides preventing overfitting from training too long, early stopping can also prevent underfitting from not training long enough. Just set your training epochs to some large number (more than you'll need), and early stopping will take care of the rest.

We'll define an early stopping callback that waits some epochs (`patience`) for a change in validation loss of at least the `min_delta` and keeps the weights with the best loss (`restore_best_weights`).

In [None]:
early_stopping = callbacks.EarlyStopping(
    patience=10,
    min_delta=0.001,
    restore_best_weights=True,
)

model = keras.Sequential([
    layers.Dense(units=512, activation='relu', input_shape=input_shape),
    layers.Dense(units=512, activation='relu'),
    layers.Dense(units=1)
])

model.compile(
    optimizer="adam",
    loss="mae")

history = model.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    batch_size=32,
    epochs=200,
    verbose=0,
    callbacks=[early_stopping] # added the early_stoping
)

history_df = pd.DataFrame(history.history)
history_df.loc[:, ['loss', 'val_loss']].plot()
print("Minimum Validation Loss: {:0.4f}".format(history_df['val_loss'].min()));

## **Dropout Layer**

Some layers can do preprocessing or transformations of other sorts.

One of these is the "dropout layer", which can help correct overfitting. To recognize these spurious patterns a network will often rely on very a specific combinations of weight, a kind of "conspiracy" of weights. Being so specific, they tend to be fragile: remove one and the conspiracy falls apart.

This is the idea behind dropout. To break up these conspiracies, we randomly drop out some fraction of a layer's input units every step of training, making it much harder for the network to learn those spurious patterns in the training data. Instead, it has to search for broad, general patterns, whose weight patterns tend to be more robust.

You could also think about dropout as creating a kind of ensemble of networks. The predictions will no longer be made by one big network, but instead by a committee of smaller networks. Individuals in the committee tend to make different kinds of mistakes, but be right at the same time, making the committee as a whole better than any individual. (If you're familiar with random forests as an ensemble of decision trees, it's the same idea.)

In Keras, the dropout rate argument rate defines what percentage of the input units to shut off. Put the Dropout layer just before the layer you want the dropout applied to. (When adding dropout, you may need to increase the number of units in your Dense layers).

## **Batchnorm Layer**

With neural networks, it's generally a good idea to put all of your data on a common scale, perhaps with something like scikit-learn's StandardScaler or MinMaxScaler. The reason is that SGD will shift the network weights in proportion to how large an activation the data produces. Features that tend to produce activations of very different sizes can make for unstable training behavior.

Now, if it's good to normalize the data before it goes into the network, maybe also normalizing inside the network would be better! In fact, we have a special kind of layer that can do this, the batch normalization layer. A batch normalization layer looks at each batch as it comes in, first normalizing the batch with its own mean and standard deviation, and then also putting the data on a new scale with two trainable rescaling parameters. Batchnorm, in effect, performs a kind of coordinated rescaling of its inputs.

Most often, batchnorm is added as an aid to the optimization process (though it can sometimes also help prediction performance). Models with batchnorm tend to need fewer epochs to complete training. Moreover, batchnorm can also fix various problems that can cause the training to get "stuck". Consider adding batch normalization to your models, especially if you're having trouble during training. Batch normalization can be used at almost any point in a network.

## **Example**

In [None]:
model = keras.Sequential([
    layers.BatchNormalization(input_shape=[11]),
    layers.Dense(512, activation='relu'),
    layers.BatchNormalization(),
    layers.Dropout(0.3),
    layers.Dense(512, activation='relu'),
    layers.BatchNormalization(),
    layers.Dropout(0.3),
    layers.Dense(1)
])

model.compile(
    optimizer='adam',
    loss='mae'
)

early_stopping = callbacks.EarlyStopping(
    patience=10,
    min_delta=0.001,
    restore_best_weights=True
)

history = model.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    batch_size=32,
    epochs=200,
    verbose=0,
    callbacks=[early_stopping]
)

history_df = pd.DataFrame(history.history)
history_df.loc[:, ['loss', 'val_loss']].plot()
print("Minimum Validation Loss: {:0.4f}".format(history_df['val_loss'].min()));

## **Classification**

Let's treat this problem as a classification problem. If the quality of the wine is > 5, we'll classify it as good (1), and if it's <= 5, bad (0).

Accuracy is one of the many metrics in use for measuring success on a classification problem. The problem with accuracy (and most other classification metrics) is that it can't be used as a loss function. SGD needs a loss function that changes smoothly, but accuracy, being a ratio of counts, changes in "jumps". So, we have to choose a substitute to act as the loss function. This substitute is the cross-entropy function.

Cross-entropy is a sort of measure for the distance from one probability distribution to another. The idea is that we want our network to predict the correct class with probability 1.0. The further away the predicted probability is from 1.0, the greater will be the cross-entropy loss.

The cross-entropy and accuracy functions both require probabilities as inputs, meaning, numbers from 0 to 1. To convert the real-valued outputs produced by a dense layer into probabilities, we attach a different kind of activation function, the sigmoid activation.

In [None]:
y_valid = y_valid.apply(
    lambda x: 1 if x > 5 else 0)
y_train = y_train.apply(
    lambda x: 1 if x > 5 else 0)

In [None]:
model = keras.Sequential([
    layers.BatchNormalization(input_shape=[11]),
    layers.Dense(512, activation='relu'),
    layers.BatchNormalization(),
    layers.Dropout(0.3),
    layers.Dense(512, activation='relu'),
    layers.BatchNormalization(),
    layers.Dropout(0.3),
    layers.Dense(1, activation='sigmoid') # sigmoid activation to convert the outputs into probabilities
])

model.compile(
    optimizer='adam', # adam optimizer works great for classification too
    loss='binary_crossentropy', # add the cross-entropy loss and accuracy metric to the model
    metrics=['binary_accuracy'] # for two-class problems use the 'binary' versions
)

early_stopping = callbacks.EarlyStopping(
    patience=10,
    min_delta=0.001,
    restore_best_weights=True
)

history = model.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    batch_size=32,
    epochs=200,
    verbose=0,
    callbacks=[early_stopping]
)

history_df = pd.DataFrame(history.history)

history_df.loc[:, ['loss', 'val_loss']].plot()
history_df.loc[:, ['binary_accuracy', 'val_binary_accuracy']].plot()

print(("Best Validation Loss: {:0.4f}" +\
      "\nBest Validation Accuracy: {:0.4f}")\
      .format(history_df['val_loss'].min(), 
              history_df['val_binary_accuracy'].max()))