# 03 Losses, Stochastic Gradient Descent, and Natural Language Processing
## Dr. Tristan Behrens

In the following we will lean about the essential Deep Learning building blocks. We will learn 

- the most common loss functions,
- the intuition behind Stochastic Gradient Descent, and
- apply our new knowledge to Natural Language Processing.

## Make sure that we have TensorFlow 2 enabled.

In [None]:
%tensorflow_version 2.x

## Imports.

In [None]:
import tensorflow as tf
import numpy as np
import tensorflow_datasets as tfds
import matplotlib.pyplot as plt

## Loss Functions and Their Use.

These four are the most common:

- Binary Crossentropy (BCE), mainly used for binary classifiers,
- Categorical Crossentropy (CCE), mainly used for categorical classifiers,
- Mean Squared Error (MSE), mainly used for regressions, and
- Mean Absolute Error (MAE), mainly used for regressions, too.

### Binary Crossentropy.

Crossentropy, in layman's terms, is the distance between probability distribtion. Binary Crossentropy is uses when we have two classes.

In [None]:
y_true = [[0., 1.], [0., 0.]]
y_pred = [[0.6, 0.4], [0.4, 0.6]]
bce = tf.keras.losses.BinaryCrossentropy()
bce(y_true, y_pred).numpy()

### Categorical Crossentropy.

We use Categorical Crossentropy when we have more than two classes.

In [None]:
y_true = [[0, 1, 0], [0, 0, 1]]
y_pred = [[0.05, 0.95, 0], [0.1, 0.8, 0.1]]
cce = tf.keras.losses.CategoricalCrossentropy()
cce(y_true, y_pred).numpy()

### Mean Squared Error.

Measures the average of the quares of the errors. Mostly used in regression problems.

In [None]:
y_true = [[0., 1.], [0., 0.]]
y_pred = [[1., 1.], [1., 0.]]
mse = tf.keras.losses.MeanSquaredError()
mse(y_true, y_pred).numpy()

### Mean Absolute Error.

Measures the average of the absolutes of the errors. Mostly used in regression problems.

In [None]:
y_true = [[0., 1.], [0., 0.]]
y_pred = [[1., 1.], [1., 0.]]
mae = tf.keras.losses.MeanAbsoluteError()
mae(y_true, y_pred).numpy()

## Intuition behind Stochastic Gradient Descent.

![](http://www.its.caltech.edu/~nazizanr/imgs/nonconvex3.jpg)

(Image copyright Navid Azizan, Caltech)

## Binary Classification in Natural Language Processing - Underfitting and Overfitting.

Let us solve another usecase. Sentiment Analysis is about processing a text and extracting the contained sentiment.

In [None]:
(imdb_train, imdb_validate, imdb_test), info = tfds.load(
    name="imdb_reviews/subwords8k", 
    split=["train[:80%]", "train[80%:]", "test"],
    with_info=True,
    as_supervised=True
)

### Inspecting the data.

In [None]:
print("Training:", len(list(imdb_train)))
print("Validate:", len(list(imdb_validate)))
print("Testing: ", len(list(imdb_test)))

In [None]:
reviews = np.array(list(imdb_train.map(lambda image, label: image)))
print("Review 0 shape: ", reviews[0].shape)
print("Review 1 shape: ", reviews[1].shape)
print("Review 2 shape: ", reviews[2].shape)
print("Review 3 shape: ", reviews[3].shape)

In [None]:
lengths = [len(x) for x in reviews]
print("min", np.min(lengths))
print("mean", np.mean(lengths))
print("std", np.std(lengths))
print("max", np.max(lengths))
plt.figure(figsize=(12, 8))
plt.hist(lengths, bins=200)
plt.show()
plt.close()

In [None]:
encoder = info.features['text'].encoder
print ("Vocabulary size: {}".format(encoder.vocab_size))

### A close look at the text data.

In [None]:
random_review, random_label = list(imdb_train.shuffle(1000).take(1))[0]
print("Review:", random_review.numpy())
print("")
print("Label: ", random_label.numpy())
print("")
random_review_decoded = encoder.decode(random_review)
print("Decoded:", random_review_decoded)

In [None]:
text = "Hello my dear students and colleagues!"

text_encoded = encoder.encode(text)
print(text_encoded)

text_decoded = encoder.decode(text_encoded)
print(text_decoded)

### Encode the reviews for Deep Learning.

In [None]:
dimensions = encoder.vocab_size

def encode(indices, label):
    indices = tf.dtypes.cast(indices, tf.int32)
    review_encoded = tf.one_hot(indices=indices, depth=dimensions)
    review_encoded = tf.reduce_max(review_encoded, 0)
    label_encoded = label
    return review_encoded, label_encoded

imdb_train_encoded = imdb_train.map(lambda image, label: encode(image, label)).cache().prefetch(tf.data.experimental.AUTOTUNE)
imdb_validate_encoded = imdb_validate.map(lambda image, label: encode(image, label)).cache().prefetch(tf.data.experimental.AUTOTUNE)
imdb_test_encoded = imdb_test.map(lambda image, label: encode(image, label)).cache().prefetch(tf.data.experimental.AUTOTUNE)

In [None]:
random_review, random_label = list(imdb_train_encoded.shuffle(1000).take(1))[0]
print(random_review.shape)
print(random_review.numpy()[0:100], "...")
print(np.sum(random_review))
print(random_label)

### Create the model and compile it.

In [None]:
from tensorflow.keras import models, layers

model = models.Sequential()
model.add(layers.Dense(
    16, 
    activation="relu", 
    input_shape=(dimensions,)
))
model.add(layers.Dense(16, activation="relu"))
model.add(layers.Dense(1, activation="sigmoid"))

model.summary()

model.compile(
    optimizer="rmsprop",
    loss="binary_crossentropy",
    metrics=["accuracy"]
)

### Train the model.

In [None]:
history = model.fit(
    imdb_train_encoded.batch(512),
    epochs=20,
    validation_data=imdb_validate_encoded.batch(512)
)

### Evaluate training.

In [None]:
plt.plot(history.history["loss"], label="loss")
plt.plot(history.history["val_loss"], label="val_loss")
plt.legend()
plt.show()
plt.close()

plt.plot(history.history["accuracy"], label="accuracy")
plt.plot(history.history["val_accuracy"], label="val_accuracy")
plt.legend()
plt.show()
plt.close()

In [None]:
model.evaluate(imdb_test_encoded.batch(512))

## Overfitting, Underfitting, Best Practices.

Ways to overcome underfitting:
- Train longer,
- bigger Neural Network architecture.

Ways to overcome overfitting:

- More data,
- better data,
- Data augmentation,
- early stopping,
- smaller Neural Network architecture,
- Dropout and other regularization techniques.

## Summary.

In this notebook we have learned the use of loss functions when it comes to assessing the quality of Neural Networks. On top of that, we heard the intuition behind our learning algorithm Stochastic Gradient Descent. We hd a close look at Natural Language Processing. And we discussed underfitting and overfitting.