<a href="https://colab.research.google.com/github/zjzsu2000/CMPE258/blob/master/Ungraded_assignment_5/5)Tesorflow_Redo_5_overfit_and_underfit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##### Copyright 2018 The TensorFlow Authors.

# 5) Tesorflow_Redo_5_overfit_and_underfit

## Import the necessary packages

In [0]:
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras import regularizers
from tensorflow.keras import losses
from  IPython import display
from matplotlib import pyplot as plt
import numpy as np
import pathlib
import shutil
import tempfile
print(tf.__version__)

In [0]:
!pip install git+https://github.com/tensorflow/docs
import tensorflow_docs as tfdocs
import tensorflow_docs.modeling
import tensorflow_docs.plots

In [0]:
logdir = pathlib.Path(tempfile.mkdtemp())/"tensorboard_logs"
shutil.rmtree(logdir, ignore_errors=True)

## Load the Higgs Dataset

It contains 11&#x202F;000&#x202F;000 examples, each with 28 features, and a binary class label.

In [0]:
gz_file = tf.keras.utils.get_file('HIGGS.csv.gz', 'http://mlphysics.ics.uci.edu/data/higgs/HIGGS.csv.gz')

In [0]:
FEATURES = 28

In [0]:
from tensorflow.data.experimental import CsvDataset

In [0]:
df = CsvDataset(gz_file,[float(),]*(FEATURES+1), compression_type="GZIP")

###Function repacks  list of scalars into a (feature_vector, label) pair.

In [0]:
def pack_row(*row):
  label = row[0]
  features = tf.stack(row[1:],1)
  return features,label

In [0]:
df_packed = df.batch(10000).map(pack_row).unbatch()

Have a look at some of the records from this new `packed_ds`.

The features are not perfectly normalized, but this is sufficient for this tutorial.

In [0]:
for features,label in df_packed.batch(1000).take(1):
  print(features[0])
  plt.hist(features.numpy().flatten(), bins = 101)

We will use the first 1000 samples for validation, and the next 10 000 for training:

In [0]:
N_VALIDATION = int(1e3)
N_TRAIN = int(1e4)
BUFFER_SIZE = int(1e4)
BATCH_SIZE = 500
STEPS_PER_EPOCH = N_TRAIN//BATCH_SIZE

In [0]:
df_valid = df_packed.take(N_VALIDATION).cache()
df_train = df_packed.skip(N_VALIDATION).take(N_TRAIN).cache()

In [0]:
df_train

In [0]:
df_valid = df_valid.batch(BATCH_SIZE)
df_train = df_train.shuffle(BUFFER_SIZE).repeat().batch(BATCH_SIZE)

In [0]:
df_train

## Demonstrate overfitting


> The simplest way to prevent overfitting is to start with a small model: A model with a small number of learnable parameters (which is determined by the number of layers and the number of units per layer). In deep learning, the number of learnable parameters in a model is often referred to as the model's "capacity".

Intuitively, a model with more parameters will have more "memorization capacity" and therefore will be able to easily learn a perfect dictionary-like mapping between training samples and their targets, a mapping without any generalization power, but this would be useless when making predictions on previously unseen data.

Always keep this in mind: deep learning models tend to be good at fitting to the training data, but the real challenge is generalization, not fitting.

On the other hand, if the network has limited memorization resources, it will not be able to learn the mapping as easily. To minimize its loss, it will have to learn compressed representations that have more predictive power. At the same time, if you make your model too small, it will have difficulty fitting to the training data. There is a balance between "too much capacity" and "not enough capacity".

Unfortunately, there is no magical formula to determine the right size or architecture of your model (in terms of the number of layers, or the right size for each layer). You will have to experiment using a series of different architectures.

To find an appropriate model size, it's best to start with relatively few layers and parameters, then begin increasing the size of the layers or adding new layers until you see diminishing returns on the validation loss.

Start with a simple model using only `layers.Dense` as a baseline, then create larger versions, and compare them.




### Training procedure

Many models train better if you gradually reduce the learning rate during training. Use `optimizers.schedules` to reduce the learning rate over time:

In [0]:
from tensorflow.keras import optimizers
from tensorflow.keras.optimizers import schedules

Set a `schedules.InverseTimeDecay` to hyperbolically decrease the learning rate to 1/2 of the base rate at 1000 epochs, 1/3 at 2000 epochs and so on.

In [0]:
lr_schedule =schedules.InverseTimeDecay(0.001, decay_steps=STEPS_PER_EPOCH*1000,
                                  decay_rate=1,staircase=False)

def get_optimizer():
  return optimizers.Adam(lr_schedule)

In [0]:
step = np.linspace(0,100000)
lr = lr_schedule(step)
plt.figure(figsize = (8,6))
plt.plot(step/STEPS_PER_EPOCH, lr)
plt.ylim([0,max(plt.ylim())])
plt.xlabel('Epoch')
_ = plt.ylabel('Learning Rate')


Each model in this tutorial will use the same training configuration. So set these up in a reusable way, starting with the list of callbacks.

To reduce the logging noise use the `tfdocs.EpochDots` which simply prints a `.` for each epoch, and a full set of metrics every 100 epochs.



`callbacks.EarlyStopping` to avoid long and unnecessary training times. Note that this callback is set to monitor the `val_binary_crossentropy`, not the `val_loss`. 

Use `callbacks.TensorBoard` to generate TensorBoard logs for the training.


In [0]:
def get_callbacks(name):
  return [
    tfdocs.modeling.EpochDots(),
    tf.keras.callbacks.EarlyStopping(monitor='val_binary_crossentropy', patience=200),
    tf.keras.callbacks.TensorBoard(logdir/name),
  ]

Similarly each model will use the same `Model.compile` and `Model.fit` settings:

In [0]:
def compile_and_fit(model, name, optimizer=None, max_epochs=10000):
  if optimizer is None:
    optimizer = get_optimizer()
  model.compile(optimizer=optimizer,
                loss=losses.BinaryCrossentropy(from_logits=True),
                metrics=[losses.BinaryCrossentropy(from_logits=True, 
                         name='binary_crossentropy'),'accuracy'])
  model.summary()

  history = model.fit(df_train,steps_per_epoch = STEPS_PER_EPOCH,
            epochs=max_epochs, validation_data=df_valid,
            callbacks=get_callbacks(name),verbose=0)
  return history

### Tiny model

In [0]:
from tensorflow.keras import Sequential

In [0]:
from tensorflow.keras.layers import Dense

In [0]:
tiny_model = Sequential([Dense(16, activation='elu', input_shape=(FEATURES,)),
                        Dense(1)])

In [0]:
size_histories = {}

In [0]:
size_histories['Tiny'] = compile_and_fit(tiny_model, 'sizes/Tiny')

Now check how the model did:

In [0]:
plotter = tfdocs.plots.HistoryPlotter(metric = 'binary_crossentropy', smoothing_std=10)
plotter.plot(size_histories)
plt.ylim([0.5, 0.7])

### Small model

Try two hidden layers with 16 units each:

In [0]:
small_model =Sequential([Dense(16, activation='elu', input_shape=(FEATURES,)),
                        Dense(16, activation='elu'),Dense(1)])

In [0]:
size_histories['Small'] = compile_and_fit(small_model, 'sizes/Small')

### Medium model

 3 hidden layers with 64 units each:

In [0]:
medium_model = Sequential([
                        Dense(64, activation='elu', input_shape=(FEATURES,)),
                        Dense(64, activation='elu'),
                        Dense(64, activation='elu'),
                        Dense(1)])

And train the model using the same data:

In [0]:
size_histories['Medium']  = compile_and_fit(medium_model, "sizes/Medium")

### Large model

Create an even larger model, and see how quickly it begins overfitting. 

In [0]:
large_model = Sequential([Dense(512, activation='elu', input_shape=(FEATURES,)),
                         Dense(512, activation='elu'),
                         Dense(512, activation='elu'),
                         Dense(512, activation='elu'),
                         Dense(512, activation='elu'),
                         Dense(1)])

In [0]:
size_histories['large'] = compile_and_fit(large_model, "sizes/large")

### Plot the training and validation losses

The solid lines show the training loss, and the dashed lines show the validation loss .


> Note: a lower validation loss indicates a better model



While building a larger model gives it more power, if this power is not constrained somehow it can easily overfit to the training set.

In this example, typically, only the `"Tiny"` model manages to avoid overfitting altogether, and each of the larger models overfit the data more quickly.






> * It's normal for there to be a small difference.
> * If both metrics are moving in the same direction, everything is fine.
> * If the validation metric begins to stagnate while the training metric continues to improve, you are probably close to overfitting.
> * If the validation metric is going in the wrong direction, the model is clearly overfitting.



In [0]:
plotter.plot(size_histories)
a = plt.xscale('log')
plt.xlim([5, max(plt.xlim())])
plt.ylim([0.5, 0.7])
plt.xlabel("Epochs [Log Scale]")

Note: All the above training runs used the `callbacks.EarlyStopping` to end the training once it was clear the model was not making progress.

In [0]:
%load_ext tensorboard
%tensorboard --logdir {logdir}/sizes

In [0]:
display.IFrame(
    src="https://tensorboard.dev/experiment/vW7jmmF9TmKmy3rbheMQpw/#scalars&_smoothingWeight=0.97",
    width="100%", height="800px")

In [0]:
!tensorboard dev upload --logdir  {logdir}/sizes

 https://tensorboard.dev/experiment/8LkLMHj4S32KKsDe55o4nA/

## Strategies to prevent overfitting

The training logs from the `"Tiny"` model as a baseline for comparison.

In [0]:
shutil.rmtree(logdir/'regularizers/Tiny', ignore_errors=True)
shutil.copytree(logdir/'sizes/Tiny', logdir/'regularizers/Tiny')

In [0]:
regularizer_histories = {}
regularizer_histories['Tiny'] = size_histories['Tiny']

### Add weight regularization




> Occam's Razor principle: given two explanations for something, the explanation most likely to be correct is the "simplest" one, the one that makes the least amount of assumptions. This also applies to the models learned by neural networks: given some training data and a network architecture, there are multiple sets of weights values (multiple models) that could explain the data, and simpler models are less likely to overfit than complex ones.



A "simple model" in this context is a model where the distribution of parameter values has less entropy (or a model with fewer parameters altogether, as we saw in the section above). Thus a common way to mitigate overfitting is to put constraints on the complexity of a network by forcing its weights only to take small values, which makes the distribution of weight values more "regular". This is called "weight regularization", and it is done by adding to the loss function of the network a cost associated with having large weights. This cost comes in two flavors:

* [L1 regularization](https://developers.google.com/machine-learning/glossary/#L1_regularization), where the cost added is proportional to the absolute value of the weights coefficients (i.e. to what is called the "L1 norm" of the weights).

* [L2 regularization](https://developers.google.com/machine-learning/glossary/#L2_regularization), where the cost added is proportional to the square of the value of the weights coefficients (i.e. to what is called the squared "L2 norm" of the weights). L2 regularization is also called weight decay in the context of neural networks. Don't let the different name confuse you: weight decay is mathematically the exact same as L2 regularization.

L1 regularization pushes weights towards exactly zero encouraging a sparse model. L2 regularization will penalize the weights parameters without making them sparse since the penalty goes to zero for small weights. one reason why L2 is more common.

In `tf.keras`, weight regularization is added by passing weight regularizer instances to layers as keyword arguments. Let's add L2 weight regularization now.

In [0]:
l2_model = Sequential([Dense(512, activation='elu',
                 kernel_regularizer=regularizers.l2(0.001),
                 input_shape=(FEATURES,)),
        Dense(512, activation='elu',
                 kernel_regularizer=regularizers.l2(0.001)),
        Dense(512, activation='elu',
                 kernel_regularizer=regularizers.l2(0.001)),
        Dense(512, activation='elu',
                 kernel_regularizer=regularizers.l2(0.001)),
        Dense(1)])

regularizer_histories['l2'] = compile_and_fit(l2_model, "regularizers/l2")

`l2(0.001)` means that every coefficient in the weight matrix of the layer will add `0.001 * weight_coefficient_value**2` to the total **loss** of the network.

That is why we're monitoring the `binary_crossentropy` directly. Because it doesn't have this regularization component mixed in.

So, that same `"Large"` model with an `L2` regularization penalty performs much better:


In [0]:
plotter.plot(regularizer_histories)
plt.ylim([0.5, 0.7])

#### More info

> **First:** if you are writing your own training loop, then you need to be sure to ask the model for its regularization losses.



In [0]:
result = l2_model(features)
regularization_loss=tf.add_n(l2_model.losses)


> **Second:** This implementation works by adding the weight penalties to the model's loss, and then applying a standard optimization procedure after that.




There is a second approach that instead only runs the optimizer on the raw loss, and then while applying the calculated step the optimizer also applies some weight decay. This "Decoupled Weight Decay" is seen in optimizers like `optimizers.FTRL` and `optimizers.AdamW`.

### Add dropout

Dropout is one of the most effective and most commonly used regularization techniques for neural networks, developed by Hinton and his students at the University of Toronto.

The intuitive explanation for dropout is that because individual nodes in the network cannot rely on the output of the others, each node must output features that are useful on their own.

Dropout, applied to a layer, consists of randomly "dropping out" (i.e. set to zero) a number of output features of the layer during training. Let's say a given layer would normally have returned a vector [0.2, 0.5, 1.3, 0.8, 1.1] for a given input sample during training; after applying dropout, this vector will have a few zero entries distributed at random, e.g. [0, 0.5,
1.3, 0, 1.1].

The "dropout rate" is the fraction of the features that are being zeroed-out; it is usually set between 0.2 and 0.5. At test time, no units are dropped out, and instead the layer's output values are scaled down by a factor equal to the dropout rate, so as to balance for the fact that more units are active than at training time.

In `tf.keras` you can introduce dropout in a network via the Dropout layer, which gets applied to the output of layer right before.

Let's add two Dropout layers in our network to see how well they do at reducing overfitting:

In [0]:
from tensorflow.keras.layers import Dropout

In [0]:
dropout_model = Sequential([Dense(512, activation='elu', input_shape=(FEATURES,)),
            Dropout(0.5),
            Dense(512, activation='elu'),
            Dropout(0.5),
            Dense(512, activation='elu'),
            Dropout(0.5),
            Dense(512, activation='elu'),
            Dropout(0.5),
            Dense(1)])

regularizer_histories['dropout'] = compile_and_fit(dropout_model, "regularizers/dropout")

In [0]:
plotter.plot(regularizer_histories)
plt.ylim([0.5, 0.7])

It's clear from this plot that both of these regularization approaches improve the behavior of the `"Large"` model. But this still doesn't beat even the `"Tiny"` baseline.

### Combined L2 + dropout

In [0]:
combined_model = Sequential([
      Dense(512, kernel_regularizer=regularizers.l2(0.0001),
                 activation='elu', input_shape=(FEATURES,)),
      Dropout(0.5),
      Dense(512, kernel_regularizer=regularizers.l2(0.0001),
                 activation='elu'),
      Dropout(0.5),
      Dense(512, kernel_regularizer=regularizers.l2(0.0001),
                 activation='elu'),
      Dropout(0.5),
      Dense(512, kernel_regularizer=regularizers.l2(0.0001),
                 activation='elu'),
      Dropout(0.5),
      Dense(1)
])

regularizer_histories['combined'] = compile_and_fit(combined_model, "regularizers/combined")

In [0]:
plotter.plot(regularizer_histories)
plt.ylim([0.5, 0.7])

This model with the `"Combined"` regularization is obviously the best one so far.

### View in TensorBoard




In [0]:
%load_ext tensorboard
%tensorboard --logdir {logdir}/regularizers

In [0]:
display.IFrame(
    src="https://tensorboard.dev/experiment/fGInKDo8TXes1z7HQku9mw/#scalars&_smoothingWeight=0.97",
    width = "100%",
    height="800px")


In [0]:
!tensorboard dev upload --logdir  {logdir}/regularizers

## Conclusions

The most common ways to prevent overfitting in neural networks:

* Get more training data.
* Reduce the capacity of the network.
* Add weight regularization.
* Add dropout.

Two important approaches not covered in this guide are:

* data-augmentation
* batch normalization



> Each method can help on its own, but often combining them can be even more effective.

