In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# A Single Neuron

To build deep neural networks using `Keras` and `Tensorflow`.
* create a `fully-connected` neural network architecture
* apply neural nets to two classic ML problems: `regression` and `classification`
* train neural nets with `stochastic gradient descent`, and
* improve performance with `dropout, batch normalization`, and other techniques

### What is Deep Learning?
Most impressive in AI have been `deep learning`. *Natural language translation, image recognition, and game playing* are all tasks where deep learning models have neared or even exceeded human-level performance.

`Deep learning` is an approach to machine learning characterized by deep stacks of computations. Power and Scalability `neural networks` become the defining model of deep learning. Neural networks are composed of neurons, where each neuron individually performs only a simple computation.

### The Linear Unit
A `neuron` (or `unit`) with one input looks like: `y = wx * b`
![](https://i.imgur.com/mfOlDR6.png)
- `x` is input. Its connection to neuron has a `weight, w`. For the input x, what reaches the neuron is `w * x`. A neural network `learns` by modifying its weights.
- `b` is a special kind of weight called `bias`. Bias doesn't have any data associated with input. Value reach to neuron is `1 * b`. The bias enables the neuron to modify the output independently of its inputs.
- `y` is the value the neuron ultimately outputs. Neuron sums up all the values it receives through its connections. `y = w * x + b`

### Example - The Linear Unit as a Model
Single neuron models are `linear models`. Example with [80 Cereals](https://www.kaggle.com/crawford/80-cereals) datasets. 

Training a model with `sugars` (grams of sugars per serving) as input and `calories` (calories per serving) as output. The bias is `b=90` and the weight is `w=2.5` will result.
![](https://i.imgur.com/yjsfFvY.png)

### Multiple Inputs
`80 Cereals` dataset has many more features than just `sugars`. To expand the model with more input features, can add more input connections to neuron, one for each additional feature. Output will be sum of bias and multiplication of `each input to weight`.
![](https://i.imgur.com/vyXSnlZ.png)
Output will be `y = w0x0 + w1x1 + w2x2 + b`.

A linear unit with two inputs will fit a plane, and a unit with more inputs than that will fit a hyperplane.

### Linear Units in Keras
The easiest way to create a model in `Keras` is through `keras.Sequential`, which creates a neural network as a stack of layers. Can define a linear model accepting three features (`'sugars', 'fiber', and 'protein'`) to produce a single output `calories`.

```
model = keras.Sequential([
    layers.Dense(units=1, input_shape=[3])
])
```
- `units`: how many outputs wanted
- `input_shape`: ensures the model will accept three features as input (`'sugars', 'fiber', and 'protein'`)

`Why is input_shape a Python list?`
- The data we'll use in this course will be tabular data, like in a Pandas dataframe. We'll have one input for each feature in the dataset. The features are arranged by column, so we'll always have `input_shape=[num_columns]`. 
- The reason Keras uses a list here is to permit use of more complex datasets. Image data, for instance, might need three dimensions: `[height, width, channels]`.

### Exercise: A Single Neuron
[Red Wine Quality](https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009) datases consists of physiochemical measurements from about 1600 Portuguese red wines. Also included is a quality rating for each wine from blind taste-tests.

In [None]:
red_wine = pd.read_csv('red-wine.csv')
red_wine.head()
red_wine.shape

# Q1: Input_shape for model
input_shape = [len(red_wine.columns[:-1])] # must be number with list format

# Q2: Define linear model
from tensorflow import keras
from tensorflow.keras import layers
model = keras.Sequential([
    layers.Dense(units=1, input_shape=input_shape)
])

# Q3: Check the weight and bias
w, b = model.weights

In [None]:
# Plot the output of an untrained linear model
import tensorflow as tf
import matplotlib.pyplot as plt

model = keras.Sequential([
    layers.Dense(1, input_shape=[1]),
])

x = tf.linspace(-1.0, 1.0, 100)
y = model.predict(x)

plt.figure(dpi=100)
plt.plot(x, y, 'k')
plt.xlim(-1, 1)
plt.ylim(-1, 1)
plt.xlabel("Input: x")
plt.ylabel("Target y")
w, b = model.weights # you could also use model.get_weights() here
plt.title("Weight: {:0.2f}\nBias: {:0.2f}".format(w[0][0], b[0]))
plt.show()

---

# Deep Neural Networks

How can build neural networks capable of learning the complex kinds of relationships deep neural networks. Key idea is `modularity`, building up a complex network from simpler functional units. 

### Layers
Neural networks typically organize their neurons into `layers`. When we collect together linear units having a common set of inputs, we get a dense layer.
![](https://i.imgur.com/2MA4iMV.png)

A `layer` in Keras is a very general kind of thing. A layer can be, essentially, any kind of `data transformation`. Many layers, like the `convolutional` and `recurrent` layers, `transform data` through use of neurons and differ primarily in the pattern of connections they form.

### The Activation Function
Two dense layers with nothing in between are not better than a single dense layer by itself. Dense layers by `themselves` can never move us out of the world of lines and planes. Need `nonlinear` activation function. 
![](https://i.imgur.com/OLSUEYT.png)
Without `activation functions`, neural networks can only learn `linear` relationships. In order to `fit curves`, we'll need to use activation functions.

An `activation function, ReLU` is simply some function apply to each of a layer's outputs (its activations). The most common is the `rectifier function` --> `max(0,x)`.
![](https://i.imgur.com/aeIyAlF.png)
`Rectifier function` has a graph that's a line with the negative part "rectified" to zero. Outputs of a neuron will put a `bend` in the data, moving us away from simple lines. 

Applying a ReLU activation to a linear unit means the output becomes `max(0, w * x + b)`.
![](https://i.imgur.com/eFry7Yu.png)


### Stacking Dense Layers
How layers are stacked to get complex `data transformation`. A stack of dense layers makes a `fully-connected` network.
![](https://i.imgur.com/Y5iwFQZ.png)
The layers before the output layer are sometimes called `hidden`.  

Final (output) layer is a linear unit (no activation function) making this network appropriate to regression task, where we are trying to predict some arbitrary numeric value.
- `Classification` might requires activation function.

#### Building Sequential Models
`Sequential` model will connect together a list of layers in order from first to last:
- first layer gets input
- last layer produces output
```
model = keras.Sequential([
    # the hidden ReLU layers
    layers.Dense(units=4, activation='relu', input_shape=[2]),
    layers.Dense(units=3, activation='relu'),
    # the linear output layer 
    layers.Dense(units=1),
])
```

### Exercise: Deep Neural Networks
[Concrete strength](https://www.kaggle.com/c/dat300-2018-concrete) datasets to predict the compressive strength of concrete manufactured according to various recipes.

In [None]:
concrete = pd.read_csv('concrete.csv')
concrete.head()

# Q1: Input_shape for model
input_shape = [len(concrete.columns[:-1])]

# Define model with Hidden Layers 
#    with three hidden layers, each having 512 units and the ReLU activation
#    an output layer of one unit and no activation
model = keras.Sequential([
    layers.Dense(units=512, activation='relu', input_shape=input_shape),
    layers.Dense(units=512, activation='relu'),
    layers.Dense(units=512, activation='relu'),
    layers.Dense(units=1)
])

# Q3: Rewrite Activation Layers
model = keras.Sequential([
    layers.Dense(512, input_shape=[8]),
    layers.Activation('relu'),
    layers.Dense(512),
    layers.Activation('relu'),
    layers.Dense(512),
    layers.Activation('relu'),
    layers.Dense(1),
])

Whole family of variants of the `relu` activation -- `elu`, `selu`, and `swish`. Sometimes one activation will perform better than another on a given task. The `ReLU` activation tends to do well on most problems. [documentation](https://www.tensorflow.org/api_docs/python/tf/keras/activations)

In [None]:
# Alternatives to ReLU
relu_activation_layer = layers.Activation('relu')
elu_activation_layer = layers.Activation("elu")
selu_activation_layer = layers.Activation('selu')
swish_activation_layer = layers.Activation('swish')

x = tf.linspace(-3.0, 3.0, 100)

plt.figure(dpi=100)
plt.plot(x, relu_activation_layer(x),'b-', label="relu")
plt.plot(x, elu_activation_layer(x), 'g--', label='elu')
plt.plot(x, selu_activation_layer(x), 'r-.', label='selu')
plt.plot(x, swish_activation_layer(x), 'k:', label='swish')
plt.xlim(-3, 3)
plt.xlabel("Input")
plt.ylabel("Output")
plt.legend()
plt.show();

---

# Stochastic Gradient Descent

Each example in the training data consists of some `features, inputs` together with an expected `target, output`. For `80 Cereals` datasets, features as `'sugar', 'fiber', and 'protein'` and target as `'calories'`.

In addition to the training data,
- A `loss function` that measures how good the network's predictions are.
- An `optimizer` that can tell the network how to change its weights.

### Loss Function
`Loss function` tells a network what problem to solve and measures the disparity between the the target's true value and the value the model predicts. 

`Regression` problems, where the task is to predict some numerical value --> calories in 80 Cereals, rating in `Red Wine Quality`.
A common `loss function` for regression problems is the `mean absolute error (MAE)`.  
- For each prediction `y_pred`, `MAE` measures the disparity from the true target `y_true` by an absolute difference 
    - `abs(y_true - y_pred)`
    - mean absolute error is the average length between the fitted curve and the data points.
![](https://i.imgur.com/VDcvkZN.png)

Other `loss functions` for regression problems are `mean-squared error (MSE)` or `Huber loss`.

During training, the model will use the loss function as a guide for finding the correct values of its weights (`lower loss is better`). 

### The Optimizer - Stochastic Gradient Descent
`Optimizer` solve the network and it is an algorithm that adjusts the weights to minimize the loss. Virtually all of the optimization algorithms used in deep learning belong to a family called `stochastic gradient descent`. They are iterative algorithms that train a network in steps.
1. Sample some training data and run it through the network to make predictions.
2. Measure the loss between the predictions and the true values.
3. Finally, adjust the weights in a direction that makes the loss smaller.

Do this over and over until the loss is as small as the the desired value.
![](https://i.imgur.com/rFI1tIk.gif)

Each iteration's sample of training data is called a `minibatch` or `batch`, while a complete round of the training data is called an `epoch`. The `number of epochs` you train for is how many times the network will see each training example. The animation shows:
- linear model from Lesson-1 being trained with `SGD`. 
- pale red dots depict the entire training set, while the solid red dots are the minibatches.
- every time `SGD` sees a new minibatch, it will shift the `weights` (`w` slope and `b` y-intercept) toward their correct values on that batch.
- Batch after batch, the line eventually converges to its best fit.
- loss gets smaller as the weights get closer to their true values.

#### Learning Rate and Batch Size
In the above animation, line only makes a small shift in the direction of each batch. The size of these shifts is determined by the `learning rate`. 
- A `smaller learning rate` means the network needs to see more minibatches before its weights converge to their best values.

The `learning rate` and the `size of minibatches` are the two parameters that have the largest effect on how the SGD training proceeds. For most work it won't be necessary to do an extensive hyperparameter search to get satisfactory results. `Adam` is an `SGD algorithm` that has an adaptive learning rate that makes it suitable for most problems without any parameter tuning, `self tuning`. `Adam is a great general-purpose optimizer`.

### Adding the Loss and Optimizer
After defining a model, can add a `loss function` and `optimizer` with the model's `compile` method:
```
model.compile(
    optimizer="adam",
    loss="mae",
)
```
The `gradient` is a vector that tells us in what direction the weights need to go. It tells us how to change the weights to make the loss change `fastest`. Used `gradient decent` to descend loss curve towards minimum. `Stochastic` means `determined by chance`. Our training is stochastic because the minibatches are random samples from the dataset and called SGD!

### Example - Red Wine Quality
`Red Wine Quality` dataset consists of physiochemical measurements from about 1600 Portuguese red wines. Also included is a quality rating for each wine from blind taste-tests.

Data have rescaled each feature to lie in the interval `[0,1]` since neural netwroks perform best when their inputs are on a common scale.

In [None]:
import pandas as pd
from IPython.display import display

# add `dl-course-data`
red_wine = pd.read_csv('../input/dl-course-data/red-wine.csv')
red_wine.head

In [None]:
# Create training and validation splits
df_train = red_wine.sample(frac=0.7, random_state=0)
df_valid = red_wine.drop(df_train.index)
display(df_train.head(4))

# Scale to [0, 1]
max_ = df_train.max(axis=0) # for each column
min_ = df_train.min(axis=0) # for each column
df_train = (df_train - min_) / (max_ - min_)
df_valid = (df_valid - min_) / (max_ - min_)

# Split features and target
X_train = df_train.drop('quality', axis=1)
y_train = df_train['quality']

X_valid = df_valid.drop('quality', axis=1)
y_valid = df_valid['quality']

In [None]:
from tensorflow import keras
from tensorflow.keras import layers

# number of columns should be the input_shape for the model
# chosen a three-layer network with over 1500 neurons
model = keras.Sequential([
    layers.Dense(512, activation='relu', input_shape=[11]),
    layers.Dense(512, activation='relu'),
    layers.Dense(512, activation='relu'),
    layers.Dense(1),
])

# After defining the model, compile in the optimizer and loss function.
model.compile(optimizer='adam',loss='mae')

In [None]:
# start training the model
#    batch_size: to feed the optimizer N rows of the training data at a time  
#    epochs: to do N times all the way through the dataset

history = model.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    batch_size=256,
    epochs=10,
)

In [None]:
# a better way to view the loss though is to plot it.
# `fit` method in fact keeps a record of the loss produced during training

import pandas as pd
history_df = pd.DataFrame(history.history)
history_df['loss'].plot();

### Exercise: Stochastic Gradient Descent
[Fuel Economy](https://www.kaggle.com/sarita19/fuel-consumption) dataset will be used to explore the effect of the learning rate and batch size on SGD and to predict the fuel economy of an automobile given features

In [None]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.model_selection import train_test_split

from tensorflow import keras
from tensorflow.keras import layers

fuel = pd.read_csv('../input/dl-course-data/fuel.csv')

X = fuel.copy()
# Remove target
y = X.pop('FE')

preprocessor = make_column_transformer(
    (StandardScaler(),
     make_column_selector(dtype_include=np.number)),
    (OneHotEncoder(sparse=False),
     make_column_selector(dtype_include=object)),
)

X = preprocessor.fit_transform(X)
y = np.log(y) # log transform target instead of standardizing

input_shape = [X.shape[1]]
print("Input shape: {}".format(input_shape))


# defining model, compiling and fitting
model = keras.Sequential([
    layers.Dense(128, activation='relu', input_shape=input_shape),
    layers.Dense(128, activation='relu'),    
    layers.Dense(64, activation='relu'),
    layers.Dense(1),
])
model.compile(optimizer='adam', loss='mae')
history = model.fit(X, y, batch_size=128, epochs=200)

# ploting loss values
history_df = pd.DataFrame(history.history)
history_df.loc[5:, ['loss']].plot();

Can create animation by using different values.

| `learning_rate` | `batch_size` | `num_examples` |
|-----------------|--------------|----------------|
| 0.05            | 32           | 256            |
| 0.05            | 2            | 256            |
| 0.05            | 128          | 256            |
| 0.02            | 32           | 256            |
| 0.2             | 32           | 256            |
| 1.0             | 32           | 256            |
| 0.9             | 4096         | 8192           |
| 0.99            | 4096         | 8192           |

In [None]:
learning_rate = 0.05
batch_size = 32
num_examples = 256

animate_sgd(
    learning_rate=learning_rate,
    batch_size=batch_size,
    num_examples=num_examples,
    # You can also change these, if you like
    steps=50, # total training steps (batches seen)
    true_w=3.0, # the slope of the data
    true_b=2.0, # the bias of the data
)

---

# Overfitting and Underfitting

Keras will keep a history of the training and validation loss over the epochs for the training model. How to interpret these learning curves and how we can use them to guide model development. Can examine at learning curve with `underfitting` and `overfitting`.

### Interpreting the Learning Curves
Information in training data can be two kinds: `signal` and `noise`. 
- The signal is the part that generalizes, the part that help model make predictions from new data.
- The noise is that part that is `only` true of the training data; the noise is all of the random fluctuation that comes from data in the real-world or all of the incidental, non-informative patterns that can't actually help the model predictions.

Train a model by choosing weights or parameters that minimize the loss on a training set. Need to evaluate it on a new set of data, `validation data`. When training a model, plot the loss on training set epoch by epoch. And nned to plot validation data too. These plots we call the `learning curves`. 
![](https://i.imgur.com/tHiVFnM.png)
- The training loss will go down either when the model learns signal or when it learns noise. 
- The validation loss will go down only when the model learns signal.
- when a model learns signal both curves go down, but when it learns noise a gap is created in the curves. 
    - The size of the gap tells you how much noise the model has learned.

![](https://i.imgur.com/eUF6mfo.png)
This trade-off indicates that there can be two problems that occur when training a model: `not enough signal` or `too much noise`.
- `Underfitting` the training set is when the loss is not as low as it could be because the model hasn't learned enough signal.
- `Overfitting` the training set is when the loss is not as low as it could be because the model learned too much noise.

`The trick to training deep learning models is finding the best balance between the two.`



### Capacity
A model's c`apacity` refers to the size and complexity of the patterns it is able to learn. For neural networks:
- by how many neurons it has and how they are connected together.
- if the network is underfitting, should increase the capacity. Two ways to increase
    - by making it wider (more units to existing layers) --> have an easier time learning more linear relationships 
    - by making it deeper (adding more layers) --> prefer more nonlinear ones.

```
model = keras.Sequential([
    layers.Dense(16, activation='relu'),
    layers.Dense(1),
])

wider = keras.Sequential([
    layers.Dense(32, activation='relu'),
    layers.Dense(1),
])

deeper = keras.Sequential([
    layers.Dense(16, activation='relu'),
    layers.Dense(16, activation='relu'),
    layers.Dense(1),
])
```

### Early Stopping
When a model is too eagerly learning noise, the validation loss may start to increase during training. To prevent this, training should be stopped  when it seems the validation loss isn't decreasing anymore. Interrupting the training this way is called `early stopping`.

![](https://i.imgur.com/eP0gppr.png)
- once detecting the validation loss is starting to rise again, reset the weights back to where the minimum occured so that the model will not continue to learn noise and overfit the data.
- Training with early stopping less danger of stopping the training too early before the network has finished learning signal. 
- preventing overfitting from training too long, early stopping can also prevent underfitting from not training long enough.

#### Adding Early Stopping
`Keras` includes early stopping in training through a `callback` which is a function run every training times. The early stopping callback will run after every epoch. Keras has [pre-defined callbacks](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks) and can [create callbacks](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/LambdaCallback).

```
from tensorflow.keras.callbacks import EarlyStopping
early_stopping = EarlyStopping(
    min_delta=0.001,     # minimium amount of change to count as an improvement
    patience=20,         # how many epochs to wait before stopping
    restore_best_weights=True,
)
```
Above callback define: `If there hasn't been at least an improvement of 0.001 in the validation loss over the previous 20 epochs, then stop the training and keep the best model you found.`

### Example - Train a Model with Early Stopping
Will increase the capacity of that network but also add an early-stopping callback to prevent overfitting.

In [None]:
import pandas as pd
from IPython.display import display

red_wine = pd.read_csv('../input/dl-course-data/red-wine.csv')

# Create training and validation splits
df_train = red_wine.sample(frac=0.7, random_state=0)
df_valid = red_wine.drop(df_train.index)
display(df_train.head(4))

# Scale to [0, 1]
max_ = df_train.max(axis=0)
min_ = df_train.min(axis=0)
df_train = (df_train - min_) / (max_ - min_)
df_valid = (df_valid - min_) / (max_ - min_)

# Split features and target
X_train = df_train.drop('quality', axis=1)
X_valid = df_valid.drop('quality', axis=1)
y_train = df_train['quality']
y_valid = df_valid['quality']

In [None]:
from tensorflow import keras
from tensorflow.keras import layers, callbacks

early_stopping = callbacks.EarlyStopping(
    min_delta=0.001, # minimium amount of change to count as an improvement
    patience=20, # how many epochs to wait before stopping
    restore_best_weights=True,
)

model = keras.Sequential([
    layers.Dense(512, activation='relu', input_shape=[11]),
    layers.Dense(512, activation='relu'),
    layers.Dense(512, activation='relu'),
    layers.Dense(1),
])

model.compile(optimizer='adam',loss='mae')

In [None]:
history = model.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    batch_size=256,
    epochs=500,
    callbacks=[early_stopping], # put callbacks in a list
    verbose=0,  # turn off training log
)

history_df = pd.DataFrame(history.history)
history_df.loc[:, ['loss', 'val_loss']].plot();
print("Keras stopped the training well before the full 500 epochs!")
print("Minimum validation loss: {}".format(history_df['val_loss'].min()))

### Exercise: Overfitting and Underfitting
How to improve training outcomes by including an early stopping callback to prevent overfitting.
Use `spotify` dataset to predict the popularity of a song based on various audio features, like `'tempo', 'danceability', and 'mode'`.

In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.model_selection import GroupShuffleSplit

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras import callbacks

spotify = pd.read_csv('../input/dl-course-data/spotify.csv')

X = spotify.copy().dropna()
y = X.pop('track_popularity')
artists = X['track_artist']

features_num = ['danceability', 'energy', 'key', 'loudness', 'mode',
                'speechiness', 'acousticness', 'instrumentalness',
                'liveness', 'valence', 'tempo', 'duration_ms']
features_cat = ['playlist_genre']

preprocessor = make_column_transformer(
    (StandardScaler(), features_num),
    (OneHotEncoder(), features_cat),
)

# We'll do a "grouped" split to keep all of an artist's songs in one
# split or the other. This is to help prevent signal leakage.
def group_split(X, y, group, train_size=0.75):
    splitter = GroupShuffleSplit(train_size=train_size)
    train, test = next(splitter.split(X, y, groups=group))
    return (X.iloc[train], X.iloc[test], y.iloc[train], y.iloc[test])

X_train, X_valid, y_train, y_valid = group_split(X, y, artists)

X_train = preprocessor.fit_transform(X_train)
X_valid = preprocessor.transform(X_valid)
y_train = y_train / 100 # popularity is on a scale 0-100, so this rescales to 0-1.
y_valid = y_valid / 100

input_shape = [X_train.shape[1]]
print("Input shape: {}".format(input_shape))

In [None]:
# Underfitting
model = keras.Sequential([
    layers.Dense(1, input_shape=input_shape),
])
model.compile(
    optimizer='adam',
    loss='mae',
)
history = model.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    batch_size=512,
    epochs=50,
    verbose=0, # suppress output since we'll plot the curves
)
history_df = pd.DataFrame(history.history)
history_df.loc[0:, ['loss', 'val_loss']].plot()
print("Minimum Validation Loss: {:0.4f}".format(history_df['val_loss'].min()));

In [None]:
# Overfitting
model = keras.Sequential([
    layers.Dense(128, activation='relu', input_shape=input_shape),
    layers.Dense(64, activation='relu'),
    layers.Dense(1)
])
model.compile(
    optimizer='adam',
    loss='mae',
)
history = model.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    batch_size=512,
    epochs=50,
    verbose=0
)
history_df = pd.DataFrame(history.history)
history_df.loc[:, ['loss', 'val_loss']].plot()
print("Minimum Validation Loss: {:0.4f}".format(history_df['val_loss'].min()));

In [None]:
# Early Stopping Callbacks
early_stopping = callbacks.EarlyStopping(
    patience=5,
    min_delta=0.001,
    restore_best_weights=True
)

model = keras.Sequential([
    layers.Dense(128, activation='relu', input_shape=input_shape),
    layers.Dense(64, activation='relu'),    
    layers.Dense(1)
])
model.compile(
    optimizer='adam',
    loss='mae',
)
history = model.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    batch_size=512,
    epochs=50,
    callbacks=[early_stopping],
    verbose=0
)
history_df = pd.DataFrame(history.history)
history_df.loc[:, ['loss', 'val_loss']].plot()
print("Minimum Validation Loss: {:0.4f}".format(history_df['val_loss'].min()));

---

# Dropout and Batch Normalization

There are dozens of kinds of layers to add to a model, [Keras docs](https://www.tensorflow.org/api_docs/python/tf/keras/layers/). Some are like dense layers and define connections between neurons, and others can do preprocessing or transformations of other sorts. Will learn two kinds of special layers, not containing any neurons themselves, but that add some functionality that can sometimes benefit a model in various ways.

### Dropout
`Dropout laer` helps to correct overfitting. Randomly drop out some fraction of a layer's input units every step of training, making it much harder for the network to learn those spurious patterns in the training data. 
![](https://i.imgur.com/a86utxY.gif)
- dropout as creating a kind of ensemble of networks.
- predictions will no longer be made by one big network, but instead by a committee of smaller networks.
- individuals in the committee tend to make different kinds of mistakes, but be right at the same time, making the committee as a whole better than any individual.
    - same idea as random forests as an ensemble of decision trees

#### Adding Dropout
In `Keras`, the dropout rate argument `rate` defines what percentage of the input units to shut off. Put the `Dropout layer` just before the layer that want to apply dropout:
```
keras.Sequential([
    # ...
    layer.Dropout(rate=0.3), # apply 30% dropout to the next layer
    layer.Dense(16),
    # ...
])
```

### Batch Normalization
`Batch normalization (batchnorm)`, helps correct training that is slow or unstable. With neural networks, it's generally a good idea to put all of your data on a common scale using `scikit-learn's` [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) or [MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html). `SGD` will shift the network weights in proportion to how large an activation the data produces. 

If it's good to normalize the data before it goes into the network, maybe also normalizing inside the network would be better! A special kind of layer that can do this is `batch normalization layer`. 
- looks at each batch as it comes in, 
- first normalizing the batch with its own mean and standard deviation, and then 
- also putting the data on a new scale with two trainable rescaling parameters.

`Batchnorm` is added as an aid to the optimization process. Models with batchnorm tend to need fewer epochs to complete training. If having some trouble during training, should consider adding `batch normalization`.

#### Adding Batch Normalization
```
layers.Dense(16, activation='relu'),
layers.BatchNormalization(),
```
OR
```
layers.Dense(16),
layers.BatchNormalization(),
layers.Activation('relu'),
```

### Example - Using Dropout and Batch Normalization
`Red Wine` dataset will be uesd.

In [None]:
import matplotlib.pyplot as plt

plt.style.use('seaborn-whitegrid')
# Set Matplotlib defaults
plt.rc('figure', autolayout=True)
plt.rc('axes', labelweight='bold', labelsize='large',
       titleweight='bold', titlesize=18, titlepad=10)


import pandas as pd
red_wine = pd.read_csv('../input/dl-course-data/red-wine.csv')

# Create training and validation splits
df_train = red_wine.sample(frac=0.7, random_state=0)
df_valid = red_wine.drop(df_train.index)

# Split features and target
X_train = df_train.drop('quality', axis=1)
X_valid = df_valid.drop('quality', axis=1)
y_train = df_train['quality']
y_valid = df_valid['quality']

In [None]:
from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential([
    layers.Dense(1024, activation='relu', input_shape=[11]),
    layers.Dropout(0.3),
    layers.BatchNormalization(),
    layers.Dense(1024, activation='relu'),
    layers.Dropout(0.3),
    layers.BatchNormalization(),
    layers.Dense(1024, activation='relu'),
    layers.Dropout(0.3),
    layers.BatchNormalization(),
    layers.Dense(1),
])

model.compile(optimizer='adam',loss='mae')

In [None]:
history = model.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    batch_size=256,
    epochs=100,
    verbose=0,
)

history_df = pd.DataFrame(history.history)
history_df.loc[:, ['loss', 'val_loss']].plot();

### Exercise: Dropout and Batch Normalization
How batch normalization can successfully train models on difficult datasets.
Use `spotify` dataset to predict the popularity of a song based on various audio features, like `'tempo', 'danceability', and 'mode'`.

In [None]:
from tensorflow.keras import layers
from tensorflow.keras import callbacks

spotify = pd.read_csv('spotify.csv')

X = spotify.copy().dropna()
y = X.pop('track_popularity')
artists = X['track_artist']

features_num = ['danceability', 'energy', 'key', 'loudness', 'mode',
                'speechiness', 'acousticness', 'instrumentalness',
                'liveness', 'valence', 'tempo', 'duration_ms']
features_cat = ['playlist_genre']

preprocessor = make_column_transformer(
    (StandardScaler(), features_num),
    (OneHotEncoder(), features_cat),
)

def group_split(X, y, group, train_size=0.75):
    splitter = GroupShuffleSplit(train_size=train_size)
    train, test = next(splitter.split(X, y, groups=group))
    return (X.iloc[train], X.iloc[test], y.iloc[train], y.iloc[test])

X_train, X_valid, y_train, y_valid = group_split(X, y, artists)

X_train = preprocessor.fit_transform(X_train)
X_valid = preprocessor.transform(X_valid)
y_train = y_train / 100
y_valid = y_valid / 100

input_shape = [X_train.shape[1]]
print("Input shape: {}".format(input_shape))

In [None]:
model = keras.Sequential([
    layers.Dense(128, activation='relu', input_shape=input_shape),
    layers.Dropout(0.3),
    layers.Dense(64, activation='relu'),
    layers.Dropout(0.3),
    layers.Dense(1)
])

model.compile(
    optimizer='adam',
    loss='mae',
)
history = model.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    batch_size=512,
    epochs=50,
    verbose=0,
)
history_df = pd.DataFrame(history.history)
history_df.loc[:, ['loss', 'val_loss']].plot()
print("Minimum Validation Loss: {:0.4f}".format(history_df['val_loss'].min()))

Load the `Concrete` dataset to explore how batch normalization can fix problems in training. The data will not be pre-normalized as previous.

In [None]:
import pandas as pd

concrete = pd.read_csv('concrete.csv')
df = concrete.copy()

df_train = df.sample(frac=0.7, random_state=0)
df_valid = df.drop(df_train.index)

X_train = df_train.drop('CompressiveStrength', axis=1)
X_valid = df_valid.drop('CompressiveStrength', axis=1)
y_train = df_train['CompressiveStrength']
y_valid = df_valid['CompressiveStrength']

input_shape = [X_train.shape[1]]

In [None]:
# model without BatchNormalization cannot be trained
model = keras.Sequential([
    layers.BatchNormalization(),            # <---------
    layers.Dense(512, activation='relu'),
    layers.BatchNormalization(),            # <---------
    layers.Dense(512, activation='relu'),
    layers.BatchNormalization(),            # <---------
    layers.Dense(512, activation='relu'),
    layers.BatchNormalization(),            # <---------
    layers.Dense(1),
])

model.compile(
    optimizer='sgd',
    loss='mae',
    metrics=['mae'],
)
EPOCHS = 100
history = model.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    batch_size=64,
    epochs=EPOCHS,
    verbose=0,
)

history_df = pd.DataFrame(history.history)
history_df.loc[0:, ['loss', 'val_loss']].plot()
print(("Minimum Validation Loss: {:0.4f}").format(history_df['val_loss'].min()))

---

# Binary Classification

Will apply neural networks to another common machine learning problem: `classification`. The main difference is in the loss function used and in what kind of outputs want the final layer to produce.

### Binary Classification
Classification into one of two classes is a common machine learning problem. In raw data, the classes might be represented by strings like `"Yes" and "No"`, or `"Dog" and "Cat"`. Before using this raw data, need to assign `class label`: `0` and `1`.

Class label can be assigned:
- `df['Class] = df['Class'].map({'good': 0, 'bad': 1})`

### Accuracy and Cross-Entropy
`Accuracy` is one of the many metrics in use for measuring success on a classification problem. Accuracy is the ratio of correct predictions to total predictions: `accuracy = number_correct / total`. 
- if a model always predict correctly, the accuracy score is `1.0`

Accuracy cannot be used as a loss function, `SGD` needs a loss function that changes `smoothly`, but `accuracy`, being a ratio of counts, changes in `jumps`. So, need to chhose a substitute (`cross-entropy` function) to act as loss funciton.
- loss function defines the objective of the network during training.
- with regression, goal was to minimizee the distance between expected outcome and predeicted outcome (`MAE` was chosen to measure the distance)
- for classification, want a distance between `probabilities` that `cross-entropy` provides.
    - `Cross-entropy` is a sort of measure for the distance from one probability distribution to another.
![](https://i.imgur.com/DwVV9bR.png)
Idea is that the network to predict the correct class with probability `1.0`. 
- The further away the predicted probability is from `1.0`, the greater will be the `cross-entropy loss`.

### Making Probabilities with the Sigmoid Function
The `cross-entropy` and `accuracy` functions both require probabilities as inputs, numbers from `0` to `1`. To covert the real-valued outputs produced by a dense layer into probabilities, can use a new kind of activation function, `sigmoid activation`.
![](https://i.imgur.com/FYbRvJo.png)
To get the final class prediction, need to define a `threshold` probability. Mostly use `0.5` so that rounding will give the correct class: A `0.5` threshold is what Keras uses by default with its [accuracy metric](https://www.tensorflow.org/api_docs/python/tf/keras/metrics/BinaryAccuracy).
- below 0.5 means the class with label 0 and 0.5 or 
- above means the class with label 1.

### Example - Binary Classification
[Lonosphere](https://archive.ics.uci.edu/ml/datasets/Ionosphere) dataset contains features obtained from radar signals focused on the ionosphere layer of the Earth's atmosphere. The task is to determine whether the signal shows the presence of some object, or just empty air.

In [None]:
import pandas as pd
from IPython.display import display

ion = pd.read_csv('../input/dl-course-data/ion.csv', index_col=0)
display(ion.head())

df = ion.copy()
df['Class'] = df['Class'].map({'good': 0, 'bad': 1})

df_train = df.sample(frac=0.7, random_state=0)
df_valid = df.drop(df_train.index)

max_ = df_train.max(axis=0)
min_ = df_train.min(axis=0)

df_train = (df_train - min_) / (max_ - min_)
df_valid = (df_valid - min_) / (max_ - min_)
df_train.dropna(axis=1, inplace=True) # drop the empty feature in column 2
df_valid.dropna(axis=1, inplace=True)

X_train = df_train.drop('Class', axis=1)
X_valid = df_valid.drop('Class', axis=1)
y_train = df_train['Class']
y_valid = df_valid['Class']

In [None]:
from tensorflow import keras
from tensorflow.keras import layers

# define the model with final layer include a 'sigmoid' activation
model = keras.Sequential([
    layers.Dense(4, activation='relu', input_shape=[33]),
    layers.Dense(4, activation='relu'),    
    layers.Dense(1, activation='sigmoid'), # <-------------
])

# Add the cross-entropy loss and accuracy metric to the model with its `compile` method
model.compile(
    optimizer='adam',
    loss='binary_crossentropy', # For two-class problems 
    metrics=['binary_accuracy'],
)

early_stopping = keras.callbacks.EarlyStopping(
    patience=10,
    min_delta=0.001,
    restore_best_weights=True,
)

history = model.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    batch_size=512,
    epochs=1000,
    callbacks=[early_stopping],
    verbose=0, # hide the output because we have so many epochs
)

In [None]:
history_df = pd.DataFrame(history.history)
# Start the plot at epoch 5
history_df.loc[5:, ['loss', 'val_loss']].plot()
history_df.loc[5:, ['binary_accuracy', 'val_binary_accuracy']].plot()

print(("Best Validation Loss: {:0.4f}" +\
      "\nBest Validation Accuracy: {:0.4f}")\
      .format(history_df['val_loss'].min(), 
              history_df['val_binary_accuracy'].max()))

### Exercise: Binary Classification
To predict hotel cancellations with a binary classifier using `Hotel Cancellations` dataset.

In [None]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer

hotel = pd.read_csv('../input/dl-course-data/hotel.csv')

X = hotel.copy()
y = X.pop('is_canceled')

X['arrival_date_month'] = \
    X['arrival_date_month'].map(
        {'January':1, 'February': 2, 'March':3,
         'April':4, 'May':5, 'June':6, 'July':7,
         'August':8, 'September':9, 'October':10,
         'November':11, 'December':12}
    )

features_num = [
    "lead_time", "arrival_date_week_number",
    "arrival_date_day_of_month", "stays_in_weekend_nights",
    "stays_in_week_nights", "adults", "children", "babies",
    "is_repeated_guest", "previous_cancellations",
    "previous_bookings_not_canceled", "required_car_parking_spaces",
    "total_of_special_requests", "adr",
]
features_cat = [
    "hotel", "arrival_date_month", "meal",
    "market_segment", "distribution_channel",
    "reserved_room_type", "deposit_type", "customer_type",
]

transformer_num = make_pipeline(
    SimpleImputer(strategy="constant"), # there are a few missing values
    StandardScaler(),
)
transformer_cat = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="NA"),
    OneHotEncoder(handle_unknown='ignore'),
)

preprocessor = make_column_transformer(
    (transformer_num, features_num),
    (transformer_cat, features_cat),
)

# stratify - make sure classes are evenlly represented across splits
X_train, X_valid, y_train, y_valid = \
    train_test_split(X, y, stratify=y, train_size=0.75)

X_train = preprocessor.fit_transform(X_train)
X_valid = preprocessor.transform(X_valid)

input_shape = [X_train.shape[1]]

Figure to define mode.
![](https://i.imgur.com/V04o59Z.png)

In [None]:
from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential([
    layers.BatchNormalization(input_shape=input_shape),
    layers.Dense(256, activation='relu'),
    layers.BatchNormalization(),
    layers.Dropout(0.3),
    layers.Dense(256, activation='relu'),
    layers.BatchNormalization(),
    layers.Dropout(0.3),
    layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['binary_accuracy'])

early_stopping = keras.callbacks.EarlyStopping(
    patience=5,
    min_delta=0.001,
    restore_best_weights=True,
)
history = model.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    batch_size=512,
    epochs=200,
    callbacks=[early_stopping],
    verbose=0
)

history_df = pd.DataFrame(history.history)
history_df.loc[:, ['loss', 'val_loss']].plot(title="Cross-entropy")
history_df.loc[:, ['binary_accuracy', 'val_binary_accuracy']].plot(title="Accuracy");

Why not try one of our Getting Started competitions?
* Classify images with TPUs in [Petals to the Metal](https://www.kaggle.com/c/tpu-getting-started)
* Create art with GANs in [I'm Something of a Painter Myself](https://www.kaggle.com/c/gan-getting-started)
* Classify Tweets in [Real or Not? NLP with Disaster Tweets](https://www.kaggle.com/c/nlp-getting-started)
* Detect contradiction and entailment in [Contradictory, My Dear Watson](https://www.kaggle.com/c/contradictory-my-dear-watson)

---

# Detecting the Higgs Boson With TPUs

[Detecting the Higgs Boson With TPUs](https://www.kaggle.com/minyannaing/detecting-the-higgs-boson-with-tpus/edit)