### Deep and wide neural networks

some or all inputs are connected to output layer directly, bypassing hidden layers thus enabling deep patterns and simple rules.

In [2]:
from tensorflow import keras
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [3]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

collect dataset - california housing

In [4]:
housing = fetch_california_housing()

In [5]:
X_train_full, X_test, y_train_full, y_test = train_test_split(housing.data, housing.target)
X_train, X_valid, y_train, y_valid = train_test_split(X_train_full, y_train_full)

define model

In [33]:
X_train.shape[1:]

(8,)

In [34]:
input_layer = keras.layers.Input(shape=(8,))

In [35]:
hidden1 = keras.layers.Dense(units=30,activation='relu')(input_layer)

In [36]:
hidden2 = keras.layers.Dense(units=30,activation='relu')(hidden1)

In [37]:
concat = keras.layers.Concatenate()([input_layer,hidden2])

In [38]:
output_layer = keras.layers.Dense(units=1)(concat)

In [69]:
model = keras.models.Model(inputs=[input_layer],outputs=[output_layer])

In [70]:
model.compile(loss='mse',optimizer='sgd',metrics=['accuracy'])

### Using Callbacks

As early stopping

<code>checkpoint_cb = keras.callbacks.ModelCheckpoint("my_keras_model.h5",save_best_only=True)<br>
history = model.fit(X_train, y_train, epochs=10,validation_data=(X_valid, y_valid),callbacks=[checkpoint_cb])<br>
model = keras.models.load_model("my_keras_model.h5")

### It is possible to use RandomizedSearchCV on deep neural nets

# Hierarchical Learning in DNNs

1. One hidden layer can model most complex functions.
2. Lower hidden layer(from beginning) model low level structures
(lines, segments)
3. Intermediate hidden layers model intermediate-level structures 
(squares, circles)
4. Highest hidden layers model high level structures
(faces)

Reusing pretrained weights from a DNN for another dataset can be helpful<br>
to converge much faster and avoid random initialization of weights

That is called transfer learning

# Number of Neurons per hidden layer

Input layer has to have size of the input

output layer should have 'n' units for 'n' classes for multiclass classification

<b> try gradually inccreasing the number of units until the model starts to overfit

# Best Practice

1. Start with a model with more hidden layers and more units than needed
2. Use early stopping to prevent it from overfitting.
3. Use Dropout Regularization to improve generalization

## Learning Rate, Batch Size and hyperparameters

### Learning Rate or lr:

lr: Start with an lr that makes model diverge, then divide by 3 until it starts converging. <br>
It will be around max lr<br>
optimal lr will be half of max lr

### Batch Size

Impacts model performance as well as training time

Small batch gives faster training<br>
Large  batch gives more precise estimate

1. Should be lower than 32<br>
2. Should be higher than 10 (utilizes matrix multiplication)

If you use Batch Normalization, your batch size should not be less than 20

### Optimizers

Other optimizers are also there which are better than plain old gradient descent

### Activation Function

ReLU is good choice and should be by default for all hidden layers

# Yoshua Bengio's 2012 paper 

# Practical Recommendations for Gradient Based Training of DNN