# TOC

__Chapter 3 - Diving deep into neural networks__

1. [Import](#Import)
1. [Layers – fundamental blocks of neural networks](#Layers–fundamental-blocks-of-neural-networks)
1. [Non-linear activations](#Non-linear-activations)
    1. [Sigmoid](#Sigmoid)
    1. [Tanh](#Tanh)
    1. [ReLU](#ReLU)
    1. [Leaky ReLU](#Leaky-ReLU)
1. [The PyTorch way of building deep learning algorithms](#The-PyTorch-way-of-building-deep-learning-algorithms)
1. [](#)
1. [](#)
1. [](#)
1. [](#)
1. [](#)
1. [](#)
1. [](#)
1. [](#)

# Import

<a id = 'Import'></a>

In [2]:
# Standard libary and settings
import os
import sys
import importlib
import itertools
import warnings; warnings.simplefilter('ignore')
from IPython.core.display import display, HTML; display(HTML("<style>.container { width:95% !important; }</style>"))

# Data extensions and settings
import numpy as np
np.set_printoptions(threshold = np.inf, suppress = True)
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.options.display.float_format = '{:,.6f}'.format

# pytorch tools
import torch
from torch.autograd import Variable
import torch.nn as nn

# Visualization extensions and settings
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set_style('whitegrid')


# Layers – fundamental blocks of neural networks


PyTorch users built-in functionality referred to as layers to perform high-level operations in the execution of a neural network framework. There are many different types of layers. The linear layer, for example, applies the following linear transformation:

$$
\textbf{Y} = \textbf{W}\textbf{x} + \textbf{b}
$$

Multiple layers can be executed in sequence.

<a id = 'Layers–fundamental-blocks-of-neural-networks'></a>

In [4]:
# this linear function layer accepts a tensor of size 1- and outputs a tensor of size 5 after
# applying a linear transformation
myLayer = nn.Linear(in_features = 10, out_features = 5, bias = True)
myLayer


Linear(in_features=10, out_features=5, bias=True)

In [5]:
# feed tensor of random numbers into myLayer
inp = Variable(torch.randn(1,10))
myLayer = nn.Linear(in_features = 10, out_features = 5, bias = True)
myLayer(inp)


tensor([[-0.2251, -0.4737,  0.3719, -0.0200,  0.1534]],
       grad_fn=<AddmmBackward>)

In [7]:
# review weights - a 5 row by 10 column matrix reprsenting the 50 connections between the 10 input
# nodes and 5 output nodes
myLayer.weight


Parameter containing:
tensor([[-0.1058,  0.0475, -0.1087,  0.2467, -0.0248, -0.0428, -0.1703, -0.0919,
         -0.2566,  0.1299],
        [ 0.0930,  0.0178,  0.2726, -0.2643, -0.2069, -0.2342, -0.1533,  0.1485,
          0.3090,  0.2094],
        [-0.2547,  0.1003, -0.0412,  0.0987,  0.2688, -0.0383,  0.0628, -0.2042,
          0.1284,  0.1556],
        [ 0.0162,  0.0529,  0.2653,  0.1618,  0.0902, -0.2695, -0.2497,  0.2101,
          0.2665, -0.2102],
        [-0.0355,  0.2456,  0.0833, -0.0785,  0.0578,  0.1010, -0.1808,  0.2231,
          0.0419, -0.2705]], requires_grad=True)

In [8]:
# review biases - 1 bias unit for each output node
myLayer.bias


Parameter containing:
tensor([-0.0694,  0.1742,  0.1781,  0.2609, -0.1346], requires_grad=True)

In [12]:
# create a network with multiple linear layers. NOTE - stacking two linear layer is not typically
# done in practice because linear transformation are unable to capture more complicated non-linear
# patterns
myLayer1 = nn.Linear(10, 5)
myLayer2 = nn.Linear(5, 2)
myLayer2(myLayer1(inp))

print('Layer 1 weights:\n')
print(myLayer1.weight)

print('\nLayer 2 weights:\n')
print(myLayer2.weight)


Layer 1 weights:

Parameter containing:
tensor([[ 0.2979,  0.1654,  0.2752,  0.1576, -0.1869, -0.0838,  0.2349, -0.0735,
         -0.1603,  0.1928],
        [-0.1303, -0.0924, -0.2520,  0.1261, -0.1366,  0.0719,  0.2128,  0.2838,
         -0.0277,  0.0056],
        [-0.2782,  0.2058,  0.0431, -0.2941,  0.2881, -0.0104, -0.1394,  0.1627,
         -0.0551,  0.1939],
        [-0.3060,  0.0599, -0.1529, -0.1520, -0.1153,  0.0205, -0.1728,  0.1376,
         -0.0044,  0.2036],
        [-0.1926, -0.2671,  0.0307,  0.2874,  0.2832, -0.1891, -0.0860, -0.1008,
          0.2368, -0.0984]], requires_grad=True)

Layer 2 weights:

Parameter containing:
tensor([[-0.2163, -0.1791, -0.4168,  0.0401, -0.2722],
        [-0.0049, -0.2296,  0.3167, -0.4103,  0.1534]], requires_grad=True)


# Non-linear activations



<a id = 'Non-linear-activations'></a>

## Sigmoid

The sigmoid activation function takes the following form:

$$
\sigma(x) = \frac{1}{1 + e^{-x}}
$$

It returns a number between 0 and 1. A large negative number returns a value close to zero, and a large positive number returns a value close to 1.

This function is not especially popular in modern implementations of neural networks because when the output of the sigmoid function is close to 0 or 1, the gradients for the layers are close to zero. Consequently, the learnable parameters of the previous layer get gradients close to zero and the weights do not get adjusted often. This results in 'dead' neurons.


<a id = 'Sigmoid'></a>

## Tanh

Similar to the sigmoid function, the tanh function also squashes real-valued inputs to a narrow range. The tanh range is -1 to 1. This activation function encounters the same issue as the sigmoid function when it handles extreme values close to -1 and 1. However, it is preferred to sigmoid because the output of tanh is centered at zero.

<a id = 'Tanh'></a>

## ReLU

ReLU is very popular activation function. It hsa the mathematical form:

$$
f(x) = max(0,x)
$$

ReLU squashes any input that is negative to an output value of zero, and leaves positive numbers as they are. Some pros of ReLU include:

- It helps the optimizer find the right weights sooner.
- It speeds up convergence of stochastic gradient descent
- It is computationally inexpensive, as we are simply thresholding rather than calculating.

A disadvantage of ReLU is that when a large gradient passes through it during backpropagation, the function can become non-responsive, resulting in a 'dead' neuron. This can be controlled by carefully choosing the model learning rate.

In [14]:
# simple illustration of ReLU activation
sample = Variable(torch.Tensor([[1, 2, -1, -1]]))
myRelu = nn.ReLU()
myRelu(sample)


tensor([[1., 2., 0., 0.]])

<a id = 'ReLU'></a>

## Leaky ReLU

Leaky ReLU attempts to solve the drawback of ReLU by insteading setting negative inputs to 0, it returns very small numbers suc as 0.001. It may provide superior performance but that is not guaranteed.

<a id = 'Leaky ReLU'></a>

In [15]:
# simple illustration of Learky ReLU activation
sample = Variable(torch.Tensor([[1, 2, -1, -1]]))
myRelu = nn.LeakyReLU()
myRelu(sample)


tensor([[ 1.0000,  2.0000, -0.0100, -0.0100]])

# The PyTorch way of building deep learning algorithms


All networks in PyTorch are implemented as classes by subclassing a PyTorch class called nn.Module. The class should also implement the init and forward methods. Inside the init function we initialize any layers, and the forward method

<a id = 'The-PyTorch-way-of-building-deep-learning-algorithms'></a>

In [None]:
# 



# A



<a id = ''></a>

In [None]:
# 



# A



<a id = ''></a>

In [None]:
# 



# A



<a id = ''></a>

In [None]:
# 



# A



<a id = ''></a>

In [None]:
# 

