# Introduction to Neural Networks:

## Author:
### [Dr. Rahul Remanan](https://www.linkedin.com/in/rahulremanan)

### CEO, [Moad Computer](https://www.moad.computer)



This is a hands-on workshop notebook on deep-learning using python 3. In this notebook, we will learn how to implement a neural network from scratch using numpy. Once we have implemented this network, we will visualize the predictions generated by the neural network and compare it with a logistic regression model, in the form of classification boundaries. This workshop aims to provide an intuitive understanding of neural networks.

In practical code development, there is seldom an use case for building a neural network from scratch. Neural networks in real-world are typically implemented using a deep-learning framework such as tensorflow. But, building a neural network with very minimal dependencies helps one gain an understanding of how neural networks work. This understanding is essential to designing effective neural network models. Also, towards the end of the session, we will use tensorflow deep-learning library to build a neural network, to illustrate the importance of building a neural network using a deep-learning framework.

### Architecture of the basic XOR gate neural network:

![Artificial neural network architecture](https://github.com/rahulremanan/python_tutorial/raw/master/Fundamentals_of_deep-learning/media/Artificial_neural_network.png)

### XOR gate problem and neural networks -- Background:

[The XOR gate is an interesting problem in neural networks](http://www.ece.utep.edu/research/webfuzzy/docs/kk-thesis/kk-thesis-html/node19.html). [Marvin Minsky](https://en.wikipedia.org/wiki/Marvin_Minsky) and [Samuel Papert](https://en.wikipedia.org/wiki/Seymour_Papert) in their book [ 'Perceptrons' (1969)](https://en.wikipedia.org/wiki/Perceptrons_(book) showed that the XOR gate cannot be solved using a two layer perceptron, since the solution for a XOR gate was not linearly separable. This conclusion lead to a significantly reduced interest in[ Frank Rosenblatt's](https://en.wikipedia.org/wiki/Frank_Rosenblatt) perceptrons as a mechanism for building artificial intelligence applications.

Some of these earliest work in AI were using networks or circuits of connected units to simulate intelligent behavior. Examples of this kind of work are called "connectionism". [After the publication of 'Perceptrons', the interest in connectionism significantly reduced](https://en.wikipedia.org/wiki/AI_winter#The_abandonment_of_connectionism_in_1969), till the renewed interest following the works of [John Hopfield](https://en.wikipedia.org/wiki/John_Hopfield) and [David Rumelhart](https://en.wikipedia.org/wiki/David_Rumelhart).

The assertions in the book 'Perceptrons' by Minsky was inspite of his thorough knowledge that the powerful perceptrons have multiple layers and that Rosenblatt's basic feed-forward perceptrons have three layers. In the book, to deceive unsuspecting readers, Minsky defined a perceptron as a two-layer machine that can handle only linearly separable problems and, for example, cannot solve the exclusive-OR problem. [The Minsky-Papert collaboation is now believed to be a political maneuver and a hatchet job for contract funding by some knowledgeable scientists](http://csis.pace.edu/~ctappert/srd2011/rosenblatt-contributions.htm). This strong, unidimensional and misplaced criticism of perceptrons essentially halted work on practical, powerful artificial intelligence systems that were based on neural-networks for nearly a decade.

Part 1 of this notebook explains how to build a very basic neural network in numpy. This perceptron like neural network is trained to predict the output of a [XOR gate](https://en.wikipedia.org/wiki/XOR_gate).

![CMOS XOR Gate](https://github.com/rahulremanan/python_tutorial/raw/master/Fundamentals_of_deep-learning/media/CMOS_XOR_Gate.png)

#### XOR gate table:

![XOR Gate Table](https://github.com/rahulremanan/python_tutorial/raw/master/Fundamentals_of_deep-learning/media/XOR_Gate_Table.png)

#### Image below shows an example of a lienarly separable dataset:

![Linearly separable points](https://github.com/rahulremanan/python_tutorial/raw/master/Fundamentals_of_deep-learning/media/linearly_spearable_points.gif)

#### Image below shows the XOR gate problem and no linear separation:

![XOR problem](https://github.com/rahulremanan/python_tutorial/raw/master/Fundamentals_of_deep-learning/media/XOR_gate.gif)


# Mathematical intuition of machine learning

Consider an input matrix with the entire training data, represented as set **_X_** and the corresponding labels, represented as set **_Y_**. 

An ideal machine learning model can be defined as a special type of [surjective function](https://en.wikipedia.org/wiki/Surjective_function) that maps the set **_X_**, which contains all the input elements **__x__**; to the set **_Y_**, which contains all the label elements **__y__**.

$${f: X \rightarrow Y}$$
$${where}$$
$${\forall x \in X,\hspace{1em}\exists \forall y \in Y\hspace{1em}  \mid\hspace{1em}f(x) = y}$$

$${{For\hspace{0.5em}all}\hspace{1em}x \in X,\hspace{1em}there\hspace{0.5em}exists\hspace{0.5em}{for\hspace{0.5em}all}\hspace{1em}y \in Y;\hspace{1em}{a\hspace{0.5em}function\hspace{0.5em}such\hspace{0.5em}that:}\hspace{1em} f(x) = y.}$$

An important result from the surjective function based definition of machine learning is that, they are not true universal function approximators; since the universe of all the mathematical operations cannot be expressed as a surjective function, due to [Cantor's paradox](https://en.wikipedia.org/wiki/Cantor%27s_paradox). 

Due to the very high effectiveness of machine learning models in approximating a variety of practical problems, they are understandably mischaracterized as universal function approximators.

But, from a practical stand-point; machine learning models do not have to satisfy the criteria of a true universal function approximator, to be useful. 

Since the most commonly encountered phenomena in the universe can be treated as just a smaller subset of the universe of all the mathematical operations, machine learning models can be applied to approximate these phenomena very successfully. 

Therefore, machine learning models are highly effective quasi or pseudo universal function approximators; capable of solving most, but not all problems in the universe of all mathematical operations.


# [Dot product](https://en.wikipedia.org/wiki/Dot_product)

$$
{\begin{bmatrix}
a_{11} & a_{12} & ... & a_{1n} \\
a_{21} & a_{22} & ... & a_{2n} \\
... \\
a_{m1} & a_{m2} & ... & a_{mn} \\
\end{bmatrix}
}{.}
\begin{bmatrix}
b_{11} & b_{12} & ... & b_{1p} \\
b_{21} & b_{22} & ... & b_{2p} \\
... \\
b_{n1} & b_{n2} & ... & b_{np} \\
\end{bmatrix}
=\\
$$

$$
\begin{bmatrix}
\Sigma_{i=1}^na_{1i} \times b_{i1} & \Sigma_{i=1}^na_{1i} \times b_{i2} & ... & \Sigma_{i=1}^na_{1i} \times b_{ip} \\
\Sigma_{i=1}^na_{2i} \times b_{i1} & \Sigma_{i=1}^na_{2i} \times b_{i2} & ... & \Sigma_{i=1}^na_{2i} \times b_{ip} \\
... \\
\Sigma_{i=1}^na_{mi} \times b_{i1} & \Sigma_{i=1}^na_{mi} \times b_{i2} & ... & \Sigma_{i=1}^na_{mi} \times b_{ip} \\
\end{bmatrix}
$$

# [Hadamard product](https://en.wikipedia.org/wiki/Hadamard_product_(matrices))

$$
\begin{bmatrix}
a_{11} & a_{12} & ... & a_{1n} \\
a_{21} & a_{22} & ... & a_{2n} \\
... \\
a_{m1} & a_{m2} & ... & a_{mn} \\
\end{bmatrix}
⊙
\begin{bmatrix}
b_{11} & b_{12} & ... & b_{1n} \\
b_{21} & b_{22} & ... & b_{2n} \\
... \\
b_{m1} & b_{m2} & ... & b_{mn} \\
\end{bmatrix}
=\\
$$

$$
\begin{bmatrix}
(a_{11})(b_{11}) & (a_{12})(b_{12}) & ... & (a_{1n})(b_{1n}) \\
(a_{21})(b_{21}) & (a_{22})(b_{22}) & ... & (a_{2n})(b_{2n}) \\
... \\
(a_{m1})(b_{m1}) & (a_{m2})(b_{m2}) & ... & (a_{mn})(b_{mn}) \\
\end{bmatrix}
$$

# [Euclidean / Cartesian norm](https://en.wikipedia.org/wiki/Norm_(mathematics)#Euclidean_norm)

$$\|(a_1, a_2, ... ,a_n ) \|=\sqrt{a_1^2+a_2^2+..+a_n^2}$$

## Part 01a -- Simple neural network as XOR gate using sigmoid activation function:


The XOR gate neural network implemention uses a two layer perceptron with sigmoid activation function. This portion of the notebook is a modified fork of the [neural network implementation in numpy by Milo Harper](https://github.com/miloharper/simple-neural-network).

### Import the dependent libraries -- numpy and matplotlib:


In [None]:
import math, random, numpy as np, matplotlib.pyplot as plt

### Create [sigmoid function](https://en.wikipedia.org/wiki/Sigmoid_function):

- The sigmoid function takes two input arguments: x and a boolean argument called 'derivative'
- When the boolean argument is set as true, the sigmoid function calculates the derivative of x
- The derivative of x is required when calculating error or performing back-propagation
- The sigmoid function runs in every single neuron
- The sigmoid funtion feeds forward the data by converting the numeric matrices to probablities

**To implement the [sigmoid activation function using numpy](https://stackoverflow.com/questions/3985619/how-to-calculate-a-logistic-sigmoid-function-in-python), we use the mathematical formula:**

$$
  F(x) = \frac{1}{1+e^{-x}}
$$

### [Backpropagation](https://en.wikipedia.org/wiki/Backpropagation):

- Method to make the network better
- [Mathematically we need to compute the derivative of the activation function](https://www.coursera.org/learn/neural-networks-deep-learning/lecture/6dDj7/backpropagation-intuition-optional)

#### If sigmoid function can be expressed as follows:

$g_{sigmoid}(z) = \frac{1}{1+e^{-z}}$

#### Then, the first [derivative](https://en.wikipedia.org/wiki/Derivative) of this function can be expressed as:

$g'_{sigmoid}(z) = g_{sigmoid}(z)(1-g_{sigmoid}(z))$

### Forward pass and backpropagation functions using sigmoid activation:

### Implementing sigmoid function using math library in python:

In [None]:
x  = -1.2
y = 1 / (1 + math.exp(-x))
print (y)

In [None]:
y = 1 / (1 + np.exp(-x))
print (y)

In [None]:
def sigmoid(x, derivative=False):
    '''
    Parameters:
      x: input
      derivative: boolean to specify if the derivative of the function should be computed
    '''
    if derivative:
        return (x * (1 - x))
    return (1 / (1 + np.exp(-x)))

In [None]:
def ReLU(x, derivative=False):
  if derivative:
      return np.where(x < 0, 0, 1)
  x_relu = np.maximum(x, 0)
  return x_relu

In [None]:
sigmoid(-1.2, derivative=False)

In [None]:
x = -1.2
y_d = (1 / (1 + np.exp(x))) * (1 - (1 / (1 + np.exp(x))))

In [None]:
y_d

In [None]:
sigmoid(0.23147521650098238, derivative=True)

### Plotting sigmoid activation function:



In [None]:
xmin= -10
xmax = 10
ymin = -0.1
ymax = 1.1
step_size = 0.01

x = list(np.arange(xmin, xmax, step_size))
y = []
for i in x:
  y_i = sigmoid(i)
  y.append(y_i)


axis = [xmin, xmax, ymin, ymax]
plt.axhline(y=0.5, color='C2', alpha=0.5)
plt.axvline(x=0, color='C2', alpha=0.5)
plt.axis(axis)
plt.plot(x, y, linewidth=2.0)

### Create an input data matrix as numpy array:
- Matrix with n number of dimensions

## Defining the input matrix
$
\begin{bmatrix}
0 & 0 \\
1 & 1 \\
1 & 0 \\
0 & 1
\end{bmatrix}
$

In [None]:
x = np.asarray([[0, 0],
                [1, 1],
                [1, 0],
                [0, 1]])

In [None]:
print('Shape of the input matrix: ', x.shape)

In [None]:
print('Number of rows: ', x.shape[0], 'Number of columns: ', x.shape[1])

## [For loop in Python](https://wiki.python.org/moin/ForLoop):

In [None]:
x_ = (1, 2, 3, 4)

In [None]:
len(x_)

In [None]:
for i in range(len(x_)):
    print ('This is the {} element in the tuple'.format(i))
    print ('The value is: {}'.format(x_[i]))

### Using ```enumerate```

In [None]:
for idx, i  in enumerate(x_):
    print (f'This is the {idx} element in the tuple')
    print (f'The value is: {i}')

## Define the output data matrix as numpy array:

\begin{bmatrix}
0 \\
0 \\
1 \\
1
\end{bmatrix}

In [None]:
y = np.asarray([[0],
                [0],
                [1],
                [1]])

In [None]:
y.shape

In [None]:
plt.scatter(x[:,0], x[:,1], s=180, c=y, cmap=plt.colormaps.get_cmap('Spectral'))
for i, (i_x, i_y) in enumerate(x):
    i_z = y[i]
    p_x, p_y = i_x, i_y
    if i_x == 1:
        p_x = i_x - 0.35
    plt.text(p_x + 0.02, p_y - 0.02, f'({i_x}, {i_y} ► {i_z})', fontsize=18)

## Define the neural network:
$$
f_n(\begin{bmatrix}
0 & 0 \\
1 & 1 \\
1 & 0 \\
0 & 1
\end{bmatrix})
\rightarrow
\begin{bmatrix}
0 \\
0 \\
1 \\
1
\end{bmatrix}
$$

### Create a random number seed:

- Random number seeding is useful for producing reproducible results
- Deterministic random number generator output using a seed
- Unsuitable for experiments that require true random sampling

In [None]:
seed = 1
np.random.seed(seed)

### Create a synapse matrix:

- A function applied to the syanpses
- For the first synapse, weights matrix of shape: input_shape_1 x input_shape_2 is created
- For the second synapse, weights matrix of shape: input_shape_2 x output_dim is created
-  This function also introduces the first hyper-parameter in neural network tuning called 'bias_val', which is the bias value for the synaptic function

In [None]:
bias_val = 1

output_dim = 1

input_shape_1 = x.shape[1]
input_shape_2 = x.shape[0]

hidden_layer_size = 4 # 3 # 2 # 5 #

synapse_0 = 2*np.random.random((input_shape_1, hidden_layer_size)) - bias_val
synapse_1 = 2*np.random.random((hidden_layer_size, output_dim)) - bias_val

loss_col = []

In [None]:
print (synapse_0.shape)

## Define the first synapse (synapse 0):

>
$$
Eg.\\
$$
$$
Synapse_0
→
\begin{bmatrix}
-0.16595599 & 0.44064899 & -0.99977125 & -0.39533485\\
-0.70648822 & -0.81532281 & -0.62747958 & -0.30887855
\end{bmatrix}
$$

In [None]:
synapse_0

In [None]:
print (synapse_1.shape)

## Define the second synapse (synapse 1):

>
$$
Eg.\\
$$
$$
Synapse_1
→
\begin{bmatrix}
-0.20646505\\
0.07763347\\
-0.16161097\\
0.370439
\end{bmatrix}
$$

In [None]:
print (synapse_1)

## **Implement a single forward pass of the XOR input table**

### Create the input layer (layer 0)

$$
layer_0 = \begin{bmatrix}
0 & 0 \\
1 & 1 \\
1 & 0 \\
0 & 1
\end{bmatrix}
$$

In [None]:
layer_0 = x

### Compute the dot product between layer 0 and synapse 0:
$$
{
Eg.\\
\begin{bmatrix}
0 & 0 \\
1 & 1 \\
1 & 0 \\
0 & 1
\end{bmatrix}
}{.}{
\begin{bmatrix}
-0.16595599 & 0.44064899 & -0.99977125 & -0.39533485\\
-0.70648822 & -0.81532281 & -0.62747958 & -0.30887855
\end{bmatrix}
}
{=}
{\\
\begin{bmatrix}
0 & 0 & 0 & 0 \\
-0.87244421 & -0.37467382 & -1.62725083 & -0.7042134 \\
-0.16595599  & 0.44064899 & -0.99977125 & -0.39533485 \\
-0.70648822 & -0.81532281 & -0.62747958 & -0.30887855 \\
\end{bmatrix}
}

$$

In [None]:
dot_product_0 = np.dot(layer_0, synapse_0)

In [None]:
print(dot_product_0)

### Apply bias value for dot product 0:

$$
Eg. \\
bias {\space} value_0 = 1 \\
\begin{bmatrix}
0 & 0 & 0 & 0 \\
-0.87244421  & -0.37467382 & -1.62725083 & -0.7042134 \\
-0.16595599  & 0.44064899  & -0.99977125 & -0.39533485 \\
-0.70648822  & -0.81532281 & -0.62747958 & -0.30887855 \\
\end{bmatrix}
- bias {\space} value_0 =\\
$$

$$
\begin{bmatrix}
-1.          & -1.         & -1.         & -1.        \\
 -1.87244421 & -1.37467382 & -2.62725083 & -1.7042134 \\
 -1.16595599 & -0.55935101 & -1.99977125 & -1.39533485\\
 -1.70648822 & -1.81532281 & -1.62747958 & -1.30887855 \\
\end{bmatrix}
$$

In [None]:
bias_val_0 = 1
dot_product_0 = dot_product_0 - bias_val_0

In [None]:
print(dot_product_0)

### Create layer 1 by applying the activation function to dot product 0:
$$
g_{sigmoid}(z) = \frac{1}{1+e^{-z}}\\
$$
$$
Eg.\\
g_{sigmoid}(
\begin{bmatrix}
0 & 0 & 0 & 0 \\
-0.87244421 & -0.37467382 & -1.62725083 & -0.7042134  \\
-0.16595599  & 0.44064899 & -0.99977125 & -0.39533485 \\
-0.70648822 & -0.81532281 & -0.62747958 & -0.30887855 \\
\end{bmatrix}) = \\
$$
$$
\begin{bmatrix}
0.26894142 & 0.26894142 & 0.26894142 & 0.26894142 \\
0.13325916 & 0.20186577 & 0.06740506 & 0.15391577 \\
0.23758674 & 0.36369764 & 0.11922694 & 0.19855744 \\
0.15361976 & 0.13999605 & 0.16417593 & 0.21267456 \\
\end{bmatrix}
$$

In [None]:
layer_1 = sigmoid(dot_product_0)

In [None]:
layer_1.shape

In [None]:
print(layer_1)

### Compute dot product between layer 1 and synapse 1
$$
{Eg.}\\
{
\begin{bmatrix}
0.26894142 & 0.26894142 & 0.26894142 & 0.26894142 \\
0.13325916 & 0.20186577 & 0.06740506 & 0.15391577 \\
0.23758674 & 0.36369764 & 0.11922694 & 0.19855744 \\
0.15361976 & 0.13999605 & 0.16417593 & 0.21267456 \\
\end{bmatrix}
}{.}
{
\begin{bmatrix}
-0.20646505\\
0.07763347\\
-0.16161097\\
0.370439
\end{bmatrix} = \\
}
{
\begin{bmatrix}
0.02151436 \\
0.03428119 \\
0.03346679 \\
0.03140159 \\
\end{bmatrix}
}
$$

In [None]:
dot_product_1 = np.dot(layer_1, synapse_1)

In [None]:
print(dot_product_1)

### Apply bias value for the dot product 1
$$
Eg.\\
bias{\space}value_1= 1 \\
\begin{bmatrix}
0.02151436 \\
0.03428119 \\
0.03346679 \\
0.03140159 \\
\end{bmatrix}-bias{\space}value_1=\\
$$

$$
\begin{bmatrix}
-0.97848564\\
-0.96571881\\
-0.96653321\\
-0.96859841
\end{bmatrix}
$$

In [None]:
bias_val_1 = 1
dot_product_1 = dot_product_1 - bias_val_1

In [None]:
dot_product_1

### Create layer 2 by passing the dot product 2 through the activation function

$$
g_{sigmoid}(z) = \frac{1}{1+e^{-z}}\\
$$
$$
Eg.\\
g_{sigmoid}(\begin{bmatrix}
-0.97848564\\
-0.96571881\\
-0.96653321\\
-0.96859841
\end{bmatrix})=\\
$$

$$
\begin{bmatrix}
0.27319237\\
0.27573466\\
0.27557205\\
0.27515996
\end{bmatrix}
$$

In [None]:
layer_2 = sigmoid(dot_product_1)

In [None]:
layer_2.shape

In [None]:
layer_2

# Part 01b -- [Backpropagation](https://en.wikipedia.org/wiki/Backpropagation)

This backpropagation example is a naive implmentation of the
$\frac{1}{2}$ of mean squared error as the loss function.

Therefore, the derivative of the loss function is computed first, which is:
$ (\frac{\partial E}{\partial o_j}) $:
$ \\
\frac{\partial E}{\partial o_j} = \frac{\partial }{\partial o_j} {\frac{1}{2}}(t-y)^2 {= y-t}
$

Layer delta is computed using:
$
\\
\Delta w_{ij} = -\eta \frac{\partial E}{\partial w_{ij}} $

In this implementation, learning rate ($\eta$) = 1


Read more by following the backpropogation link above.


## Implement a single backprop pass

### Compute the partial first derivative of the loss function:
$\frac{\partial }{\partial o_j} {(\frac{1}{2}}(t-y)^2) \text{ } { = y-t}$

In [None]:
output_loss_derivative = (layer_2 - y)

In [None]:
output_loss_derivative

### Compute the partial first derivative of the activation function for the back-propagation step:
$g_{sigmoid}(z) = \frac{1}{1+e^{-z}}$ 

$g'_{sigmoid}(z) = \frac{\partial}{\partial z} {(\frac{1}{1+e^{-z}})}$

${\text{Let }f(z) \text{ } = 1 + e^{-z}}$

${\therefore g_{sigmoid}(z) \text{ } = {f(z)}^{-1}}$

${\therefore \frac{\partial{(g_{sigmoid}(z))}}{\partial{z}} \text{ } = \frac{\partial{({f(z)}^{-1})}}{\partial{z}}}$

${\therefore \text{By applying chain rule, } \frac{\partial{({f(z)}^{-1})}}{\partial{z}} = \frac{\partial{({f(z)}^{-1})}}{\partial{(f(z))}}{.}\frac{\partial{({f(z)})}}{\partial{z}}}$

$\frac{\partial{({f(z)}^{-1})}}{\partial{f(z)}} = - {f(z)}^{-2}$

$\frac{\partial{({f(z)})}}{\partial{z}} = {\frac{\partial{(1)}}{\partial{z}}} + {\frac{\partial{(e^{-z})}}{\partial{z}}} $

${\because \frac{\partial{(e^{x})}}{\partial{x}} = e^{x}, \text{by applying chain rule; } \frac{\partial{(e^{-z})}}{\partial{z}} = \frac{\partial{(e^{-z})}}{\partial{(-z)}}{.}\frac{\partial{(-z)}}{\partial{z}}}$

${\therefore {\frac{\partial{(e^{-z})}}{\partial{z}}} = -e^{-z}}$

${\therefore \frac{\partial{({f(z)})}}{\partial{z}} = -e^{-z}}$

${\therefore \frac{\partial{({f(z)}^{-1})}}{\partial{z}} = {- {f(z)}^{-2}}{.}{-e^{-z}}}$

${\therefore \frac{\partial{({f(z)}^{-1})}}{\partial{z}} = \frac{e^{-z}}{{(1 + e^{-z})}^{2}}}$

${\therefore \frac{\partial{({f(z)}^{-1})}}{\partial{z}} = \frac{(1 + e^{-z}) - 1}{{(1 + e^{-z})}^{2}}}$

${\therefore \frac{\partial{({f(z)}^{-1})}}{\partial{z}} = \frac{(1 + e^{-z})}{{(1 + e^{-z})}^{2}} - (\frac{1}{(1 + e^{-z})})^{2}}$

${\therefore \frac{\partial{({f(z)}^{-1})}}{\partial{z}} = \frac{1}{{(1 + e^{-z})}} - (\frac{1}{(1 + e^{-z})})^{2}}$

${\therefore \frac{\partial{({f(z)}^{-1})}}{\partial{z}} = \frac{1}{{(1 + e^{-z})}} {.} (1 - \frac{1}{(1 + e^{-z})})}$

${\because g_{sigmoid}(z) = \frac{1}{1+e^{-z}} \text{; } \frac{\partial{({f(z)}^{-1})}}{\partial{z}} = g_{sigmoid}(z) {.} (1 - g_{sigmoid}(z))}$

${\therefore \frac{\partial{(g_{sigmoid}(z))}}{\partial{z}} = g_{sigmoid}(z) {.} (1 - g_{sigmoid}(z))}$

${\therefore g'_{sigmoid}(z) = g_{sigmoid}(z)(1 - g_{sigmoid}(z))}$

Described in this notebook is the naive implementation for computing: $$g'_{sigmoid}(z)$$

In this implementation, the first derivative is defined as an additional optional output within the activation function; which is executed when the ```derivative``` flag is set to ```True```.

$$
Eg.\\
g'_{sigmoid}(
  \begin{bmatrix}
      0.27319237\\
      0.27573466\\
      0.27557205\\
      0.27515996
  \end{bmatrix})=\\
$$

$$
\begin{bmatrix}
    0.1985583\\
    0.19970506\\
    0.19963209\\
    0.19944695
\end{bmatrix}
$$

In [None]:
layer_2_derivative = sigmoid(layer_2, derivative=True)

In [None]:
print(layer_2_derivative)

In [None]:
layer_2_delta = (output_loss_derivative * layer_2_derivative)

In [None]:
layer_2_delta

In [None]:
layer_1_error = (layer_2_delta.dot(synapse_1.T))

In [None]:
layer_1_error

In [None]:
layer_1_delta = layer_1_error * sigmoid(layer_1, derivative=True)

In [None]:
layer_1_delta

### Updating the weights/synapses of the neural network

In [None]:
synapse_1 += layer_1.T.dot(layer_2_delta)

In [None]:
synapse_1.shape

In [None]:
synapse_0 += layer_0.T.dot(layer_1_delta)

In [None]:
synapse_0

## Part 01d -- Auto-differentiation

### Compute the partial first derivative of the activation function, for the back-propagation step; non-deterministically:

In the previous section, the first derivative of the activation function was computed naively by defining an additional output that corresponded to the function which represented the first derivative of the associated activation function.

Since this naive approach severely limits the practical utility of neural networks, as it requires a practitioner that develops a neural network; to understand the necessary calculus, to deterministically define the formula of the first derivative of any arbitrary activation function that they want to implement.

An alternate approach to this problem is to compute the first derivative value for a activation function, using a technique called auto-differentiation.

Auto-differentiation can be implemented in multiple ways.

In this implementation, a random sampling approach is used to compute the approximate value to the first derivative; using Riemann differentiation.

In [None]:
def compute_stochastic_auto_differentiation(x, input_fn, epsilon, num_eval_steps=10):
    auto_diff_out = 0
    for i in range(num_eval_steps):
        fn_x = input_fn(x)

        rand_epsilon_a = random.SystemRandom().uniform(abs(epsilon), (abs(epsilon) / 100))
        rand_epsilon_b = random.SystemRandom().uniform(abs(epsilon), (abs(epsilon) / 100))

        x_a = x + rand_epsilon_a
        x_b = x + rand_epsilon_b

        x_delta = x_a - x_b
        
        fn_auto_diff_a = input_fn(x_a)
        fn_auto_diff_b = input_fn(x_b)

        auto_diff_out += (fn_auto_diff_a - fn_auto_diff_b) / x_delta
        auto_diff_out += (fn_auto_diff_a - fn_x)           / rand_epsilon_a
        auto_diff_out += (fn_auto_diff_b - fn_x)           / rand_epsilon_b

    auto_diff_out = auto_diff_out / (3 * num_eval_steps)

    return auto_diff_out

In [None]:
layer_2_derivative, layer_2

In [None]:
auto_diff_layer_2_derivative = compute_stochastic_auto_differentiation(layer_2, sigmoid, 1e-8, 100)

In [None]:
auto_diff_layer_2_derivative

## Part 01e -- Training the simple XOR gate neural network:

- Note: There is no function that defines a neuron! In practice neuron is just an abstract concept to understand the probability function
- Continuously feeding the data throught the neural network
- Updating the weights of the network through backpropagation
- During the training the model becomes better and better in predicting the output values
- The layers are just matrix multiplication functions that apply the sigmoid function to the synapse matrix and the corresponding layer
- Backpropagation portion of the training is the machine learning portion of this code
- Backpropagation function reduces the prediction errors during each training step
- Synapses and weights are synonymous

In [None]:
training_steps = 500000
update_freq = 10

input_data, output_data = x, y

bias_val_1 = 1e-2
bias_val_2 = 0.2 # 1e-4 # 0.5 # 1 # 1e-3 # 10

learning_rate = 0.1 # 10 # 1 # 1e-3 # 1000 #

for t in range(training_steps):
    # Creating the layers of the neural network:
    layer_0 = input_data
    layer_1 = sigmoid(np.dot(layer_0, synapse_0) + bias_val_1)
    layer_2 = sigmoid(np.dot(layer_1, synapse_1) + bias_val_2)

    # Backpropagation:
    outputLoss_derivative = output_data - layer_2
    loss_col.append(np.mean(np.abs(outputLoss_derivative)))
    if ((t*update_freq) % training_steps == 0):
        print ('Training step :' + str(t))
        print ('Prediction error during training :' + str(np.mean(np.abs(outputLoss_derivative))))

    # Layer-wsie delta function:
    layer_2_delta = (learning_rate*outputLoss_derivative*sigmoid(layer_2, derivative=True))
    layer_1_error = layer_2_delta.dot(synapse_1.T) # Matrix multiplication of the layer 2 delta with the transpose of the first synapse function.
    layer_1_delta = (layer_1_error*learning_rate)*(sigmoid(layer_1, derivative=True))

    # Updating synapses or weights:
    synapse_1 += layer_1.T.dot(layer_2_delta)
    synapse_0 += layer_0.T.dot(layer_1_delta)
    del layer_0, layer_1

print ('Training completed ...')
print ('Predictions :' + str (layer_2))

In [None]:
layer_0 = input_data
layer_1 = sigmoid(np.dot(layer_0, synapse_0) + bias_val_1)
layer_2 = sigmoid(np.dot(layer_1, synapse_1) + bias_val_2)
print(synapse_0)
print('\n')
print(bias_val_1)
print('\n')
print(layer_1)
print('\n')
print(bias_val_2)
print('\n')
print(synapse_1)
print('\n')
print(layer_2)

In [None]:
def xor_gate_predictor(x):
    layer_0 = x
    layer_1 = sigmoid(np.dot(layer_0, synapse_0) + bias_val_1)
    layer_2 = sigmoid(np.dot(layer_1, synapse_1) + bias_val_2)
    return layer_2

In [None]:
y.shape

In [None]:
def plot_xor_gate_decision_boundary(prediction_function, x, y):
    # Setting minimum and maximum values for giving the plot function some padding
    x_min, x_max = x[:, 0].min() - .5, \
                  x[:, 0].max() + .5
    y_min, y_max = x[:, 1].min() - .5, \
                  x[:, 1].max() + .5
    h = 0.01

    # Generate a grid of points with distance h between them
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), \
                        np.arange(y_min, y_max, h))

    # Predict the function value for the whole grid
    Z = prediction_function(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    # Plotting the contour and training examples
    plt.contourf(xx, yy, Z, cmap=plt.colormaps.get_cmap('Spectral'))
    plt.scatter(x[:, 0], x[:, 1], c=y, s=42, cmap=plt.colormaps.get_cmap('Accent'))
    for i, (i_x, i_y) in enumerate(x):
        i_z = y[i]
        p_x, p_y = i_x - 0.25, i_y - 0.2
        #if i_x == 1:
        #    p_x = i_x - 0.25
        plt.text(p_x , p_y , f'({i_x}, {i_y} ► {i_z})', fontsize=18)

In [None]:
import sklearn, sklearn.datasets, sklearn.linear_model
X, Y = sklearn.datasets.make_moons(6, noise=0.20)

In [None]:
linear_classifier = sklearn.linear_model.LogisticRegression()
linear_classifier.fit(X, Y)

In [None]:
plot_xor_gate_decision_boundary(lambda x: linear_classifier.predict(x), x, y)
plt.title('Logistic Regression')

In [None]:
plot_xor_gate_decision_boundary(lambda x: xor_gate_predictor(x), x, y)
plt.title('XOR gate neural network model')

In [None]:
plt.plot(loss_col)
plt.show()

delete_model = True

if delete_model:
  try:
    del loss_col
  except:
    pass
  try:
    del input_data
  except:
    pass
  try:
    del output_data
  except:
    pass
  try:
    del x
  except:
    pass
  try:
    del y
  except:
    pass
  try:
    del layer_2
  except:
    pass
  try:
    del output_data
  except:
    pass
  try:
    del synapse_0
  except:
    pass
  try:
    del synapse_1
  except:
    pass
  import gc
  gc.collect()

## Part 01f -- Neural network based XOR gate using rectified linear units activation function:

In [None]:
import numpy as np
import matplotlib.pyplot as plt

### Plotting the rectified linear units (ReLU) activation function:

In [None]:
def ReLU(x, derivative=False):
  if derivative:
      return np.where(x < 0, 0, 1)
  x_relu = np.maximum(x, 0)
  return x_relu

def relu(x, derivative=False):
  return ReLU(x, derivative=derivative)

In [None]:
def relu_derivative(x, y):
    return np.array([x])[np.array([y])<0]

In [None]:
a = np.array([3])
a[np.array([-1])<0] = 0

In [None]:
print(a)

In [None]:
x = list(np.arange(-6.0, 6.0, 0.1))
y = []
for i in x:
  y_i = ReLU(i)
  y.append(y_i)

xmin= -6
xmax = 6
ymin = 0
ymax = 1
axis = [xmin, xmax, ymin, ymax]
plt.axhline(y=0.5, color='C2', alpha=0.5)
plt.axvline(x=0, color='C2', alpha=0.5)
plt.axis(axis)
plt.plot(x, y, linewidth=2.0)

### Create input and output data

In [None]:
x = np.array([[0, 0],
              [1, 1],
              [0, 1],
              [1, 0]])

y = np.array([[0],
              [0],
              [1],
              [1]])

### N is batch size(sample size); D_in is input dimension; H is hidden dimension; D_out is output dimension:

In [None]:
D_in, H, D_out = x.shape[1], 30, 1

### Randomly initialize weights:

In [None]:
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

In [None]:
learning_rate = 1e-3
update_freq = 10
training_steps = 2000

loss_col = []

### ReLu as the activation function and [squared error](https://datascience.stackexchange.com/questions/10188/why-do-cost-functions-use-the-square-error) as the loss function:

In [None]:
for t in range(training_steps):
    # Forward pass: compute predicted y
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)  # using ReLU as activation function
    y_pred = h_relu.dot(w2)

    # Compute and print loss
    loss = np.square(y_pred - y).sum() # squared error as the loss function
    loss_col.append(loss)
    if ((t*update_freq) % training_steps ==0):
      print ('Training step :' + str(t))
      print ('Loss function during training :' + str(loss))
    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2*(y_pred - y) # the last layer's error
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T) # the second layer's error
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0  # the derivate of ReLU
    grad_w1 = x.T.dot(grad_h)

    # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

print ('Training completed ...')
print ('Predictions :' + str (y_pred))

In [None]:
plt.plot(loss_col)
plt.show()

In [None]:
delete_model = True

if delete_model:
  try:
    del loss_col
  except:
    pass
  try:
    del input_data
  except:
    pass
  try:
    del output_data
  except:
    pass
  try:
    del x
  except:
    pass
  try:
    del y
  except:
    pass
  try:
    del output_data
  except:
    pass

  import gc
  gc.collect()

## Part 01g -- Training using stochastic auto gradients

Automated gradient computation using auto-differentiation.

In [None]:
x = np.array([[0, 0],
              [1, 1],
              [0, 1],
              [1, 0]])

y = np.array([[0],
              [0],
              [1],
              [1]])

In [None]:
def tanh(x, derivative=False):
    if (derivative == True):
        return (1 - (x ** 2))
    return np.tanh(x)

In [None]:
input_dim = x.shape[1] # 28 * 28
output_dim = y.shape[1] # 10
hidden_units_list = [input_dim, 8, output_dim]
bias_val_list = [1e-4, 1e-2, 0.2]
synapse_list = []
for i, unit_size in enumerate(hidden_units_list):
    if i + 1 < len(hidden_units_list):
        synapse_list.append(2 * np.random.random((unit_size, hidden_units_list[i + 1])) - bias_val_list[i])

### Stochastic automatic gradient (SAG) for a neural network
Function to compute the layer gradient using stochastic automatic differentiation.

In [None]:
def stochastic_auto_gradient(layer, layer_error, activation_fn, epsilon):
    return layer_error * compute_stochastic_auto_differentiation(layer, activation_fn, epsilon)

In [None]:
def forward_pass(input_data, synapse_list, bias_val_list, activation_fn):
    layer_list = [input_data]
    for i, synapse in enumerate(synapse_list):
        layer_list.append(activation_fn(np.dot(layer_list[i], synapse) + bias_val_list[i]))
    return layer_list

def back_propagation(output_data, layer_list, synapse_list, learning_rate, verbose=False):
    output_loss_derivative = output_data - layer_list[-1]
    bprop_loss = np.mean(np.abs(output_loss_derivative))
    if verbose:
        print ('Prediction error during training : ' + str(bprop_loss))

    layer_error = learning_rate * output_loss_derivative

    synapse_list = list(reversed(synapse_list))
    layer_list = list(reversed(layer_list))

    for i, layer in enumerate(layer_list):
        if i + 1 < len(layer_list):
            layer_delta = stochastic_auto_gradient(layer, layer_error, activation_fn, 1e-8)
            layer_error = layer_delta.dot(synapse_list[i].T)
            synapse_list[i] += layer_list[i + 1].T.dot(layer_delta)

    synapse_list = list(reversed(synapse_list))
    layer_list = list(reversed(layer_list))

    return synapse_list, layer_list, bprop_loss

In [None]:
activation_fn = sigmoid
loss_col = []
learning_rate = 1
training_steps = 10000

In [None]:
for t in range(training_steps):
    input_data, output_data = x, y
    layer_list = forward_pass(input_data, synapse_list, bias_val_list, activation_fn)
    synapse_list, layer_list, loss = back_propagation(output_data, layer_list, synapse_list, learning_rate, verbose=(t + 1) % (0.1 * training_steps) == 0)
    loss_col.append(loss)
print ('Training completed ...')
print (layer_list[-1])

## Part 02 -- [Build a more complex neural network classifier using numpy](http://www.wildml.com/2015/09/implementing-a-neural-network-from-scratch/):

### Importing dependent libraries:

In [None]:
import matplotlib.pyplot as plt # pip3 install matplotlib
import numpy as np # pip3 install numpy
import sklearn # pip3 install scikit-learn
import sklearn.datasets
import sklearn.linear_model
import matplotlib

### Plotting hyperbolic tan (tanh) activation function:

In [None]:
def tanh(x, derivative=False):
    if (derivative == True):
        return (1 - (x ** 2))
    return np.tanh(x)

In [None]:
x = list(np.arange(-6.0, 6.0, 0.1))
y = []
for i in x:
    y_i = tanh(i)
    y.append(y_i)

xmin = -6
xmax =  6
ymin = -1.1
ymax = 1.1
axis = [xmin, xmax, ymin, ymax]
plt.axhline(y=0, color='C2', alpha=0.5)
plt.axvline(x=0, color='C2', alpha=0.5)
plt.axis(axis)
plt.plot(x, y, linewidth=2.0)

### Display plots inline and change default figure size:

In [None]:
%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (10.0, 8.0)

### Generate a dataset and create a plot:

In [None]:
np.random.seed(0)
X, y = sklearn.datasets.make_moons(200, noise=0.20)
plt.scatter(X[:,0], X[:,1], s=40, c=y, cmap=plt.cm.Spectral)

### Train the logistic regression classifier:

The classification problem can be summarized as creating a boundary between the red and the blue dots.

In [None]:
linear_classifier = sklearn.linear_model.LogisticRegressionCV()
linear_classifier.fit(X, y)

### Visualize the logistic regression classifier output:

In [None]:
def plot_decision_boundary(prediction_function):
    # Setting minimum and maximum values for giving the plot function some padding
    x_min, x_max = X[:, 0].min() - .5, \
                   X[:, 0].max() + .5
    y_min, y_max = X[:, 1].min() - .5, \
                   X[:, 1].max() + .5
    h = 0.01
    # Generate a grid of points with distance h between them
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), \
                         np.arange(y_min, y_max, h))
    # Predict the function value for the whole grid
    Z = prediction_function(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    # Plotting the contour and training examples
    plt.contourf(xx, yy, Z, cmap=plt.colormaps.get_cmap("Spectral"))
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.colormaps.get_cmap("Spectral"))

### Plotting the decision boundary:

In [None]:
plot_decision_boundary(lambda x: linear_classifier.predict(x))
plt.title('Logistic Regression')
plt.show()

### Create a neural network:

In [None]:
num_examples = len(X) # training set size
nn_input_dim = 2 # input layer dimensionality
nn_output_dim = 2 # output layer dimensionality

### Gradient descent parameters:

In [None]:
epsilon = 0.01 # learning rate for gradient descent
reg_lambda = 0.01 # regularization strength

### Function that predicts the output of either 0 or 1:

Forward pass across the model layers for creating the predictions

In [None]:
def forward_pass(model, x):
    W1, b1, W2, b2 = model['W1'], \
                     model['b1'], \
                     model['W2'], \
                     model['b2']
    # Design a network with forward propagation
    z1 = x.dot(W1) + b1
    a1 = np.tanh(z1)
    z2 = a1.dot(W2) + b2

    exp_scores = np.exp(z2)

    probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)

    return np.argmax(probs, axis=-1)

### Compute loss function on the dataset:

In [None]:
def loss_function(model, x, y, reg_lambda=0.01):
    W1, W2 = model['W1'],  model['W2']
    probabilities = forward_pass(model, x)
    num_examples = len(x)
    # Calculating the loss function
    corect_logprobs = -np.log(np.where(probabilities == y, 2, 1))
    data_loss = np.sum(corect_logprobs)
    # Adding the regulatization term to the loss function
    data_loss += reg_lambda/2 * (np.sum(np.square(W1)) + np.sum(np.square(W2)))
    return 1./num_examples * data_loss


### This function learns parameters for the neural network and returns the model:
- nn_hdim: Number of nodes in the hidden layer
- num_passes: Number of passes through the training data for gradient descent
- print_loss: If True, print the loss every 1000 iterations

In [None]:
def build_model(nn_hdim):

    # Initialize the parameters to random values. We need to learn these.
    np.random.seed(0)
    W1 = np.random.randn(nn_input_dim, nn_hdim) / np.sqrt(nn_input_dim)
    b1 = np.zeros((1, nn_hdim))
    W2 = np.random.randn(nn_hdim, nn_output_dim) / np.sqrt(nn_hdim)
    b2 = np.zeros((1, nn_output_dim))

    # Assign new parameters to the model
    model = { 'W1': W1, 'b1': b1, 'W2': W2, 'b2': b2}
    return model

def train_model(X, y, model, num_passes=20000, print_loss=False):
    W1 = model['W1']
    b1 = model['b1']
    W2 = model['W2']
    b2 = model['b2']
    # Gradient descent. For each batch...
    for i in range(0, num_passes):

        # Forward propagation
        z1 = X.dot(W1) + b1
        a1 = np.tanh(z1)
        z2 = a1.dot(W2) + b2
        exp_scores = np.exp(z2)
        probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)

        # Backpropagation
        delta3 = probs
        delta3[range(num_examples), y] -= 1
        dW2 = (a1.T).dot(delta3)
        db2 = np.sum(delta3, axis=0, keepdims=True)
        delta2 = delta3.dot(W2.T) * (1 - np.power(a1, 2))
        dW1 = np.dot(X.T, delta2)
        db1 = np.sum(delta2, axis=0)

        # Add regularization terms (b1 and b2 don't have regularization terms)
        dW2 += reg_lambda * W2
        dW1 += reg_lambda * W1

        # Gradient descent parameter update
        W1 += -epsilon * dW1
        b1 += -epsilon * db1
        W2 += -epsilon * dW2
        b2 += -epsilon * db2

        model['W1'] = W1
        model['W2'] = W2
        model['b1'] = b1
        model['b2'] = b2

        # Optionally print the loss.
        # This is expensive because it uses the whole dataset, so we don't want to do it too often.
        if print_loss and i % 1000 == 0:
            print("Loss after iteration %i: %f" %(i, loss_function(model, X, y)))
    return model

### Build a model with 50-dimensional hidden layer:

In [None]:
model = build_model(50)
model = train_model(X, y, model, print_loss=True)

### Plot the decision boundary:

In [None]:
plot_decision_boundary(lambda x: forward_pass(model, x))
plt.title('Decision Boundary for hidden layer size  50')
plt.show()

### Visualizing the hidden layers with varying sizes:

In [None]:
plt.figure(figsize=(16, 32))
hidden_layer_dimensions = [1, 2, 3, 4, 5, 20, 50]
for i, nn_hdim in enumerate(hidden_layer_dimensions):
    plt.subplot(5, 2, i+1)
    plt.title('Hidden Layer size %d' % nn_hdim)
    model = build_model(nn_hdim)
    plot_decision_boundary(lambda x: forward_pass(model, x))
plt.show()

## Part 03 -- Example illustrating the importance of[ learning rate](http://users.ics.aalto.fi/jhollmen/dippa/node22.html) in hyper-parameter tuning:

- Learning rate is a decreasing function of time.
- Two forms that are commonly used are:
    * 1) a linear function of time
    * 2) a function that is inversely proportional to the time t

### Create a noisier, more complex dataset:

In [None]:
np.random.seed(0)
X, y = sklearn.datasets.make_moons(20000, noise=0.5)
plt.scatter(X[:,0], X[:,1], s=40, c=y, cmap=plt.cm.Spectral)

In [None]:
linear_classifier = sklearn.linear_model.LogisticRegressionCV()
linear_classifier.fit(X, y)

In [None]:
plot_decision_boundary(lambda x: linear_classifier.predict(x))
plt.title('Logistic Regression')

In [None]:
num_examples = len(X) # training set size
nn_input_dim = 2 # input layer dimensionality
nn_output_dim = 2 # output layer dimensionality

In [None]:
epsilon = 0.01 # learning rate for gradient descent
reg_lambda = 0.01 # regularization strength

In [None]:
model = build_model(5)
model = train_model(X, y, model, print_loss=True)

### Plotting output of the model that failed to learn, given a set of hyper-parameters:

In [None]:
plot_decision_boundary(lambda x: forward_pass(model, x))
plt.title('Decision Boundary for hidden layer size  50')
plt.show()

### Adjusting the learning rate such that the neural network re-starts learning:

In [None]:
epsilon = 1e-6 # learning rate for gradient descent
reg_lambda = 0.01 # regularization strength

In [None]:
model = build_model(5)
model = train_model(X, y, model, print_loss=True)

### Plotting the decision boundary layer generated by an improved neural network model:

In [None]:
plot_decision_boundary(lambda x: forward_pass(model, x))
plt.title('Decision Boundary for hidden layer size  50')
plt.show()

## Part 04 -- Building a neural network using [tensorflow](https://www.tensorflow.org/):

- A neural network that predicts the y value given an x value.
- Implemented using tensorflow, an open-source deep-learning library.

### Import dependent libraries:

In [None]:
import tensorflow as tf
import numpy as np

### Create a synthetic dataset for training and generating predictions:

In [None]:
x_data = np.float32(np.random.rand(2, 500))
y_data = np.dot([0.5, 0.7], x_data) + 0.6

- Variable objects store tensors in tensorflow
- Tensorflow considers all input data tensors
- Tensors are higher dimensional matrices

### Constructing a linear model:

In [None]:
bias = tf.Variable(tf.zeros([1]))
synapses = tf.Variable(tf.random.uniform([1, 2], -1, 1))

In [None]:
@tf.function
def cost():
    y = tf.matmul(synapses, x_data) + bias
    return tf.reduce_mean(tf.square(y - y_data))

### Gradient descent optimizer:

- Imagine the valley with a ball
- The goal of the optimizer is to localize the ball to the lowest point in the valley
- Loss function will be reduced over the training
- Mean squared error as the loss function

In [None]:
lr = 0.01

optimizer = tf.optimizers.RMSprop(lr)

### Training function:

- In tensorflow the computation is wrapped inside a graph.
- Tensorflow makes it easier to visualize the training sessions.

In [None]:
train = optimizer.minimize(cost, var_list=[synapses, bias])

### Training the model:

In [None]:
training_steps = 60000

for step in range (0, training_steps):
    train
    if step % 1000 == 0:
        print ('Current training session: ' + str(step) + str(synapses.numpy())+ str(bias.numpy()))

## Part 05 -- Neural net XOR gate solver using [Tensorflow](https://tensorflow.org) and [Keras](https://keras.io)

In [None]:
import tensorflow.keras as keras
import numpy as np
import os

from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam

In [None]:
MODEL_PATH = './XOR_gate_keras_network.h5'

In [None]:
! wget https://github.com/rahulremanan/python_tutorial/raw/master/Fundamentals_of_deep-learning/weights/XOR_gate_keras_network.h5 -O XOR_gate_keras_network.h5

### Create input and output data

In [None]:
x = np.array([[0, 0],
              [0, 1],
              [1, 0],
              [1, 1]])

y = np.array([[0],
              [1],
              [1],
              [0]])

### Create a neural network using [Keras Sequential API](https://keras.io/models/sequential/)

In [None]:
model = Sequential()
model.add(Dense(5, activation='relu', input_shape=(2,)))
model.add(Dense(5, activation='relu'))
model.add(Dense(1, activation='relu'))

### Select optimizer

In [None]:
optimizer = keras.optimizers.Adam(learning_rate=1e-4)

### Compile keras model

In [None]:
model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])

### Load model weights

In [None]:
if os.path.exists(MODEL_PATH):
    model.load_weights(MODEL_PATH)

### Summarize keras model

In [None]:
model.summary()

### Visualize model architecture

In [None]:
%%capture
! apt-get install -y graphviz libgraphviz-dev && pip3 install pydot graphviz

In [None]:
from keras.utils import plot_model
import pydot
import graphviz # apt-get install -y graphviz libgraphviz-dev && pip3 install pydot graphviz
from IPython.display import SVG
try:
    from keras.utils.vis_utils import model_to_dot
except ModuleNotFoundError:
    from keras.utils import model_to_dot


In [None]:
output_dir = './'
plot_model(model, to_file= output_dir + '/model_summary_plot.png')
SVG(model_to_dot(model).create(prog='dot', format='svg'))

### Train model

In [None]:
model.fit(x, y, batch_size=4,epochs=1000)

### Save model weights

In [None]:
model.save_weights(MODEL_PATH)

In [None]:
model.predict(x)

In [None]:
from google.colab import files
files.download(MODEL_PATH)

## Extra credit --[ Activation functions in numpy](https://codereview.stackexchange.com/questions/132023/different-neural-network-activation-functions-and-gradient-descent):

In [None]:
import numpy as np

def sigmoid(x, derivative=False):
    if (derivative == True):
        return x * (1 - x)
    return 1 / (1 + np.exp(-x))

def tanh(x, derivative=False):
    if derivative:
        return (1 - (np.power(x, 2)))
    return np.tanh(x)

def relu(x, derivative=False):
    if derivative:
        return np.where(x < 0, 0, 1)
    x_relu = np.maximum(x, 0)
    return x_relu

def arctan(x, derivative=False):
    if derivative:
        return (np.cos(x) ** 2)
    return np.arctan(x)

def step(x, derivative=False):
    if derivative:
        return x
    return np.where(x < 0, 0, 1)

# def squash(x, derivative=False):
#     if derivative:
#         for i in range(0, len(x)):
#             for k in range(0, len(x[i])):
#                 if x[i][k] > 0:
#                     x[i][k] = (x[i][k]) / (1 + x[i][k])
#                 else:
#                     x[i][k] = (x[i][k]) / (1 - x[i][k])
#         return x
#     return x / (1 + np.abs(x))

def gaussian(x, derivative=False):
    if derivative:
        return -2 * x * np.exp(-np.power(x, 2))
    return np.exp(-np.power(x, 2))

## Concluding notes -- Frank Rosenblatt, connectionism and the Perceptron:

![Frank Rosenblatt and the Perceptron](https://github.com/rahulremanan/python_tutorial/raw/master/Fundamentals_of_deep-learning/media/Frank_Rosenblatt's_Mark_I_Perceptron__Cornell_Aeronautical_Laboratory__Buffalo__New%20York.jpg)

#### Image 1: Frank Rosenblatt working on his Mark 1 Perceptron at Cornell Aeronautical Laboratory in Buffalo, New York, circa 1960.

This notebook is created to coincide the 90th birth anniversary of pioneering psychologist and artificial intelligence researcher, Frank Rosenblatt, born July 11, 1928 – died July 11, 1971. He is known for his work on connectionism, the incredible Mark 1 Perceptron. This notebook aims to remember the promise, the controversy and the resurgence of connectionism and neural networks as a tool in artificial intelligence.

[Here is a brief biography of Frank Rosenblatt](https://en.wikipedia.org/wiki/Frank_Rosenblatt) (Via Wikipedia):

Frank Rosenblatt was born in New Rochelle, New York as son of Dr. Frank and Katherine Rosenblatt. After graduating from The Bronx High School of Science in 1946, he attended Cornell University, where he obtained his A.B. in 1950 and his Ph.D. in 1956.

He then went to Cornell Aeronautical Laboratory in Buffalo, New York, where he was successively a research psychologist, senior psychologist, and head of the cognitive systems section. This is also where he conducted the early work on perceptrons, which culminated in the development and hardware construction of the Mark I Perceptron in 1960. This was essentially the first computer that could learn new skills by trial and error, using a type of neural network that simulates human thought processes.

Rosenblatt’s research interests were exceptionally broad. In 1959 he went to Cornell’s Ithaca campus as director of the Cognitive Systems Research Program and also as a lecturer in the Psychology Department. In 1966 he joined the Section of Neurobiology and Behavior within the newly formed Division of Biological Sciences, as associate professor. Also in 1966, he became fascinated with the transfer of learned behavior from trained to naive rats by the injection of brain extracts, a subject on which he would publish extensively in later years.

In 1970 he became field representative for the Graduate Field of Neurobiology and Behavior, and in 1971 he shared the acting chairmanship of the Section of Neurobiology and Behavior. Frank Rosenblatt died in July 1971 on his 43rd birthday, in a boating accident in Chesapeake Bay.

![Mark 1 Perceptron](https://github.com/rahulremanan/python_tutorial/raw/master/Fundamentals_of_deep-learning/media/Smithsonian_Perceptron.jpg)

#### Image 2: Mark 1 Perceptron at Smithsonian Institute, Washington DC