# 21. Homework 3. Neural Networks

## 1. Neural Netweoks

In this problem we will analyze a simple neural network to understand its classification properties. Consider the neural network given in the figure below, with ReLU activation functions (denoted by $f$) on all neurons, and a softmax activation function in the output layer:

![img](img/images_hw4_p1.png)

### 1. Q1


In [38]:
import numpy as np
import math

X = np.array([3.,14,1], dtype=np.float64).reshape([1,3])
W = np.array([1,0,-1,0,1,-1,-1,0,-1,0,-1,-1], dtype=np.float64).reshape([4,3])
V = np.array([1,1,1,1,0,-1,-1,-1,-1,2],dtype=np.float64).reshape([2,5])

print(X, W, V, sep='\n')

def ReLU(x):
    return max(x, 0)

Z = X @ W.T 
print("Z: ", Z)

F = np.vectorize(ReLU)(X @ W.T)
print("F: ", F)

F2 = np.c_[ F, np.ones(F.shape[0])]

U = F2 @ V.T
print(U)

FU = np.vectorize(ReLU)(U)
print(FU)

[[ 3. 14.  1.]]
[[ 1.  0. -1.]
 [ 0.  1. -1.]
 [-1.  0. -1.]
 [ 0. -1. -1.]]
[[ 1.  1.  1.  1.  0.]
 [-1. -1. -1. -1.  2.]]
Z:  [[  2.  13.  -4. -15.]]
F:  [[ 2. 13.  0.  0.]]
[[ 15. -13.]]
[[15.  0.]]


In [39]:
# X = np.array([3.,14,1], dtype=np.float64).reshape([1,3])
# Y = np.array([3.,14,1], dtype=np.float64).reshape([3,1])

# print(X)
# print(Y)

In [40]:
from math import exp

def o(u1, u2):
    return exp(u1) / (exp(u1) + exp(u2))

# print("o1", o(15, 0))
# print("o2", o(0, 15))

### 1. Q2

In [41]:
# print(o(1,1))
# print(o(0,2))
# print(o(3,0))


In [42]:
def oo(u1,u2,beta):
    return exp( beta * ReLU(u1)) / ( exp( beta * ReLU(u1)) + exp( beta * ReLU(u2)))

### Inverse Temperature

https://en.wikipedia.org/wiki/Softmax_function
softmax( z + c ) = softmax( z ) 

## 2. LSTM

A Sigmoid function is a function with a S shaped curve

In ML, the term "sigmoid fuction" is an alias fro the logistics function.

$$ sigmoid(x) = { 1 \over (1 + e^{-x}) } $$
$$ = { e^x \over (e^x + 1) } $$


Sources:
* [Sigmoid - Wikipedia](https://en.wikipedia.org/wiki/Sigmoid_function)

In [43]:
import math

def sigmoid(x):
    return 1 / (1 + math.exp(-x))

# print(0, sigmoid(0))
# print(-1, sigmoid(-1))


In [44]:
def tanh(x):
    return((math.exp(x) - math.exp(-x)) / (math.exp(x) + math.exp(-x)))

$$ f_t = sigmoid(W^{f,h}h_{t-1} + W^{f,x}x_{t} + b_f) $$


In [45]:
W_fh = 0.
W_ih = 0.
W_oh = 0
W_fx = 0.
W_ix = 100.
W_ox = 100
b_f = -100.
b_i = 100.
b_o = 0
W_ch = -100.
W_cx = 50
b_c = 0

In [46]:
nsteps = 6

h = [None] * (nsteps + 1)
h[-1] = 0

c = [None] * (nsteps + 1)
c[-1] = 0

x = [0.] * (nsteps + 1)

print(h)

[None, None, None, None, None, None, 0]


In [47]:
def forget_gate(t, h, x, W_fh=W_fh, W_fx=W_fx, b_f=b_f):
    return sigmoid(W_fh * h[t-1] + W_fx * x[t] + b_f)

$$ i_t = sigmoid(W^{i,h}h_{t-1} + W^{i,x}x_{t} + b_i) $$

In [48]:
def input_gate(t, h, x, W_ih=W_ih, W_ix=W_ix, b_i=b_i):
    return sigmoid(W_ih * h[t-1] + W_ix * x[t] + b_i)


$$ o_t = sigmoid(W^{o,h}h_{t-1} + W^{o,x}x_{t} + b_o) $$

In [49]:
def output_gate(t, h, x, W_oh=W_oh, W_ox=W_ox, b_o=b_o):
    return sigmoid(W_oh * h[t-1] + W_ox * x[t] + b_o)

$$ c_t = f_t \odot c_{t-1} + i_t \odot tanh(W^{c,h}h_{t-1} + W^{c,x}x_{t} + b_c) $$

In [50]:
def cell(t, h, x, W_ch=W_ch, W_cx=W_cx, b_c=b_c):
    return forget_gate(t, h, x) * c[t-1] + input_gate(t, h, x) * tanh(W_ch * h[t-1] + W_cx * x[t] + b_c)

In [51]:
# print(round(0.5))
# print(round(-0.5))

def h_state(t, h, x):
    return round(output_gate(t, h, x) * tanh(cell(t, h, x)))

In [52]:
def learn():
    for t in range(nsteps):

        c[t] = cell(t, h, x)
        h[t] = h_state(t, h, x)
        
    print("h", h[:nsteps+1])

# learn()

In [53]:
# print("h", h[:6])

In [54]:
# nsteps = 10

# h = [None] * (nsteps + 1)
# h[-1] = 0

# print(h)

# c = [None] * (nsteps + 1)
# c[-1] = 0

# x = [0.,0.,0.,0.,0.,0.,0.,0.,0.,0.]

# for t in range(nsteps):

#     c[t] = cell(t, h, x)
#     h[t] = h_state(t, h, x)
    
# print("h", h[:nsteps+1])

## 3. Backpropagation

One of the key steps for training multi-layer neural networks is stochastic gradient descent. We will use the back-propagation algorithm to compute the gradient of the loss function with respect to the model parameters.

Consider the $L$-layer neural network below:



https://denizyuret.github.io/Knet.jl/latest/backprop/
http://neuralnetworksanddeeplearning.com/chap2.html 

In [55]:
import math
def sigmoid(x):
    return 1 / (1 + math.exp(-x))

def dSigmoid(x):
    return sigmoid(x)*(1-sigmoid(x))

In [56]:
def LossC(y, t):
    return 1/2 * (y-t)**2

In [57]:
t = 1
x = 3
w_1 = 0.01
w_2 = -5
b = -1

In [58]:
z_1 = w_1 * x
print("z_1",z_1)

a_1 = ReLU(z_1)
print("a_1",a_1)

z_2 = w_2 * a_1 + b
print("z_2",z_2)

y = sigmoid(z_2)
print("y", y)

C = LossC(y, t)
print("C", C)
print("Loss", 0.28842841648243966)

z_1 0.03
a_1 0.03
z_2 -1.15
y 0.24048908305088898
C 0.28842841648243966
Loss 0.28842841648243966


In [59]:
def dLossC(w, x, t):
    return 2*x*(w*x-t)

In [60]:
dLossC(w_2, w_1*x, t)

-0.06899999999999999

In [61]:
# import torch
# x1 = torch.tensor(3., requires_grad=False,  dtype=torch.float64)
# x2 = torch.tensor(1., requires_grad=False,  dtype=torch.float64)

# w11 = torch.tensor(6., requires_grad=True)
# w21 = torch.tensor(-2., requires_grad=True)
# w12 = torch.tensor(-3., requires_grad=True)
# w22 = torch.tensor(5., requires_grad=True)

# v11 = torch.tensor(1., requires_grad=True)
# v21 = torch.tensor(0.25, requires_grad=True)
# v12 = torch.tensor(-2., requires_grad=True)
# v22 = torch.tensor(2., requires_grad=True)

# t1 = torch.tensor(1., requires_grad=False)
# t2 = torch.tensor(0., requires_grad=False)

# # calculating the hidden layer 
# h1 = torch.sigmoid(w11*x1 + w21*x2)
# h2 = torch.sigmoid(w12*x1 + w22*x2)

# # calculating the output layer
# y1 = torch.sigmoid(v11*h1 + v21*h2)
# y2 = torch.sigmoid(v12*h1 + v22*h2)

# e1 = y1 - t1
# e2 = y2 - t2

# # the loss function 
# l = e1**2 + e2**2 

# l.backward()

# print("First layer gradients by framework")
# print(w11.grad, w12.grad)
# print(w21.grad, w22.grad)
# #>>> tensor(-5.6607e-08) tensor(0.0014)
# #>>> tensor(-1.8869e-08) tensor(0.0005)

# print("First layer gradients manually")
# print(grad_w11, grad_w12)
# print(grad_w21, grad_w22) 

In [62]:
import torch
x = torch.tensor(3., requires_grad=False, dtype=torch.float64)
w1 = torch.tensor(0.01, requires_grad=True)
b = torch.tensor(-1., requires_grad=True)
w2 = torch.tensor(-5., requires_grad=True)
t = torch.tensor(1., requires_grad=False)

# calculating the hidden layer 
h1 = w1*x

# calculating the output layer
y1 = torch.sigmoid(w2*h1 + b)

e = y1 - t

# the loss function 
C = 1/2*(e**2) 

C.backward()

# print(w1.grad)
# print(w2.grad)
# print(b.grad)

## 4. Words embeddings

* [ ] The first network cannot classify the sequence correctly
* [ ] The second network cannot classify the sequence correctly
* [ ] The second network has a fighting chance at classifying the sequence
* [ ] The first network has a fighting chance at classifying the sequence

https://towardsdatascience.com/neural-network-embeddings-explained-4d028e6f0526 

Embedding: a method used to {represent|map} a {discrete|categorical} variables as a vector of continuous numbers.
Low dimension, learn 
Reduce dimensionality

3 purposes:
- find nearest neighbors
- as input to a machine learning model for a supervised task
- for visualization of concepts and relations between categories

Overcome limitation of one-hot encoding
Drawbacks:
- size of the matrix (1 dim per category)
- mapping contains no meaning

embedding NN + supervised task (train with data)


