# Deep Learning (DL)

* technically: GPUs (parallelism, simple operation)
* GPU: graphical processing units
* Renaissance of approaches from the 1980's (Rumelhart et al.)
* very successful not only in NLP
* a methodological commitment in NLP by now
* many architectures: Feeedforward (FF), RNN, CNN, LSTM, ...
* word embeddings are crucial for NLP
* learning is minimizing a loss function by an optimizer (e.g. Adam)
* backpropagation as learning method 
* activation functions for non-linearity, e.g. sigmoid und softmax
* hyperparameter tuning (carried out manually) 


## Activation Functions

* activation functions take the input of a node and maps it to a new output value
* in DL, non-linear activation functions are applied

why would we need non-linear activation function?

* because multiplying vectors (matrices) with matrices is just a linear (!) transformation
* (see the script math.pdf in Olat, e.g. the eigenvalue part)
* in order to deal with XOR problems we need non-linear transformations
* examples: sigmoid, softmax, tanh, hard tanh und rectified linear unit (ReLU)

##  Loss Functions

* assign to ($y,\hat{y}$) a value which quantifies the loss the current model makes given the true class (or value) $y$ ($\hat{y}$  is the prediction)
* MSE, Cross Entropy Loss, ...

## Frameworks

provide high-level modules for tasks tightly coupled with DL


* DyNet, Keras, Theano, Tensorflow, Pytorch, ....

## PyTorch

is a Python based framework to specify and train neural nets


NN with PyTorch

* use module nn
* specify a loss function
* specify an optimizer
* define a forward pass
* let torch do the backward pass
* iterate some epochs or until a stopping criterion is met


##  Optimizer

An optimizer combines the loss function with its own way to determine the delta. The loss function defines the loss, the optimizer defines the way to reduce best.

basically,  we have (the delta rule):

$$w_j = w_j + -\eta*\frac{\delta J}{\delta w_j}$$

where J is some loss function and $\eta$ is the learning rate

with the Adam optimizer we have:

$$w_j = w_j + -\eta*\frac{\delta J}{\delta w_j }+ \gamma*v_t$$

where $v_t$ is the last change made to $w_j$ and $\gamma$ is called momentum

The effect is: we are moving down faster 

## Gradients

* we use tensors
* do operations on tensors
* we tell PyTorch to remember the operations
* we tell PyTorch to do the backward pass

e.g. torch.randn(3, 5, requires_grad=True)

* a matrix with 3 rows and 5 columns
* 'requires_grad=True' means that PyTorch keeps track of the operations we carry out

In [None]:
import torch
torch.randn(3, 5, requires_grad=True)

# Some activaton functions

### sigmoid

In [None]:
import torch
import matplotlib.pyplot as plt

x = torch.arange(-5., 5., 0.1)
y = torch.sigmoid(x)
plt.plot(x.numpy(), y.detach().numpy())
plt.show()

### tanh

In [None]:
import torch
import matplotlib.pyplot as plt

x = torch.arange(-5., 5., 0.1)
y = torch.tanh(x)

plt.plot(x.numpy(), y.detach().numpy())
plt.show()

### relu

In [None]:
import torch
import matplotlib.pyplot as plt

relu = torch.nn.ReLU()
x = torch.arange(-5., 5., 0.1)
y = relu(x)

plt.plot(x.numpy(), y.detach().numpy())
plt.show()

## Some loss functions


### MSE (mean squared error)

In [None]:
import torch
import torch.nn as nn

mse_loss = nn.MSELoss()
outputs = torch.randn(3, 5, requires_grad=True)
targets = torch.randn(3, 5)
loss = mse_loss(outputs, targets)
loss.backward()
print(loss)

### CrossEntropyLoss

* is used for categorical out
* in learning mode net input is used (i.e. before the application of any activation function)
* requires for each example a class ([1,0,3], since we have 3 rows=examples)

In [None]:
import torch
import torch.nn as nn

ce_loss = nn.CrossEntropyLoss()
outputs = torch.randn(3, 5, requires_grad=True)
targets = torch.tensor([1, 0, 3], dtype=torch.int64)
loss = ce_loss(outputs, targets)
loss.backward()
print (loss)

## A simple computation graph

In [None]:
W = torch.tensor([[ 0.2111], [-0.6587]], requires_grad=True)
x = torch.tensor([[1, 0]], dtype=torch.float32, requires_grad=True)
b = torch.tensor([-0.3705], requires_grad=True)

y=torch.matmul(x,W)+b    # a linear mapping y=xW+b

y

## Linear Transformations

carry out: xW^T+b

In [None]:
import torch.nn as nn

linear_trans = nn.Linear(2,1)   # input dim = 2, output dim = 1

print(linear_trans)

In [None]:
linear_trans.weight, linear_trans.bias  # weights and bias  are randomly initialized

In [None]:
# set your own weights and bias

my_weights=torch.tensor([[0.2111,-0.6587]], requires_grad=True)
my_bias= torch.tensor([-0.3705], requires_grad=True)

linear_trans.weight.data = my_weights
linear_trans.bias.data = my_bias

linear_trans.weight, linear_trans.bias

In [None]:
# do the linear transformation

print("our input vector x",x)
linear_trans(x),torch.matmul(x,W)+b   # i.e. 0.2111 * 1 + -0.3705  (1,0) * (0.2111,-0.6587)^T + b

In [None]:
# check it manually, .. is right

0.2111 * 1 + -0.6587 * 0 + -0.3705

# Create neural nets in PyTorch

* we define a class Net, with a linear mapping xW^T+b and a sigmoid activation function
* we define our own weights and bias to have stable output for teaching purposes, normally random
* we instantiate it: net=Net()
* we provide some input [1,0] and look at the output (the sigmoid applied to the linear transformation)
* we define the real target (i.e. output) value 
* define a loss function (MSE) manually
* use as an optimizer SDG (Stochastic Gradient Decent)
* we determine the loss (we don't need it further)
* let Pytorch do the backward pass
* and optimize one step, i.e. altering the weights
* and then we manually do it again in order to understand what happend

In [None]:
import torch
import torch.nn as nn

class Net(nn.Module):   
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(2,1)   # input dim = 2, output dim = 1, i.e. a simple perceptron
        
        # we set weights and bias in order to have stable output for teaching
        my_weights=torch.tensor([[ 0.2, 0.6]], requires_grad=True)
        my_bias= torch.tensor([2.0], requires_grad=True)

        self.fc1.weight.data = my_weights
        self.fc1.bias.data = my_bias
        
    def forward(self, x):
        x = torch.sigmoid(self.fc1(x))  # non-linearity, no perceptron any longer
        return x
    
net=Net()    
net   

In [None]:
import torch.optim as optim

optimizer = optim.SGD(net.parameters(), lr=0.1)

input = torch.tensor([[1,0]], dtype=torch.float32, requires_grad=True)
target = torch.tensor([[1]], dtype=torch.float32, requires_grad=True)

def criterion(out, label):   # MSE, our own definition of loss/cost function J
    return (label - out)**2

out = net(input)   # forward pass

print("the initial parameters:",list(net.parameters()))

print("\noutput of net=",out)

optimizer.zero_grad()   # reset gradients to zero

loss=criterion(out,target)

print("\nloss i.e. (1-0.9)**2",loss)

loss.backward()

optimizer.step()   # one step weight/bias adaptation

#input.grad

print("\nthe adopted parameters",list(net.parameters()))

print("\nthe gradients:",net.fc1.weight.grad, net.fc1.bias.grad)
# input.grad

In [None]:
# backpropagation done manually, i.e. chain rule application

# gradient of w_1 (0.2) is -0.0179 

-2*(1-0.9002)*0.9002*(1-0.9002)*1    

In [None]:
# determining the new weight w_1 by applying the delta rule (that is SGD)

dx=-0.017932056015999998

w1=0.2
w1=w1 + -(0.1*dx) 
w1          # the new, incremente weight w_1

### Output of net.parameters()

[Parameter containing:
tensor([[0.3100, 0.6000]], requires_grad=True), Parameter containing:
tensor([2.1100], requires_grad=True)]

the new weight matrix [[0.3100, 0.6000]] and the new bias vector [2.1100]

a better version:

In [None]:
for name, param in net.named_parameters():
    if param.requires_grad:
        print("\t",name, param.data)

we find: weight 1 is increasing from 0.2 to 0.2018, also the bias term: from 2 to 2.0018

## Adding layers

In [1]:
import torch
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(2,5)
        self.fc2 = nn.Linear(5,1)
    def forward(self, x):
        x = torch.sigmoid(self.fc2(torch.sigmoid(self.fc1(x))))
        return x


In [2]:
net=Net()
print(net)

Net(
  (fc1): Linear(in_features=2, out_features=5, bias=True)
  (fc2): Linear(in_features=5, out_features=1, bias=True)
)


In [9]:
inputs = torch.tensor([[1, 0], [1, 1], [0, 1], [0, 0]], dtype=torch.float32, requires_grad=True)

net(inputs)

tensor([[0.5258],
        [0.5267],
        [0.5241],
        [0.5232]], grad_fn=<SigmoidBackward>)

In [3]:
import torch.optim as optim

def criterion(out, label):
    return (label - out)**2

optimizer = optim.SGD(net.parameters(), lr=0.1, momentum=0.5)

In [4]:
#net=Net()

data = [[1, 0, 0], [1, 1, 0], [0, 1, 1], [0, 0, 0]] 

for epoch in range(1400):
    for i, data2 in enumerate(data):
        optimizer.zero_grad()
        X1, X2, Y = iter(data2)
        X, Y = Variable(torch.FloatTensor([X1,X2]), requires_grad=True), Variable(torch.FloatTensor([Y]), requires_grad=False)
#        print(X,Y)
        outputs = net(X)
        loss = criterion(outputs, Y)
        loss.backward()
        optimizer.step()
        if (i % 10 == 0):
            print("Epoch {} - loss: {}".format(epoch, loss.data[0]))


Epoch 0 - loss: 0.12603244185447693
Epoch 1 - loss: 0.11866507679224014
Epoch 2 - loss: 0.11145086586475372
Epoch 3 - loss: 0.1055215522646904
Epoch 4 - loss: 0.1006547212600708
Epoch 5 - loss: 0.09661699086427689
Epoch 6 - loss: 0.0932324230670929
Epoch 7 - loss: 0.09036954492330551
Epoch 8 - loss: 0.08792856335639954
Epoch 9 - loss: 0.08583252131938934
Epoch 10 - loss: 0.08402121067047119
Epoch 11 - loss: 0.0824468731880188
Epoch 12 - loss: 0.08107119798660278
Epoch 13 - loss: 0.07986308634281158
Epoch 14 - loss: 0.0787971094250679
Epoch 15 - loss: 0.07785216718912125
Epoch 16 - loss: 0.07701076567173004
Epoch 17 - loss: 0.07625820487737656
Epoch 18 - loss: 0.07558207958936691
Epoch 19 - loss: 0.07497184723615646
Epoch 20 - loss: 0.07441853731870651
Epoch 21 - loss: 0.07391445338726044
Epoch 22 - loss: 0.07345297187566757
Epoch 23 - loss: 0.07302837073802948
Epoch 24 - loss: 0.0726357027888298
Epoch 25 - loss: 0.07227069139480591
Epoch 26 - loss: 0.07192959636449814
Epoch 27 - loss: 

Epoch 262 - loss: 0.0006699099321849644
Epoch 263 - loss: 0.0006553609273396432
Epoch 264 - loss: 0.0006412251968868077
Epoch 265 - loss: 0.0006274895858950913
Epoch 266 - loss: 0.0006141399499028921
Epoch 267 - loss: 0.0006011630757711828
Epoch 268 - loss: 0.0005885481950826943
Epoch 269 - loss: 0.0005762828513979912
Epoch 270 - loss: 0.0005643542390316725
Epoch 271 - loss: 0.000552753044757992
Epoch 272 - loss: 0.0005414671613834798
Epoch 273 - loss: 0.0005304872174747288
Epoch 274 - loss: 0.0005198032595217228
Epoch 275 - loss: 0.000509405683260411
Epoch 276 - loss: 0.0004992848262190819
Epoch 277 - loss: 0.0004894321318715811
Epoch 278 - loss: 0.000479839596664533
Epoch 279 - loss: 0.0004704986640717834
Epoch 280 - loss: 0.00046140182530507445
Epoch 281 - loss: 0.00045254052383825183
Epoch 282 - loss: 0.00044390736729837954
Epoch 283 - loss: 0.00043549586553126574
Epoch 284 - loss: 0.0004272995865903795
Epoch 285 - loss: 0.0004193107597529888
Epoch 286 - loss: 0.0004115242045372724

Epoch 509 - loss: 3.8062993553467095e-05
Epoch 510 - loss: 3.782552812481299e-05
Epoch 511 - loss: 3.759035826078616e-05
Epoch 512 - loss: 3.7357458495534956e-05
Epoch 513 - loss: 3.71268397429958e-05
Epoch 514 - loss: 3.689841469167732e-05
Epoch 515 - loss: 3.667215787572786e-05
Epoch 516 - loss: 3.6447967431740835e-05
Epoch 517 - loss: 3.622593794716522e-05
Epoch 518 - loss: 3.600593481678516e-05
Epoch 519 - loss: 3.57880890078377e-05
Epoch 520 - loss: 3.5572225897340104e-05
Epoch 521 - loss: 3.5358356399228796e-05
Epoch 522 - loss: 3.5146516893291846e-05
Epoch 523 - loss: 3.4936630981974304e-05
Epoch 524 - loss: 3.472868411336094e-05
Epoch 525 - loss: 3.4522658097557724e-05
Epoch 526 - loss: 3.4318509278818965e-05
Epoch 527 - loss: 3.411630314076319e-05
Epoch 528 - loss: 3.391589052625932e-05
Epoch 529 - loss: 3.371729690115899e-05
Epoch 530 - loss: 3.352056228322908e-05
Epoch 531 - loss: 3.33255338773597e-05
Epoch 532 - loss: 3.313224442536011e-05
Epoch 533 - loss: 3.29406939272303

Epoch 766 - loss: 1.21503762784414e-05
Epoch 767 - loss: 1.2111296200600918e-05
Epoch 768 - loss: 1.207241712108953e-05
Epoch 769 - loss: 1.2033747225359548e-05
Epoch 770 - loss: 1.1995265595032834e-05
Epoch 771 - loss: 1.1956978596572299e-05
Epoch 772 - loss: 1.1918903510377277e-05
Epoch 773 - loss: 1.1881017599080224e-05
Epoch 774 - loss: 1.184331631520763e-05
Epoch 775 - loss: 1.1805817848653533e-05
Epoch 776 - loss: 1.1768500371545088e-05
Epoch 777 - loss: 1.1731368431355804e-05
Epoch 778 - loss: 1.1694432942022104e-05
Epoch 779 - loss: 1.165768026112346e-05
Epoch 780 - loss: 1.1621131307038013e-05
Epoch 781 - loss: 1.1584772437345237e-05
Epoch 782 - loss: 1.154860365204513e-05
Epoch 783 - loss: 1.1512593118823133e-05
Epoch 784 - loss: 1.147675902757328e-05
Epoch 785 - loss: 1.144111138273729e-05
Epoch 786 - loss: 1.1405652003304567e-05
Epoch 787 - loss: 1.1370356332918163e-05
Epoch 788 - loss: 1.1335253475408535e-05
Epoch 789 - loss: 1.1300315236439928e-05
Epoch 790 - loss: 1.1265

Epoch 1003 - loss: 6.506918452942045e-06
Epoch 1004 - loss: 6.492790816992056e-06
Epoch 1005 - loss: 6.478730938397348e-06
Epoch 1006 - loss: 6.46470698484336e-06
Epoch 1007 - loss: 6.45073805571883e-06
Epoch 1008 - loss: 6.436810963350581e-06
Epoch 1009 - loss: 6.4229507188429125e-06
Epoch 1010 - loss: 6.409126854123315e-06
Epoch 1011 - loss: 6.395363925548736e-06
Epoch 1012 - loss: 6.381641469488386e-06
Epoch 1013 - loss: 6.367966307152528e-06
Epoch 1014 - loss: 6.354345259751426e-06
Epoch 1015 - loss: 6.340753770928131e-06
Epoch 1016 - loss: 6.327226856228663e-06
Epoch 1017 - loss: 6.313752237474546e-06
Epoch 1018 - loss: 6.300312634266447e-06
Epoch 1019 - loss: 6.286913503572578e-06
Epoch 1020 - loss: 6.273591225181008e-06
Epoch 1021 - loss: 6.260284408199368e-06
Epoch 1022 - loss: 6.247042165341554e-06
Epoch 1023 - loss: 6.2338517636817414e-06
Epoch 1024 - loss: 6.2206822804000694e-06
Epoch 1025 - loss: 6.207576916494872e-06
Epoch 1026 - loss: 6.194516572577413e-06
Epoch 1027 - lo

Epoch 1247 - loss: 4.127465217607096e-06
Epoch 1248 - loss: 4.120823632547399e-06
Epoch 1249 - loss: 4.1142047848552465e-06
Epoch 1250 - loss: 4.1076004890783224e-06
Epoch 1251 - loss: 4.101014837942785e-06
Epoch 1252 - loss: 4.094439645996317e-06
Epoch 1253 - loss: 4.087893557880307e-06
Epoch 1254 - loss: 4.081370661879191e-06
Epoch 1255 - loss: 4.074854132340988e-06
Epoch 1256 - loss: 4.068359430675628e-06
Epoch 1257 - loss: 4.061878371430794e-06
Epoch 1258 - loss: 4.055415502079995e-06
Epoch 1259 - loss: 4.048963546665618e-06
Epoch 1260 - loss: 4.042544787807856e-06
Epoch 1261 - loss: 4.036132395413006e-06
Epoch 1262 - loss: 4.02973773816484e-06
Epoch 1263 - loss: 4.023364908789517e-06
Epoch 1264 - loss: 4.016998445877107e-06
Epoch 1265 - loss: 4.010652446595486e-06
Epoch 1266 - loss: 4.004332367912866e-06
Epoch 1267 - loss: 3.998021384177264e-06
Epoch 1268 - loss: 3.991721314378083e-06
Epoch 1269 - loss: 3.985443072451744e-06
Epoch 1270 - loss: 3.97917710870388e-06
Epoch 1271 - los

In [None]:
print(net(Variable(torch.Tensor([[[1,1]]]))))

In [8]:
def step(x):
    if x<0.5:
        return 0
    else:
        return 1
    
print(step(net(Variable(torch.Tensor([[[1,0]]])))))

1
