## Machine Learning 7.1

### From last class ....

#### Import the libraries

In [1]:
import sys
sys.path.append('/Users/tlee010/Desktop/github_repos/fastai/')

In [2]:
%load_ext autoreload
%autoreload 2

from fastai.imports import *
from fastai.torch_imports import *
from fastai.io import *

#### Load the MNIST data

In [3]:
path = './'
URL='http://deeplearning.net/data/mnist/' 

In [4]:
FILENAME='mnist.pkl.gz'

def load_mnist(filename):
    return pickle.load(gzip.open(filename, 'rb'), encoding='latin-1')


get_data(URL+FILENAME, path+FILENAME)
((x, y), (x_valid, y_valid), _) = load_mnist(path+FILENAME)



(-3.1638146e-07, 0.99999934)

#### Normalize the data

In [14]:
mean = x.mean()
std = x.std()
x=(x-mean)/std
x.mean(), x.std()

(-8.0261424e-09, 1.0)

In [15]:
x_valid = (x_valid-mean)/std
x_valid.mean(), x_valid.std()

(-0.0058506131, 0.99243379)

In [16]:
x_imgs = np.reshape(x_valid, (-1,28,28)); x_imgs.shape

(10000, 28, 28)

### Start our simple NN model (From last lecture)

In [17]:
from fastai.metrics import *
from fastai.model import *
from fastai.dataset import *
from fastai.core import *

import torch.nn as nn

## Master Model: The basic NN model  w/pytorch libraries

In [18]:
net = nn.Sequential(
    nn.Linear(28*28, 10),
    nn.LogSoftmax()
)#.cuda() #<--- signals to run on the GPU

In [19]:
md = ImageClassifierData.from_arrays(path, (x,y), (x_valid, y_valid))

In [20]:
x.shape

(50000, 784)

#### Define our loss and optimization

In [23]:
loss=nn.NLLLoss()
metrics=[accuracy]
opt=optim.Adam(net.parameters())

In [25]:
def binary_loss(y,p):
    return np.mean(-(y * np.log(p)+(1-y)*np.log(1-p)))

In [26]:
acts_sample = np.array([1, 0, 0, 1])
preds_sample = np.array([0.9, .1, .2, .8])
binary_loss(acts_sample, preds_sample)

0.164252033486018

In [27]:
fit(net, md, epochs=1, crit=loss, opt=opt, metrics=metrics)

[ 0.       0.29864  0.27374  0.92566]                         



### Now let's rebuild this model several times 

#### (We will manually replace the prebuilt libraries with manual functions)

```python
net = nn.Sequential(
    nn.Linear(28*28, 10),
    nn.LogSoftmax()
)
```

#### Last week we remade the above model from scratch:

**`(nn.module)`** <-- we are ***extending*** a pytorch class. we are borrowing all the methods from the standard module and will add some additional methods to it.

**`super().__init__()`** <-- as a result, we have to create or instantiate the standard module first

**`self.l1_w = get_weights(28*28, 10) `** is essentially the `ax` part of `y=ax + b`

**`self.l1_b = get_weights(10) `** is essentially the `b` part of `y=ax + b`



## Rewrite #1 Model - linear / softmax from scratch

In [30]:
def get_weights(*dims): return nn.Parameter(torch.randn(*dims)/dims[0])

class LogReg(nn.Module):
    
    def __init__(self):
        super().__init__()
        self.l1_w = get_weights(28*28, 10)  # Layer 1 weights
        self.l1_b = get_weights(10)         # Layer 1 bias

    def forward(self, x):
        x = x.view(x.size(0), -1)
        x = torch.matmul(x, self.l1_w) + self.l1_b  # Linear Layer
        x = torch.log(torch.exp(x)/(1 + torch.exp(x).sum(dim=0)))        # Non-linear (LogSoftmax) Layer
        return x

### Equivalent functions


#### Original Torch
```python
net = nn.Sequential(
    nn.Linear(28*28, 10), <-- (1)
    nn.LogSoftmax()       <-- (2)
)
```
#### Basic Python Version
The linear portion of our original model is replicated here:
```python
    def __init__(self):
        super().__init__()
        self.l1_w = get_weights(28*28, 10)  <--(1 equivalent)
        self.l1_b = get_weights(10)         <--(1 equivalent)
    
    def forward(self, x):                   
        x = x.view(x.size(0), -1)
        x = torch.matmul(x, self.l1_w) + self.l1_b                <--(1 equivalent)
        x = torch.log(torch.exp(x)/(1 + torch.exp(x).sum(dim=0))) <--(2 equivalent)
        return x
    
```

#### Pytorch special method: forward

**`forward`** - special hook in the pytorch library, how we are implementing each layer 

http://pytorch.org/tutorials/beginner/former_torchies/nn_tutorial.html#forward-and-backward-function-hooks

#### Softmax activation:

- We want probabilities at the end, values should be between 0 and 1
- categorical predictions, where we only want to predict **1** of the class (not simultaneous classes, such as a cat and dog in the same picture aka. multilabel which would be sigmoid activation)
- all the scores should add up to 1
- Side note: Torch uses `torch.log` for numerical stability

#### Let's Run our Model

In [31]:
net2 = LogReg()#.cuda()
opt=optim.Adam(net2.parameters())

In [32]:
fit(net2, md, epochs=1, crit=loss, opt=opt, metrics=metrics)

[ 0.       2.44347  2.40075  0.90953]                        



### Using our custom object manually
#### first: get a mini batch of the data

Will pull from the image generator

In [34]:
dl = iter(md.trn_dl)

#### since we called it, it will provide the next minibatch set

Then we will save the data into the local variables

In [35]:
xmb, ymb = next(dl)

#### Returns a tensor

In [36]:
xmb


-0.4245 -0.4245 -0.4245  ...  -0.4245 -0.4245 -0.4245
-0.4245 -0.4245 -0.4245  ...  -0.4245 -0.4245 -0.4245
-0.4245 -0.4245 -0.4245  ...  -0.4245 -0.4245 -0.4245
          ...             ⋱             ...          
-0.4245 -0.4245 -0.4245  ...  -0.4245 -0.4245 -0.4245
-0.4245 -0.4245 -0.4245  ...  -0.4245 -0.4245 -0.4245
-0.4245 -0.4245 -0.4245  ...  -0.4245 -0.4245 -0.4245
[torch.FloatTensor of size 64x784]

**`Variable`** - this is how we can get the differentiation. We are tracking operations on this variable, because later we will be using pytorch to do the differentiation. Note the different output of 

```python
    Variable containing:
    -0.4245 -0.4245 -0.4245 
```

#### Make a wrapper

In [40]:
vxmb = Variable(xmb)
vxmb

Variable containing:
-0.4245 -0.4245 -0.4245  ...  -0.4245 -0.4245 -0.4245
-0.4245 -0.4245 -0.4245  ...  -0.4245 -0.4245 -0.4245
-0.4245 -0.4245 -0.4245  ...  -0.4245 -0.4245 -0.4245
          ...             ⋱             ...          
-0.4245 -0.4245 -0.4245  ...  -0.4245 -0.4245 -0.4245
-0.4245 -0.4245 -0.4245  ...  -0.4245 -0.4245 -0.4245
-0.4245 -0.4245 -0.4245  ...  -0.4245 -0.4245 -0.4245
[torch.FloatTensor of size 64x784]

#### Syntax - treating our model like a function 

(which is calling the forward command that we implemented)

In [43]:
preds = net2(vxmb)
preds[:10]

Variable containing:

Columns 0 to 7 
 -1.4970 -18.9267  -8.6922  -7.2008 -10.0914  -5.1697 -11.1822 -15.2741
 -9.0697 -11.9168  -6.6778  -6.3863  -1.6509  -8.4909  -7.1499  -7.1995
 -5.9551  -7.1476  -1.1607  -6.7816  -7.3475  -7.2832  -6.5625 -10.3792
 -4.7502 -11.2491  -6.9513  -4.1343  -7.2965  -7.5893 -11.0756  -2.1567
 -6.7426 -11.6235  -7.4818  -6.0834  -5.0911  -6.3521 -10.0404  -5.7750
 -1.8195 -10.2395  -6.6171  -7.6511  -7.9903  -4.1114  -2.8303  -9.8166
 -6.5060  -5.1258  -7.2731  -6.8241  -2.8162  -4.9901  -6.4188  -6.3444
 -3.8293 -10.2828  -6.5348  -2.0728 -11.5522  -2.4806  -6.7743 -15.9268
 -6.6867  -2.1434  -5.3255  -5.6275  -7.2797  -6.0414  -5.5714  -6.7775
 -6.4043  -2.1118  -5.7165  -5.4894  -7.3070  -5.9303  -5.9022  -5.7852

Columns 8 to 9 
 -9.1331 -11.8261
 -4.7619  -5.0652
 -5.0495  -9.8351
 -8.4450  -5.4082
 -4.8309  -2.2167
 -7.8708 -11.7794
 -4.1470  -6.3982
 -6.4812 -12.6123
 -5.8198  -6.1695
 -5.2201  -5.4992
[torch.FloatTensor of size 10x10]

#### Let's look at our predictions for this minibatch

In [57]:
preds.max(1)[1]

Variable containing:
 0
 4
 2
 7
 9
 0
 4
 3
 1
 1
 4
 0
 6
 2
 1
 8
 3
 8
 5
 3
 9
 3
 6
 3
 0
 2
 8
 5
 2
 5
 6
 3
 9
 8
 9
 7
 7
 1
 6
 8
 7
 9
 7
 8
 1
 9
 8
 4
 7
 1
 2
 7
 9
 9
 6
 1
 3
 7
 6
 6
 4
 9
 1
 1
[torch.LongTensor of size 64]

### Now we can stack, to show that we are combining these different parts together

In [50]:
## Example 1

net = nn.Sequential(
    nn.Linear(28*28, 100),
    nn.ReLU(),
    nn.Linear(100, 10),    
    nn.LogSoftmax()
)#.cuda() #<--- signals to run on the GPU

## Example 2

net = nn.Sequential(
    nn.Linear(28*28, 100),
    nn.ReLU(),
    nn.Linear(100, 100),
    nn.ReLU(),
    nn.Linear(100, 100),
    nn.ReLU(),    
    nn.Linear(100, 100),
    nn.ReLU(),    
    nn.Linear(100, 10),    
    nn.LogSoftmax()
)#.cuda() #<--- signals to run on the GPU


net = nn.Sequential(
    nn.Linear(28*28, 100),
    nn.Sigmoid(),
    nn.Linear(100, 100),
    nn.Sigmoid(),
    nn.Linear(100, 100),
    nn.Sigmoid(), 
    nn.Linear(100, 100),
    nn.ReLU(),    
    nn.Linear(100, 10),    
    nn.LogSoftmax()
)#.cuda() #<--- signals to run on the GPU

#### Activation is the value calculated in a layer

- `Relu` - will tend to drop out negative values
- `softmax` - will force many of the activations towards to zero.
- `sigmoid` - will limit the values between `0` to `1`
- `tanh` - very similar to sigmoid, but between `-1` and `1`
- `LeakRelu` - will minimize negative values to a small small decimal

In [53]:
loss=nn.NLLLoss()
metrics=[accuracy]
opt=optim.Adam(net.parameters())

In [54]:
fit(net, md, epochs=1, crit=loss, opt=opt, metrics=metrics)

[ 0.       0.21213  0.19673  0.94984]                        



## Rewrite #2 Model - matrix multiplication from scratch

### Review of python matrix methods and approaches

<img src='https://cdn.lynda.com/video/161431-100-635594039429618029_338x600_thumb.jpg' />

It's important to learn these concepts to reduce the number of loops and make efficient coding. SIMD - single instruction, multiple data. This is maximizing the CPU calculation. Also there's multiple cores, multiple threads. If you can distribute all these calculations, you take advantage of your computers full potential.

GPUs are special processors can run even more things in parallel!

<img src='https://image.slidesharecdn.com/tensordecomposition-170301235239/95/a-brief-survey-of-tensors-3-638.jpg?cb=1488412458' style='width:400px'/>

In [61]:
a = T([10, 6, -4])
b = T([2, 8, 7])

In [62]:
a + b


 12
 14
  3
[torch.LongTensor of size 3]

In [63]:
a < b


 0
 1
 1
[torch.ByteTensor of size 3]

### Broadcast: Comparing Arrays against single values

In [64]:
a


 10
  6
 -4
[torch.LongTensor of size 3]

#### Broadcasting boolean comparison

In [71]:
a > 0


 1
 1
 0
[torch.ByteTensor of size 3]

#### Broadcasting addition of a constant

In [72]:
a + 1


 11
  7
 -3
[torch.LongTensor of size 3]

#### Broadcasting multiplication against a constant

In [73]:
m = T([[1, 2, 3], [4,5,6], [7,8,9]]); m


 1  2  3
 4  5  6
 7  8  9
[torch.LongTensor of size 3x3]

In [74]:
m*2


  2   4   6
  8  10  12
 14  16  18
[torch.LongTensor of size 3x3]

#### Broad Casting vector addition against a matrix

In [75]:
c = T([10,20,30]); c


 10
 20
 30
[torch.LongTensor of size 3]

In [87]:
print(m), print(c), print(m + c)


 1  2  3
 4  5  6
 7  8  9
[torch.LongTensor of size 3x3]


 10
 20
 30
[torch.LongTensor of size 3]


 11  22  33
 14  25  36
 17  28  39
[torch.LongTensor of size 3x3]



(None, None, None)

#### Python Numpy version

In [90]:
m = np.array([[1, 2, 3], [4,5,6], [7,8,9]]); m
c = np.array([10,20,30]); c

array([10, 20, 30])

The numpy `expand_dims` method lets us convert the 1-dimensional array `c` into a 2-dimensional array (although one of those dimensions has value 1). This will allow us to use matrix to matrix operations

In [93]:
m + np.expand_dims(c,0)

array([[11, 22, 33],
       [14, 25, 36],
       [17, 28, 39]])

In [92]:
m + np.expand_dims(c,1)

array([[11, 12, 13],
       [24, 25, 26],
       [37, 38, 39]])

In [95]:
m + c[None,:]

array([[11, 22, 33],
       [14, 25, 36],
       [17, 28, 39]])

In [96]:
m + c[:,None]

array([[11, 12, 13],
       [24, 25, 26],
       [37, 38, 39]])

In [98]:
np.broadcast_to(c, (3,3))

array([[10, 20, 30],
       [10, 20, 30],
       [10, 20, 30]])

#### Broadcasting Rules

When operating on two arrays, Numpy/PyTorch compares their shapes element-wise. It starts with the **trailing dimensions**, and works its way forward. Two dimensions are **compatible** when

- they are equal, or
- one of them is 1

Arrays do not need to have the same number of dimensions. For example, if you have a $256 \times 256 \times 3$ array of RGB values, and you want to scale each color in the image by a different value, you can multiply the image by a one-dimensional array with 3 values. Lining up the sizes of the trailing axes of these arrays according to the broadcast rules, shows that they are compatible:

    Image  (3d array): 256 x 256 x 3
    Scale  (1d array):             3
    Result (3d array): 256 x 256 x 3

The [numpy documentation](https://docs.scipy.org/doc/numpy-1.13.0/user/basics.broadcasting.html#general-broadcasting-rules) includes several examples of what dimensions can and can not be broadcast together.

```python
A      (2d array):  5 x 4
B      (1d array):      1
Result (2d array):  5 x 4

A      (2d array):  5 x 4
B      (1d array):      4
Result (2d array):  5 x 4

A      (3d array):  15 x 3 x 5
B      (3d array):  15 x 1 x 5
Result (3d array):  15 x 3 x 5

A      (3d array):  15 x 3 x 5
B      (2d array):       3 x 5
Result (3d array):  15 x 3 x 5

A      (3d array):  15 x 3 x 5
B      (2d array):       3 x 1
Result (3d array):  15 x 3 x 5
```

#### Some bad examples

```python
A      (1d array):  3
B      (1d array):  4 # trailing dimensions do not match

A      (2d array):      2 x 1
B      (3d array):  8 x 4 x 3 # second from last dimensions mismatched
```

In [99]:
c

array([10, 20, 30])

In [100]:
c[None]

array([[10, 20, 30]])

In [101]:
c[:,None]

array([[10],
       [20],
       [30]])

#### Note that since these are now considered 2 dim Matrix, these are broadcast operations

In [102]:
c[None] * c[:,None]

array([[100, 200, 300],
       [200, 400, 600],
       [300, 600, 900]])

In [103]:
c[:,None] * c[None]

array([[100, 200, 300],
       [200, 400, 600],
       [300, 600, 900]])

In [106]:
xg, yg = np.ogrid[0:5,0:5]
np.ogrid[0:5,0:5]

[array([[0],
        [1],
        [2],
        [3],
        [4]]), array([[0, 1, 2, 3, 4]])]

In [107]:
xg+yg

array([[0, 1, 2, 3, 4],
       [1, 2, 3, 4, 5],
       [2, 3, 4, 5, 6],
       [3, 4, 5, 6, 7],
       [4, 5, 6, 7, 8]])

In [108]:
m,c

(array([[1, 2, 3],
        [4, 5, 6],
        [7, 8, 9]]), array([10, 20, 30]))

In [109]:
m @ c

array([140, 320, 500])

In [110]:
T(m)@ T(c)


 140
 320
 500
[torch.LongTensor of size 3]

In [111]:
m * c

array([[ 10,  40,  90],
       [ 40, 100, 180],
       [ 70, 160, 270]])

In [112]:
(m * c).sum(axis=1)

array([140, 320, 500])

### Matrix Matrix Product

http://matrixmultiplication.xyz

A nice visualization to assist in understanding matrix multiplication



In [113]:
n = np.array([[10,40],[20,0],[30,-5]]); n

array([[10, 40],
       [20,  0],
       [30, -5]])

In [114]:
m @ n

array([[140,  25],
       [320, 130],
       [500, 235]])

In [115]:
(m * n[:,0]).sum(axis=1)

array([140, 320, 500])

In [116]:
(m * n[:,1]).sum(axis=1)

array([ 25, 130, 235])

## Rewrite #2 `fit` - matrix multiplication from scratch

https://github.com/tensorly/tensorly

#### The original master model


```python
class LogReg(nn.Module):
    def __init__(self):
        super().__init__()
        self.l1_w = get_weights(28*28, 10)  # Layer 1 weights
        self.l1_b = get_weights(10)         # Layer 1 bias

    def forward(self, x):
        x = x.view(x.size(0), -1)
        x = torch.matmul(x, self.l1_w) + self.l1_b 
        x = torch.log(torch.exp(x)/(1 + torch.exp(x).sum(dim=0)))
        return x

net2 = LogReg().cuda()
opt=optim.Adam(net2.parameters())

fit(net2, md, epochs=1, crit=loss_fn, opt=opt, metrics=metrics)
```

#### The first rewrite:

```python
def get_weights(*dims): return nn.Parameter(torch.randn(*dims)/dims[0])

class LogReg(nn.Module):
    
    def __init__(self):
        super().__init__()
        self.l1_w = get_weights(28*28, 10)  # Layer 1 weights
        self.l1_b = get_weights(10)         # Layer 1 bias

    def forward(self, x):
        x = x.view(x.size(0), -1)
        x = torch.matmul(x, self.l1_w) + self.l1_b  # Linear Layer
        x = torch.log(torch.exp(x)/(1 + torch.exp(x).sum(dim=0)))        # Non-linear (LogSoftmax) Layer
        return x
```

### We want to focus on re-writing the `fit` of the model

We want to pass the fit function mini batches at a time

In [127]:
net2 = LogReg()#.cuda()
loss_fn=nn.NLLLoss()
learning_rate = 1e-3
optimizer=optim.Adam(net2.parameters(), lr=learning_rate)

In [128]:
dl = iter(md.trn_dl) #Data loader

In [133]:
x, y = next(dl)
y_pred = net2(Variable(x))#.cuda())

In [134]:
# Compute and print loss.
loss = loss_fn(y_pred, Variable(y))#.cuda())
print(loss.data)


 4.1750
[torch.FloatTensor of size 1]



### Get the gradient using pytorch inherited method `backward`

In [135]:
# Before the backward pass, use the optimizer object to zero all of the
# gradients for the variables it will update (which are the learnable weights
# of the model)
optimizer.zero_grad()

# Backward pass: compute gradient of the loss with respect to model
# parameters
loss.backward()

# Calling the step function on an Optimizer makes an update to its
# parameters
optimizer.step()

### What's the stepping doing?

<img src='https://sebastianraschka.com/images/faq/closed-form-vs-gd/ball.png' style='width:400px' />

<img src = 'https://media.licdn.com/mpr/mpr/AAEAAQAAAAAAAA0oAAAAJDUzMTBlMDdjLWM0ZmMtNDJkNS1hODk3LTAzYTllMDUwZmY1OQ.jpg' style='width:400px' />

### Let's do a single step

In [145]:
x, y = next(dl)
y_pred = net2.forward(Variable(x))#.cuda())

In [146]:
# Compute and print loss.

loss = loss_fn(y_pred, Variable(y))#.cuda())
print(loss.data)


 2.3914
[torch.FloatTensor of size 1]



### Let's loop the steps

Note: in pytorch you have to zero the gradients.

In [147]:
for t in range(100):
    x, y = next(dl)
    y_pred = net2(Variable(x))#.cuda())
    loss = loss_fn(y_pred, Variable(y))#.cuda())
    
    if t % 10 == 0:
        accuracy = np.sum(to_np(y_pred).argmax(axis=1) == to_np(y))/len(y_pred)
        print("loss: ", loss.data[0], "\t accuracy: ", accuracy)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

loss:  2.4622957706451416 	 accuracy:  0.859375
loss:  2.4914746284484863 	 accuracy:  0.875
loss:  2.5012717247009277 	 accuracy:  0.90625
loss:  2.438032388687134 	 accuracy:  0.90625
loss:  2.41378116607666 	 accuracy:  0.90625
loss:  2.336564779281616 	 accuracy:  0.875
loss:  2.5167903900146484 	 accuracy:  0.921875
loss:  2.4636924266815186 	 accuracy:  0.890625
loss:  2.624217987060547 	 accuracy:  0.84375
loss:  2.524811267852783 	 accuracy:  0.90625
