# Lecture 8 Deep Learning Software - Private Notes
 - Caffe, Torch, Theano, TensorFlow, Keras, PyTorch, etc<br>

Slides : http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture8.pdf<br>
videos : https://www.youtube.com/watch?v=6SlgtELqOWc&list=PL3FW7Lu3i5JvHM8ljYj-zLfQRF3EO8sYv

## Today
 - CPU vs GPU
 - Deep Learning FrameWorks
  - Caffe / Caffe2
  - Theano / TensorFlow
  - Torch / PyTorch

# Programming GPU
 - CUDA(NVIDIA only)
  - Write C-like code that runs directly on the GPU
  - Higher-level API:cuBLAS, cuFFT, cuDNN etc
 - OpenCL
  - Similar to CUDA, but runs on anything
  - Usually slower
 - Udacity: Intro to Parallel Programming
  - https://www.udacity.com/course/cs344
  - For deep learning just use existing libraries

<img src='./Lesson pic/8-21.png'>

## The point of deep learning frameworks
 (1) Easily build big computational graphs<br>
 (2) Easily compute gradients in computational graphs<br>
 (3) Run it all efficiently on GPU (wrap cuDNN, cuBLAS, etc)

Problems
 - Can't run on GPU
 - Have to compute our own gradients

In [2]:
import numpy as np
np.random.seed(0)

N,D = 3,4

x=np.random.randn(N,D)
y=np.random.randn(N,D)
z=np.random.randn(N,D)

a=x*y
b=a+z
c=np.sum(b)

grad_c = 1.
grad_b = grad_c * np.ones((N,D))
grad_a = grad_b.copy()
grad_z = grad_b.copy()
grad_x = grad_a * y
grad_y = grad_a * x

In [6]:
import numpy as np
np.random.seed(0)
import tensorflow as tf

N,D = 3,4

x=tf.placeholder(tf.float32)
y=tf.placeholder(tf.float32)
z=tf.placeholder(tf.float32)

a=x*y
b=a+z
c=tf.reduce_sum(b)

grad_x, grad_y, grad_z = tf.gradients(c,[x,y,z])

with tf.Session() as sess:
    values = {
        x:np.random.randn(N,D),
        y:np.random.randn(N,D),
        z:np.random.randn(N,D),
    }
    
    out = sess.run([c,grad_x,grad_y,grad_z], feed_dict=values)
    c_val, grad_x_val, grad_y_val, grad_z_val = out

In [6]:
import numpy as np
np.random.seed(0)
import tensorflow as tf

N,D = 3,4

with tf.device('/cpu:0'):
    x=tf.placeholder(tf.float32)
    y=tf.placeholder(tf.float32)
    z=tf.placeholder(tf.float32)

    a=x*y
    b=a+z
    c=tf.reduce_sum(b)

grad_x, grad_y, grad_z = tf.gradients(c,[x,y,z])

with tf.Session() as sess:
    values = {
        x:np.random.randn(N,D),
        y:np.random.randn(N,D),
        z:np.random.randn(N,D),
    }
    
    out = sess.run([c,grad_x,grad_y,grad_z], feed_dict=values)
    c_val, grad_x_val, grad_y_val, grad_z_val = out

# PyTorch
```Python
import torch
from torch.autograd import Variable

N,D=3,4

x=Variable(torch.randn(N,D), requires_grad=True)
y=Variable(torch.randn(N,D), requires_grad=True)
z=Variable(torch.randn(N,D), requires_grad=True)

"""
x=Variable(torch.randn(N,D).cuda(), requires_grad=True)
y=Variable(torch.randn(N,D).cuda(), requires_grad=True)
z=Variable(torch.randn(N,D).cuda(), requires_grad=True)
"""

a=x*y
b=a+z
c=torch.sum(b)

c.backward()

print(x.grad.data)
print(y.grad.data)
print(z.grad.data)
```

# Tensorflow

In [9]:
import numpy as np
np.random.seed(0)
import tensorflow as tf
# Running example: Train a two-layer ReLU network on random data with L2 loss

# First define computational graph
N,D,H=64,1000,100

# Create placeholders for input x, weights w1 and w2, and targets y
x=tf.placeholder(tf.float32, shape=(N,D))
y=tf.placeholder(tf.float32, shape=(N,D))
w1=tf.placeholder(tf.float32, shape=(D,H))
w2=tf.placeholder(tf.float32, shape=(H,D))

# Forward pass: compute prediction for y and loss (L2 distance between y and y_pred)
# No computation happens here - just building the graph!
h=tf.maximum(tf.matmul(x,w1),0)
y_pred = tf.matmul(h,w2)
diff = y_pred - y
loss = tf.reduce_mean(tf.reduce_sum(diff**2, axis=1))

# Tell TensorFlow to compute loss of gradient with respect to w1 and w2.
# Again no computation here - just building the graph
grad_w1, grad_w2 = tf.gradients(loss, [w1,w2])

# Then run the graph many times
# Now done building our graph, so we enter a session so we can actually run the graph
with tf.Session() as sess:
    # Create numpy arrays that will fill in the placeholders above
    values = {x:np.random.randn(N,D),
              w1:np.random.randn(D,H),
              w2:np.random.randn(H,D),
              y:np.random.randn(N,D),
             }
    
    # Run the graph: feed in the numpy arrays for x, y, w1, and w2; get numpy arrays for loss, grad_w1, and grad_w2
    out = sess.run([loss, grad_w1, grad_w2],
                  feed_dict=values)
    loss_val, grad_w1_val, grad_w2_val = out

In [10]:
import numpy as np
np.random.seed(0)
import tensorflow as tf

N,D,H=64,1000,100

x=tf.placeholder(tf.float32, shape=(N,D))
y=tf.placeholder(tf.float32, shape=(N,D))


w1=tf.placeholder(tf.float32, shape=(D,H))
w2=tf.placeholder(tf.float32, shape=(H,D))

h=tf.maximum(tf.matmul(x,w1),0)
y_pred = tf.matmul(h,w2)
diff = y_pred - y
loss = tf.reduce_mean(tf.reduce_sum(diff**2, axis=1))

grad_w1, grad_w2 = tf.gradients(loss, [w1,w2])

with tf.Session() as sess:
    values = {x:np.random.randn(N,D),
              w1:np.random.randn(D,H),
              w2:np.random.randn(H,D),
              y:np.random.randn(N,D),
             }
    learning_rate = 1e-5
    # Train the network: Run the graph over and over,use gradient to update weights
    # Problem: copying weights between CPU / GPU each step
    for t in range(50):        
        out = sess.run([loss, grad_w1, grad_w2],
                      feed_dict=values)
        loss_val, grad_w1_val, grad_w2_val = out
        values[w1] -= learning_rate * grad_w1_val
        values[w2] -= learning_rate * grad_w2_val

In [11]:
import numpy as np
np.random.seed(0)
import tensorflow as tf

N,D,H=64,1000,100

x=tf.placeholder(tf.float32, shape=(N,D))
y=tf.placeholder(tf.float32, shape=(N,D))
# Change w1 and w2 from placeholder (fed on each call) to Variable (persists in the graph between calls)
w1 = tf.Variable(tf.random_normal((D,H)))
w2 = tf.Variable(tf.random_normal((H,D)))

h=tf.maximum(tf.matmul(x,w1),0)
y_pred = tf.matmul(h,w2)
diff = y_pred - y
loss = tf.reduce_mean(tf.reduce_sum(diff**2, axis=1))

grad_w1, grad_w2 = tf.gradients(loss, [w1,w2])

learning_rate = 1e-5
new_w1 = w1.assign(w1 - learning_rate * grad_w1)
new_w2 = w2.assign(w2 - learning_rate * grad_w2)

with tf.Session() as sess:
    # Run graph once to initialize w1 and w2
    sess.run(tf.global_variables_initializer())
    values = {x:np.random.randn(N,D),
              y:np.random.randn(N,D),}
    # Run many times to train
    for t in range(50):        
        loss_val, = sess.run([loss], feed_dict=values)

In [11]:
import numpy as np
np.random.seed(0)
import tensorflow as tf

N,D,H=64,1000,100

x=tf.placeholder(tf.float32, shape=(N,D))
y=tf.placeholder(tf.float32, shape=(N,D))
# Change w1 and w2 from placeholder (fed on each call) to Variable (persists in the graph between calls)
w1 = tf.Variable(tf.random_normal((D,H)))
w2 = tf.Variable(tf.random_normal((H,D)))

h=tf.maximum(tf.matmul(x,w1),0)
y_pred = tf.matmul(h,w2)
diff = y_pred - y
loss = tf.reduce_mean(tf.reduce_sum(diff**2, axis=1))

grad_w1, grad_w2 = tf.gradients(loss, [w1,w2])

learning_rate = 1e-5
new_w1 = w1.assign(w1 - learning_rate * grad_w1)
new_w2 = w2.assign(w2 - learning_rate * grad_w2)
# Add dummy graph node that depends on updates
updates=tf.group(new_w1, new_w2)


with tf.Session() as sess:
    # Run graph once to initialize w1 and w2
    sess.run(tf.global_variables_initializer())
    values = {x:np.random.randn(N,D),
              y:np.random.randn(N,D),}
    # Run many times to train
    for t in range(50):      
        # Tell graph to compute dummy node
        loss_val,_ = sess.run([loss, updates], feed_dict=values)

# Optimizer

In [13]:
import numpy as np
np.random.seed(0)
import tensorflow as tf

N,D,H=64,1000,100

x=tf.placeholder(tf.float32, shape=(N,D))
y=tf.placeholder(tf.float32, shape=(N,D))
# Change w1 and w2 from placeholder (fed on each call) to Variable (persists in the graph between calls)
w1 = tf.Variable(tf.random_normal((D,H)))
w2 = tf.Variable(tf.random_normal((H,D)))

h=tf.maximum(tf.matmul(x,w1),0)
y_pred = tf.matmul(h,w2)
"""
diff = y_pred - y
loss = tf.reduce_mean(tf.reduce_sum(diff**2, axis=1))
"""
loss = tf.losses.mean_squared_error(y_pred, y)

#Can use an optimizer to compute gradients and update weights
optimizer = tf.train.GradientDescentOptimizer(1e-5)
updates = optimizer.minimize(loss)

with tf.Session() as sess:
    # Run graph once to initialize w1 and w2
    sess.run(tf.global_variables_initializer())
    values = {x:np.random.randn(N,D),
              y:np.random.randn(N,D),}
    # Run many times to train
    for t in range(50):      
        # Tell graph to compute dummy node
        # Remember to execute the output of the optimizer!
        loss_val,_ = sess.run([loss, updates], feed_dict=values)

# Layers

In [15]:
import numpy as np
np.random.seed(0)
import tensorflow as tf

N,D,H=64,1000,100

x=tf.placeholder(tf.float32, shape=(N,D))
y=tf.placeholder(tf.float32, shape=(N,D))

#Use Xavier initializer
init=tf.contrib.layers.xavier_initializer()
#tf.layers automatically sets up weight and bias for us!
h=tf.layers.dense(inputs=x, units=H, activation=tf.nn.relu, kernel_initializer=init)
y_pred = tf.layers.dense(inputs=h, units=D, kernel_initializer=init)

loss = tf.losses.mean_squared_error(y_pred, y)

#Can use an optimizer to compute gradients and update weights
optimizer = tf.train.GradientDescentOptimizer(1e0)
updates = optimizer.minimize(loss)

with tf.Session() as sess:
    # Run graph once to initialize w1 and w2
    sess.run(tf.global_variables_initializer())
    values = {x:np.random.randn(N,D),
              y:np.random.randn(N,D),}
    # Run many times to train
    for t in range(50):      
        # Tell graph to compute dummy node
        # Remember to execute the output of the optimizer!
        loss_val,_ = sess.run([loss, updates], feed_dict=values)

# Keras: High - Level Wrapper
 - Keras is a layer on top of Tensorflow, makes common things easy to do<br>
(Also supports Theano backend)

In [20]:
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.optimizers import SGD

N,D,H = 64,1000,100

# Define model object as a sequence of layers
model=Sequential()
#model.add(Dense(input_dim=D, output_dim=H))
model.add(Dense(input_dim=D, units=H))
model.add(Activation('relu'))
model.add(Dense(input_dim=H, units=D))

# Define optimizer object
optimizer = SGD(lr=1e0)
# Build the model,specify loss function
model.compile(loss='mean_squared_error', optimizer=optimizer)

x = np.random.randn(N,D)
y = np.random.randn(N,D)

# Train the model with a single line!
#history = model.fit(x,y, nb_epoch=50, batch_size=N, verbose=0)
history = model.fit(x,y, epochs=50, batch_size=N, verbose=0)

# TensorFlow: Other High-Level Wrappers 
- Keras (https://keras.io/)
- TFLearn (http://tflearn.org/)

- TensorLayer (http://tensorlayer.readthedocs.io/en/latest/) #Ships with Tensorflow
- tf.layers (https://www.tensorflow.org/api_docs/python/tf/layers) #Ships with Tensorflow
- TF-Slim (https://github.com/tensorflow/models/tree/master/inception/inception/slim) #Ships with Tensorflow
- tf.contrib.learn (https://www.tensorflow.org/get_started/tflearn) #From google
- Pretty Tensor (https://github.com/google/prettytensor) # From DeepMind

# TensorFlow: Pretrained Models
- TF-Slim: (https://github.com/tensorflow/models/tree/master/slim/nets)
- Keras: (https://github.com/fchollet/deep-learning-models)

# TensorFlow: Tensorboard
- Add logging to code to record loss, stats, etc 
- Run server and get pretty graphs!

# TensorFlow: Distributed Version
- Split one graph over multiple machines!
- https://www.tensorflow.org/deploy/distributed

# Side Note: Theano

```Python
import theano
import theano.tensor as T

N,D,H,C = 64,1000,100,10

# Define symbolic variables (similar to TensorFlow placeholder)
x=T.matrix('x')
y=T.vector('y',dtype='int64')
w1 = T.matrix('w1')
w2=T.matrix('w2')

# Forward pass: compute predictions and loss
a=x.dot(w1)
a_relu=T.nnet.relu(a)
scores = a_relu.dot(w2)

# Forward pass: compute predictions and loss (no computation performed yet)
probs = T.nnet.softmax(scores)
loss = T.nnet.categorical_crossentropy(probs,y).mean()

# Ask Theano to compute gradients for us (no computation performed yet)
dw1,dw2 = T.grad(loss,[w1,w2])

# Compile a function that computes loss, scores, and gradients from data and weights
f=theano.funtion(
    inputs=[x,y,w1,w2],
    outputs=[loss,scores,dw1,dw2],
)           
```

# Pythorch : Three Levels ofAbstraction
 - Tensor:Imperative ndarray, but runs on GPU
 - Variable: Node in a computational graph; stores data and gradient
 - Modules: A neural network layer; may store state or learnable weights

# PyTorch: Tensors
PyTorch Tensors are just like numpy arrays, but they can run on GPU.<br>
No built-in notion of computational graph, or gradients, or deep learning.<br>
Here we fit a two-layer net using PyTorch Tensors:<br>
```Python
import torch

# To run on GPU, just cast tensors to a cuda datatype!
dtype = torch.FloatTensor

# Create random tensors for data and weights
N, D_in, H, D_out = 64,1000,100,10
x=torch.randn(N,D_in).type(dtype)
y=torch.randn(N,D_out).type(dtype)
w1=torch.randn(D_in, H).type(dtype)
w2=torch.randn(H,D_out).type(dtype)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predictions and loss
    h=x.mm(w1)
    h_relu = h.clamp(min=0)
    y_pred = h_relu.mm(w2)
    loss=(y_pred - y).pow(2).sum()
    
    # Backward pass: manually compute gradients
    grad_y_pred = 2. * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h<0] = 0
    grad_w1 = x.t().mm(grad_h)
    
    # Gradient descent step on weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2
```

# PyTorch: Autograd
A PyTorch Variable is a node in a computational graph <br>
x.data is a Tensor <br>
x.grad is a Variable of gradients <br>
(same shape as x.data) <br>
x.grad.data is a Tensor of gradients <br>

```Python
import torch
from torch.autograd import Variable

# Create random tensors for data and weights
N, D_in, H, D_out = 64,1000,100,10

# We will not want gradients (of loss) with respect to data
x=Variable(torch.randn(N,D_in), requires_grad=False)
y=Variable(torch.randn(N,D_out), requires_grad=False)
# Do want gradients with respect to weights
w1=Variable(torch.randn(D_in, H), requires_grad=True)
w2=Variable(torch.randn(H,D_out), requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    y_pred = x.mm(w1).clamp(min=0)
    loss = (y_pred - y).pow(2).sum()
    
    # Compute gradient of loss with respect to w1 and w2 (zero out grads first)
    if w1.grad: w1.grad.data.zero_()
    if w2.grad: w2.grad.data.zero_()
    loss.backward()
    
    # Make gradient step on weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2
        
```
PyTorch Tensors and Variables have the same API! <br>
Variables remember how they were created (for backprop) <br>


# PyTorch: New Autograd Functions
- (slides:94)Define your own autograd functions by writing forward and backward for Tensors <br>
(similar to modular layers in A2)
- Can use our new autograd function in the forward pass

# PyTorch: nn
- Higher-level wrapper for working with neural nets
- Similar to Keras and friends but only one, and it’s good =)
- Define our model as a sequence of layers
- nn also defines common loss functions
- Forward pass: feed data to model, and prediction to loss function
- Backward pass: compute all gradients
- Make gradient step on each model parameter

# PyTorch: optim
- Use an optimizer for different update rules
- Update all parameters after computing gradients

# PyTorch: nn Define new Modules
- A PyTorch Module is a neural net layer; it inputs and outputs Variables
- Modules can contain weights (as Variables) or other Modules
- You can define your own Modules using autograd!
- Define our whole model as a single Module
- Initializer sets up two children (Modules can contain modules)
- Define forward pass using child modules and autograd ops on Variables
- No need to define backward - autograd will handle it
- Construct and train an instance of our model

# PyTorch: DataLoaders
- A DataLoader wraps a Dataset and provides minibatching, shuffling, multithreading, for you
- When you need to load custom data, just write your own Dataset class
- Iterate over loader to form minibatches
- Loader gives Tensors so you need to wrap in Variables

# PyTorch: Pretrained Models
- Super easy to use pretrained models with torchvision
- https://github.com/pytorch/vision

# PyTorch: Visdom
- Somewhat similar to TensorBoard: add logging to your code, then visualized in a browser
- Can’t visualize computational graph structure (yet?)
- https://github.com/facebookresearch/visdom

# Aside: Torch
- Direct ancestor of PyTorch (they share a lot of C backend)
- Written in Lua, not Python
- PyTorch has 3 levels of abstraction: Tensor, Variable, and Module
- Torch only has 2: Tensor, Module
- More details: Check 2016 slides

- Build a model as a sequence of layers, and a loss function
- Forward: compute scores and loss
- Backward: compute gradient (no autograd, need to pass grad_scores around)
- Define a callback that inputs weights, produces loss and gradient on weights
- Pass callback to optimizer over and over

# Torch vs PyTorch
    Torch :   (-) Lua    (-) No autograd (+) More stable           (+) Lots of existing code (0) Fast
    PyTorch : (+) Python (+) Autograd    (-) Newer, still changing (-) Less existing code    (0) Fast
    
## Conclusion: Probably use PyTorch for new projects

<img src='./Lesson pic/8-120.png'>

 - With static graphs,framework can optimize the graph for you before it runs!
 - Static : Once graph is built, can serialize it and run it without the code that built the graph!
 - Dynamic : Graph building and execution are intertwined, so always need to keep code around

<img src='./Lesson pic/8-125.png'>
<img src='./Lesson pic/8-128.png'>

# Dynamic Graphs in TensorFlow
- TensorFlow Fold make dynamic graphs easier in TensorFlow through dynamic batching
- Recurrent networks
- Recursive networks
- Modular Networks
- (Your creative idea here)

# Caffe (UC Berkeley)
 - Core written in C++
 -  Has Python and MATLAB bindings
 -  Good for training or finetuning feedforward classification models
 -  Often no need to write code!
 -  Not used as much in research anymore, still popular for deploying models

# Caffe: Training / Finetuning
No need to write code!
1. Convert data (run a script)
2. Define net (edit prototxt)
3. Define solver (edit prototxt)
4. Train (with pretrained weights) (run a script)

# Caffe step 1: Convert Data
- ● DataLayer reading from LMDB is the easiest
- ● Create LMDB using convert_imageset
- ● Need text file where each line is
 - ○ “[path/to/image.jpeg] [label]”
- ● Create HDF5 file yourself using h5py
- ● ImageDataLayer: Read from image files
- ● WindowDataLayer: For detection
- ● HDF5Layer: Read from HDF5 file
- ● From memory, using Python interface
- ● All of these are harder to use (except Python)

# Caffe step 2: Define Network (prototxt)
- ● .prototxt can get ugly for big models
- ● ResNet-152 prototxt is 6775 lines long!
- ● Not “compositional”; can’t easily define a residual block and reuse

# Caffe step 3: Define Solver (prototxt)
- ● Write a prototxt file defining a SolverParameter
- ● If finetuning, copy existing solver.prototxt file
  - ○ Change net to be your net
  - ○ Change snapshot_prefix to your output
  - ○ Reduce base learning rate(divide by 100)
  - ○ Maybe change max_iter and snapshot

# Caffe step 4: Train!
```
./build/tools/caffe train \
-gpu 0 \
-model path/to/trainval.prototxt \
-solver path/to/solver.prototxt \
-weights path/to/pretrained_weights.caffemodel
```

## Instead of -gpu 0 \ You Could Write
-gpu -1 for CPU-only <br>
-gpu all for multi-gpu <br>
https://github.com/BVLC/caffe/blob/master/tools/caffe.cpp

## Caffe Model Zoo

    AlexNet, VGG, GoogLeNet, ResNet, plus others

    https://github.com/BVLC/caffe/wiki/Model-Zoo

# Caffe Python Interface
Not much documentation…<br>
Read the code! Two most important files:<br>
    ● caffe/python/caffe/_caffe.cpp: https://github.com/BVLC/caffe/blob/master/python/caffe/_caffe.cpp <br>
        ○ Exports Blob, Layer, Net, and Solver classes<br>
    ● caffe/python/caffe/pycaffe.py : https://github.com/BVLC/caffe/blob/master/python/caffe/pycaffe.py<br>
        ○ Adds extra methods to Net class<br>

Good for:<br>
● Interfacing with numpy<br>
● Extract features: Run net forward<br>
● Compute gradients: Run net backward (DeepDream, etc)<br>
● Define layers in Python with numpy (CPU only)<br>

# Caffe Pros / Cons

● (+) Good for feedforward networks<br>
● (+) Good for finetuning existing networks<br>
● (+) Train models without writing any code!<br>
● (+) Python interface is pretty useful!<br>
● (+) Can deploy without Python<br>
● (-) Need to write C++ / CUDA for new GPU layers<br>
● (-) Not good for recurrent networks<br>
● (-) Cumbersome for big networks (GoogLeNet, ResNet)<br>

# Caffe2 (Facebook)
● Very new - released a week ago =) <br>
● Static graphs, somewhat similar to TensorFlow<br>
● Core written in C++<br>
● Nice Python interface<br>
● Can train model in Python, then serialize and deploy without Python <br>
● Works on iOS / Android, etc <br>

# Google:  
TensorFlow (“One framework to rule them all”)

# Facebook:
PyTorch(Research) +Caffe2(Production)

# My Advice:
TensorFlow is a safe bet for most projects. Not perfect but has<br>
huge community, wide usage. Maybe pair with high-level wrapper<br>
(Keras, Sonnet, etc)<br>
I think PyTorch is best for research. However still new, there can be<br>
rough patches.<br>
Use TensorFlow for one graph over many machines<br>
Consider Caffe, Caffe2, or TensorFlow for production deployment<br>

Consider TensorFlow or Caffe2 for mobile<br>

<img src='./Lesson pic/8-120.png'>