# Deep learning with PyTorch
## Introduction
This tutorial will introduce PyTorch, a deep learning framework for python initiated by Facebook.  
Compared to Tensorflow, PyTorch has some features that benefit the development and make it worth learning:
- In TensorFlow, the structure of the neural network must be defined and compiled statically before a model can run, while in PyTorch the structure can change dynamically in the run time, which can be illustrated especially in [sequences modeling](#Example:-Tweets-sentiment-prediction). 
- Users with sufficient knowledge for deep learning can start to code in PyTorch quickly, while TensorFlow's static mechanism leads to more tedious concepts and boilerplate code for users to learn.
- The dynamical mechanism also makes it eaiser to debug in PyTorch. Common Python IDEs like [PyCharm](https://www.jetbrains.com/pycharm/) can directly debug PyTorch but not TensorFlow. 


## Content
We will cover some basic and advanced usages of PyTorch and finally use an application example to illustrate how to use PyTorch to solve a real data problem.   
Following topics are covered:
- [Basic I: Tensor](#Basic-I:-Tensor)
- [Basic II: Autograd: automatic differentiation](#Basic-II:-Autograd:-automatic-differentiation)
- [Basic III: Constructing neural networks](#Basic-III:-Constructing-neural-networks)
- [Advanced I: Module and Function Customization](#Advanced-I:-Module-and-Function-Customization)
- [Advanced II: Loss Function Customization](#Advanced-II:-Loss-Function-Customization)
- [Example: Tweets sentiment prediction](#Example:-Tweets-sentiment-prediction)
- [Further resources](#Further-resources)

**Note**:  
PyTorch, Numpy and Pandas should be installed for running the code.  
As memtion in the [official website](http://pytorch.org/), PyTorch can be installed as below:




<img src="https://s1.ax1x.com/2018/03/17/95zSX9.png">

## Basic I: Tensor
Implementing a neural netowork often involves matrixes, or more generally, [tensors](https://medium.com/@quantumsteinke/whats-the-difference-between-a-matrix-and-a-tensor-4505fbdc576c) computations. In numpy, we may use ndarrays for such computations. In PyTorch, Tensors serve as similar data objects compared to ndarrays in numpy, while it is easy for Tensor to use the power of GPUs.
### Operations
The below code snippets show some most common operations of Tensor:

In [1]:
import torch
print("Creation example")
# Construct a 6x3 matrix, uninitialized
x = torch.Tensor(6, 3)
print(x)
# Construct a randomly initialized matrix from a uniform distribution on the interval [0,1)
x = torch.rand(6, 3)
print(x)

print("Size example")
# Get the size of the tensor
print(x.size())

print("Addition example")
# Addition: syntax 1
y = torch.rand(6, 3)
print(x + y)

# Addition: syntax 2
print(torch.add(x, y))

# Addition: in-place. And all operations that mutates a tensor in-place 
#is post-fixed with an _ character
y.add_(x)
print(y)

# Matric multiplication
print("Matrix multiplication example")
z = torch.rand(3,6)
print(torch.mm(z,y))

print("Slicing example")
# Slicing, standard NumPy-like indexing syntax is supported,
# including indexing using boolean array
print(x[:, 1])
print(x[x>0.2])

Creation example

 4.2549e-37  0.0000e+00  2.4027e-01
 4.5901e-41  2.4027e-01  4.5901e-41
 1.1884e-37  0.0000e+00  1.1884e-37
 0.0000e+00  7.7008e-16  4.5901e-41
 1.5342e-19  1.1707e-19  0.0000e+00
 0.0000e+00  1.3354e+09  4.5901e-41
[torch.FloatTensor of size 6x3]


 0.6403  0.2933  0.7005
 0.5391  0.7813  0.7213
 0.7448  0.8847  0.7505
 0.4780  0.7247  0.4475
 0.3711  0.4441  0.6590
 0.5147  0.4491  0.4959
[torch.FloatTensor of size 6x3]

Size example
torch.Size([6, 3])
Addition example

 1.3746  1.2270  1.0780
 0.8528  1.6781  0.8199
 1.3832  1.7739  1.0324
 0.8843  1.7162  1.0872
 1.1871  0.4506  1.4266
 1.1611  1.0453  1.4595
[torch.FloatTensor of size 6x3]


 1.3746  1.2270  1.0780
 0.8528  1.6781  0.8199
 1.3832  1.7739  1.0324
 0.8843  1.7162  1.0872
 1.1871  0.4506  1.4266
 1.1611  1.0453  1.4595
[torch.FloatTensor of size 6x3]


 1.3746  1.2270  1.0780
 0.8528  1.6781  0.8199
 1.3832  1.7739  1.0324
 0.8843  1.7162  1.0872
 1.1871  0.4506  1.4266
 1.1611  1.0453  1.4595
[torc

More operations on Tensor can be found [here](http://pytorch.org/docs/master/torch.html)

### NumPy Bridge
Tensor can be converted to a NumPy array and vice versa.  
And we need to note that 
the Tensor and NumPy array will share their underlying memory locations, and changing one will change the other.

In [2]:
# Convert tensor to numpy array
tensor = torch.ones(6)
print(tensor)
array = tensor.numpy()
print(array)

# Convert numpy array to tensor
import numpy as np
array = np.ones(5)
tensor = torch.from_numpy(array)
print(tensor)


 1
 1
 1
 1
 1
 1
[torch.FloatTensor of size 6]

[1. 1. 1. 1. 1. 1.]

 1
 1
 1
 1
 1
[torch.DoubleTensor of size 5]



In [3]:
# Change the numpy array will also change the tensor converted from it
array[0]=2
print(array)
print(tensor)

[2. 1. 1. 1. 1.]

 2
 1
 1
 1
 1
[torch.DoubleTensor of size 5]



### Use GPU
Tensors can be moved onto GPU to utilize its power for speeding up the computation using the **.cuda** method.

In [4]:
# The below code will get the on-GPU tensor if CUDA is available
if torch.cuda.is_available():
    x = x.cuda()
    y = y.cuda()
    x + y

## Basic II: Autograd: automatic differentiation
Computing gradients is the key step of conducting backprop in a neural network. Without any other supports, the computation of gradient requiring conducting differentiation will involves much mannual effort to determine how to compute. In order to mitigate this effort, the **autograd** package in PyTorch is designed to provide automatic differentiation for all **defined operations** ([later](#Advanced-I:-Customized-Module-and-Function) we will see the exception) on Tensors. Unlike frameworks like Tensorflow which provide automatic differentiation in a static way, PyTorch designs it as a define-by-run framework, meaning that the backprop depends on how the code is run, and every single iteration can be different instead of being fixed once compiled.

### Variable
It is easy to get the gradents in PyTorch. **autograd.Variable** class wraps a Tensor, and supports nearly all of operations defined on it. Once finishing the computation, calling **.backward()** will make all the gradients computed automatically.  
We can access the raw tensor through its **.data** attribute, and the gradients w.r.t this Variable is accumulated into **.grad**.


<img src="http://pytorch.org/tutorials/_images/Variable.png">

### Function
**Function** is another important class for autograd implementation. Each Variable has a **.grad_fn** attribute that references a Function who created this Variable (but grad_fn is None for Variables created by the user). Function links the Variable it creates and other Variables it use to create this Variable. Therefore, by interconnecting Variable and Function, we can represent a acyclic graph to encode a complete history of the computation.   
**Note**: the .grad_fn attribute will be created automatically when any operation is applied on Variable(s) to create to new one.

In [5]:
from torch.autograd import Variable

# Create a variable, and requires_grad will mark the Variable whose gradient is needed to
# be computed during backprop
x = Variable(torch.ones(3, 3), requires_grad=True)
print(x)

Variable containing:
 1  1  1
 1  1  1
 1  1  1
[torch.FloatTensor of size 3x3]



In [6]:
# Do an operation of variable
y = x + 3
print(y)

# y was created as a result of an operation, so it has a grad_fn.
print(y.grad_fn)

Variable containing:
 4  4  4
 4  4  4
 4  4  4
[torch.FloatTensor of size 3x3]

<AddBackward0 object at 0x7ff4265e3750>


To conduct differentiation, we can call **.backward()** on a Variable, by which gradients w.r.t it will be computed for **all its ancestors in the Variable acyclic graph**:
- If Variable is a scalar (i.e. it holds a one element data), we don’t need to specify any arguments to backward()
- If it has more elements, we need to specify a gradient argument, which is a tensor with matching shape.  
  
Note, once a Variable is created, its backward method can only be called once.

In [7]:
z = y * y * 2
o = z.mean()


# o.backward() is equivalent to doing out.backward(torch.Tensor([1.0])), since out is a scalar
o.backward()
# After calling o.backward(), we can access the computed gradients of x by x.grad
print(x.grad)

Variable containing:
 1.7778  1.7778  1.7778
 1.7778  1.7778  1.7778
 1.7778  1.7778  1.7778
[torch.FloatTensor of size 3x3]



The above should print a matrix with element of value 1.7778.  
Its mannual differentiation process is:  
$o=\frac{1}{9}\sum_{i}z_i$  
$z_i=2(x_i+3)^2$, and $z_i | _{x_i}=32$  
Hence, $\frac{∂o}{∂x_i}=\frac{4}{9}(x_i+3)$  
$\frac{∂o}{∂x_i}|_{x_i=1}=\frac{16}{9}≈1.7778$  
But now, with Autograd, we can save the effor for these differentiation works. 

## Basic III: Constructing neural networks
We can use the **torch.nn** package to construct neural networks.   
### Define the network
Let's take a look at the classic [LeNet](http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf) model for classifying digit images.

<img src="http://pytorch.org/tutorials/_images/mnist.png">

As a feed-forward network, it takes the input, feeds it through several layers one by one, and finally generates the output. This kind of networks can be taken as a kind of acyclic graph where each layer in it can be taken as one node in the graph, and the direction of the data flow indicates the edge direction between nodes. Hence, we can see that the acyclic graph representation mechanism described in [Autograd](#Basic-II:-Autograd:-automatic-differentiation) section is quite suitable.  
  
Each neural network model in PyTorch is defined as a subclass of the **nn.Module** class. Every nn.Module (or its subclass) builds on top of other Modules or Functions, and contains a method **forward(input)** that returns the **output**. The input to the forward is an autograd.Variable. So is the output.  
The only things needed to be done is to define the forward method, where we can use any Tensor operations. And the backward method for backprop will be automatically defined using autograd.  

The model above can be implemented as below (for better reference, [official example code](https://github.com/pytorch/tutorials/blob/master/beginner_source/blitz/neural_networks_tutorial.py) is used here):  
**Note**: several pre-defined building blocks of neural networks are under the [nn package](http://pytorch.org/docs/nn).


In [8]:
import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        # 1 input image channel, 6 output channels, 5x5 square convolution
        # kernel
        self.conv1 = nn.Conv2d(1, 6, 5)
        self.conv2 = nn.Conv2d(6, 16, 5)
        # an affine operation: y = Wx + b
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        # Max pooling over a (2, 2) window
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        # If the size is a square you can only specify a single number
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = x.view(-1, self.num_flat_features(x))
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

    def num_flat_features(self, x):
        size = x.size()[1:]  # all dimensions except the batch dimension
        num_features = 1
        for s in size:
            num_features *= s
        return num_features


net = Net()
print(net)

Net(
  (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear(in_features=400, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)


All learnable parameters of a model are returned by **net.parameters()**, which is a generator that returns parameters in an order that corresponds to the order of layers in the network starting from the input.  
  
**Note**: The entire torch.nn package only supports inputs in the form of a mini-batch of samples, instead a single sample.  
For example, nn.Conv2d taks a 4D Tensor of nSamples x nChannels x Height x Width as input.  
Therefore, if using a single sample, input.unsqueeze(0) is needed to add a fake batch dimension, where the parameter indicates the index of the dimension to be added.  

In [9]:
params = list(net.parameters())
print(len(params))
print(params[0].size())  # conv1's .weight

10
torch.Size([6, 1, 5, 5])


### Loss function
A loss function takes the (output, target) pairs as inputs, and computes a value for estimating the difference between the output and the target. Several [loss functions](http://pytorch.org/docs/nn.html#loss-functions) are pre-defined under the nn package.  
  
For example, **nn.MSELoss** computes the mean-squared error:

In [10]:
input_data = Variable(torch.randn(1, 1, 32, 32))
# Note that the object of nn.Module will forward the 
# calling of __call__ method to the forward method, so net(input_data)
# actually equals to calling net.forward(input_data)
output = net(input_data)
target = Variable(torch.arange(1, 11))  # a dummy target just for demo
criterion = nn.MSELoss()

loss = criterion(output, target)
print(loss)

Variable containing:
 38.9208
[torch.FloatTensor of size 1]



When calling **loss.backward()**, the whole graph will be differentiated w.r.t. the loss and gradients of all Variables in the graph will be accumulated in their .grad attributes.
**Note**: before accumulating gradients through backprop, we should use **net.zero_grad()** to zero the gradient buffers.

In [11]:
net.zero_grad()

print('fc3.bias.grad before backward')
print(net.fc3.bias.grad)

loss.backward()

print('fc3.bias.grad after backward')
print(net.fc3.bias.grad)

fc3.bias.grad before backward
None
fc3.bias.grad after backward
Variable containing:
-0.2015
-0.3807
-0.5985
-0.7914
-1.0168
-1.1685
-1.3838
-1.6255
-1.8084
-2.0424
[torch.FloatTensor of size 10]



### Update the parameters
Finally, we need to update the parameters for the network during the training. But we don't need to implement the optimization methods to compute the updates by ourself. Instead, PyTorch provides several optimization methods for updating the weights of the network under [torch.optim](http://pytorch.org/docs/0.3.1/optim.html).  
  
Each optimization method take the parameters of the network as input with possible specific arguments for the method itself to generate the optimizer. By calling **.step()** of the optimizer after the backprop, the parameters of the network will be updated.

In [12]:
import torch.optim as optim

# Create the optimizer, 'lr' is the learning rate parameter
optimizer = optim.SGD(net.parameters(), lr=0.01)

# Below is what needed to be done in one iteration of a loop :
optimizer.zero_grad()   # Zero the gradient buffers
output = net(input_data)
loss = criterion(output, target)
loss.backward()
optimizer.step()    # Conduct the update

Now we can summarize a typical process for training a network in PyTorch as follows:
1. Use optimization methods under **torch.optim** to generate an optimizer for the constructed network
2. Call **zero_grad** of the network or of the optimizer to zero the gradient buffers
3. Give the input(a mini-batch of samples) to the network (object **net** here)
4. Use the output to compute the **loss** with the target using a specific loss function
5. Conduct backprop through calling the **backward** method on the loss
6. Call **step** of the optimizer to update the parameters
7. Back to step 2 to start a new mini-batch

## Advanced I: Module and Function Customization
### Comparison
In [Basic III: Constructing neural networks](#Basic-III:-Constructing-neural-networks), we built a customized CNN-based network in the form of a new **Module** using existed **Modules** and **Functions** provided under [torch.nn](http://pytorch.org/docs/0.3.1/nn.html) and [torch.nn.functional](http://pytorch.org/docs/0.3.1/nn.html#torch-nn-functional). Now let's take a look on the comparison between Module and Function.  
- In general, Module can be taken as a layer of the network or a complete network built on top of other Modules or Functions as seen before. Module defines the hyperparameters of the layer(s) along with how to conduct forward operations to transform the input **Variable** into the output **Variable**, and also keeps the parameters of the layer(s) during training. Since Module is built on top of other Modules (hence on top of Functions finally) or Functions, its backward method can be implemented by simply calling backward methods of Functions'. Therefore PyTorch automatically do it for us and we don't need to define them.
- However, Function **cannot** keep the parameters of the layer, so it would be more suitable for defining operations like activation function and pooling only. Similarly, we needs to define hyperparameters (if needed) and forward method for Function. Additionally, we need to define the backward method for it, which means we need to conduct the differentiation process ourselves.

### Customization
We've seen how to write a customized Module [before](#Define-the-network), now let's look at how to write a customized Function. Here we use [LeakyReLU](https://en.wikipedia.org/wiki/Rectifier_%28neural_networks%29#Leaky_ReLUs) as an example.  
The definition of LeakyReLU is:  
  
$f(x)=max(0,x)+a×min(0,x)$  
  
where a is a small non-zero hyperparameter. And its differentiation is:  
  
$\begin{equation}  
\frac{∂f(x)}{∂x}=\left\{  
             \begin{array}{lr}  
             a, if\ x≤0 &  \\
             1, if\ x>0 &    
             \end{array}  
\right.  
\end{equation}  $  
  
Now we can implement the LeakyReLU Function as below: ([clamp](http://pytorch.org/docs/0.3.1/torch.html?highlight=torch%20clamp#torch.clamp) document may be helpful)
  



In [13]:
class MyLeakyReLU(torch.autograd.Function):

    
    def __init__(self,a):
        self.a=a # Keep the hyperparameter a
        
    def forward(self, input_):
        # Save the input (whose type is Tensor) for later use in backward
        self.save_for_backward(input_)         
        # LeakyReLU is a combination of minimum and scaled maximum clamps
        output = input_.clamp(min=0) + self.a * input_.clamp(max=0) 
        # For LeakyReLU, one input Tensor only results in one output Tensor, 
        # but you can output a tuple containing multiple Tensors when needed
        return output

    def backward(self, grad_input):  
        # According to the chain rule, dloss / dx = (dloss / doutput) * (doutput / dx)
        # and dloss / doutput is the grad_input argument here, whose type is Tensor
        # The number of input Tensors here is equals to the number of output Tensors of the 
        # forward method. And each input Tensor here represents the gradient w.r.t. that output.
        # For LeakyReLU, there is only one output Tensor for the forward method, so we 
        # receive only one input Tensor here correspondingly.
        input_, = self.saved_tensors
        grad_input_copy = grad_input.clone()
        # Now, we need to multiply the dloss / doutput by doutput/dx according to the chain 
        # rule shown above.
        # According to the differentiation, LeakyReLU only scale dloss / doutput of those elements 
        # whose input element<0 by a, and others are kept as original
        # Note that as mentioned before, Tensor is quite like the numpy.narray, therefore
        # retrieving elements using multi-dimensional boolean array is also supported just
        # like numpy.narray
        grad_input_copy[input_ <= 0] = self.a*grad_input_copy[input_ <= 0]
        return grad_input_copy



In [14]:
from torch.autograd import Variable
# Now we can validate the implementation of MyLeakyReLU
array = np.array([-0.5,0.5,0.3])
tensor = torch.from_numpy(array)
input_ = Variable(tensor,requires_grad=True)
leaky_relu = MyLeakyReLU(a=0.1)
output_ = leaky_relu(input_)

# We can see the element with value of -0.5 will be scaled by a=0.1
# and becomes -0.05
print(output_)

Variable containing:
-0.0500
 0.5000
 0.3000
[torch.DoubleTensor of size 3]



In [15]:
o=output_.mean()
o.backward()

# Note that since o=1/3(∑LeakyLeRU(xi)), do/dxi = 1/3*a if xi<=0, or 1/3 if xi>0
# So gradient for x0 which is less than 0 will be 1/3*a ≈ 0.0333
print(input_.grad)

Variable containing:
 0.0333
 0.3333
 0.3333
[torch.DoubleTensor of size 3]



## Advanced II: Loss Function Customization
Loss functions can be seen as a special kind of Module, which takes one or more Tensors as inputs and return a scalar as output. Indeed, all pre-defined loss functions in PyTorch are implemented as subclasses of Module.  
  
One example of defining a L1 loss function is below:

In [16]:
class MyL1Function(nn.Module):
    def __init__(self):
        super(MyL1Function, self).__init__()
    
    def forward(self,input_,target):
        abs_result=torch.abs(input_-target)
        return torch.sum(abs_result)

In [17]:
# We can validate our implementation as below
l1loss=MyL1Function()
t1=torch.from_numpy(np.array([1,1,1]))
t2=torch.from_numpy(np.array([0,-1,2]))
loss=l1loss(t1,t2)
print(loss)

4


## Example: Tweets sentiment prediction
**Note**: This tutorial focuses on the usage of PyTorch, so due to the limit of words, knowledges about [word embedding](https://en.wikipedia.org/wiki/Word_embedding), [RNN](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) and [LSTM](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) will not be covered here, but you can reference the links attached.  
  
The goal here is to build a LSTM-based model for tweets sentiment prediction.  
Assume the input tweet consists of words sequence $w_1,…,w_M$, where $w_i∈V$, the vocabulary, and let $L=\{0,1\}$ be the sentiment labels set. Our model need to generate the prediction sentiment label $\widehat{l}$ of the tweet , where $\widehat{l}∈L$.  
  
This section will introduce a full pipeline of building, training and testing the model as well as illustrate the technique to feed varing length sequences input to RNN-family model in PyTorch.  
  
Let's first take a look at the model. As illustrated below, tweets will first be transferred into padded word id sequences, and then each word id will be replaced by its corresponding word embedding vector, which then serve as the input for the LSTM unit at each time step. Output of each LSTM unit will be transferred to the units of next time step and next layer of corresponding time step, except the unit of the last time step. The output of the unit of the last time step in the last layer will serve as the input to the last linear layer, whose output will pass through a log softmax unit to generate the final output of the model.
  


<img src="https://s1.ax1x.com/2018/03/22/9Hi7rV.png"/>

The model above can be implemented as followings:

In [18]:
import torch
import torch.nn as nn
from torch.nn.utils.rnn import (pack_padded_sequence, pad_packed_sequence)
class LSTMClassifier(nn.Module):
    def __init__(self, vocab_size, label_size, embedding_dim, 
                 lstm_hidden_size, lstm_num_layers, dropout):
        super(LSTMClassifier, self).__init__()
        self.embedding = nn.Embedding(
            num_embeddings=vocab_size, embedding_dim=embedding_dim, padding_idx=0)
        self.lstm = nn.LSTM(
            input_size=embedding_dim,
            hidden_size=lstm_hidden_size,
            num_layers=lstm_num_layers,
            dropout=dropout,
            bidirectional=False)
        self.dense = nn.Linear(
            in_features=lstm_hidden_size, out_features=label_size)
        self.softmax = nn.LogSoftmax(dim=1)
    
    def forward(self, word_ids, seq_lens):
        # word_ids is assumed to be in the form of nSamples x max_seq_len, 
        # each sample is of varing length substantially but should all be
        # padded to be of same length, the maximum sequence length
        #
        # seq_lens is assumed to be in the form of nSamples x 1, where each
        # element is an integer that indicates the length of the sample before
        # padding
        #
        # nSamples equals to the size of a mini-batch, hence below all use 
        # batch_first=True since nSamples is the first dimension
        embedding = self.embedding(word_ids)
        # Since each sample (word sequence here representing a tweet) is 
        # substantially of varing length, those padded values (word ids here) 
        # should not be used to compute gradient and so on.
        # PyTorch provides a class called PackedSequence to store such type of
        # varing length samples, which can be taken as the input of rnn-family
        # layer such as LSTM, GRU, etc. And pack_padded_sequence function is the 
        # key function to transform the same-length padded sequences to PackedSequence,
        # provided the sequences and the original sequence lengths.
        # However, it requires sequences input to be sorted by length in a decreasing
        # order, which should be paid attention to during input preparation.
        embedding = pack_padded_sequence(embedding, seq_lens, batch_first=True)
        # Note that although for each mini-batch, the input sequences length seems to be the same,
        # but different batch may have different maximum sequence length. However, unlike Tensorflow
        # which compiles the graph statically, PyTorch allow dinamically and automatically changing
        # the structure parameters of the network in the run time, therefore we don't need to do any
        # special action to deal with these differences, while Tensorflow, instead, has to at least
        # estimate the maximum sequence length for all possible sequences at the very beginning 
        # to avoid re-compile the model.
        out_lstm, _ = self.lstm(embedding)
        # pad_packed_sequence just does the reverse thing of pack_padded_sequence's,
        # which transforms the PackedSequence back to padded sequences.
        out_lstm, lengths = pad_packed_sequence(out_lstm, batch_first=True)
        # For classification, we only need the last time step
        # output from LSTM to feed the final linear layer
        lengths = [l - 1 for l in lengths]
        last_output = out_lstm[range(len(lengths)),lengths]
        out_linear = self.dense(last_output)
        output = self.softmax(out_linear)
        return output

Below is the data preprocessing code. Note that the tweet csv file can be downloaded [here](https://drive.google.com/file/d/1qkFbgdg_gs85sqUhvToQ8R33R3jvDPwK/view?usp=sharing), which is indeed the first 5000 tweets from a [bigger dataset](http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip).

In [19]:
import pandas as pd
import re

# Clean the text by rules
def cleanup_text(texts):
    cleaned_text = []
    for text in texts:
        # remove &quot and &amp
        text = re.sub(r'&quot;(.*?)&quot;', "\g<1>", text)
        text = re.sub(r'&amp;', "", text)

        # replace emoticon
        text = re.sub(r'(^| )(\:\w+\:|\<[\/\\]?3|[\(\)\\\D|\*\$][\-\^]?[\:\;\=]|[\:\;\=B8][\-\^]?[3DOPp\@\$\*\\\)\(\/\|])(?=\s|[\!\.\?]|$)', "\g<1>TOKEMOTICON", text)

        text = text.lower()
        text = text.replace("tokemoticon", "TOKEMOTICON")

        # replace url
        text = re.sub(r'(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?',
                    "TOKURL", text)

        # replace mention
        text = re.sub(r'@[\w]+', "TOKMENTION", text)

        # replace hashtag
        text = re.sub(r'#[\w]+', "TOKHASHTAG", text)

        # replace dollar
        text = re.sub(r'\$\d+', "TOKDOLLAR", text)

        # remove punctuation
        text = re.sub('[^a-zA-Z0-9]', ' ', text)

        # remove multiple spaces
        text = re.sub(r' +', ' ', text)

        # remove newline
        text = re.sub(r'\n', ' ', text)

        cleaned_text.append(text)
    return cleaned_text

# Transform the words sequence into padded word ids sequence
def prepare_sequence(seq, to_ix,max_len):
    unseen = len(to_ix)
    idxs = [to_ix[w] if w in to_ix else unseen for w in seq]
    return np.pad(np.array(idxs),(0,max_len-len(idxs)),"constant")

# Transform the text data into padded word ids sequence and 
# sort them by their length in a descending order
# Return transformed text data, original sequence lengths list and label list
# in corresponding sorted order
def transform(data,word_to_ix):
    sorted_data = data.sort_values(by="seq_len",ascending=False)
    text_x = sorted_data["text"].values
    seq_lens = list(map(lambda x: len(x), text_x))
    max_len = max(seq_lens)
    text_x = np.array(map(lambda seq: prepare_sequence(seq, word_to_ix, max_len), text_x))
    text_x = torch.autograd.Variable(torch.from_numpy(text_x))
    return text_x, seq_lens, sorted_data["label"].values

dataset_name = "sentiment-analysis-dataset5000.csv"
train_frac = 0.7

data = pd.read_csv(dataset_name, names=["text", "label"])
data["text"] = cleanup_text(data["text"].values)
data["text"] = data["text"].apply(lambda x: x.split())
data["seq_len"] = data["text"].apply(len)
data = data.sample(frac=1, random_state=2018).reset_index(drop=True)
data_size = len(data)
train_data = data[:(int)(train_frac * data_size)]
val_data = data[(int)(train_frac * data_size):]

word_to_ix = {}
for tweet in train_data["text"].values:
    for word in tweet:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)


vocab_size = len(word_to_ix)+1 # One extra slot for all unseen words
label_size = len(data.label.unique())




To save time, only 10 epoch is trained below, but it is enough to illustrate the process of training.

In [20]:
batch_size=100
embedding_dim=300
lstm_hidden_size=256
lstm_num_layers=3
dropout=0.2

"""
As summarized before, the typical steps are:
 1.   Use optimization methods under torch.optim to generate an optimizer for the constructed network
 2.   Call zero_grad of the network or of the optimizer to zero the gradient buffers
 3.   Give the input(a mini-batch of samples) to the network
 4.   Use the output to compute the loss with the target using a specific loss function
 5.   Conduct backprop through calling the backward method on the loss
 6.   Call step of the optimizer to update the parameters
 7.   Back to step 2 to start a new mini-batch
"""


loss_fn = nn.NLLLoss()
model = LSTMClassifier(vocab_size, label_size, embedding_dim,
                       lstm_hidden_size, lstm_num_layers, dropout)
optimizer = torch.optim.Adam(model.parameters())

torch.manual_seed(2018)
for epoch in range(10):
    print("epoch:"+str(epoch))
    for batch_index in range(0, len(train_data), batch_size):
        batch_data = train_data[batch_index: batch_index+batch_size]
        
        model.zero_grad()
        train_x, seq_lens, train_y = transform(batch_data,word_to_ix)
        
        y_pred = model(train_x, seq_lens)
        
        train_y = torch.autograd.Variable(torch.from_numpy(train_y))
        
        
        loss = loss_fn(y_pred, train_y)
        loss.backward()
        optimizer.step()
        print("batch_range:["+str(batch_index)+","+str(batch_index+batch_size)+"),loss:"+str(loss.data.numpy()[0]))
        


epoch:0
batch_range:[0,100),loss:0.71413857
batch_range:[100,200),loss:0.6888852
batch_range:[200,300),loss:0.6732738
batch_range:[300,400),loss:0.66105306
batch_range:[400,500),loss:0.5905137
batch_range:[500,600),loss:0.54367226
batch_range:[600,700),loss:0.7153888
batch_range:[700,800),loss:0.6606488
batch_range:[800,900),loss:0.6947618
batch_range:[900,1000),loss:0.57920736
batch_range:[1000,1100),loss:0.618479
batch_range:[1100,1200),loss:0.59570414
batch_range:[1200,1300),loss:0.60670584
batch_range:[1300,1400),loss:0.6103523
batch_range:[1400,1500),loss:0.60352015
batch_range:[1500,1600),loss:0.6328545
batch_range:[1600,1700),loss:0.5811124
batch_range:[1700,1800),loss:0.6144712
batch_range:[1800,1900),loss:0.59883475
batch_range:[1900,2000),loss:0.6295128
batch_range:[2000,2100),loss:0.66279894
batch_range:[2100,2200),loss:0.5876168
batch_range:[2200,2300),loss:0.52975965
batch_range:[2300,2400),loss:0.5235077
batch_range:[2400,2500),loss:0.56950676
batch_range:[2500,2600),loss

batch_range:[3400,3500),loss:0.07493071
epoch:6
batch_range:[0,100),loss:0.118619315
batch_range:[100,200),loss:0.25951198
batch_range:[200,300),loss:0.1966622
batch_range:[300,400),loss:0.39412773
batch_range:[400,500),loss:0.10707372
batch_range:[500,600),loss:0.101805285
batch_range:[600,700),loss:0.047954027
batch_range:[700,800),loss:0.071542256
batch_range:[800,900),loss:0.08736799
batch_range:[900,1000),loss:0.13768645
batch_range:[1000,1100),loss:0.13045196
batch_range:[1100,1200),loss:0.13703293
batch_range:[1200,1300),loss:0.18783821
batch_range:[1300,1400),loss:0.13811305
batch_range:[1400,1500),loss:0.12632355
batch_range:[1500,1600),loss:0.06476125
batch_range:[1600,1700),loss:0.13325772
batch_range:[1700,1800),loss:0.08644192
batch_range:[1800,1900),loss:0.08522044
batch_range:[1900,2000),loss:0.03621664
batch_range:[2000,2100),loss:0.104220115
batch_range:[2100,2200),loss:0.07371977
batch_range:[2200,2300),loss:0.07876557
batch_range:[2300,2400),loss:0.14184876
batch_ran

Finally we can evaluate the model.

In [21]:
val_x, val_seq_lens, val_y = transform(val_data,word_to_ix)
y_pred = model(val_x,val_seq_lens)
_, label_pred = torch.max(y_pred,1)
print("accuracy:%f" % (sum(label_pred.data.numpy() == val_y)/float(len(val_y))))

accuracy:0.748667


## Further resources
- [Official examples](https://github.com/pytorch/examples)
- [30 lines code for most models](https://github.com/yunjey/pytorch-tutorial)