# Deep Learning for Natural Language Processing with Pytorch
This tutorial will walk you through the key ideas of deep learning programming using Pytorch.
Many of the concepts (such as the computation graph abstraction and autograd) are not unique to Pytorch and are relevant to any deep learning tool kit out there.

I am writing this tutorial to focus specifically on NLP for people who have never written code in any deep learning framework (e.g, TensorFlow, Theano, Keras, Dynet).

In [3]:
import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

# 1. Introduction to Torch's tensor library

All of deep learning is computations on tensors, which are generalizations of a matrix that can be indexed in more than 2 dimensions.  We will see exactly what this means in-depth later.  First, lets look what we can do with tensors.

### Creating Tensors
Tensors can be created from Python lists with the torch.Tensor() function.

In [8]:
# Create a torch.Tensor object with the given data.  It is a 1D vector
V_data = [1., 2., 3.]
V = torch.Tensor(V_data)
print x

# Creates a matrix
M_data = [[1., 2., 3.], [4., 5., 6]]
M = torch.Tensor(M_data)
print y

# Create a 3D tensor of size 2x2x2.
T_data = [[[1.,2.], [3.,4.]],
          [[5.,6.], [7.,8.]]]
T = torch.Tensor(T_data)
print z


 1
 2
 3
[torch.FloatTensor of size 3]


 1  2  3
 4  5  6
[torch.FloatTensor of size 2x3]


(0 ,.,.) = 
  1  2
  3  4

(1 ,.,.) = 
  5  6
  7  8
[torch.FloatTensor of size 2x2x2]



What is a 3D tensor anyway?
Think about it like this.
If you have a vector, indexing into the vector gives you a scalar.  If you have a matrix, indexing into the matrix gives you a vector.  If you have a 3D tensor, then indexing into the tensor gives you a matrix!

A note on terminology: when I say "tensor" in this tutorial, it refers to any torch.Tensor object.  Matrices and vectors are special cases of torch.Tensors, where their dimension is 1 and 2 respectively.  When I am talking about 3D tensors, I will explicitly use the term "3D tensor".

In [9]:
# Index into V and get a scalar
print V[0]

# Index into M and get a vector
print M[0]

# Index into T and get a matrix
print T[0]

1.0

 1
 2
 3
[torch.FloatTensor of size 3]


 1  2
 3  4
[torch.FloatTensor of size 2x2]



You can also create tensors of other datatypes.  The default, as you can see, is Float.
To create a tensor of integer types, try torch.LongTensor().  Check the documentation for more data types, but Float and Long will be the most common.

You can create a tensor with random data and the supplied dimensionality with torch.randn()

In [10]:
x = torch.randn((3, 4, 5))
print x


(0 ,.,.) = 
  0.1146 -0.6506  0.6839 -1.5281 -0.3123
 -0.3978  0.1204  1.0106  0.6426 -0.3773
  1.5972  0.6948 -0.6626 -0.5126  2.1116
  0.8082  2.5599 -0.5388 -0.2966 -0.8011

(1 ,.,.) = 
  0.4382  0.2241 -0.8473  0.2425  0.9720
 -0.3337  0.7472 -0.4915 -0.5668 -0.8585
 -2.1146  0.2108 -0.6764  0.0754  1.8036
  0.8029 -1.2479 -0.3994 -0.8097 -1.0615

(2 ,.,.) = 
  1.2940  0.3598  1.5550 -0.1175 -0.2281
  0.4670 -0.9777  1.7070  0.3330  2.5000
 -1.6602  0.1536  1.2721  1.3831 -0.1833
 -0.0308 -1.9315  0.4497  1.4138  0.5602
[torch.FloatTensor of size 3x4x5]



### Operations with Tensors
You can operate on tensors in the ways you would expect.

In [11]:
x = torch.Tensor([ 1., 2., 3. ])
y = torch.Tensor([ 4., 5., 6. ])
z = x + y
print z


 5
 7
 9
[torch.FloatTensor of size 3]



See [the documentation](http://pytorch.org/docs/torch.html) for a complete list of the massive number of operations available to you.  They expand beyond just mathematical operations.

One helpful operation that we will make use of later is concatenation.

In [22]:
# By default, it concatenates along the first axis (concatenates rows)
x_1 = torch.randn(2, 5)
y_1 = torch.randn(3, 5)
z_1 =torch.cat([x_1, y_1])
print z_1

# Concatenate columns:
x_2 = torch.randn(2, 3)
y_2 = torch.randn(2, 5)
z_2 = torch.cat([x_2, y_2], 1) # second arg specifies which axis to concat along
print z_2

# If your tensors are not compatible, torch will complain.  Uncomment to see the error
# torch.cat([x_1, x_2])


-0.0180  0.0584  0.9895  0.6425  1.8225
 1.3009  2.2458  0.0894  0.7866 -0.9593
-0.6731  0.5934 -0.6335 -0.4634  2.0126
 0.0406 -0.5549  0.5613 -2.4119 -0.2033
-0.3331 -0.2922 -1.7489 -0.4690 -0.7442
[torch.FloatTensor of size 5x5]


-0.5333  1.3768  0.8385  0.2887  0.0408  1.0201  0.9143  0.3634
 0.3216 -0.6005 -1.8512 -0.5111  0.7530  0.4623  0.7390 -1.6262
[torch.FloatTensor of size 2x8]



### Reshaping Tensors
Use the .view() method to reshape a tensor.
This method receives heavy use, because many neural network components expect their inputs to have a certain shape.
Often you will need to reshape before passing your data to the component.

In [28]:
x = torch.randn(2, 3, 4)
print x
print x.view((2, 12)) # Reshape to 2 rows, 12 columns
print x.view((2, -1)) # Same as above.  If one of the dimensions is -1, its size can be inferred


(0 ,.,.) = 
 -0.2645 -0.4485  0.2376 -1.4220
 -1.5924 -0.4594 -0.2210 -0.2319
 -1.1506 -0.2188 -1.4698 -1.8785

(1 ,.,.) = 
  0.9020 -0.1623 -1.5317 -0.1645
 -1.9074  0.2076 -1.1382  1.4981
 -0.0087 -0.0006  0.6136 -0.2514
[torch.FloatTensor of size 2x3x4]



Columns 0 to 9 
-0.2645 -0.4485  0.2376 -1.4220 -1.5924 -0.4594 -0.2210 -0.2319 -1.1506 -0.2188
 0.9020 -0.1623 -1.5317 -0.1645 -1.9074  0.2076 -1.1382  1.4981 -0.0087 -0.0006

Columns 10 to 11 
-1.4698 -1.8785
 0.6136 -0.2514
[torch.FloatTensor of size 2x12]



Columns 0 to 9 
-0.2645 -0.4485  0.2376 -1.4220 -1.5924 -0.4594 -0.2210 -0.2319 -1.1506 -0.2188
 0.9020 -0.1623 -1.5317 -0.1645 -1.9074  0.2076 -1.1382  1.4981 -0.0087 -0.0006

Columns 10 to 11 
-1.4698 -1.8785
 0.6136 -0.2514
[torch.FloatTensor of size 2x12]




# 2. Computation Graphs and Automatic Differentiation

The concept of a computation graph is essential to efficient deep learning programming, because it allows you to not have to write the back propogation gradients yourself.  A computation graph is simply a specification of how your data is combined to give you the output.  Since the graph totally specifies what parameters were involved with which operations, it contains enough information to compute derivatives.  This probably sounds vague, so lets see what is going on using the fundamental class of Pytorch: autograd.Variable.

First, think from a programmers perspective.  What is stored in the torch.Tensor objects we were creating above?
Obviously the data and the shape, and maybe a few other things.  But when we added two tensors together, we got an output tensor.  All this output tensor knows is its data and shape.  It has no idea that it was the sum of two other tensors (it could have been read in from a file, it could be the result of some other operation, etc.)

The Variable class keeps track of how it was created.  Lets see it in action.

In [49]:
# Variables wrap tensor objects
x = autograd.Variable( torch.Tensor([1., 2., 3]), requires_grad=True )
# You can access the data with the .data attribute
print x.data

# You can also do all the same operations you did with tensors with Variables.
y = autograd.Variable( torch.Tensor([4., 5., 6]), requires_grad=True )
z = x + y
print z.data

# BUT z knows something extra.
print z.creator


 1
 2
 3
[torch.FloatTensor of size 3]


 5
 7
 9
[torch.FloatTensor of size 3]

<torch.autograd._functions.basic_ops.Add object at 0x7f02746c4820>


So Variables know what created them.  z knows that it wasn't read in from a file, it wasn't the result of a multiplication or exponential or whatever.  And if you keep following z.creator, you will find yourself at x and y.

But how does that help us compute a gradient?

In [50]:
# Lets sum up all the entries in z
s = z.sum()
print s
print s.creator

Variable containing:
 21
[torch.FloatTensor of size 1]

<torch.autograd._functions.reduce.Sum object at 0x7f02746c4370>


So now, what is the derivative of this sum with respect to the first component of x?  In math, we want
$$ \frac{\partial s}{\partial x_0} $$
Well, s knows that it was created as a sum of the tensor z.  z knows that it was the sum x + y.
So 
$$ s = \overbrace{x_0 + y_0}^\text{$z_0$} + \overbrace{x_1 + y_1}^\text{$z_1$} + \overbrace{x_2 + y_2}^\text{$z_2$} $$
And so s contains enough information to determine that the derivative we want is 1!

Of course this glosses over the challenge of how to actually compute that derivative.  The point here is that s is carrying along enough information that it is possible to compute it.  In reality, the developers of Pytorch program the sum() and + operations to know how to compute their gradients, and run the back propogation algorithm.  An in-depth discussion of that algorithm is beyond the scope of this tutorial.

Lets have Pytorch compute the gradient, and see that we were right: (note if you run this block multiple times, the gradient will increment.  That is because Pytorch *accumulates* the gradient into the .grad property, since for many models this is very convenient.)

In [51]:
s.backward() # calling .backward() on any variable will run backprop, starting from it.
print x.grad

Variable containing:
 1
 1
 1
[torch.FloatTensor of size 3]



Understanding what is going on in the block below is crucial for being a successful programmer in deep learning.

In [54]:
x = torch.randn((2,2))
y = torch.randn((2,2))
z = x + y # These are Tensor types, and backprop would not be possible

var_x = autograd.Variable( x )
var_y = autograd.Variable( y )
var_z = var_x + var_y # var_z contains enough information to compute gradients, as we saw above
print var_z.creator

var_z_data = var_z.data # Get the wrapped Tensor object out of var_z...
new_var_z = autograd.Variable( var_z_data ) # Re-wrap the tensor in a new variable

# ... does new_var_z have information to backprop to x and y?
# NO!
print new_var_z.creator
# And how could it?  We yanked the tensor out of var_z (that is what var_z.data is).  This tensor
# doesn't know anything about how it was computed.  We pass it into new_var_z, and this is all the information
# new_var_z gets.  If var_z_data doesn't know how it was computed, theres no way new_var_z will.
# In essence, we have broken the variable away from its past history

<torch.autograd._functions.basic_ops.Add object at 0x7f02746c48e8>
None


Here is the basic, extremely important rule for computing with autograd.Variables (note this is more general than Pytorch.  There is an equivalent object in every major deep learning toolkit):

** If you want the error from your loss function to backpropogate to a component of your network, you MUST NOT break the Variable chain from that component to your loss Variable.  If you do, the loss will have no idea your component exists, and its parameters can't be updated. **

I say this in bold, because this error can creep up on you in very subtle ways (I will show some such ways below), and it will not cause your code to crash or complain, so you must be careful.

# 3. Deep Learning Building Blocks: Affine maps, non-linearities and objectives

Deep learning consists of composing linearities with non-linearities in clever ways.  The introduction of non-linearities allows for powerful models.  In this section, we will play with these core components, make up an objective function, and see how the model is trained.

### Affine Maps
One of the core workhorses of deep learning is the affine map, which is a function $$f(x)$$ where
$$ f(x) = Ax + b $$ for a matrix $A$ and vectors $x, b$.  The parameters to be learned here are $A$ and $b$.  Often, $b$ is refered to as the *bias* term.

### Non-Linearities
First, note the following fact, which will explain why we need non-linearities in the first place.
Suppose we have two affine maps $f(x) = Ax + b$ and $g(x) = Cx + d$.  What is $f(g(x))$?
$$ f(g(x)) = A(Cx + d) + b = ACx + (Ad + b) $$
$AC$ is a matrix and $Ad + b$ is a vector, so we see that composing affine maps gives you an affine map.

From this, you can see that if you wanted your neural network to be long chains of affine compositions, that this adds no new power to your model than just doing a single affine map.

If we introduce non-linearities in between the affine layers, this is no longer the case, and we can build much more powerful models.

There are a few core non-linearities.  $\tanh(x), \sigma(x), \text{ReLU}(x)$ are the most common.
You are probably wondering: "why these functions?  I can think of plenty of other non-linearities."
The reason for this is that they have gradients that are easy to compute, and computing gradients is essential for learning.  For example
$$ \frac{d\sigma}{dx} = \sigma(x)(1 - \sigma(x)) $$

A quick note: although you may have learned some neural networks in your intro to AI class where $\sigma(x)$ was the default non-linearity, typically people shy away from it in practice.  This is because the gradient *vanishes* very quickly as the absolute value of the argument grows.  Small gradients means it is hard to learn.  Most people default to tanh or ReLU.

### Softmax and Probabilities
The function $\text{Softmax}(x)$ is also just a non-linearity, but it is special in that it usually is the last operation done in a network.  This is because it takes in a vector of real numbers and returns a probability distribution.  Its definition is as follows.  Let $x$ be a vector of real numbers (positive, negative, whatever, there are no constraints.  Then the i'th component of $\text{Softmax}(x)$ is
$$ \frac{\exp(x_i)}{\sum_j \exp(x_j)} $$
It should be clear that the output is a probability distribution: each element is non-negative and the sum over all components is 1.

You could also think of it as just applying an element-wise exponentiation operator to the input (to make everything non-negative) and then dividing by the normalization constant.

### Objective Functions
The objective function is the function that your network is being trained to minimize (in which case it is often called a *loss function* or *cost function*).
This proceeds by first choosing a training instance, running it through your neural network, and then computing the loss of the output.  The parameters of the model are then updated by taking the derivative of the loss function.  Intuitively, if your model is completely confident in its answer, and its answer is wrong, your loss will be high.  If it is very confident in its answer, and its answer is correct, the loss will be low.

The idea behind minimizing the loss function on your training examples is that your network will hopefully generalize well and have small loss on unseen examples in your dev set, test set, or in production.
An example loss function is the *negative log likelihood loss*, which is a very common objective for multi-class classification.  For supervised multi-class classification, this means training the network to minimize the negative log probability of the correct output (or equivalently, maximize the log probability of the correct output).

# 4. Optimization and Training

So what we can compute a loss function for an instance?  What do we do with that?
We saw earlier that autograd.Variable's know how to compute gradients with respect to the things that were used to compute it.  Well, since our loss is an autograd.Variable, we can compute gradients with respect to all of the parameters used to compute it!  Then we can perform standard gradient updates.  Let $\theta$ be our parameters, $L(\theta)$ the loss function, and $\eta$ a positive learning rate.  Then:

$$ \theta^{(t+1)} = \theta^{(t)} - \eta \nabla_\theta L(\theta) $$

There are a huge collection of algorithms and active research in attempting to do something more than just this vanilla gradient update.  Many attempt to vary the learning rate based on what is happening at train time.  You don't need to worry about what specifically these algorithms are doing unless you are really interested.  Torch provies many in the torch.optim package, and they are all completely transparent.  Using the simplest gradient update is the same as the more complicated algorithms.  Trying different update algorithms and different parameters for the update algorithms (like different initial learning rates) is important in optimizing your network's performance.  Often,

# 5. Creating Network Components in Pytorch
Before we move on to our focus on NLP, lets do an annotated example of building a network in Pytorch using only affine maps and non-linearities.  We will also see how to compute a loss function, using Pytorch's built in negative log likelihood.

All network components should inherit from nn.Module and override the forward() method.  That is about it, as far as the boilerplate is concerned.  Inheriting from nn.Module provides functionality to your component.  For example, it makes it keep track of its trainable parameters, you can swap it between CPU and GPU with the .cuda() or .cpu() functions, etc.

Let's write an annotated example of a network that takes in a sparse bag-of-words representation and outputs a probability distribution over two labels: "English" and "Spanish".

Note: This is just for demonstration, so that we can build Pytorch components in later sections and you will know what is going on.  Handing in a sparse bag-of-words representation is not how you would actually want to do things.

Our model will map a sparse BOW representation to log probabilities over labels.  We assign each word in the vocab an index.  For example, say our entire vocab is two words "hello" and "world", with indices 0 and 1 respectively.
The BoW vector for the sentence "hello hello hello hello" is
$$ \left[ 4, 0 \right] $$
For "hello world world hello", it is 
$$ \left[ 2, 2 \right] $$
etc.
In general, it is
$$ \left[ \text{Count}(\text{hello}), \text{Count}(\text{world}) \right] $$

Denote this BOW vector as $x$.
The output of our network is:
$$ \log \text{Softmax}(Ax + b) $$
That is, we pass the input through an affine map and then do log softmax.

In [4]:
data = [ ("me gusta comer en la cafeteria".split(), "SPANISH"),
         ("Give it to me".split(), "ENGLISH"),
         ("No creo que sea una buena idea".split(), "SPANISH"),
         ("No it is not a good idea to get lost at sea".split(), "ENGLISH") ]

test_data = [ ("Yo creo que si".split(), "SPANISH"),
              ("it is lost on me".split(), "ENGLISH")]

# word_to_ix maps each word in the vocab to a unique integer, which will be its
# index into the Bag of words vector
word_to_ix = {}
for sent, _ in data + test_data:
    for word in sent:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)
print word_to_ix

VOCAB_SIZE = len(word_to_ix)
NUM_LABELS = 2

{'en': 3, 'No': 9, 'buena': 14, 'it': 7, 'at': 22, 'sea': 12, 'cafeteria': 5, 'Yo': 23, 'la': 4, 'to': 8, 'creo': 10, 'is': 16, 'a': 18, 'good': 19, 'get': 20, 'idea': 15, 'que': 11, 'not': 17, 'me': 0, 'on': 25, 'gusta': 1, 'lost': 21, 'Give': 6, 'una': 13, 'si': 24, 'comer': 2}


In [5]:
class BoWClassifier(nn.Module): # inheriting from nn.Module!
    
    def __init__(self, num_labels, vocab_size):
        # calls the init function of nn.Module.  Dont get confused by syntax,
        # just always do it in an nn.Module
        super(BoWClassifier, self).__init__()
        
        # Define the parameters that you will need.  In this case, we need A and b,
        # the parameters of the affine mapping.
        # Torch defines nn.Linear(), which provides the affine map.
        # Make sure you understand why the input dimension is vocab_size
        # and the output is num_labels!
        self.linear = nn.Linear(vocab_size, num_labels)
        
        # NOTE! The non-linearity log softmax does not have parameters! So we don't need
        # to worry about that here
        
    def forward(self, bow_vec):
        # Pass the input through the linear layer,
        # then pass that through log_softmax.
        # Many non-linearities and other functions are in torch.nn.functional
        return F.log_softmax(self.linear(bow_vec))

In [6]:
def make_bow_vector(sentence, word_to_ix):
    vec = torch.Tensor([0] * len(word_to_ix))
    for word in sentence:
        vec[word_to_ix[word]] += 1
    return vec.view(1, -1)

def make_target(label, label_to_ix):
    return torch.LongTensor([label_to_ix[label]])

In [21]:
model = BoWClassifier(NUM_LABELS, VOCAB_SIZE)

# the model knows its parameters.  The first output below is A, the second is b.
# Whenever you assign a component to a class variable in the __init__ function of a module,
# which was done with the line
# self.linear = nn.Linear(...)
# Then through some Python magic from the Pytorch devs, your module (in this case, BoWClassifier)
# will store knowledge of the nn.Linear's parameters
for param in model.parameters():
    print param

Parameter containing:

Columns 0 to 9 
 0.1320 -0.0388  0.0645 -0.1169 -0.1678 -0.1038  0.0251 -0.1059  0.0203 -0.1292
-0.0476  0.1450  0.1249  0.1534 -0.0181 -0.1762 -0.1807  0.0966 -0.0169  0.1473

Columns 10 to 19 
-0.1339 -0.0634  0.1013  0.0989  0.1273 -0.1592  0.1516  0.1461 -0.0260 -0.0570
 0.1585 -0.1099  0.0117  0.1641  0.0070  0.1021  0.1845 -0.1548  0.0414 -0.1394

Columns 20 to 25 
-0.1034  0.0926 -0.0168  0.1822 -0.1233 -0.1624
 0.1191  0.1816  0.0089  0.0151  0.1768 -0.1394
[torch.FloatTensor of size 2x26]

Parameter containing:
1.00000e-02 *
  0.9037
 -8.6429
[torch.FloatTensor of size 2]



In [22]:
# To run the model, pass in a BoW vector, but wrapped in an autograd.Variable
sample = data[0]
bow_vector = make_bow_vector(sample[0], word_to_ix)
log_probs = model(autograd.Variable(bow_vector))
print log_probs

Variable containing:
-0.8639 -0.5473
[torch.FloatTensor of size 1x2]



Which of the above values corresponds to the log probability of ENGLISH, and which to SPANISH?  We never defined it, but we need to if we want to train the thing.

In [23]:
label_to_ix = { "SPANISH": 0, "ENGLISH": 1 }

So lets train!  To do this, we pass instances through to get log probabilities, compute a loss function, compute the gradient of the loss function, and then update the parameters with a gradient step.  Loss functions are provided by Torch in the nn package.  nn.NLLLoss() is the negative log likelihood loss we want.  It also defines optimization functions in torch.optim.  Here, we will just use SGD.

In [27]:
# Run on test data before we train, just to see a before-and-after
for instance, label in test_data:
    bow_vec = autograd.Variable(make_bow_vector(instance, word_to_ix))
    log_probs = model(bow_vec)
    print log_probs
print next(model.parameters())[:,word_to_ix["creo"]] # Print the matrix column corresponding to "creo"

Variable containing:
-0.8448 -0.5615
[torch.FloatTensor of size 1x2]

Variable containing:
-0.7299 -0.6577
[torch.FloatTensor of size 1x2]

Variable containing:
-0.1339
 0.1585
[torch.FloatTensor of size 2]



In [28]:
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# Usually you want to pass over the training data several times.
# 100 is much bigger than on a real data set, but real datasets have more than
# two instances.  Usually, somewhere between 5 and 30 epochs is reasonable.
for epoch in xrange(100):
    for instance, label in data:
        # Step 1. Remember that Pytorch accumulates gradients.  We need to clear them out
        # before each instance
        model.zero_grad()
    
        # Step 2. Make our BOW vector and also we must wrap the target in a Variable
        # as an integer.  For example, if the target is SPANISH, then we wrap the integer
        # 0.  The loss function then knows that the 0th element of the log probabilities is
        # the log probability corresponding to SPANISH
        bow_vec = autograd.Variable(make_bow_vector(instance, word_to_ix))
        target = autograd.Variable(make_target(label, label_to_ix))
    
        # Step 3. Run our forward pass.
        log_probs = model(bow_vec)
    
        # Step 4. Compute the loss, gradients, and update the parameters by calling
        # optimizer.step()
        loss = loss_function(log_probs, target)
        loss.backward()
        optimizer.step()

In [30]:
for instance, label in test_data:
    bow_vec = autograd.Variable(make_bow_vector(instance, word_to_ix))
    log_probs = model(bow_vec)
    print log_probs
print next(model.parameters())[:,word_to_ix["creo"]] # Index corresponding to Spanish goes up, English goes down!

Variable containing:
-0.1581 -1.9227
[torch.FloatTensor of size 1x2]

Variable containing:
-2.7471 -0.0663
[torch.FloatTensor of size 1x2]

Variable containing:
 0.3237
-0.2992
[torch.FloatTensor of size 2]



We got the right answer!  You can see that the log probability for Spanish is much higher in the first example, and the log probability for English is much higher in the second for the test data, as it should be.

Now you see how to make a Pytorch component, pass some data through it and do gradient updates.
We are ready to dig deeper into what deep NLP has to offer.

# 6. Word Embeddings: Encoding Lexical Semantics

Word embeddings are dense vectors of real numbers, one per 

### An Example: N-Gram Language Modeling
Recall that in an n-gram language model, given a sequence of words $w$, we want to compute
$$ P(w_i | w_{i-1}, w_{i-2}, \dots, w_{i-n+1} ) $$
Where $w_i$ is the ith word of the sequence.

In this example, we will compute the loss function on some training examples.  We won't yet train the network.  We will expand on this example soon.

In [11]:
CONTEXT_SIZE = 2
EMBEDDING_DIM = 10
# We will use Shakespeare Sonnet 2
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()
# we should tokenize the input, but we will ignore that for now
# build a list of tuples.  Each tuple is ([ word_i-2, word_i-1 ], target word)
trigrams = [ ([test_sentence[i], test_sentence[i+1]], test_sentence[i+2]) for i in xrange(len(test_sentence) - 2) ]
print trigrams[:3] # print the first 3, just so you can see what they look like

[(['When', 'forty'], 'winters'), (['forty', 'winters'], 'shall'), (['winters', 'shall'], 'besiege')]


In [3]:
vocab = set(test_sentence)
word_to_ix = { word: i for i, word in enumerate(vocab) }

In [10]:
class NGramLanguageModeler(nn.Module):
    
    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)
        
    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        log_probs = F.log_softmax(
                        self.linear2(
                            F.relu(
                                self.linear1(embeds))))
        return log_probs

In [19]:
losses = []
loss_function = nn.NLLLoss()
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
optimizer = optim.SGD(model.parameters(), lr=0.001)

for epoch in xrange(10):
    total_loss = torch.Tensor([0])
    for context, target in trigrams:
    
        # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
        # into integer indices and wrap them in variables)
        context_idxs = map(lambda w: word_to_ix[w], context)
        context_var = autograd.Variable( torch.LongTensor(context_idxs) )
    
        # Step 2. Recall that torch *accumulates* gradients.  Before passing in a new instance,
        # you need to zero out the gradients from the old instance
        model.zero_grad()
    
        # Step 3. Run the forward pass, getting log probabilities over next words
        log_probs = model(context_var)
    
        # Step 4. Compute your loss function. (Again, Torch wants the target word wrapped in a variable)
        loss = loss_function(log_probs, autograd.Variable(torch.LongTensor([word_to_ix[target]])))
    
        # Step 5. Do the backward pass and update the gradient
        loss.backward()
        optimizer.step()
    
        total_loss += loss.data
    losses.append(total_loss)
print losses # The loss decreased every iteration over the training data!

[
 519.3700
[torch.FloatTensor of size 1]
, 
 516.8912
[torch.FloatTensor of size 1]
, 
 514.4286
[torch.FloatTensor of size 1]
, 
 511.9796
[torch.FloatTensor of size 1]
, 
 509.5433
[torch.FloatTensor of size 1]
, 
 507.1210
[torch.FloatTensor of size 1]
, 
 504.7090
[torch.FloatTensor of size 1]
, 
 502.3077
[torch.FloatTensor of size 1]
, 
 499.9145
[torch.FloatTensor of size 1]
, 
 497.5307
[torch.FloatTensor of size 1]
]


# 7. Making Decisions

# 8. Sequence Models and Long-Short Term Memory Networks