# Introduction to PyTorch

PyTorch is a Python package for performing tensor computation, automatic differentiation, and dynamically defining neural networks. It makes it particularly easy to accelerate model training with a GPU. In recent years it has gained a large following in the NLP community.


## Installing PyTorch

Instructions for installing PyTorch can be found on the home-page of their website: <http://pytorch.org/>. The PyTorch developers recommended you use the `conda` package manager to install the library (in my experience `pip` works fine as well).

One thing to be aware of is that the package name will be different depending on whether or not you intend on using a GPU. If you do plan on using a GPU, then you will need to install CUDA and CUDNN before installing PyTorch. Detailed instructions can be found at NVIDIA's website: <https://docs.nvidia.com/cuda/>. The following versions of CUDA are supported: 7.5, 8, and 9.


## PyTorch Basics

The PyTorch API is designed to very closely resemble NumPy. The central object for performing computation is the `Tensor`, which is PyTorch's version of NumPy's `array`.

In [1]:
import numpy as np
import torch

In [2]:
# Create a 3 x 2 array
np.ndarray((3, 2))

array([[6.92791565e-310, 1.17155217e-316],
       [0.00000000e+000, 0.00000000e+000],
       [0.00000000e+000, 0.00000000e+000]])

In [3]:
# Create a 3 x 2 Tensor
torch.Tensor(3, 2)


1.00000e-21 *
  1.1919  0.0000
  0.0000  0.0000
  0.0000  0.0000
[torch.FloatTensor of size 3x2]

All of the basic arithmetic operations are supported.

In [4]:
a = torch.Tensor([1,2])
b = torch.Tensor([3,4])
print('a + b:', a + b)
print('a - b:', a - b)
print('a * b:', a * b)
print('a / b:', a / b)

a + b: 
 4
 6
[torch.FloatTensor of size 2]

a - b: 
-2
-2
[torch.FloatTensor of size 2]

a * b: 
 3
 8
[torch.FloatTensor of size 2]

a / b: 
 0.3333
 0.5000
[torch.FloatTensor of size 2]



Indexing/slicing also behaves the same.

In [5]:
a = torch.Tensor(5, 5)
print('a:', a)

# Slice using ranges
print('a[2:4, 3:4]', a[2:4, 3:4])

# Can count backwards using negative indices
print('a[:, -1]', a[:, -1])

# Skipping elements
print('a[::2, ::3]', a[::2, ::3])

a: 
1.00000e-21 *
  1.1919  0.0000  1.1919  0.0000  0.0000
  0.0000  0.0000  0.0000  0.0000  0.0000
  0.0000  0.0000  0.0000  0.0000  1.1919
  0.0000  0.0000  0.0000  0.0000  0.0000
  0.0000  0.0000  0.0000  0.0000  0.0000
[torch.FloatTensor of size 5x5]

a[2:4, 3:4] 
 0
 0
[torch.FloatTensor of size 2x1]

a[:, -1] 
1.00000e-21 *
  0.0000
  0.0000
  1.1919
  0.0000
  0.0000
[torch.FloatTensor of size 5]

a[::2, ::3] 
1.00000e-21 *
  1.1919  0.0000
  0.0000  0.0000
  0.0000  0.0000
[torch.FloatTensor of size 3x2]



Changing a `Tensor` to and from an `array` is also quite simple:

In [6]:
# Tensor from array
arr = np.array([1,2])
torch.from_numpy(arr)


 1
 2
[torch.LongTensor of size 2]

In [7]:
# Tensor to array
t = torch.Tensor([1, 2])
t.numpy()

array([1., 2.], dtype=float32)

Moving `Tensor`s to the GPU is also quite simple:

In [8]:
t = torch.Tensor([1, 2]) # on CPU
if torch.cuda.is_available():
    t = t.cuda() # on GPU

## Automatic Differentiation

Derivatives and gradients are critical to a large number of machine learning algorithms. One of the key benefits of 
PyTorch is that these can be computed automatically by `Variable` objects.

We'll demonstrate this using the following example. Suppose we have some data $x$ and $y$, and want to fit a model:
$$ \hat{y} = mx + b $$
by minimizing the loss function:
$$ L(y, \hat{y}) = \frac{1}{2}(y - \hat{y})^2 $$


In [9]:
from torch.autograd import Variable

# Data
x = Variable(torch.Tensor([1,  2,  3,  4]))
y = Variable(torch.Tensor([0, -1, -2, -3]))

# Initialize a variables
m = Variable(torch.rand(1), requires_grad=True)
b = Variable(torch.rand(1), requires_grad=True)

# Define function
y_hat = m * x + b

# Define loss
loss = torch.mean(0.5 * (y - y_hat)**2)

To obtain the gradient of the $L$ w.r.t $m$ and $b$ you need only run:

In [10]:
loss.backward() # Backprop the gradients of the loss w.r.t other variables

# Gradients
print('dL/dm: %0.4f' % m.grad.data)
print('dL/db: %0.4f' % b.grad.data)

dL/dm: 12.1886
dL/db: 4.0370


## Training Models

While automatic differentiation is in itself a useful feature, it can be quite tedious to keep track of all of the different parameters and gradients for more complicated models. In order to make life simple, PyTorch defines a `torch.nn.Module` class which handles all of these details for you. To paraphrase the PyTorch documentation, this is the base class for all neural network modules, and whenever you define a model it should be a subclass of this class.

Here is an example implementation of the simple linear model given above:

In [11]:
import torch.nn as nn

class LinearModel(nn.Module):
    
    def __init__(self):
        """This method is called when you instantiate a new LinearModel object.
        
        You should use it to define the parameters/layers of your model.
        """
        # Whenever you define a new nn.Module you should start the __init__()
        # method with the following line. Remember to replace `LinearModel` 
        # with whatever you are calling your model.
        super(LinearModel, self).__init__()
        
        # Now we define the parameters used by the model.
        self.m = torch.nn.Parameter(torch.rand(1))
        self.b = torch.nn.Parameter(torch.rand(1))
    
    def forward(self, x):
        """This method computes the output of the model.
        
        Args:
            x: The input data.
        """
        return self.m * x + self.b


# Example forward pass. Note that we use model(x) not model.forward(x) !!! 
model = LinearModel()
y_hat = model(x)

To train this model we need to pick an optimizer such as SGD, AdaDelta, ADAM, etc. There are many options in `torch.optim`. When initializing an optimizer, the first argument will be the collection of variables you want optimized. To obtain a list of all of the trainable parameters of a model you can call the `nn.Module.parameters()` method. For example, the following code initalizes a SGD optimizer for the model defined above:

In [12]:
import torch.optim as optim
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

Training is done in a loop. The general structure is:

1. Clear the gradients.
2. Evaluate the model.
3. Calculate the loss.
4. Backpropagate.
5. Perform an optimization step.
6. (Once in a while) Print monitoring metrics.

For example, we can train our linear model by running:

In [13]:
import time

for i in range(5001):
    optimizer.zero_grad()
    y_hat = model(x)
    loss = torch.mean(0.5 * (y - y_hat)**2)
    loss.backward()
    optimizer.step()
    if i % 1000 == 0:
        time.sleep(1) # DO NOT INCLUDE THIS IN YOUR CODE !!! Only for demo.
        print('Iteration %i - Loss: %0.6f' % (i, loss.data[0]),
              end='\r') # COOL TRICK: ` end='\r' ` makes print overwrite the current line.

Iteration 5000 - Loss: 0.000000

Observe that the final parameters are what we expect:

In [14]:
print('Final parameters:')
print('m: %0.2f' % model.m.data[0])
print('b: %0.2f' % model.b.data[0])

Final parameters:
m: -1.00
b: 1.00


# CASE STUDY: Word2Vec

Now let's dive into an example that is more relevant to NLP, Word2Vec!
The idea of Word2Vec is to create continuous vector representations of words using shallow neural networks, so that words with similar meanings end up close together in the vector space.
First introduced by [Mikolov et. al 2013](https://arxiv.org/abs/1301.3781), these models greatly improved the state-of-the-art of measuring semantic and syntactic similarity of words.
In addition, the learned word embeddings end up being useful for many other downstream tasks such as language modeling, machine translation, and automatic image captioning.

In this notebook, we will go over the Skip-Gram model, which has the following architecture:
![Mikolov et. al](img/skip-gram.png)
A good mathematical description of the model is provided by
[Goldberg and Levy 2014](https://www.cs.bgu.ac.il/~yoavg/publications/negative-sampling.pdf).
Here are the cliff notes:

- We are given a corpus of words $w$ and their contexts $c$.
- The goal is to maximize $p(c|w)$ for all words and contexts in the corpus.
- The model looks like:
    $$ p(c|w) = \frac{e^{v_c \cdot v_w}}{\sum_{c'}e^{v_{c'} \cdot v_w}} $$
    where $v_c, v_w \in \mathbb{R}^k$ are the word embeddings of $c$ and $w$.
- We treat $w$ and $c$ as coming from different vocabularies. Thus the same word will have a different embedding depending on whether or not it occurs in the context.
- To speed up training we can use negative sampling. The objective function we want to maximize looks like:
    $$ J = \sum_{(w, c) \in D}\log\sigma(v_c \cdot v_w) + \sum_{(w, c) \in D'}\log\sigma(-v_c \cdot v_w) $$
    Where $D$ is the set of word-context pairs that are observed in the corpus and $D'$ is the set of word-context pairs that are not observed. 
    
## Dataset

To start, we'll need some data to train on. To keep things brief we'll just use shell scripts:

In [15]:
!wget http://mattmahoney.net/dc/text8.zip
!unzip text8.zip

--2018-02-05 23:18:16--  http://mattmahoney.net/dc/text8.zip
Resolving mattmahoney.net (mattmahoney.net)... 67.195.197.75
Connecting to mattmahoney.net (mattmahoney.net)|67.195.197.75|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 31344016 (30M) [application/zip]
Saving to: ‘text8.zip.1’

text8.zip.1           8%[>                   ]   2.67M   524KB/s    eta 55s    ^C
Archive:  text8.zip
replace text8? [y]es, [n]o, [A]ll, [N]one, [r]ename: ^C


Next, we need to get our data into Python and in a form that is usable by PyTorch. For text data this typically entails building a vocabulary of all of the words, sorting the vocabulary in terms of frequency, and then mapping words to integers corresponding to their place in the sorted vocabulary. This can be done as follows:

In [34]:
# Read in the data and split on spaces.
# NOTE: You will probably need to perform a more advanced tokenization method for other datasets.
with open('text8', 'r') as f:
    corpus = f.read().split()

In [78]:
from collections import Counter

def build_vocabulary(corpus):
    """Builds a vocabulary.
    
    Args:
        corpus: A list of words.
    """
    counts = Counter(corpus) # Count the word occurances
    counts = counts.items() # Filter down to only the most frequent words
    reverse_vocab = [x[0] for x in counts] # Use a list to map indices to words
    probs = np.array([x[1] for x in counts])
    probs = probs**0.75 / np.sum(probs**0.75)
    vocab = {x: i for i, x in enumerate(reverse_vocab)} # Invert that mapping to get the vocabulary
    data = [vocab[x] if x in vocab else vocab['<UNK>'] for x in corpus] # Get ids for all words in the corpus
    return data, vocab, reverse_vocab, probs

data, vocab, reverse_vocab, probs = build_vocabulary(corpus)

Use the following cell to inspect the output of that function.

We now need a function to generate batches of word-context pairs from the corpus.

In [81]:
from collections import deque
from random import shuffle
from torch.autograd import Variable


skip_window = 1
batch_size = 4


def batch_generator(data, vocab, probs):

    # Stores the indices of the words in order that they will be chosen when generating batches.
    # e.g. the first word that will be chosen in an epoch is word_ids[0]
    ids = list(range(skip_window, len(data) - skip_window))

    while True:
        shuffle(ids) # Randomize at the start of each epoch
        pos_sample_queue = deque()
        for id in ids: # Iterate over random words
            
            for i in range(1, skip_window + 1): # Iterate over window sizes
                # Get the word and contexts i words away on left and right
                w = data[id]
                c_left = data[id - i]
                c_right = data[id + i]
                pos_sample_queue.append((w, c_left))
                pos_sample_queue.append((w, c_right))
            
            # Once positive sample queue is full deque a batch and generate negative samples
            if len(pos_sample_queue) >= batch_size:
                
                batch = [pos_sample_queue.pop() for _ in range(batch_size)]
                w, c_pos = zip(*batch) # Separate words and contexts
                c_neg = np.random.choice(len(vocab), batch_size, p=probs)
                
                # Read data into torch variables
                w = Variable(torch.LongTensor(w))
                c_pos = Variable(torch.LongTensor(c_pos))
                c_neg = Variable(torch.LongTensor(c_neg))
                if torch.cuda.is_available():
                    w = w.cuda()
                    c_pos = c_pos.cuda()
                    c_neg = c_neg.cuda()
                yield w, c_pos, c_neg


Here's an example of the `batch_generator` in action...

In [83]:
it = batch_generator(data, vocab, probs) # Produce the iterable
w, c_pos, c_neg = next(it) # next() will run until yield
print('w:', w)
print('c_pos:', c_pos)
print('c_neg:', c_neg)

w: Variable containing:
 1956
 1956
 2166
 2166
[torch.LongTensor of size 4]

c_pos: Variable containing:
 1975
    3
 6070
   24
[torch.LongTensor of size 4]

c_neg: Variable containing:
   681
  5374
 11012
   198
[torch.LongTensor of size 4]



## Model

Now that we can read in the data, it is time to build our model.

In [100]:
import torch.nn as nn
import torch.nn.functional as F

class SkipGram(nn.Module):
    
    def __init__(self, vocab_size, embedding_size):
        
        # Remember you always need this line !!!
        super(SkipGram, self).__init__()
        
        # Create the embedding layers
        self.w_embeddings = nn.Embedding(vocab_size, embedding_size)
        self.c_embeddings = nn.Embedding(vocab_size, embedding_size)
        
        # Initialize the embedding layers
        initrange = 0.5 / embedding_size
        self.w_embeddings.weight.data.uniform_(-initrange, initrange)
        self.c_embeddings.weight.data.uniform_(-0, 0)
        
    def forward(self, w, c_pos, c_neg):
        """Returns the objective function output."""
        # Word embeddings
        v_w = self.w_embeddings(w)

        # Postive context sample term
        v_c_pos = self.c_embeddings(c_pos)
        dot_prods_pos = torch.sum(v_w * v_c_pos, dim=1)
        J_pos = F.logsigmoid(dot_prods_pos)
        
        # Negative sample term
        v_c_neg = self.c_embeddings(c_neg)
        dot_prods_neg = torch.sum(v_w * v_c_neg, dim=1)
        J_neg = F.logsigmoid(-1 * dot_prods_neg)
        
        J = -1*(torch.sum(J_pos) + torch.sum(J_neg))
        return J
        
    def input_embedding(self, w):
        """Returns the embedding of w."""
        return self.w_embeddings(w)


## Training

In [None]:
from torch import optim

n_iterations = 100000
batch_size = 128
it = batch_generator(data, vocab, probs)

vocab_size = len(vocab)
embedding_size = 200
model = SkipGram(vocab_size, embedding_size)
if torch.cuda.is_available():
    model = model.cuda()

optimizer = optim.SGD(model.parameters(), lr = 0.01, momentum=0.9)

for i in range(n_iterations):
    optimizer.zero_grad()
    w, c_pos, c_neg = next(it)
    loss = model(w, c_pos, c_neg)
    loss.backward()
    optimizer.step()
    if (i % 10) == 0:
        print('Iteration %i - Loss: %0.6f' % (i, loss.data[0]), end='\r')

Iteration 240 - Loss: 170.029297