In [1]:
%matplotlib inline


# PyTorch로 딥러닝하기 #

## 딥 러닝의 기본 : Affine 매핑, 비선형성, 목적 함수 ##

딥 러닝은 선형성과 비선형성을 잘 연결하는 작업입니다. 비선형성의 도입으로 더욱 강력한 딥 러닝 모델을 만들 수 있습니다. 이 섹션에서는, 이 요소들을 이용해서 목적 함수를 만들고, 이를 통해 어떻게 딥 러닝 모델이 훈련되는지 알아보겠습니다.


## Affine 매핑(*Affine Maps*) ##

딥 러닝의 가장 유용한 도구 중 하나는 **Affine map** 입니다. 이는 아래와 같은 함수 $f(x)$ 를 말합니다.

\begin{align}f(x) = Ax + b\end{align}

행렬 $A$ 와 벡터 $x, b$. 입니다. 학습되어야 하는 파라미터는 $A$ 와 $b$ 이고요. 보통 $b$를 *bias* 벡터 이라고 합니다.

PyTorch와 다른 딥 러닝 프레임워크들은 전통적인 선형대수와는 약간 다른 연산을 합니다. 보통 선형대수는 열을 매핑하는데, 딥러닝 프레임워크들은 보통 행벡터를 매핑하는 행렬을 찾습니다. 


In [1]:
# Author: Robert Guthrie
# 번역 : Sweetcocoa

import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

<torch._C.Generator at 0x1c5b3f243f0>

In [2]:
lin = nn.Linear(5, 3)  # R^5에서 R^3으로의 매핑을 말합니다. 파라미터 A, b를 포함합니다. 

data = autograd.Variable(torch.randn(2, 5))
print(lin(data))

Variable containing:
 0.3130  0.2576  1.3546
 1.0007  0.6433  0.4951
[torch.FloatTensor of size 2x3]



## Non-Linearities ##

우리는 왜 비선형성이 필요할까요? 예를 들어서, 우리가 두 개의 affine maps, $f(x) = Ax + b$, $g(x) = Cx + d$ 를 합성하여 $f(g(x))$ 를 만든다고 생각해봅시다. 그러면 아래와 같이 됩니다.

\begin{align}f(g(x)) = A(Cx + d) + b = ACx + (Ad + b)\end{align}

$AC$ 는 하나의 행렬, $Ad + b$는 하나의 벡터죠.  따라서 단순히 affine map을 합성하는 것으로는 하나의 affine map을 넘는 무엇인가를 얻을 수 없습니다. 

만일 우리가 두 개 Affine map 사이에 비선형성을 추가할 수 있다면요?, 그렇다면 합성 map은 이전과는 다릅니다. 더욱 강력한 성능을 지닌 모델이 됩니다. 여기에 널리 쓰이는 비선형 함수에는 $\tanh(x), \sigma(x), \text{ReLU}(x)$ 가 있습니다. 차례로 하이퍼볼릭 탄젠트, 시그모이드, 렐루(혹은 리엘유 등) 으로 읽습니다.

여러분은 하고 많은 비선형 함수 중에 왜 하필 이 함수들인지 궁금하실텐데요, 제 생각에 이 함수들은 미분하기에 좋은 특성을 지니고 있기 때문입니다. 또 미분하여 함수의 기울기를 구하는 것은 학습에 있어 아주 필수적이고 중요한 연산이기 때문에 이러한 특성이 있는 비선형함수가 좋겠죠.

시그모이드의 경우
\begin{align}\frac{d\sigma}{dx} = \sigma(x)(1 - \sigma(x))\end{align}

으로 좋은 미분 성질을 가지고 있습니다.

*참고 사항* : 뉴럴 네트워크에서 $\sigma(x)$를 비선형 함수로 사용하는 경우, 이론적으로는 큰 문제가 없지만 실제로 신경망에 적용하면 *gradient vanishing* 문제가 있습니다. 따라서 많은 경우 시그모이드 함수는 비선형 함수로 사용하지 않고, 대부분은 tanh나 ReLU, 혹은 그 변형 함수를 사용합니다. 




In [3]:
# PyTorch에서는 대부분의 비선형 함수가 torch.functional 에 구현되어 있습니다.
# 그리고 이러한 비선형 함수는 가중치(Weight)와 같이, affine map들이 갖는 파라미터를 갖지 않습니다. 
# 즉 학습 중에 업데이트 되지 않습니다.
data = autograd.Variable(torch.randn(2, 2))
print(data)
print(F.relu(data))

Variable containing:
 1.2182  0.2117
-1.0613 -1.9441
[torch.FloatTensor of size 2x2]

Variable containing:
 1.2182  0.2117
 0.0000  0.0000
[torch.FloatTensor of size 2x2]



Softmax and Probabilities
~~~~~~~~~~~~~~~~~~~~~~~~~

The function $\text{Softmax}(x)$ is also just a non-linearity, but
it is special in that it usually is the last operation done in a
network. This is because it takes in a vector of real numbers and
returns a probability distribution. Its definition is as follows. Let
$x$ be a vector of real numbers (positive, negative, whatever,
there are no constraints). Then the i'th component of
$\text{Softmax}(x)$ is

\begin{align}\frac{\exp(x_i)}{\sum_j \exp(x_j)}\end{align}

It should be clear that the output is a probability distribution: each
element is non-negative and the sum over all components is 1.

You could also think of it as just applying an element-wise
exponentiation operator to the input to make everything non-negative and
then dividing by the normalization constant.




In [None]:
# Softmax is also in torch.functional
data = autograd.Variable(torch.randn(5))
print(data)
print(F.softmax(data))
print(F.softmax(data).sum())  # Sums to 1 because it is a distribution!
print(F.log_softmax(data))  # theres also log_softmax

Objective Functions
~~~~~~~~~~~~~~~~~~~

The objective function is the function that your network is being
trained to minimize (in which case it is often called a *loss function*
or *cost function*). This proceeds by first choosing a training
instance, running it through your neural network, and then computing the
loss of the output. The parameters of the model are then updated by
taking the derivative of the loss function. Intuitively, if your model
is completely confident in its answer, and its answer is wrong, your
loss will be high. If it is very confident in its answer, and its answer
is correct, the loss will be low.

The idea behind minimizing the loss function on your training examples
is that your network will hopefully generalize well and have small loss
on unseen examples in your dev set, test set, or in production. An
example loss function is the *negative log likelihood loss*, which is a
very common objective for multi-class classification. For supervised
multi-class classification, this means training the network to minimize
the negative log probability of the correct output (or equivalently,
maximize the log probability of the correct output).




Optimization and Training
=========================

So what we can compute a loss function for an instance? What do we do
with that? We saw earlier that autograd.Variable's know how to compute
gradients with respect to the things that were used to compute it. Well,
since our loss is an autograd.Variable, we can compute gradients with
respect to all of the parameters used to compute it! Then we can perform
standard gradient updates. Let $\theta$ be our parameters,
$L(\theta)$ the loss function, and $\eta$ a positive
learning rate. Then:

\begin{align}\theta^{(t+1)} = \theta^{(t)} - \eta \nabla_\theta L(\theta)\end{align}

There are a huge collection of algorithms and active research in
attempting to do something more than just this vanilla gradient update.
Many attempt to vary the learning rate based on what is happening at
train time. You don't need to worry about what specifically these
algorithms are doing unless you are really interested. Torch provies
many in the torch.optim package, and they are all completely
transparent. Using the simplest gradient update is the same as the more
complicated algorithms. Trying different update algorithms and different
parameters for the update algorithms (like different initial learning
rates) is important in optimizing your network's performance. Often,
just replacing vanilla SGD with an optimizer like Adam or RMSProp will
boost performance noticably.




Creating Network Components in Pytorch
======================================

Before we move on to our focus on NLP, lets do an annotated example of
building a network in Pytorch using only affine maps and
non-linearities. We will also see how to compute a loss function, using
Pytorch's built in negative log likelihood, and update parameters by
backpropagation.

All network components should inherit from nn.Module and override the
forward() method. That is about it, as far as the boilerplate is
concerned. Inheriting from nn.Module provides functionality to your
component. For example, it makes it keep track of its trainable
parameters, you can swap it between CPU and GPU with the .cuda() or
.cpu() functions, etc.

Let's write an annotated example of a network that takes in a sparse
bag-of-words representation and outputs a probability distribution over
two labels: "English" and "Spanish". This model is just logistic
regression.




Example: Logistic Regression Bag-of-Words classifier
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Our model will map a sparse BOW representation to log probabilities over
labels. We assign each word in the vocab an index. For example, say our
entire vocab is two words "hello" and "world", with indices 0 and 1
respectively. The BoW vector for the sentence "hello hello hello hello"
is

\begin{align}\left[ 4, 0 \right]\end{align}

For "hello world world hello", it is

\begin{align}\left[ 2, 2 \right]\end{align}

etc. In general, it is

\begin{align}\left[ \text{Count}(\text{hello}), \text{Count}(\text{world}) \right]\end{align}

Denote this BOW vector as $x$. The output of our network is:

\begin{align}\log \text{Softmax}(Ax + b)\end{align}

That is, we pass the input through an affine map and then do log
softmax.




In [None]:
data = [("me gusta comer en la cafeteria".split(), "SPANISH"),
        ("Give it to me".split(), "ENGLISH"),
        ("No creo que sea una buena idea".split(), "SPANISH"),
        ("No it is not a good idea to get lost at sea".split(), "ENGLISH")]

test_data = [("Yo creo que si".split(), "SPANISH"),
             ("it is lost on me".split(), "ENGLISH")]

# word_to_ix maps each word in the vocab to a unique integer, which will be its
# index into the Bag of words vector
word_to_ix = {}
for sent, _ in data + test_data:
    for word in sent:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)
print(word_to_ix)

VOCAB_SIZE = len(word_to_ix)
NUM_LABELS = 2


class BoWClassifier(nn.Module):  # inheriting from nn.Module!

    def __init__(self, num_labels, vocab_size):
        # calls the init function of nn.Module.  Dont get confused by syntax,
        # just always do it in an nn.Module
        super(BoWClassifier, self).__init__()

        # Define the parameters that you will need.  In this case, we need A and b,
        # the parameters of the affine mapping.
        # Torch defines nn.Linear(), which provides the affine map.
        # Make sure you understand why the input dimension is vocab_size
        # and the output is num_labels!
        self.linear = nn.Linear(vocab_size, num_labels)

        # NOTE! The non-linearity log softmax does not have parameters! So we don't need
        # to worry about that here

    def forward(self, bow_vec):
        # Pass the input through the linear layer,
        # then pass that through log_softmax.
        # Many non-linearities and other functions are in torch.nn.functional
        return F.log_softmax(self.linear(bow_vec))


def make_bow_vector(sentence, word_to_ix):
    vec = torch.zeros(len(word_to_ix))
    for word in sentence:
        vec[word_to_ix[word]] += 1
    return vec.view(1, -1)


def make_target(label, label_to_ix):
    return torch.LongTensor([label_to_ix[label]])


model = BoWClassifier(NUM_LABELS, VOCAB_SIZE)

# the model knows its parameters.  The first output below is A, the second is b.
# Whenever you assign a component to a class variable in the __init__ function
# of a module, which was done with the line
# self.linear = nn.Linear(...)
# Then through some Python magic from the Pytorch devs, your module
# (in this case, BoWClassifier) will store knowledge of the nn.Linear's parameters
for param in model.parameters():
    print(param)

# To run the model, pass in a BoW vector, but wrapped in an autograd.Variable
sample = data[0]
bow_vector = make_bow_vector(sample[0], word_to_ix)
log_probs = model(autograd.Variable(bow_vector))
print(log_probs)

Which of the above values corresponds to the log probability of ENGLISH,
and which to SPANISH? We never defined it, but we need to if we want to
train the thing.




In [None]:
label_to_ix = {"SPANISH": 0, "ENGLISH": 1}

So lets train! To do this, we pass instances through to get log
probabilities, compute a loss function, compute the gradient of the loss
function, and then update the parameters with a gradient step. Loss
functions are provided by Torch in the nn package. nn.NLLLoss() is the
negative log likelihood loss we want. It also defines optimization
functions in torch.optim. Here, we will just use SGD.

Note that the *input* to NLLLoss is a vector of log probabilities, and a
target label. It doesn't compute the log probabilities for us. This is
why the last layer of our network is log softmax. The loss function
nn.CrossEntropyLoss() is the same as NLLLoss(), except it does the log
softmax for you.




In [None]:
# Run on test data before we train, just to see a before-and-after
for instance, label in test_data:
    bow_vec = autograd.Variable(make_bow_vector(instance, word_to_ix))
    log_probs = model(bow_vec)
    print(log_probs)

# Print the matrix column corresponding to "creo"
print(next(model.parameters())[:, word_to_ix["creo"]])

loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# Usually you want to pass over the training data several times.
# 100 is much bigger than on a real data set, but real datasets have more than
# two instances.  Usually, somewhere between 5 and 30 epochs is reasonable.
for epoch in range(100):
    for instance, label in data:
        # Step 1. Remember that Pytorch accumulates gradients.
        # We need to clear them out before each instance
        model.zero_grad()

        # Step 2. Make our BOW vector and also we must wrap the target in a
        # Variable as an integer. For example, if the target is SPANISH, then
        # we wrap the integer 0. The loss function then knows that the 0th
        # element of the log probabilities is the log probability
        # corresponding to SPANISH
        bow_vec = autograd.Variable(make_bow_vector(instance, word_to_ix))
        target = autograd.Variable(make_target(label, label_to_ix))

        # Step 3. Run our forward pass.
        log_probs = model(bow_vec)

        # Step 4. Compute the loss, gradients, and update the parameters by
        # calling optimizer.step()
        loss = loss_function(log_probs, target)
        loss.backward()
        optimizer.step()

for instance, label in test_data:
    bow_vec = autograd.Variable(make_bow_vector(instance, word_to_ix))
    log_probs = model(bow_vec)
    print(log_probs)

# Index corresponding to Spanish goes up, English goes down!
print(next(model.parameters())[:, word_to_ix["creo"]])

We got the right answer! You can see that the log probability for
Spanish is much higher in the first example, and the log probability for
English is much higher in the second for the test data, as it should be.

Now you see how to make a Pytorch component, pass some data through it
and do gradient updates. We are ready to dig deeper into what deep NLP
has to offer.


