# Анализ тональности с pytorch и RNN

[источник](https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/1%20-%20Simple%20Sentiment%20Analysis.ipynb)

## Рекуррентная нейросеть

Проходится по последовательности (например, слов)  и на каждом шаге получает вектор слова и вектор предыдущего состояния.

$$h_t = \text{RNN}(x_t, h_{t-1})$$

![rnn.png](https://miro.medium.com/max/627/1*go8PHsPNbbV6qRiwpUQ5BQ.png)


## Данные

Мы будем использовать объекты класса `Field`. Они определяют, как данные будут храниться и обрабатываться.

В поле `TEXT` задаём `tokenize='spacy'`. Это определяет, что тексты будут токенизироваться с помошью [spaCy](https://spacy.io) tokenizer. Если не подать аргументов, тексты будут разбиваться по пробелам.

`LABEL` is defined by a `LabelField`, a special subset of the `Field` class specifically used for handling labels. We will explain the `dtype` argument later.

Больше про класс `Field` [здесь](https://github.com/pytorch/text/blob/master/torchtext/data/field.py).

In [0]:
import torch
from torchtext import data

SEED = 1234 # фиксируем seed для воспроизводимости

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

TEXT = data.Field(tokenize = 'spacy')
LABEL = data.LabelField(dtype = torch.float)

В pytorch (`torchtext.datasets`) хранятся некоторые стандартные датасеты. А ещё в них встроено разделение на train и test.

In [0]:
from torchtext import datasets

In [3]:
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

aclImdb_v1.tar.gz:   0%|          | 180k/84.1M [00:00<00:52, 1.60MB/s]

downloading aclImdb_v1.tar.gz


aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:01<00:00, 72.5MB/s]


Посмотрим сколько примеров в каждой части:

In [4]:
print(f'Number of training examples: {len(train_data)}')
print(f'Number of testing examples: {len(test_data)}')

Number of training examples: 25000
Number of testing examples: 25000


Посмотрим на один пример:

In [5]:
print(vars(train_data.examples[0]))

{'text': ['Loved', 'it', '!', 'What', "'s", 'not', 'to', 'like?--you', 'got', 'your', 'suburbia', ',', 'you', 'got', 'your', 'zombies', ',', 'you', 'got', 'your', 'family', 'issues', ',', 'you', 'got', 'your', 'social', 'dilemmas', ',', 'you', 'got', 'yourself', 'one', 'Fine', "Retro-1950's", '-', 'style', 'Flesh', 'Eating', 'Under', 'Class', 'Held', 'At', 'Bay', 'By', 'An', 'Uneasy', 'Worried', 'About', 'Whether', 'They', "'re", 'The', 'Next', 'Meal', 'Upper', 'Crust', '.', 'You', 'could', "n't", 'ask', 'for', 'more.<br', '/><br', '/>Cast', 'is', 'superb', '.', 'Carrie', 'Ann', 'Moss', 'is', 'absolute', 'perfection', 'as', 'a', 'debutante', 'social', 'climbing', 'housewife', '.', 'She', "'s", 'both', 'wanton', ',', 'and', 'criminally', 'conspiratorial', '.', 'Every', 'fellow', "'s", 'dream', '.', "K'sun", 'is', 'really', 'great', 'as', 'the', 'son', 'just', 'trying', 'to', 'be', 'as', 'normal', 'as', 'possible', 'in', 'this', 'nightmare', 'existence', ',', 'and', 'somehow', 'succeedin

In [6]:
train_data.examples[0].text[:10]

['Loved', 'it', '!', 'What', "'s", 'not', 'to', 'like?--you', 'got', 'your']

In [7]:
train_data.examples[0].label

'pos'

Создадим валидационную выборку с помощью метода `.split()`. Соотношение количества примеров можно задать с помощью аргумента `split_ratio`.

In [0]:
import random

In [0]:
train_data, valid_data = train_data.split(random_state = random.seed(SEED))

Again, we'll view how many examples are in each split.

In [10]:
print(f'Number of training examples: {len(train_data)}')
print(f'Number of validation examples: {len(valid_data)}')
print(f'Number of testing examples: {len(test_data)}')

Number of training examples: 17500
Number of validation examples: 7500
Number of testing examples: 25000


Теперь построим _vocabulary_. Это, по большому счёту, таблица, в которой каждое слово соответствует индексу. Индекс используется для создания _one-hot_ векторов.

![](assets/sentiment5.png)

Количество уникальных слов 100,000 -- это очень много для векторов. Столько данных может не влезть в GPU. Поэтому мы возьмём 25,000 самых частых слов. Если нам встретится слово, которого среди них нет, оно заменится на  `<unk>`. Например, если в предожении "This film is great and I love it" не войдёт слово "love", получится "This film is great and I `<unk>` it".

In [0]:
MAX_VOCAB_SIZE = 25_000

TEXT.build_vocab(train_data, max_size = MAX_VOCAB_SIZE)
LABEL.build_vocab(train_data)

Немного посмотрим на данные.

In [12]:
print(f"Unique tokens in TEXT vocabulary: {len(TEXT.vocab)}")
print(f"Unique tokens in LABEL vocabulary: {len(LABEL.vocab)}")

Unique tokens in TEXT vocabulary: 25002
Unique tokens in LABEL vocabulary: 2


25002, а не 25000 -- из-за `<unk>` и `<pad>` токенов.

В батче должны быть однаковые длины предложений, если какое-то короче -- добавляем паддинг.

![](assets/sentiment6.png)

In [13]:
print(TEXT.vocab.freqs.most_common(20))

[('the', 201964), (',', 192194), ('.', 165535), ('and', 109744), ('a', 109057), ('of', 100617), ('to', 93468), ('is', 76658), ('in', 61031), ('I', 54486), ('it', 53713), ('that', 49465), ('"', 44201), ("'s", 43145), ('this', 42353), ('-', 36686), ('/><br', 35729), ('was', 34929), ('as', 30116), ('with', 29940)]


Данные в словаре: `stoi` (**s**tring **to** **i**nt) и `itos` (**i**nt **to**  **s**tring).

In [14]:
print(TEXT.vocab.itos[:10])

['<unk>', '<pad>', 'the', ',', '.', 'and', 'a', 'of', 'to', 'is']


In [15]:
print(TEXT.vocab.stoi['and'])

5


The final step of preparing the data is creating the iterators. We iterate over these in the training/evaluation loop, and they return a batch of examples (indexed and converted into tensors) at each iteration.

We'll use a `BucketIterator` which is a special type of iterator that will return a batch of examples where each example is of a similar length, minimizing the amount of padding per example.

We also want to place the tensors returned by the iterator on the GPU (if you're using one). PyTorch handles this using `torch.device`, we then pass this device to the iterator.

In [0]:
BATCH_SIZE = 64

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE,
    device = device)

## Определяем модель

У нас три слоя:
* слой _embedding_ -- transform our sparse one-hot vector (sparse as most of the elements are 0) into a dense embedding vector (dense as the dimensionality is a lot smaller and all the elements are real numbers). Это полносвязный слой.
* RNN (takes in our dense vector and the previous hidden state $h_{t-1}$, which it uses to calculate the next hidden state, $h_t$.)
* слой _linear_ (transforming it to the correct output dimension)

![RNN](https://raw.githubusercontent.com/bentrevett/pytorch-sentiment-analysis/9210842371c3bbde7b2007051dafa4c74d9768cd/assets/sentiment7.png)


Each batch, `text`, is a tensor of size _**[sentence length, batch size]**_. That is a batch of sentences, each having each word converted into a one-hot vector. 

You may notice that this tensor should have another dimension due to the one-hot vectors, however PyTorch conveniently stores a one-hot vector as it's index value, i.e. the tensor representing a sentence is just a tensor of the indexes for each token in that sentence. The act of converting a list of tokens into a list of indexes is commonly called *numericalizing*.

The input batch is then passed through the embedding layer to get `embedded`, which gives us a dense vector representation of our sentences. `embedded` is a tensor of size _**[sentence length, batch size, embedding dim]**_.

`embedded` is then fed into the RNN. In some frameworks you must feed the initial hidden state, $h_0$, into the RNN, however in PyTorch, if no initial hidden state is passed as an argument it defaults to a tensor of all zeros.

The RNN returns 2 tensors, `output` of size _**[sentence length, batch size, hidden dim]**_ and `hidden` of size _**[1, batch size, hidden dim]**_. `output` is the concatenation of the hidden state from every time step, whereas `hidden` is simply the final hidden state. We verify this using the `assert` statement. Note the `squeeze` method, which is used to remove a dimension of size 1. 

Finally, we feed the last hidden state, `hidden`, through the linear layer, `fc`, to produce a prediction.

In [0]:
import torch.nn as nn

In [0]:
class RNN(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):
        super().__init__()

        self.embedding = nn.Embedding(input_dim, embedding_dim) # слой, которым мы представляем слова в виде векторов 

        self.rnn = nn.RNN(embedding_dim, hidden_dim) # клетка RNN
        
        self.fc = nn.Linear(hidden_dim, output_dim) # ещё один слой в конце
        
    def forward(self, text):

        #text = [sent len, batch size]
        
        embedded = self.embedding(text)
        
        #embedded = [sent len, batch size, emb dim]
        
        output, hidden = self.rnn(embedded)
        
        #output = [sent len, batch size, hid dim]
        #hidden = [1, batch size, hid dim]
        
        assert torch.equal(output[-1,:,:], hidden.squeeze(0))
        
        return self.fc(hidden.squeeze(0))

We now create an instance of our RNN class. 

The input dimension is the dimension of the one-hot vectors, which is equal to the vocabulary size. 

The embedding dimension is the size of the dense word vectors. This is usually around 50-250 dimensions, but depends on the size of the vocabulary.

The hidden dimension is the size of the hidden states. This is usually around 100-500 dimensions, but also depends on factors such as on the vocabulary size, the size of the dense vectors and the complexity of the task.

The output dimension is usually the number of classes, however in the case of only 2 classes the output value is between 0 and 1 and thus can be 1-dimensional, i.e. a single scalar real number.

In [0]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1

model = RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)

Let's also create a function that will tell us how many trainable parameters our model has so we can compare the number of parameters across different models.

In [20]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 2,592,105 trainable parameters


## Обучаем модель

In [0]:
import torch.optim as optim

optimizer = optim.SGD(model.parameters(), lr=1e-3)

Определяем функцию потерь: _binary cross entropy with logits_. 

Our model currently outputs an unbound real number. As our labels are either 0 or 1, we want to restrict the predictions to a number between 0 and 1. We do this using the _sigmoid_ or _logit_ functions. 

We then use this this bound scalar to calculate the loss using binary cross entropy. 

The `BCEWithLogitsLoss` criterion carries out both the sigmoid and the binary cross entropy steps.

In [0]:
criterion = nn.BCEWithLogitsLoss()

Using `.to`, we can place the model and the criterion on the GPU (if we have one). 

In [0]:
model = model.to(device)
criterion = criterion.to(device)

Функция для подсчёта accuracy:

In [0]:
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """

    #round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division 
    acc = correct.sum() / len(correct)
    return acc

The `train` function iterates over all examples, one batch at a time. 

`model.train()` is used to put the model in "training mode", which turns on _dropout_ and _batch normalization_. Although we aren't using them in this model, it's good practice to include it.

For each batch, we first zero the gradients. Each parameter in a model has a `grad` attribute which stores the gradient calculated by the `criterion`. PyTorch does not automatically remove (or "zero") the gradients calculated from the last gradient calculation, so they must be manually zeroed.

We then feed the batch of sentences, `batch.text`, into the model. Note, you do not need to do `model.forward(batch.text)`, simply calling the model works. The `squeeze` is needed as the predictions are initially size _**[batch size, 1]**_, and we need to remove the dimension of size 1 as PyTorch expects the predictions input to our criterion function to be of size _**[batch size]**_.

The loss and accuracy are then calculated using our predictions and the labels, `batch.label`, with the loss being averaged over all examples in the batch.

We calculate the gradient of each parameter with `loss.backward()`, and then update the parameters using the gradients and optimizer algorithm with `optimizer.step()`.

The loss and accuracy is accumulated across the epoch, the `.item()` method is used to extract a scalar from a tensor which only contains a single value.

Finally, we return the loss and accuracy, averaged across the epoch. The `len` of an iterator is the number of batches in the iterator.

You may recall when initializing the `LABEL` field, we set `dtype=torch.float`. This is because TorchText sets tensors to be `LongTensor`s by default, however our criterion expects both inputs to be `FloatTensor`s. Setting the `dtype` to be `torch.float`, did this for us. The alternative method of doing this would be to do the conversion inside the `train` function by passing `batch.label.float()` instad of `batch.label` to the criterion. 

In [0]:
def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        optimizer.zero_grad() # обнуляем градиенты

        predictions = model(batch.text).squeeze(1)
        loss = criterion(predictions, batch.label)
        acc = binary_accuracy(predictions, batch.label)

        loss.backward() # считаем градиенты
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

`evaluate` is similar to `train`, with a few modifications as you don't want to update the parameters when evaluating.

`model.eval()` puts the model in "evaluation mode", this turns off _dropout_ and _batch normalization_. Again, we are not using them in this model, but it is good practice to include them.

No gradients are calculated on PyTorch operations inside the `with no_grad()` block. This causes less memory to be used and speeds up computation.

The rest of the function is the same as `train`, with the removal of `optimizer.zero_grad()`, `loss.backward()` and `optimizer.step()`, as we do not update the model's parameters when evaluating.

In [0]:
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            predictions = model(batch.text).squeeze(1)
            
            loss = criterion(predictions, batch.label)
            
            acc = binary_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

We'll also create a function to tell us how long an epoch takes to compare training times between models.

In [0]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

We then train the model through multiple epochs, an epoch being a complete pass through all examples in the training and validation sets.

At each epoch, if the validation loss is the best we have seen so far, we'll save the parameters of the model and then after training has finished we'll use that model on the test set.

In [0]:
N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut1-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 0m 14s
	Train Loss: 0.694 | Train Acc: 49.93%
	 Val. Loss: 0.696 |  Val. Acc: 49.84%
Epoch: 02 | Epoch Time: 0m 14s
	Train Loss: 0.693 | Train Acc: 50.01%
	 Val. Loss: 0.696 |  Val. Acc: 49.88%
Epoch: 03 | Epoch Time: 0m 14s
	Train Loss: 0.693 | Train Acc: 50.08%
	 Val. Loss: 0.696 |  Val. Acc: 50.52%
Epoch: 04 | Epoch Time: 0m 14s
	Train Loss: 0.693 | Train Acc: 49.81%
	 Val. Loss: 0.696 |  Val. Acc: 49.40%
Epoch: 05 | Epoch Time: 0m 14s
	Train Loss: 0.693 | Train Acc: 49.98%
	 Val. Loss: 0.696 |  Val. Acc: 50.41%


Loss уменьшается слабовато. Потому что эта модель далека от идеала. 

Finally, the metric we actually care about, the test loss and accuracy, which we get from our parameters that gave us the best validation loss.

In [0]:
model.load_state_dict(torch.load('tut1-model.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.711 | Test Acc: 47.01%


## Используя предобученные эмбеддинги

In [28]:
MAX_VOCAB_SIZE = 25_000

TEXT.build_vocab(train_data, 
                 max_size = MAX_VOCAB_SIZE, 
                 vectors = "glove.6B.100d", 
                 unk_init = torch.Tensor.normal_)

LABEL.build_vocab(train_data)


.vector_cache/glove.6B.zip: 862MB [06:25, 2.23MB/s]                           
100%|█████████▉| 398586/400000 [00:20<00:00, 19235.96it/s]

In [0]:
BATCH_SIZE = 64

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE,
    sort_within_batch = True,
    device = device)

In [0]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1

model = RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)

Посмотрим на эмбеддинги:

In [40]:
pretrained_embeddings = TEXT.vocab.vectors

print(pretrained_embeddings.shape)

torch.Size([25002, 100])


In [41]:
model.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[ 0.3159, -0.9579,  0.8782,  ...,  0.8522,  0.1121,  0.6375],
        [-1.4131,  0.6201,  0.7174,  ...,  1.1835, -2.1611,  1.0010],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.4619,  0.1625, -0.1297,  ..., -0.1039, -0.4719,  0.7977],
        [-0.3447, -0.3907,  0.4552,  ..., -0.5628,  0.2530,  0.4335],
        [ 0.3796,  0.8294,  0.1816,  ...,  0.4350, -0.2332,  0.4504]])

In [0]:
criterion = nn.BCEWithLogitsLoss()

model = model.to(device)
criterion = criterion.to(device)

### Обучаем

In [37]:
N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut2-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

IndexError: ignored

In [0]:
model.load_state_dict(torch.load('tut2-model.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.706 | Test Acc: 37.54%
