nguồn tham khảo: https://blog.floydhub.com/long-short-term-memory-from-zero-to-hero-with-pytorch/

#### các thư viện
nltk: natural language toolkit

In [1]:
import bz2
from collections import Counter
import re
import nltk
import numpy as np
nltk.download('punkt')

train_file = bz2.BZ2File('./amazonreviews/train.ft.txt.bz2')
test_file = bz2.BZ2File('./amazonreviews/test.ft.txt.bz2')

train_file = train_file.readlines()
test_file = test_file.readlines()

[nltk_data] Downloading package punkt to /home/aioz-
[nltk_data]     interns/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


#### Data
gồm 1000000 câu bình luận đi kèm với nhãn là tích cực (__label__2) và tiêu cực (__label__1)

In [2]:
num_train = 800000  # We're training on the first 800,000 reviews in the dataset
num_test = 200000  # Using 200,000 reviews from test set

train_file = [x.decode('utf-8') for x in train_file[:num_train]]
test_file = [x.decode('utf-8') for x in test_file[:num_test]]

print(len(train_file))
print(len(test_file))

800000
200000


In [3]:
print((train_file[300]))

__label__1 super wack: just like No-Limit Cash Money has no shame at putting out garbage music.wack beats and no lyric ryhmes.who is buying this crab? all the stuff sounds the same and it's not that average.it's all bad.



#### tiền xử lý data
+ tách nhãn (0, 1) và câu ra riêng biệt
+ thay các chữ số có trong câu thành số 0 (số có lẽ ko thể hiện sentiment)
+ thay các url có trong câu thành chuỗi '< url >'

In [4]:
# Extracting labels from sentences
train_labels = [0 if x.split(' ')[0] == '__label__1' else 1 for x in train_file]
train_sentences = [x.split(' ', 1)[1][:-1].lower() for x in train_file]

test_labels = [0 if x.split(' ')[0] == '__label__1' else 1 for x in test_file]
test_sentences = [x.split(' ', 1)[1][:-1].lower() for x in test_file]

# Some simple cleaning of data
for i in range(len(train_sentences)):
    train_sentences[i] = re.sub('\d','0',train_sentences[i])

for i in range(len(test_sentences)):
    test_sentences[i] = re.sub('\d','0',test_sentences[i])

# Modify URLs to <url>
for i in range(len(train_sentences)):
    if 'www.' in train_sentences[i] or 'http:' in train_sentences[i] or 'https:' in train_sentences[i] or '.com' in train_sentences[i]:
        train_sentences[i] = re.sub(r"([^ ]+(?<=\.[a-z]{3}))", "<url>", train_sentences[i])
        
for i in range(len(test_sentences)):
    if 'www.' in test_sentences[i] or 'http:' in test_sentences[i] or 'https:' in test_sentences[i] or '.com' in test_sentences[i]:
        test_sentences[i] = re.sub(r"([^ ]+(?<=\.[a-z]{3}))", "<url>", test_sentences[i])

In [5]:
print(train_labels[300])
print(train_sentences[300])
print(test_labels[14])
print(test_sentences[14])

0
super wack: just like no-limit cash money has no shame at putting out garbage music.wack beats and no lyric ryhmes.who is buying this crab? all the stuff sounds the same and it's not that average.it's all bad.
0
didn't run off of usb bus power: was hoping that this drive would run off of bus power, but it required the adapter to actually work. :( i sent it back.


+ lower case tất cả các word (từ 1 âm tiết)
+ đếm số lần xuất hiện của chúng lưu vào words

In [6]:
words = Counter()  # Dictionary that will map a word to the number of times it appeared in all the training sentences
for i, sentence in enumerate(train_sentences):
    # The sentences will be stored as a list of words/tokens
    train_sentences[i] = []
    for word in nltk.word_tokenize(sentence):  # Tokenizing the words
        words.update([word.lower()])  # Converting all the words to lowercase
        train_sentences[i].append(word)
    if i%20000 == 0:
        print(str((i*100)/num_train) + "% done")
print("100% done")

0.0% done
2.5% done
5.0% done
7.5% done
10.0% done
12.5% done
15.0% done
17.5% done
20.0% done
22.5% done
25.0% done
27.5% done
30.0% done
32.5% done
35.0% done
37.5% done
40.0% done
42.5% done
45.0% done
47.5% done
50.0% done
52.5% done
55.0% done
57.5% done
60.0% done
62.5% done
65.0% done
67.5% done
70.0% done
72.5% done
75.0% done
77.5% done
80.0% done
82.5% done
85.0% done
87.5% done
90.0% done
92.5% done
95.0% done
97.5% done
100% done


In [7]:
print('time "we" appear: ', words['we'])
print('time "suck" appear: ', words['suck'])

time "we" appear:  114499
time "suck" appear:  2263


+ xóa các từ chỉ xuất hiện 1 lần ra khỏi từ điển words (nhiều khả năng là vì sai chính tả)
+ sort lại theo thứ tự xuất hiện (đồng thời xóa luôn số lần xuất hiện)
+ thêm unknow và padding vào đầu từ điển
+ tạo 2 dict để truy xuất vị trí của các word trong từ điển


In [8]:
# Removing the words that only appear once
words = {k:v for k,v in words.items() if v>1}
# Sorting the words according to the number of appearances, with the most common word being first
words = sorted(words, key=words.get, reverse=True)
# Adding padding and unknown to our vocabulary so that they will be assigned an index
words = ['_PAD','_UNK'] + words
# Dictionaries to store the word to index mappings and vice versa
word2idx = {o:i for i,o in enumerate(words)}
idx2word = {i:o for i,o in enumerate(words)}

In [9]:
print(words[word2idx['suck']])
print('top 20 words: ', words[:20])

suck
top 20 words:  ['_PAD', '_UNK', '.', 'the', ',', 'i', 'and', 'a', 'to', 'it', 'of', 'this', 'is', ':', 'in', '!', 'for', 'that', 'was', 'you']


+ Encode các câu trong tập train và tập test bởi độ phổ biến của nó

In [10]:
for i, sentence in enumerate(train_sentences):
    # Looking up the mapping dictionary and assigning the index to the respective words
    train_sentences[i] = [word2idx[word] if word in word2idx else 0 for word in sentence]

for i, sentence in enumerate(test_sentences):
    # For test sentences, we have to tokenize the sentences as well
    test_sentences[i] = [word2idx[word.lower()] if word.lower() in word2idx else 0 for word in nltk.word_tokenize(sentence)]

In [11]:
print(train_sentences[0])
print(test_sentences[0])

[66140, 88, 16, 3, 103103, 13, 11, 192, 459, 18, 361, 15, 9, 5736, 3, 91059, 14, 77, 433, 36, 90, 5, 51, 1662, 9, 88, 8, 141, 80, 616, 18520, 2, 211, 126, 15, 5, 27, 539, 3, 211, 18200, 2031, 22, 56, 10, 35, 10, 3, 792, 5, 27, 131, 539, 9, 58, 3, 96, 126, 15, 9, 7389, 261, 49, 4579, 62414, 6, 427, 7, 17664, 1077, 23, 8888, 2708, 6, 3932, 19348, 2, 9, 51, 4799, 204, 80, 2391, 8, 313, 15, 16999]
[40, 99, 13, 28, 1445, 4274, 58, 31, 10, 3, 40, 1778, 10, 85, 1727, 2, 5, 27, 904, 8, 11, 99, 16, 152, 6, 5, 140, 89, 9, 2, 68, 5, 122, 14, 7, 42, 1845, 9, 210, 59, 243, 109, 2, 7, 134, 1845, 47, 29399, 38, 2640, 14, 3, 2378, 2, 11, 99, 47, 18877, 160, 2, 932, 30, 0, 0, 6, 557, 47, 1282, 2, 31, 10, 160, 21, 2334, 4156, 2, 11, 12, 7, 3564, 15134, 99, 14, 28, 24, 2, 182, 102, 130, 147, 9, 239, 12, 47, 821, 59, 2, 2582, 5, 262, 11, 4, 72, 598, 441, 4, 576, 4, 413, 4, 153, 4, 1686, 4, 1251, 1814, 519, 31, 179, 33, 80, 18, 17, 825, 62, 32]


#### padding và shorten các câu để các câu có độ dài bằng nhau
+ Các câu nào dài trên 200 thì bỏ phần sau, lấy 200 kí tự đầu
+ Các câu nào ngắn hơn 200 thì thêm 0 (_PAD) vào đầu

In [12]:
# Defining a function that either shortens sentences or pads sentences with 0 to a fixed length
def pad_input(sentences, seq_len):
    features = np.zeros((len(sentences), seq_len),dtype=int)
    for ii, review in enumerate(sentences):
        if len(review) != 0:
            features[ii, -len(review):] = np.array(review)[:seq_len]
    return features

seq_len = 200  # The length that the sentences will be padded/shortened to

train_sentences = pad_input(train_sentences, seq_len)
test_sentences = pad_input(test_sentences, seq_len)

# Converting our labels into numpy arrays
train_labels = np.array(train_labels)
test_labels = np.array(test_labels)

+ Chia 20000 câu test ra 1 nữa để valid 1 nửa để test

In [13]:
split_frac = 0.5 # 50% validation, 50% test
split_id = int(split_frac * len(test_sentences))
val_sentences, test_sentences = test_sentences[:split_id], test_sentences[split_id:]
val_labels, test_labels = test_labels[:split_id], test_labels[split_id:]

#### Dùng 2 thư viện TensorDataset và DataLoader
batch size: 400

In [14]:
import torch
from torch.utils.data import TensorDataset, DataLoader
import torch.nn as nn

train_data = TensorDataset(torch.from_numpy(train_sentences), torch.from_numpy(train_labels))
val_data = TensorDataset(torch.from_numpy(val_sentences), torch.from_numpy(val_labels))
test_data = TensorDataset(torch.from_numpy(test_sentences), torch.from_numpy(test_labels))

batch_size = 400

train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
val_loader = DataLoader(val_data, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size)

#### device

In [15]:
device = ('cuda:0' if torch.cuda.is_available() else 'cpu')
print(device)

cuda:0


#### Model
##### forward
+ kích thước input ban đầu: batch_size x seq_len ( 400 x 200 )
+ Tạo 1 bộ embedding vocab_size x embedding_size (vocab size: là số word, embed size: số thuộc tính của 1 word)
+ Tiến hành embedding input -> thu được input đã được embedding: batch_size x seq_len x embedding_size ( 400 x 200 x 400)
+ Đưa embedding input qua lstm (batch_first = True) cùng với 1 hidden state (được init zero) ban đầu (batch_size x ... x hidden_size) -> output (batch_size x output_size) cùng với hidden state (batch_size x ... x hidden_size )  
+ Vì là số layer là 2 nên cần có 2 hidden layer được init ở dạng tuple
+ lstm_out, hidden = lstm(inp, hidden), hidden ở đây là bộ hidden, nếu như có 1 layer thì nó cũng chính là lstm_out

dấu ... = num_layers * num_direction (trong bài này là = 2x1) với num_layer là số hidden layer, direction là số chiều forward đối với input   


contiguous trong đoạn code để đảm bảo phép view() reshape dữ liệu thành công: https://discuss.pytorch.org/t/contigious-vs-non-contigious-tensor/30107/2

In [16]:
class SentimentNet(nn.Module):
    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, drop_prob=0.5):
        super(SentimentNet, self).__init__()
        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, dropout=drop_prob, batch_first=True)
        self.dropout = nn.Dropout(drop_prob)
        self.fc = nn.Linear(hidden_dim, output_size)
        self.sigmoid = nn.Sigmoid()
        
    def forward(self, x, hidden):
        batch_size = x.size(0)
        x = x.long()
        embeds = self.embedding(x)
        lstm_out, hidden = self.lstm(embeds, hidden)
        # print(hidden[0].shape)
        # print(lstm_out.shape)  torch.Size([400, 200, 512])
        lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)
        
        out = self.dropout(lstm_out)
        out = self.fc(out)
        out = self.sigmoid(out)
        # print(out.shape) torch.Size([80000, 1])
        out = out.view(batch_size, -1)
        # print(out.shape)  torch.Size([400, 200])
        out = out[:,-1]
        # print(out.shape)  torch.Size([400])
        return out, hidden
    
    def init_hidden(self, batch_size):
        weight = next(self.parameters()).data
        hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().to(device),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().to(device))
        return hidden

In [17]:
vocab_size = len(word2idx) + 1  # số từ trong từ điển
output_size = 1  # là tích cực hay tiêu cực (binary cross entropy)
embedding_dim = 400 
hidden_dim = 512
n_layers = 2

model = SentimentNet(vocab_size, output_size, embedding_dim, hidden_dim, n_layers)
model.to(device)

lr=0.005
criterion = nn.BCELoss()  # binary cross entropy
optimizer = torch.optim.Adam(model.parameters(), lr=lr)


In [18]:
torch.cuda.empty_cache()

#### train
model.eval() vs torch.no_grad(): https://discuss.pytorch.org/t/model-eval-vs-with-torch-no-grad/19615/7

In [19]:
epochs = 2
counter = 0
print_every = 1000
clip = 5
valid_loss_min = np.Inf

model.train()
for i in range(epochs):
    h = model.init_hidden(batch_size)
    for inputs, labels in train_loader:
        counter += 1
        h = tuple([e.data for e in h])
        inputs, labels = inputs.to(device), labels.to(device)
        model.zero_grad()
        output, h = model(inputs, h)
        loss = criterion(output.squeeze(), labels.float())
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), clip) # chống exploding gradient bằng cách nén mấy cái đạo hàm nhỏ lại
        optimizer.step()
        
        if counter%print_every == 0:
            val_h = model.init_hidden(batch_size)
            val_losses = []
            model.eval()
            for inp, lab in val_loader:
                val_h = tuple([each.data for each in val_h])
                inp, lab = inp.to(device), lab.to(device)
                out, val_h = model(inp, val_h)
                val_loss = criterion(out.squeeze(), lab.float())
                val_losses.append(val_loss.item())
                
            model.train()
            print("Epoch: {}/{}...".format(i+1, epochs),
                  "Step: {}...".format(counter),
                  "Loss: {:.6f}...".format(loss.item()),
                  "Val Loss: {:.6f}".format(np.mean(val_losses)))
            if np.mean(val_losses) <= valid_loss_min:
                torch.save(model.state_dict(), './state_dict.pt')
                print('Validation loss decreased ({:.6f} --> {:.6f}).  Saving model ...'.format(valid_loss_min,np.mean(val_losses)))
                valid_loss_min = np.mean(val_losses)

Epoch: 1/2... Step: 1000... Loss: 0.158818... Val Loss: 0.177383
Validation loss decreased (inf --> 0.177383).  Saving model ...
Epoch: 1/2... Step: 2000... Loss: 0.152957... Val Loss: 0.169877
Validation loss decreased (0.177383 --> 0.169877).  Saving model ...
Epoch: 2/2... Step: 3000... Loss: 0.180345... Val Loss: 0.167656
Validation loss decreased (0.169877 --> 0.167656).  Saving model ...
Epoch: 2/2... Step: 4000... Loss: 0.120877... Val Loss: 0.167090
Validation loss decreased (0.167656 --> 0.167090).  Saving model ...


#### test eval

In [20]:
# Loading the best model
model.load_state_dict(torch.load('./state_dict.pt'))

test_losses = []
num_correct = 0
h = model.init_hidden(batch_size)

model.eval()
for inputs, labels in test_loader:
    h = tuple([each.data for each in h])
    inputs, labels = inputs.to(device), labels.to(device)
    output, h = model(inputs, h)
    test_loss = criterion(output.squeeze(), labels.float())
    test_losses.append(test_loss.item())
    pred = torch.round(output.squeeze())  # Rounds the output to 0/1
    correct_tensor = pred.eq(labels.float().view_as(pred))
    correct = np.squeeze(correct_tensor.cpu().numpy())
    num_correct += np.sum(correct)

print("Test loss: {:.3f}".format(np.mean(test_losses)))
test_acc = num_correct/len(test_loader.dataset)
print("Test accuracy: {:.3f}%".format(test_acc*100))

Test loss: 0.162
Test accuracy: 93.962%
