# RNN序列编码-分类期末大作业

本次大作业要求手动实现双向LSTM+基于attention的聚合模型，并用于古诗作者预测的序列分类任务。**请先阅读ppt中的作业说明。**

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F

import random
import numpy as np

from tqdm import tqdm

device = torch.device("cuda")
import torch.optim as optim

random.seed(1)
np.random.seed(1)
torch.manual_seed(1)

<torch._C.Generator at 0x26c9168ad10>

## 1. 加载数据

数据位于`data`文件夹中，每一行对应一个样例，格式为“诗句 作者”。下面的代码将数据文件读取到`train_data`, `valid_data`和`test_data`中，并根据训练集中的数据构造词表`word2idx`/`idx2word`和标签集合`label2idx`/`idx2label`。

In [2]:
word2idx = {"<unk>": 0}
label2idx = {}
idx2word = ["<unk>"]
idx2label = []

train_data = []
with open("data/train.txt", encoding="utf-8") as f:
    for line in f:
        text, author = line.strip().split()
        for c in text:
            if c not in word2idx:
                word2idx[c] = len(idx2word)
                idx2word.append(c)
        if author not in label2idx:
            label2idx[author] = len(idx2label)
            idx2label.append(author)
        train_data.append((text, author))

valid_data = []
with open("data/valid.txt", encoding="utf-8") as f:
    for line in f:
        text, author = line.strip().split()
        valid_data.append((text, author))

test_data = []
with open("data/test.txt", encoding="utf-8") as f:
    for line in f:
        text, author = line.strip().split()
        test_data.append((text, author))

In [3]:
print(len(word2idx), len(idx2word), len(label2idx), len(idx2label))
print(len(train_data), len(valid_data), len(test_data))

4941 4941 5 5
11271 1408 1410


**请完成下面的函数，其功能为给定一句古诗和一个作者，构造RNN的输入。** 这里需要用到上面构造的词表和标签集合，对于不在词表中的字用\<unk\>代替。

In [4]:
def make_data(text, author):
    """
    输入
        text: str
        author: str
    输出
        x: LongTensor, shape = (1, text_length)
        y: LongTensor, shape = (1,)
    """
    word_list = []
    for w in text:
        try:
            word_list.append(word2idx[w]) 
        except:
            word_list.append(0)

    x = torch.LongTensor(word_list).reshape(1,-1)
    y = torch.LongTensor([label2idx[author]])
    return x, y

x, y = make_data(*train_data[1])
x.shape, y.shape

(torch.Size([1, 24]), torch.Size([1]))

## 2. LSTM算子（单个时间片作为输入）

In [5]:
class LSTM(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(LSTM, self).__init__()
        self.f = nn.Linear(input_size + hidden_size, hidden_size)
        self.i = nn.Linear(input_size + hidden_size, hidden_size)
        self.o = nn.Linear(input_size + hidden_size, hidden_size)
        self.g = nn.Linear(input_size + hidden_size, hidden_size)
    
    def forward(self, ht, ct, xt):
        # ht: 1 * hidden_size
        # ct: 1 * hidden_size
        # xt: 1 * input_size
        input_combined = torch.cat((xt, ht), 1)
        ft = torch.sigmoid(self.f(input_combined))
        it = torch.sigmoid(self.i(input_combined))
        ot = torch.sigmoid(self.o(input_combined))
        gt = torch.tanh(self.g(input_combined))
        ct = ft * ct + it * gt
        ht = ot * torch.tanh(ct)
        return ht, ct

## 3. 实现双向LSTM（整个序列作为输入）

**要求使用上面提供的LSTM算子，不要调用torch.nn.LSTM**

In [6]:
class BiLSTM(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(BiLSTM, self).__init__()

        self.register_buffer("_float", torch.zeros(1, hidden_size))
        self.rnn1 = LSTM(input_size, hidden_size)
        self.rnn2 = LSTM(input_size, hidden_size)
        self.h1, self.c1 = self.init_h_and_c()
        self.h2, self.c2 = self.init_h_and_c()
        
    def init_h_and_c(self): 
        h = torch.zeros_like(self._float)
        c = torch.zeros_like(self._float)
        return h, c
    
    def lstm_loop(self, input, rnn, ht, ct):
        """
        input:  (length , input_size)
        rnn: 用到的算子
        ht: (1, hiddensize)
        ct: (1, hiddensize)
        返回值：lstm_output: (length, hidden_size)
        """
        output = []
        for x in input:
            xt = x.reshape(1,-1) # (1, hidden_size)
            ht, ct = rnn(xt, ht, ct)
            output.append(ht)
        lstm_output = torch.vstack(output) # (length, hidden_size)

        return lstm_output


    def forward(self, input):
        """
        输入
            input: 1 * length * input_size
        输出
            hiddens :(1, length, hidden_size * 2)
        """

        # recurrent, f_cx = self.rnn1(input[0])
        recurrent = self.lstm_loop(input[0], self.rnn1, self.h1, self.c1) # (length, hidden_size) 
        
        fwd = [recurrent]
        
        forward = torch.stack(fwd, dim=0).squeeze(1) # (1, length, hidden_size)

        input_reverse = torch.flip(input, dims=[1]) # 翻转

        recurrent_b= self.lstm_loop(input_reverse[0], self.rnn1, self.h2, self.c2) # (length, hidden_size)
        
        bwd = [recurrent_b] 
        backward = torch.stack(bwd, dim=0).squeeze(1) # (1, length, hidden_size)
        
        hiddens = torch.cat((forward, backward), -1) # (1, length, hidden_size * 2)
        
        return hiddens


## 4. 实现基于attention的聚合机制

In [7]:
class Attention(nn.Module):
    def __init__(self, hidden_size):
        super(Attention, self).__init__()
        self.hidden_size = hidden_size
        # 这里不加bias，就可以达到效果了
        self.qt = nn.Linear(hidden_size, 1, bias = False)
    def forward(self, hiddens):
        """
        输入
            hiddens: 1 * length * hidden_size
        输出
            attn_outputs: 1 * hidden_size
        """
        h = hiddens[0] # (length, hidden_size) 
        qh = self.qt(h) # (length, 1)

        alpha = F.softmax(qh, 1) # (length, 1)

        attn_outputs = alpha.T @ h # (1, hidden_size)

        return attn_outputs # (1, hidden_size)
    

## 5. 利用上述模块搭建序列分类模型

参考模型结构：Embedding – BiLSTM – Attention – Linear – LogSoftmax

In [8]:
class EncoderRNN(nn.Module):
    def __init__(self, num_vocab, embedding_dim, hidden_size, num_classes):
        """
        参数
            num_vocab: 词表大小
            embedding_dim: 词向量维数
            hidden_size: 隐状态维数
            num_classes: 类别数量
        """
        super(EncoderRNN, self).__init__()
        self.num_vocab = num_vocab
        self.embedding_dim = embedding_dim
        self.hidden_size = hidden_size
        self.num_classes = num_classes
        
        self.embedding = nn.Embedding(num_vocab, embedding_dim)
        self.lstm = BiLSTM(embedding_dim, hidden_size)
        self.attention = Attention(hidden_size * 2)
        self.out = nn.Linear(hidden_size * 2, num_classes)

    def forward(self, X):
        """
        输入
            x: (1, length), LongTensor
        输出
            outputs: (1, num_classes)
        """
        input = self.embedding(X) # (1, length, embedding_dim) 

        output = self.lstm(input) # (length, batch_size, 2*hidden_size)
        
        attn_output = self.attention(output) # (1, 2*hidden_size)
        
        linear_output = self.out(attn_output) # (1, num_classes)
        
        output = F.log_softmax(linear_output, dim=1) # (1, num_classes)

        return output

In [9]:
model = EncoderRNN(num_vocab = len(word2idx), embedding_dim = 16, hidden_size = 6, num_classes = 5)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

## 6. 请利用上述模型在古诗作者分类任务上进行训练和测试

要求选取在验证集上效果最好的模型，输出测试集上的准确率、confusion matrix以及macro-precision/recall/F1，并打印部分测试样例及预测结果。

In [10]:
# 训练
EPOCH = 5
maxAccuracy = 0 # 最高准确率
for epoch in range(EPOCH):
    for data in tqdm(train_data):
        input_batch, target_batch = make_data(*data)
        
        optimizer.zero_grad()
        output = model(input_batch)
        loss = criterion(output, target_batch)
        loss.backward()
        optimizer.step()

    # 一个epoch结束计算一下正确率
    correct_count = 0
    for d in valid_data:
        input_batch, target_batch = make_data(*d)
        output = model(input_batch)
        if torch.argmax(output).item() == target_batch.item():
            correct_count += 1
    accuracy = correct_count/len(valid_data)
            
    print(f'EPOCH = {epoch+1}， accuracy = {accuracy}')
            
    if accuracy > maxAccuracy:
        # 如果准确率提高了，那么就保存新模型
        maxAccuracy = accuracy
        torch.save(model,'best_model')

100%|██████████| 11271/11271 [03:56<00:00, 47.67it/s]


EPOCH = 1， accuracy = 0.49573863636363635


100%|██████████| 11271/11271 [03:59<00:00, 47.09it/s]


EPOCH = 2， accuracy = 0.5475852272727273


100%|██████████| 11271/11271 [04:25<00:00, 42.49it/s]


EPOCH = 3， accuracy = 0.5830965909090909


100%|██████████| 11271/11271 [04:33<00:00, 41.14it/s]


EPOCH = 4， accuracy = 0.5894886363636364


100%|██████████| 11271/11271 [04:00<00:00, 46.91it/s]


EPOCH = 5， accuracy = 0.5973011363636364


In [11]:
# 测试
best_model = torch.load("./best_model")

y_true = [label2idx[data[1]] for data in test_data]

y_pred = []
for n, data in enumerate(test_data):
    input_batch, target_batch = make_data(*data)    

    output = best_model(input_batch)
    predict = torch.argmax(output).item()
    y_pred.append(predict)


In [12]:
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import precision_score, recall_score, f1_score

In [13]:
print("测试集上准确率: ",accuracy_score(y_true, y_pred),"\n")

print("confusion_matrix: \n",confusion_matrix(y_true, y_pred),"\n")

print("precision_score: ",precision_score(y_true, y_pred, average="macro"),"\n")

print("recall_score: ",recall_score(y_true, y_pred, average="macro"),"\n")

print("f1_score: ",f1_score(y_true, y_pred, average="macro"),"\n")

测试集上准确率:  0.6212765957446809 

confusion_matrix: 
 [[ 43  21  56  27  13]
 [ 12 310  64  21   7]
 [ 16  40 376  28   8]
 [ 16  25  55 124  17]
 [ 20  17  44  27  23]] 

precision_score:  0.5337796119179676 

recall_score:  0.5039480688178053 

f1_score:  0.5089730830266238 



In [14]:
# 部分测试样例与结果
authors_pred = [idx2label[i] for i in y_pred]
list(zip(test_data,authors_pred))
# 每一行输出的第一个作者是正确答案，第二个作者是预测的

[(('舊日重陽日，傳杯不放杯。即今蓬鬢改，但愧菊花開。', '杜甫'), '杜甫'),
 (('熊羆交黑槊，賓客滿青油。今日文章主，梁王不姓劉。', '劉禹錫'), '杜甫'),
 (('晝號夜哭兼幽顯，早晚星關雪涕收。', '李商隱'), '杜甫'),
 (('玉壘高桐拂玉繩，上含非霧下含冰。', '李商隱'), '杜牧'),
 (('相思樹上合歡枝，紫鳳青鸞共羽儀。', '李商隱'), '劉禹錫'),
 (('空齋寂寂不生塵，藥物方書繞病身。纖草數莖勝靜地，', '劉禹錫'), '劉禹錫'),
 (('陰騭今如此，天災未可無。莫憑牲玉請，便望救焦枯。', '李商隱'), '李白'),
 (('露索秦宮井，風弦漢殿箏。幾時綿竹頌，擬薦子虛名。', '李商隱'), '杜甫'),
 (('開從綠條上，散逐香風遠。故取花落時，悠揚占春晚。', '劉禹錫'), '李白'),
 (('顧于韓蔡內，辨眼工小字。分日示諸王，鉤深法更秘。', '杜甫'), '杜甫'),
 (('貧家羞好客，語拙覺辭繁。三朝空錯莫，對飯卻慚冤。', '李白'), '杜甫'),
 (('吾愛王子晉，得道伊洛濱。金骨既不毀，玉顏長自春。', '李白'), '李白'),
 (('開元皇帝東封時，百神受職爭賓士。千鈞猛簴順流下，', '劉禹錫'), '劉禹錫'),
 (('微雨秋栽竹，孤燈夜讀書。憐君亦同志，晚歲傍山居。', '杜牧'), '杜甫'),
 (('蘆白疑粘鬢，楓丹欲照心。歸期無雁報，旅抱有猿侵。', '李商隱'), '杜甫'),
 (('烈士擊玉壺，壯心惜暮年。三杯拂劍舞秋月，', '李白'), '李白'),
 (('江色綠且明，茫茫與天平。逶迤巴山盡，搖曳楚雲行。', '李白'), '李白'),
 (('豈思鱗作簟，仍計腹為燈。浩蕩天池路，翱翔欲化鵬。', '李商隱'), '李商隱'),
 (('黃衫年少來宜數，不見堂前東逝波。', '杜甫'), '杜甫'),
 (('繁弦迸關紐，塞管裂圓蘆。眾音不能逐，嫋嫋穿雲衢。', '杜牧'), '杜甫'),
 (('柴荊具茶茗，徑路通林丘。與子成二老，來往亦風流。', '杜甫'), '杜甫'),
 (('一政政官軋軋，一年年老駸駸。', '劉禹錫'), '杜牧'),
 (('采菱寒刺上，蹋藕野泥中。素楫分曹往，金盤小徑通。', '杜甫'), '杜