# Segement Me If U Can
THUCST Intro to AI PA 2
[Eren zhao](https://zhaochenyang20.github.io/) Class 06

# 任务简介
- 进行一次情感二分类，仅考虑正负情感。

## 实验数据
- 实验数据包括包含训练、验证、测试集合以及预处理好的词向量
- 句子的分类包含正向和负向两种

## 实验要求
- 本次实验要求实现 CNN 与 RNN 两个模型，并应用在情感分类任务上。RNN 可以是 LSTM，GRU 等类型。
- 对比两模型的实验效果，并分析原因。 也可以实现其他模型作为对比模型（baseline），例如全连接神经网络（MLP），可适当加分。

## 评价指标
1. 准确率（Accuracy）
2. [F-score](https://deepai.org/machine-learning-glossary-and-terms/f-score)，类似 MIOU

## 报告内容
1. 模型的结构图，以及流程分析。
2. 实验结果，准确率，F-score标的实验效果。
3. 试简要地比较实验中使用的不同参数效果，并分析原因。
4. 比较baseline模型与CNN，RNN模型的效果差异。（如果有实现）
5.  问题思考，心得体会

## Question List
1. 实验训练什么时候停止是最合适的？简要陈述你的实现方式，并试分析固定迭代次数与通过验证集调整等方法的优缺点。
2. 实验参数的初始化是怎么做的？不同的方法适合哪些地方？（现有的初始化方法为零均值初始化，高斯分布初始化，正交初始化等）
3. 过拟合是深度学习常见的问题，有什么方法可以方式训练过程陷入过拟合
4. 试分析CNN，RNN，全连接神经网络（MLP）三者的优缺点

# 模型结构

## LSTM

![LSTM](https://zhaochenyang20.github.io/pic/lecture/2022_spring/IAI/LSTM.jpg)

- 双向 LSTM 分类网络的模型结果如上图。前向传播的流程为：将一批长度统一且标记化的句子输入网络，依次经过：
1. 嵌入层：将每个表示单词的自然数映射为指定长度的向量，即用向量表示单词。
2. 双向双层 LSTM 层：接收某个 batch 的词向量组成的序列，每个 LSTM 单元在两个方向上分别产生自己的隐藏状态。最终只用了最后一层（第二层）两个方向上传播的各自的最后一个单元的隐藏状态作为下一层的输入。
3. 线性分类层：由两层网络构成，接收上述 LSTM 层产生的两个隐藏状态直接拼接起来的向量（维数变为隐藏状态维数的 2 倍）作为输入，经过两层线性层输出维数等于分类类别数的向量，表示对类别的预测结果。

## Text-CNN

![CNN](https://zhaochenyang20.github.io/pic/lecture/2022_spring/IAI/CNN.jpg)

- 依据[参考文献](https://arxiv.org/abs/1408.5882)中的模型搭建 Text-CNN 模型。前向传播流程如下：
1. 嵌入层：将每个表示单词的自然数映射为指定长度的向量，即用向量表示单词。
2. 一维多通道多卷积核卷积层：将嵌入层得到的数据视为一批多通道的一维张量；一维张量的长度为对齐后的句子长度，通道数为词向量的数。用指定数量与大小的卷积核与输入数据做多通道多卷积核卷积，得到多通道的一维输出特征。用宽度为2、4、8的卷积核分别做三次卷积。
3. 池化层：对卷积结果进行 activate, Dropout, max pooling。
4. 线性层：将池化后的卷积结果拼接在一起，得到长度为所有卷积输出通道数之和的张量，再经过一层线性层得到表示类别标签预测的向量。

## MLP

![MLP](https://zhaochenyang20.github.io/pic/lecture/2022_spring/IAI/MLP.jpg)

使用 MLP 作为 baseline 。模型示意图如上，前向传播大致流程如下：
1. 嵌入层：将每个表示单词的自然数映射为指定长度的向量，即用向量表示单词。
2. 线性层1：接收一批将句子中的词向量直接拼接起来得到的张量为输入，输出指定大小的张量，然后进行 Batch Normalization, Activation, Dropout。
3. 线性层2：输出表示类别标签预测的向量。

## [BERT](https://zhaochenyang20.github.io/pdf/BERT.pdf)
借助大型预训练模型 BERT 完成文本分类这一下游任务。模型结构和前向传播方式是：将句子输入 BERT ，取出 BERT 的第一个输出，是一个高维维的向量；将此向量输入一个线性层，得到最终表示分类的向量。
可以发现，由于 BERT 已经在海量数据中训练，较好地学习到了词汇的语义，最终在此任务上只需要最后加一个简单的线性层做分类就能达到很好的效果。

# 配置信息
## 环境库
参考 requirements.txt
## 可视化
采用 [wandb](https://wandb.ai/site) 辅助可视化
## 算力
由于我自己的电脑是 Macbook M1 Core，虽然 M1 芯片优化了 CPU 计算的性能，然而没有显卡是硬伤。于是我在自己的服务器上进行训练，服务器有 1 张 3080。

In [18]:
import gensim
import wandb
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
import time
from collections import Counter
from torch.utils.data import TensorDataset,DataLoader
from torch.optim.lr_scheduler import *
from typing import List, Dict
from pathlib import Path
from tqdm import tqdm

wandb.init(project="IAI 2022 CNN", entity="eren-zhao")

wandb.config = {
  "learning_rate": 0.001,
  "epochs": 100,
  "batch_size": 50
}

def getFileList(filePath:str) -> List[str]:
    """
    return dataSet file list except the test file
    """
    files = os.listdir(filePath)
    returnList = []
    for each in files:
        if each.endswith('.txt') and not each.startswith('valid'):
            returnList.append(Path.cwd() / filePath / each)
    return returnList

def getFile(filePath:str, fileName:str) -> str:
    """
    return a specific file from dataSet
    """
    files = os.listdir(filePath)
    for each in files:
        if each.startswith(f'{fileName}'):
            return Path.cwd() / filePath / each

def getWord2Id() -> Dict:
    """
    word2id: word -> id
    is a dictionary which give each word in training set and valid set a id, range from 0 to n_words
    """
    path = getFileList('Dataset')
    word2id = Counter()
    for each in path:
        with open(each, encoding='utf-8', errors="ignore") as f:
            for line in f.readlines():
                sentence = line.strip().split()
                for word in sentence[1:]:
                    if word not in word2id.keys():
                        word2id[word] = len(word2id)
    return word2id


def getWord2Vec(filename, word2id):
    """
    word2vec: word -> vector
    is a dictionary which give each word in training set and valid set a vector, range, the length of vector is 50
    """
    path = getFile("Dataset", filename)
    preModel = gensim.models.KeyedVectors.load_word2vec_format(path, binary=True)
    word2vecs = np.array(np.zeros([len(word2id) + 1, preModel.vector_size]))
    for key in word2id:
        try:
            word2vecs[word2id[key]] = preModel[key]
        except Exception:
            pass
    return word2vecs

def getCorpus(path, word2id, maxLength=50):
    """
    :param path: 样本语料库的文件
    :return: 文本内容contents，以及分类标签labels(onehot形式)
    """
    contents, labels = np.array([0] * maxLength), np.array([])
    with open(path, encoding='utf-8', errors="ignore") as f:
        for line in f.readlines():
            sentence = line.strip().split()
            content = np.asarray([word2id.get(w, 0) for w in sentence[1:]])[:maxLength]
            padding = max(maxLength - len(content), 0)
            content = np.pad(content, ((0, padding)), 'constant', constant_values=0)
            labels = np.append(labels, int(sentence[0]))
            contents = np.vstack([contents, content])
    contents = np.delete(contents, 0, axis=0)
    return contents, labels

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
train_acc,▁▁▇█████████████████████████████████████
train_loss,██▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
val_acc,▁▃▆██▇▇▇████████████████████████████████
val_loss,▂▁▂▅▆▆▇▇▇▆▇█▇▆▆▇▇▆▇▆▆▆▇▇▇▇▇█▆█▇▇▇▆▆▆▆▇█▆

0,1
train_acc,0.99825
train_loss,0.01063
val_acc,0.8374
val_loss,0.84605


In [19]:
learning_rate = 0.001      # 学习率     
BATCH_SIZE = 50            # 训练批量
EPOCHS = 10                  # 训练轮数
model_path = None          # 预训练模型路径
max_length = 50            # 每个样本的最大长度

word2id = getWord2Id()
word2vec = getWord2Vec('wiki', word2id)
train_contents, train_labels = getCorpus('./Dataset/train.txt', word2id, maxLength=max_length)
val_contents, val_labels = getCorpus('./Dataset/validation.txt', word2id, maxLength=max_length)
test_contents, test_labels = getCorpus('./Dataset/test.txt', word2id, maxLength=max_length)

class CONFIG():
    update_w2v = True           # 是否在训练中更新w2v
    vocab_size = len(word2id) + 1          # 词汇量，与word2id中的词汇量一致
    n_class = 2                 # 分类数：分别为pos和neg
    embedding_dim = 50          # 词向量维度
    drop_keep_prob = 0.3        # dropout层，参数keep的比例
    kernel_num = 20            # 卷积层filter的数量
    kernel_size = [3, 5, 7]       # 卷积核的尺寸
    pretrained_embed = word2vec # 预训练的词嵌入模型
    hidden_size = 100           # 隐藏层神经元数
    num_layers = 2               # 隐藏层数

config = CONFIG()          # 配置模型参数

In [7]:
class TextCNN(nn.Module):
    def __init__(self, config):
        super(TextCNN, self).__init__()
        update_w2v = config.update_w2v
        vocab_size = config.vocab_size
        n_class = config.n_class
        embedding_dim = config.embedding_dim
        kernel_num = config.kernel_num
        kernel_size = config.kernel_size
        drop_keep_prob = config.drop_keep_prob
        pretrained_embed = config.pretrained_embed
        
        # 使用预训练的词向量
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        #! embedding is a table, which is used to lookup the embedding vector of a word
        self.embedding.weight.requires_grad = update_w2v
        #! if update_w2v is True, the embedding.weight will be updated during training
        self.embedding.weight.data.copy_(torch.from_numpy(pretrained_embed))
        #! import the pretrained embedding vector as embedding.weight


        # 卷积层
        self.conv1 = nn.Conv2d(1, kernel_num, (kernel_size[0], embedding_dim))
        #! conv1 is a convolutional layer, which takes input layer 1 ( we often take picture for 3 layer, but here is the sentence, we take 1 layer)
        #! kernel_num is the number of filter, which is the number of output channel, here we have 20 filter
        #! every filter bite a matrix of size (3, 50)
        self.conv2 = nn.Conv2d(1, kernel_num, (kernel_size[1], embedding_dim))
        self.conv3 = nn.Conv2d(1, kernel_num, (kernel_size[2], embedding_dim))
        # Dropout
        self.dropout = nn.Dropout(drop_keep_prob)
        # 全连接层
        self.fc = nn.Linear(len(kernel_size) * kernel_num, n_class)

    @staticmethod
    def conv_and_pool(x, conv):
        x = conv(x)
        x = F.relu(x.squeeze(3))
        x = F.max_pool1d(x, x.size(2)).squeeze(2)
        return x
        
    def forward(self, x):
        x = self.embedding(x.to(torch.int64)).unsqueeze(1)
        x1 = self.conv_and_pool(x, self.conv1)  
        x2 = self.conv_and_pool(x, self.conv2)  
        x3 = self.conv_and_pool(x, self.conv3)
        x = F.log_softmax(self.fc(self.dropout(torch.cat((x1, x2, x3), 1))), dim=1)
        return x

In [20]:
class RNN(nn.Module):

    def __init__(self, config):
       
        super(RNN, self).__init__()

        vocab_size = config.vocab_size
        embedding_dim = config.embedding_dim
        pretrained_embed = config.pretrained_embed
        self.num_layers = config.num_layers
        self.hidden_size = config.hidden_size
        self.n_class = config.n_class
        update_w2v = config.update_w2v

        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        #! embedding is a table, which is used to lookup the embedding vector of a word
        self.embedding.weight.requires_grad = update_w2v
        #! if update_w2v is True, the embedding.weight will be updated during training
        self.embedding.weight.data.copy_(torch.from_numpy(pretrained_embed))
        #! import the pretrained embedding vector as embedding.weight

        # (seq_len, batch, embed_dim) -> (seq_len, batch, 2 * hidden_size)
        self.encoder = nn.LSTM(input_size=embedding_dim,
                               hidden_size=self.hidden_size,
                               num_layers=self.num_layers,
                               bidirectional=True)
        # (batch, hidden_size * 2) -> (batch, num_classes)
        self.decoder = nn.Linear(2 * self.hidden_size, 64)
        self.fc1 = nn.Linear(64, self.n_class)
        # (batch, num_classes) -> (batch, num_classes)

    def forward(self, inputs):
        
        
        inputs = inputs.to(torch.int64)
        x = self.embedding(inputs)             # (batch_size, seq_len, embed_dim)
        x = x.permute(1, 0, 2)        # (seq_len, batch_size, embed_dim)
        _, (h_n, _) = self.encoder(x)          # (num_layers * 2, batch, hidden_size)
        # view h_n as (num_layers, num_directions, batch, hidden_size)
        h_n = h_n.view(self.num_layers, 2, -1, self.hidden_size)
        h_n = torch.cat((h_n[-1, 0], h_n[-1, 1]), dim=-1) # (batch, hidden_size * 2)
        outputs = self.decoder(h_n)                     # (batch, num_classes)
        outputs = self.fc1(outputs)
       
        return outputs

In [21]:
DEVICE = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
train_dataset = TensorDataset(torch.from_numpy(train_contents).type(torch.float), 
                              torch.from_numpy(train_labels).type(torch.long))
train_dataloader = DataLoader(dataset = train_dataset, batch_size = BATCH_SIZE, 
                              shuffle = True, num_workers = 2)

val_dataset = TensorDataset(torch.from_numpy(val_contents).type(torch.float), 
                              torch.from_numpy(val_labels).type(torch.long))
val_dataloader = DataLoader(dataset = val_dataset, batch_size = BATCH_SIZE, 
                              shuffle = True, num_workers = 2)

test_dataset = TensorDataset(torch.from_numpy(test_contents).type(torch.float), 
                              torch.from_numpy(test_labels).type(torch.long))
test_dataloader = DataLoader(dataset = test_dataset, batch_size = BATCH_SIZE, 
                              shuffle = True, num_workers = 2) 


In [22]:
model = RNN(config).to(DEVICE)
    
optimizer = torch.optim.Adam(model.parameters(), lr = learning_rate)

# 设置损失函数
criterion = nn.CrossEntropyLoss()
scheduler = StepLR(optimizer, step_size=5)


def train(dataloader,epoch):
    # 定义训练过程
    model.train()
    train_loss,train_acc = 0.0,0.0
    count, correct = 0,0
    for _, (x, y) in enumerate(dataloader):
        x, y = x.to(DEVICE), y.to(DEVICE)
        optimizer.zero_grad()
        output = model(x)
        loss = criterion(output, y)
        loss.backward()
        optimizer.step()
        train_loss += loss.item()
        correct += (output.argmax(1) == y).float().sum().item()
        count += len(x)
                            
    train_loss *= BATCH_SIZE
    train_loss /= len(dataloader.dataset)
    train_acc = correct/count
    scheduler.step()
    
    return train_loss,train_acc


def validation(dataloader):
    model.eval()
    # 验证过程
    val_loss,val_acc = 0.0,0.0
    count, correct = 0,0
    for _, (x, y) in enumerate(dataloader):
        x, y = x.to(DEVICE), y.to(DEVICE)
        output = model(x)
        loss = criterion(output, y)
        val_loss += loss.item()
        correct += (output.argmax(1) == y).float().sum().item()
        count += len(x)
    
    val_loss *= BATCH_SIZE
    val_loss /= len(dataloader.dataset)
    val_acc = correct/count
    
    return val_loss,val_acc


for epoch in tqdm(range(1,EPOCHS+1)):
    tr_loss,tr_acc = train(train_dataloader, epoch)
    val_loss,val_acc = validation(test_dataloader)
    wandb.log({"train_loss": tr_loss, "train_acc": tr_acc, "val_loss": val_loss, "val_acc": val_acc, "learning_rate": scheduler.get_lr()})
    print(f"for epoch {epoch}, train_loss: {tr_loss:.4f}, train_acc: {tr_acc:.4f}, val_loss: {val_loss:.4f}, val_acc: {val_acc:.4f}, learning_rate: {scheduler.get_lr()}")

 10%|█         | 1/10 [00:05<00:47,  5.24s/it]

for epoch 1, train_loss: 0.4942, train_acc: 0.7586, val_loss: 0.4490, val_acc: 0.8157, learning_rate: [0.001]


 20%|██        | 2/10 [00:10<00:42,  5.25s/it]

for epoch 2, train_loss: 0.3264, train_acc: 0.8686, val_loss: 0.3720, val_acc: 0.8537, learning_rate: [0.001]


 30%|███       | 3/10 [00:15<00:36,  5.23s/it]

for epoch 3, train_loss: 0.2184, train_acc: 0.9201, val_loss: 0.5894, val_acc: 0.7859, learning_rate: [0.001]


 40%|████      | 4/10 [00:20<00:31,  5.20s/it]

for epoch 4, train_loss: 0.1374, train_acc: 0.9535, val_loss: 0.4824, val_acc: 0.8130, learning_rate: [0.001]


 50%|█████     | 5/10 [00:26<00:26,  5.23s/it]

for epoch 5, train_loss: 0.0852, train_acc: 0.9733, val_loss: 0.5187, val_acc: 0.8320, learning_rate: [1e-05]


 60%|██████    | 6/10 [00:31<00:20,  5.19s/it]

for epoch 6, train_loss: 0.0346, train_acc: 0.9912, val_loss: 0.7483, val_acc: 0.8401, learning_rate: [0.0001]


 70%|███████   | 7/10 [00:36<00:15,  5.13s/it]

for epoch 7, train_loss: 0.0301, train_acc: 0.9928, val_loss: 0.7204, val_acc: 0.8238, learning_rate: [0.0001]


 80%|████████  | 8/10 [00:41<00:10,  5.21s/it]

for epoch 8, train_loss: 0.0263, train_acc: 0.9939, val_loss: 0.7484, val_acc: 0.8320, learning_rate: [0.0001]


 90%|█████████ | 9/10 [00:46<00:05,  5.21s/it]

for epoch 9, train_loss: 0.0228, train_acc: 0.9948, val_loss: 0.7413, val_acc: 0.8374, learning_rate: [0.0001]


100%|██████████| 10/10 [00:52<00:00,  5.21s/it]

for epoch 10, train_loss: 0.0206, train_acc: 0.9952, val_loss: 0.8012, val_acc: 0.8211, learning_rate: [1.0000000000000002e-06]



