# 任务二：基于深度学习的文本分类
熟悉Pytorch，用Pytorch重写《任务一》，实现CNN、RNN的文本分类；

## 1、参考

    ·https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html?highlight=text%20classification
    ·Convolutional Neural Networks for Sentence Classification https://arxiv.org/abs/1408.5882
    ·https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/
## 2、word embedding 的方式初始化

    ·随机embedding的初始化方式

    ·用glove 预训练的embedding进行初始化 https://nlp.stanford.edu/projects/glove/

## 3、知识点：

    ·CNN/RNN的特征抽取
    ·词嵌入
    ·Dropout
    
## 4、时间：两周

-----------------------------
## 参考：
### 官方文档：https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html
### pytorch 情感分析，无TorchText https://blog.csdn.net/captain_f_/article/details/89331133
### [TorchText]使用 https://www.jianshu.com/p/e5adb235399e
### pytorch 情感分析 https://blog.csdn.net/weixin_34351321/article/details/94699262
### 大佬的NLPBeginner解答 https://github.com/htfhxx/nlp-beginner_solution

## TODO:
### 1、EDA可视化
### 2、matplotlib画出模型收敛曲线

In [1]:
import time
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchtext
import torch.optim as optim
import numpy as np, pandas as pd
from torchtext import data,datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from tqdm import tqdm

batch_size=32
embedding_dim =300
dropout_p=0.5
filters_num=100
use_cuda = 1
learning_rate = 0.001
epochs=5
seed = 2019

np.random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed(seed)
    device = torch.device('cuda')
else:
    device = torch.device('cpu')

In [2]:
train = pd.read_csv('./sentiment-analysis-on-movie-reviews/train.tsv', sep='\t')
test = pd.read_csv('./sentiment-analysis-on-movie-reviews/test.tsv', sep='\t')

#shuffle、划分验证集、测试集,并保存
idx =np.arange(train.shape[0])
np.random.shuffle(idx)

train_size=int(len(idx) * 0.8)

train.iloc[idx[:train_size], :].to_csv('./train.csv',index=None)
train.iloc[idx[train_size:], :].to_csv('./valid.csv', index=None)
test.to_csv('./test.csv', index=None)

In [3]:
import spacy
spacy_en = spacy.load('en_core_web_sm')

def tokenizer(text): # create a tokenizer function
    return [tok.text for tok in spacy_en.tokenizer(text)]
# tokenizer = lambda x: x.split()

TEXT = data.Field(sequential=True, tokenize=tokenizer, batch_first=True, lower=True)
LABEL = data.Field(sequential=False, batch_first=True)

train_data = data.TabularDataset(path='./train.csv',format='csv', skip_header=True,
        fields = [('PhraseId', None),('SentenceId', None),('Phrase', TEXT),('Sentiment', LABEL)])
valid_data = data.TabularDataset(path='./valid.csv',format='csv', skip_header=True,
        fields = [('PhraseId', None),('SentenceId', None),('Phrase', TEXT),('Sentiment', LABEL)])
test_data = data.TabularDataset(path='./test.csv',format='csv', skip_header=True,
        fields = [('PhraseId', None),('SentenceId', None),('Phrase', TEXT)])

In [4]:
#构建词典，字符映射到embedding
#TEXT.vocab.vectors 就是词向量
TEXT.build_vocab(train_data)
LABEL.build_vocab(train_data)

In [5]:
#构建迭代器
train_iterator = data.BucketIterator(train_data, batch_size=batch_size, train=True, shuffle=True, device=device)

valid_iterator = data.Iterator(valid_data, batch_size=batch_size, train=False, sort=False, device=device)

test_iterator = data.Iterator(test_data, batch_size=batch_size, train=False, sort=False, device=device)

In [6]:
#部分参数设置
embedding_choice= 'rand'  #  'glove'    'static'    'non-static'
num_embeddings = len(TEXT.vocab)

vocab_size=len(TEXT.vocab)
label_num=len(LABEL.vocab)
print(vocab_size,label_num)

15427 6


In [7]:
class CNN(nn.Module):
    def __init__(self):
        super(CNN,self).__init__()
        self.embedding_choice=embedding_choice
        
        if self.embedding_choice==  'rand':
            self.embedding=nn.Embedding(num_embeddings,embedding_dim)
        if self.embedding_choice==  'glove':
            self.embedding = nn.Embedding(num_embeddings, embedding_dim, 
                padding_idx=PAD_INDEX).from_pretrained(TEXT.vocab.vectors, freeze=True)
            
            
        self.conv1 = nn.Conv2d(in_channels=1,out_channels=filters_num ,  #卷积产生的通道
                               kernel_size=(3, embedding_dim), padding=(2,0))
        
        self.conv2 = nn.Conv2d(in_channels=1,out_channels=filters_num ,  #卷积产生的通道
                               kernel_size=(4, embedding_dim), padding=(3,0))
        
        self.conv3 = nn.Conv2d(in_channels=1,out_channels=filters_num ,  #卷积产生的通道
                               kernel_size=(5, embedding_dim), padding=(4,0))
        
        self.dropout = nn.Dropout(dropout_p)
        
        self.fc = nn.Linear(filters_num * 3, label_num)
        
    def forward(self,x):      # (Batch_size, Length) 
        x=self.embedding(x).unsqueeze(1)      #(Batch_size, Length, Dimention) 
                                       #(Batch_size, 1, Length, Dimention) 
        
        x1 = F.relu(self.conv1(x)).squeeze(3)    #(Batch_size, filters_num, length+padding, 1) 
                                          #(Batch_size, filters_num, length+padding) 
        x1 = F.max_pool1d(x1, x1.size(2)).squeeze(2)  #(Batch_size, filters_num, 1)
                                               #(Batch_size, filters_num) 
         
        x2 = F.relu(self.conv2(x)).squeeze(3)  
        x2 = F.max_pool1d(x2, x2.size(2)).squeeze(2)      
        
        x3 = F.relu(self.conv3(x)).squeeze(3)  
        x3 = F.max_pool1d(x3, x3.size(2)).squeeze(2)      
        
        x = torch.cat((x1, x2, x3), dim=1)  #(Batch_size, filters_num *3 )
        x = self.dropout(x)      #(Batch_size, filters_num *3 )
        out = self.fc(x)       #(Batch_size, label_num  )
        return out

In [8]:
#构建模型

model=CNN()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)#创建优化器SGD
criterion = nn.CrossEntropyLoss()   #损失函数

if use_cuda:
    model.cuda()

In [9]:
best_accuracy=0
start_time=time.time()

def train(model, epoch):

    model.train()
    total_loss=0.0
    accuracy=0.0
    total_correct=0.0
    total_data_num = len(train_iterator.dataset)
    steps = 0.0
    for batch in tqdm(train_iterator):
        steps += 1
        optimizer.zero_grad()
        batch_label = batch.Sentiment
        out = model(batch.Phrase)#[batch_size, label_num]
        
        loss = criterion(out, batch_label)
        total_loss = total_loss + loss.item() 

        loss.backward()
        optimizer.step()        

        correct = (torch.max(out, dim=1)[1]  #get the indices
                   .view(batch_label.size()) == batch_label).sum()
        total_correct = total_correct + correct.item()


    print("Epoch %d_%.3f%%:  Training average Loss: %f, Total Time:%f"
    %(epoch, steps * train_iterator.batch_size*100/len(train_iterator.dataset),total_loss/steps, time.time()-start_time))

def valid(model, epoch):
    #每个epoch都验证一下
    model.eval()
    total_loss=0.0
    accuracy=0.0
    total_correct=0.0
    total_data_num = len(valid_iterator.dataset)
    steps = 0.0 
    for batch in tqdm(valid_iterator):
        steps+=1
        batch_label = batch.Sentiment
        out = model(batch.Phrase)
        loss = criterion(out, batch_label)
        total_loss = total_loss + loss.item()

        correct = (torch.max(out, dim=1)[1].view(batch_label.size()) == batch_label).sum()
        total_correct = total_correct + correct.item()

    print("Epoch %d :  Verification average Loss: %f, Verification accuracy: %f%%,Total Time:%f"
      %(epoch, total_loss/steps, total_correct*100/total_data_num,time.time()-start_time)) 
    global best_accuracy
    if best_accuracy < total_correct/total_data_num :
        best_accuracy = total_correct/total_data_num 
#         torch.save(model,'./epoch_%d_accuracy_%f'%(epoch,total_correct/total_data_num))
#         print('Model is saved in ./epoch_%d_accuracy_%f'%(epoch,total_correct/total_data_num))
        torch.save(model,'./cnn_best_model.pt')
        print('Model is saved in ./cnn_best_model.pt %d_accuracy_%f'%(epoch,total_correct/total_data_num))
    
def test(model):
    result = torch.LongTensor().cuda()
    start_time=time.time()
    for batch in tqdm(test_iterator):
        result = torch.cat((result, torch.max(model(batch.Phrase), dim=1)[1]), 0)
    print('Total time: %f',time.time()-start_time)
    return result

In [10]:
import warnings
warnings.filterwarnings("ignore")

for epoch in range(epochs):
    train(model, epoch)
    valid(model, epoch)
    print('============================================================================')

100%|██████████| 3902/3902 [00:37<00:00, 102.87it/s]


Epoch 0_100.013%:  Training average Loss: 1.112462, Total Time:37.947817


100%|██████████| 976/976 [00:02<00:00, 423.73it/s]


Epoch 0 :  Verification average Loss: 0.936803, Verification accuracy: 61.485967%,Total Time:40.255161
Model is saved in ./best_model.pt 0_accuracy_0.614860


100%|██████████| 3902/3902 [00:36<00:00, 107.80it/s]


Epoch 1_100.013%:  Training average Loss: 0.921917, Total Time:76.504708


100%|██████████| 976/976 [00:02<00:00, 426.31it/s]


Epoch 1 :  Verification average Loss: 0.892346, Verification accuracy: 64.414328%,Total Time:78.799110
Model is saved in ./best_model.pt 1_accuracy_0.644143


100%|██████████| 3902/3902 [00:36<00:00, 107.82it/s]


Epoch 2_100.013%:  Training average Loss: 0.847097, Total Time:115.025298


100%|██████████| 976/976 [00:02<00:00, 424.18it/s]


Epoch 2 :  Verification average Loss: 0.871296, Verification accuracy: 64.452775%,Total Time:117.331182
Model is saved in ./best_model.pt 2_accuracy_0.644528


100%|██████████| 3902/3902 [00:36<00:00, 108.11it/s]


Epoch 3_100.013%:  Training average Loss: 0.802887, Total Time:153.458855


100%|██████████| 976/976 [00:02<00:00, 428.34it/s]


Epoch 3 :  Verification average Loss: 0.869656, Verification accuracy: 65.378700%,Total Time:155.740357
Model is saved in ./best_model.pt 3_accuracy_0.653787


100%|██████████| 3902/3902 [00:36<00:00, 107.99it/s]


Epoch 4_100.013%:  Training average Loss: 0.770498, Total Time:191.923795


100%|██████████| 976/976 [00:02<00:00, 431.86it/s]


Epoch 4 :  Verification average Loss: 0.867726, Verification accuracy: 65.981033%,Total Time:194.186745
Model is saved in ./best_model.pt 4_accuracy_0.659810


In [11]:
# # 加载最好的模型
best_model = torch.load('./cnn_best_model.pt')
result = test(best_model)
submission = pd.read_csv('./sentiment-analysis-on-movie-reviews/sampleSubmission.csv')
submission['Sentiment'] = result.cpu()
submission.to_csv('./CNN_submission.csv', index=None)

100%|██████████| 2072/2072 [00:03<00:00, 539.37it/s]


Total time: %f 3.8445041179656982
