<a href="https://colab.research.google.com/github/tomonari-masada/course-nlp2020/blob/master/07_document_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 07 文書分類
* fastTextは高速だが、PyTorchとは独立のライブラリなので、データ準備作業が煩雑になる。
* そこで、PyTorchの一部であるtorchtextモジュールを使ってデータを準備する。
* ネットワークへの入力は、単語埋め込みの、単語の出現順どおりの列にする。
 * そして前向き計算のなかではじめて単語埋め込みの平均をとることにする。
* また、単語埋め込みの学習も、ネットワークの学習と同時におこなうことにする。
* 参考資料
 * https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/1%20-%20Simple%20Sentiment%20Analysis.ipynb
 * https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/4%20-%20Convolutional%20Sentiment%20Analysis.ipynb

## 07-01 torchtextを使って　IMDbデータを読み込む
* https://torchtext.readthedocs.io/en/latest/datasets.html

### 実験の再現性確保のための設定
* torch.backends.cudnn.deterministicをTrueにするのは、こうしないと、GPU上での計算が毎回同じ値を与えないため。

In [1]:
import torch
from torchtext import data

SEED = 123

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

### フィールドのインスタンスを作る
* TEXTフィールドは、テキストの前処理に使う。
 * batch_firstをTrueに設定するのがポイント。
 * tokenizerは、指定しないとstring.splitになり、高速だが、tokenizationとしては雑。
* LABELフィールドは、ラベルの前処理に使う。

In [2]:
TEXT = data.Field(tokenize="spacy", batch_first=True)
LABEL = data.LabelField()

### IMDbデータセットを前処理しつつ読み込む
* TEXTフィールドでspaCyのtokenizationを使うように設定したので、少し時間がかかる。

In [3]:
from torchtext import datasets

train_dev_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

downloading aclImdb_v1.tar.gz


aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:09<00:00, 9.17MB/s]


### 最初の文書を見てみる
* `vars`関数は、モジュール、クラス、インスタンス、あるいはそれ以外の`__dict__`属性を持つオブジェクトの、`__dict__`属性を辞書として返す組み込み関数。

In [4]:
print(vars(train_dev_data.examples[0]))

{'text': ['I', 'really', 'liked', 'this', 'movie', ',', 'and', 'went', 'back', 'to', 'see', 'it', 'two', 'times', 'more', 'within', 'a', 'week.<br', '/><br', '/>Ms', '.', 'Detmers', 'nailed', 'the', 'performance', '-', 'she', 'was', 'like', 'a', 'hungry', 'cat', 'on', 'the', 'prowl', ',', 'toying', 'with', 'her', 'prey', '.', 'She', 'lashes', 'out', 'in', 'rage', 'and', 'lust', ',', 'taking', 'a', '"', 'too', 'young', '"', 'lover', ',', 'and', 'crashing', 'hundreds', 'of', 'her', 'terrorist', 'fiancé', "'s", 'mother', "'s", 'pieces', 'of', 'fine', 'china', 'to', 'the', 'floor', '.', '<', 'br', '/><br', '/>The', 'film', 'was', 'full', 'of', 'beautiful', 'touches', '.', 'The', 'Maserati', ',', 'the', 'wonderful', 'wardrobe', ',', 'the', 'flower', 'boxes', 'along', 'the', 'rooftops', '.', 'I', 'particularly', 'enjoyed', 'the', 'ancient', 'Greek', 'class', 'and', 'the', 'recitation', 'of', "'", "Antigone'.<br", '/><br', '/>It', 'had', 'a', 'feeling', 'of', "'", 'Story', 'of', 'O', "'", '-'

In [5]:
print(vars(train_dev_data.examples[0])['text'])

['I', 'really', 'liked', 'this', 'movie', ',', 'and', 'went', 'back', 'to', 'see', 'it', 'two', 'times', 'more', 'within', 'a', 'week.<br', '/><br', '/>Ms', '.', 'Detmers', 'nailed', 'the', 'performance', '-', 'she', 'was', 'like', 'a', 'hungry', 'cat', 'on', 'the', 'prowl', ',', 'toying', 'with', 'her', 'prey', '.', 'She', 'lashes', 'out', 'in', 'rage', 'and', 'lust', ',', 'taking', 'a', '"', 'too', 'young', '"', 'lover', ',', 'and', 'crashing', 'hundreds', 'of', 'her', 'terrorist', 'fiancé', "'s", 'mother', "'s", 'pieces', 'of', 'fine', 'china', 'to', 'the', 'floor', '.', '<', 'br', '/><br', '/>The', 'film', 'was', 'full', 'of', 'beautiful', 'touches', '.', 'The', 'Maserati', ',', 'the', 'wonderful', 'wardrobe', ',', 'the', 'flower', 'boxes', 'along', 'the', 'rooftops', '.', 'I', 'particularly', 'enjoyed', 'the', 'ancient', 'Greek', 'class', 'and', 'the', 'recitation', 'of', "'", "Antigone'.<br", '/><br', '/>It', 'had', 'a', 'feeling', 'of', "'", 'Story', 'of', 'O', "'", '-', 'that',

In [6]:
print(vars(train_dev_data.examples[0])['label'])

pos


### テストセット以外の部分を訓練データと検証データに分ける

In [7]:
import random

train_data, dev_data = train_dev_data.split(split_ratio=0.8, random_state = random.seed(SEED))

In [8]:
print(f'Number of training examples: {len(train_data)}')
print(f'Number of development examples: {len(dev_data)}')
print(f'Number of testing examples: {len(test_data)}')

Number of training examples: 20000
Number of development examples: 5000
Number of testing examples: 25000


### データセットのラベルを作る
* TEXTラベルのほうは、最大語彙サイズを指定する。

In [9]:
MAX_VOCAB_SIZE = 25000

TEXT.build_vocab(train_data, max_size=MAX_VOCAB_SIZE)
LABEL.build_vocab(train_data)

In [10]:
print(f"Unique tokens in TEXT vocabulary: {len(TEXT.vocab)}")
print(f"Unique tokens in LABEL vocabulary: {len(LABEL.vocab)}")

Unique tokens in TEXT vocabulary: 25002
Unique tokens in LABEL vocabulary: 2


### 出現頻度順で上位２０単語を見てみる

In [11]:
print(TEXT.vocab.freqs.most_common(20))

[('the', 232028), (',', 219642), ('.', 189279), ('and', 125246), ('a', 124915), ('of', 115025), ('to', 106919), ('is', 87460), ('in', 70029), ('I', 62000), ('it', 61196), ('that', 56281), ('"', 50475), ("'s", 49438), ('this', 48574), ('-', 42352), ('/><br', 40924), ('was', 39912), ('as', 34619), ('with', 34266)]


　### 単語ID順に最初の１０単語を見てみる
* IDのうち、0と1は、未知語とパディング用の単語という特殊な単語に割り振られている。

In [12]:
print(TEXT.vocab.itos[:10])

['<unk>', '<pad>', 'the', ',', '.', 'and', 'a', 'of', 'to', 'is']


### ラベルのIDを確認する

In [13]:
print(LABEL.vocab.stoi)

defaultdict(<function _default_unk_index at 0x7f2ba33312f0>, {'neg': 0, 'pos': 1})


### ミニバッチを取り出すためのiteratorを作る

In [14]:
BATCH_SIZE = 100

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, dev_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, dev_data, test_data),
    batch_size=BATCH_SIZE,
    device=device)

### 試しにテストセットのiteratorを回してミニバッチをすべて取得して個数を数えてみる

In [15]:
i = 0
for batch in test_iterator:
  i += 1
  continue
print(f'We have {i} mini-batches in test set.')
print(batch.text[0])
print(' '.join([TEXT.vocab.itos[i] for i in batch.text[0]]))

We have 250 mini-batches in test set.
tensor([152,  15,   6,  ...,   7, 324,   4], device='cuda:0')
There 's a sign on The Lost Highway that <unk> /><br <unk> SPOILERS <unk> /><br <unk> you already knew that , did n't <unk> /><br />Since there 's a great deal of people that apparently did not get the point of this movie , I 'd like to contribute my interpretation of why the plot makes perfect sense . As others have pointed out , one single viewing of this movie is not sufficient . If you have the DVD of <unk> , you can " cheat " by looking at David Lynch 's " Top 10 <unk> to <unk> <unk> " ( but only upon second or third viewing , please . ) ; ) < br /><br />First of all , Mulholland Drive is downright brilliant . A masterpiece . This is the kind of movie that refuse to leave your head . Not often are the comments on the DVDs very accurate , but <unk> 's " It gets inside your head and stays there " really hit the mark.<br /><br />David Lynch deserves praise for creating a movie that not

### 最後のミニバッチのshapeを確認してみる
* 単語埋め込みの次元数と、最後のミニバッチで最も長い文書の文書長が表示されるはず。

In [16]:
batch.text.shape

torch.Size([100, 2640])

## 07-02 MLPによる文書分類の準備

### 定数の設定

In [17]:
INPUT_DIM = len(TEXT.vocab)
NUM_CLASS = len(LABEL.vocab)
EMBED_DIM = 100
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

TRAIN_SIZE = len(train_data)
DEV_SIZE = len(dev_data)
TEST_SIZE = len(test_data)

### torch.nnをnnとしてインポート

In [18]:
import torch.nn as nn

### モデルを定義するまえに単語埋め込みを理解する

In [19]:
embed = nn.Embedding(INPUT_DIM, EMBED_DIM, padding_idx=PAD_IDX)

In [24]:
# padding_idxのトークンはゼロベクトルになる
print(embed(torch.tensor([[0,2,1],[2,3,4]])))

tensor([[[ 0.3374, -0.1778, -0.3035, -0.5880,  0.3486,  0.6603, -0.2196,
          -0.3792,  0.7671, -1.1925,  0.6984, -1.4097,  0.1794,  1.8951,
           0.4954,  0.2692, -0.0770, -1.0205, -0.1690,  0.9178,  1.5810,
           1.3010,  1.2753, -0.2010,  0.4965, -1.5723,  0.9666, -1.1481,
          -1.1589,  0.3255, -0.6315, -2.8400, -1.3250,  0.1784, -2.1338,
           1.0524, -0.3885, -0.9343, -0.4991, -1.0867,  0.8805,  1.5542,
           0.6266, -0.1755,  0.0983, -0.0935,  0.2662, -0.5850,  0.8768,
           1.6221, -1.4779,  1.1331, -1.2203,  1.3139,  1.0533,  0.1388,
           2.2473, -0.8036, -0.2808,  0.7697, -0.6596, -0.7979,  0.1838,
           0.2293,  0.5146,  0.9938, -0.2587, -1.0826, -0.0444,  1.6236,
          -2.3229,  1.0878,  0.6716,  0.6933, -0.9487, -0.0765, -0.1526,
           0.1167,  0.4403, -1.4465,  0.2553, -0.5496,  1.0042,  0.8272,
          -0.3948,  0.4892, -0.2168, -1.7472, -1.6025, -1.0764,  0.9031,
          -0.7218, -0.5951, -0.7112,  0.6230, -1.37

### モデルの定義
* 基本的にMLPだが、入り口に単語埋め込み層が挿入されている。

In [25]:
import torch.nn.functional as F

class EmbedTextSentiment(nn.Module):
  def __init__(self, embed_dim, num_class, vocab_size, padding_idx):
    super(EmbedTextSentiment, self).__init__()
    self.embed = nn.Embedding(vocab_size, embed_dim, padding_idx=padding_idx)
    self.fc1 = nn.Linear(embed_dim, 500)
    self.fc2 = nn.Linear(500, 100)
    self.fc3 = nn.Linear(100, num_class)
    self.init_weights()

  def init_weights(self):
    initrange = 0.5
    self.fc1.weight.data.uniform_(-initrange, initrange)
    self.fc1.bias.data.zero_()
    self.fc2.weight.data.uniform_(-initrange, initrange)
    self.fc2.bias.data.zero_()
    self.fc3.weight.data.uniform_(-initrange, initrange)
    self.fc3.bias.data.zero_()

  def forward(self, text):
    x = self.embed(text)
    x = x.mean(1) 
    x = F.relu(self.fc1(x))
    x = F.relu(self.fc2(x))
    x = self.fc3(x)
    return x

### モデルを作る
* インスタンスをGPUに移動させている点に注意。

In [26]:
model = EmbedTextSentiment(EMBED_DIM, NUM_CLASS, INPUT_DIM, padding_idx=PAD_IDX).to(device)

### 損失関数とoptimizerとschedulerを作る
* 損失関数をGPUに移動させている点に注意。

In [28]:
criterion = torch.nn.CrossEntropyLoss().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1, gamma=0.9)

### 訓練用の関数
* 前回とほぼ同じ。
* データのフォーマットが変わっただけ。

In [29]:
def train_func():

  # Train the model
  train_loss = 0
  train_acc = 0
  for batch in train_iterator:
    optimizer.zero_grad()
    text, cls = batch.text, batch.label
    text, cls = text.to(device), cls.to(device)
    output = model(text)
    loss = criterion(output, cls)
    train_loss += loss.item()
    loss.backward()
    optimizer.step()
    train_acc += (output.argmax(1) == cls).sum().item()

  # Adjust the learning rate
  scheduler.step()

  return train_loss / TRAIN_SIZE, train_acc / TRAIN_SIZE

### 評価用の関数

In [30]:
def test(data_iterator):
  loss = 0
  acc = 0
  for batch in data_iterator:
    text, cls = batch.text, batch.label
    text, cls = text.to(device), cls.to(device)
    with torch.no_grad():
      output = model(text)
      loss = criterion(output, cls)
      loss += loss.item()
      acc += (output.argmax(1) == cls).sum().item()

  return loss, acc

## 07-03 分類器の訓練と評価

In [31]:
import time

N_EPOCHS = 20
for epoch in range(N_EPOCHS):

  start_time = time.time()
  train_loss, train_acc = train_func()
  dev_loss, dev_acc = test(dev_iterator)
  dev_loss, dev_acc = dev_loss / DEV_SIZE, dev_acc / DEV_SIZE

  secs = int(time.time() - start_time)
  mins = secs / 60
  secs = secs % 60

  print('Epoch: %d' %(epoch + 1), " | time in %d minutes, %d seconds" %(mins, secs))
  print(f'\tLoss: {train_loss:.4f}(train)\t|\tAcc: {train_acc * 100:.1f}%(train)')
  print(f'\tLoss: {dev_loss:.4f}(dev)\t|\tAcc: {dev_acc * 100:.1f}%(dev)')

Epoch: 1  | time in 0 minutes, 5 seconds
	Loss: 0.0043(train)	|	Acc: 79.9%(train)
	Loss: 0.0001(dev)	|	Acc: 88.5%(dev)
Epoch: 2  | time in 0 minutes, 5 seconds
	Loss: 0.0016(train)	|	Acc: 93.9%(train)
	Loss: 0.0001(dev)	|	Acc: 87.4%(dev)
Epoch: 3  | time in 0 minutes, 5 seconds
	Loss: 0.0008(train)	|	Acc: 97.4%(train)
	Loss: 0.0002(dev)	|	Acc: 87.0%(dev)
Epoch: 4  | time in 0 minutes, 5 seconds
	Loss: 0.0004(train)	|	Acc: 98.6%(train)
	Loss: 0.0003(dev)	|	Acc: 86.7%(dev)
Epoch: 5  | time in 0 minutes, 5 seconds
	Loss: 0.0002(train)	|	Acc: 99.4%(train)
	Loss: 0.0003(dev)	|	Acc: 86.8%(dev)
Epoch: 6  | time in 0 minutes, 5 seconds
	Loss: 0.0001(train)	|	Acc: 99.7%(train)
	Loss: 0.0004(dev)	|	Acc: 87.1%(dev)
Epoch: 7  | time in 0 minutes, 5 seconds
	Loss: 0.0001(train)	|	Acc: 99.8%(train)
	Loss: 0.0004(dev)	|	Acc: 86.9%(dev)
Epoch: 8  | time in 0 minutes, 5 seconds
	Loss: 0.0000(train)	|	Acc: 99.9%(train)
	Loss: 0.0005(dev)	|	Acc: 87.0%(dev)
Epoch: 9  | time in 0 minutes, 5 seconds
	Loss: 

KeyboardInterrupt: ignored

 ### テストセット上で評価

In [32]:
print('Checking the results of test dataset...')
test_loss, test_acc = test(test_iterator)
test_loss, test_acc = test_loss / TEST_SIZE, test_acc / TEST_SIZE
print(f'\tLoss: {test_loss:.4f}(test)\t|\tAcc: {test_acc * 100:.1f}%(test)')

Checking the results of test dataset...
	Loss: 0.0001(test)	|	Acc: 85.1%(test)
