<a href="https://colab.research.google.com/github/tomonari-masada/course2021-nlp/blob/main/07_PyTorch_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PyTorch入門 (3)
* IMDbデータセットの感情分析をPyTorchを使っておこなう。
 * 前にscikit-learnを使って同じ作業をおこなった。
* 参考資料
 * PyTorch公式のチュートリアル https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html
* データは以前作ったIMDbの文書埋め込みを使う。
* sentiment analysisのもっと高度な手法については、下記リンク先を参照。
 * https://github.com/bentrevett/pytorch-sentiment-analysis

## 1. fastTextによる文書埋め込みをMLPの入力として使うための準備
* MLP(多層パーセプトロン)の学習ぐらいは、空気を吸ったり吐いたりするぐらい自然にできるようにしておこう。

### 準備

* （あらかじめランタイムのタイプをGPUに設定しておこう。）

In [2]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F

np.random.seed(123)
torch.manual_seed(123)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [3]:
!nvidia-smi

Fri Nov 26 14:45:38 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   71C    P8    32W / 149W |      3MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [4]:
device

device(type='cuda')

### 単語埋め込みデータファイルの読み込み
* データファイルの準備の仕方
 * Blackboardで「自然言語処理特論」へ行く。
 * 「教材/課題/テスト」→「data」と順にクリックする。
 * 「IMDB dataset」のところに見えている4つの「.npy」ファイルをダウンロードする。
 * ダウンロードした4つのファイルを、自分のGoogle Driveの適当な場所にアップロードする。
 * 次のセルで、その置き場所を指定する。

In [5]:
PATH = '/content/drive/MyDrive/2021Courses/NLP/'

texts = dict()
labels = dict()
for tag in ['train', 'test']:
  with open(f'{PATH}{tag}.npy', 'rb') as f:
    texts[tag] = np.load(f)
  with open(f'{PATH}{tag}_labels.npy', 'rb') as f:
    labels[tag] = np.load(f)

In [6]:
for tag in ['train', 'test']:
  print(texts[tag].shape)

(25000, 300)
(25000, 300)


In [7]:
for tag in ['train', 'test']:
  texts[tag], labels[tag] = torch.tensor(texts[tag]), torch.tensor(labels[tag])

## 2. 学習のための準備

### Dataset

In [8]:
from torch.utils.data import Dataset, random_split

class MyDataset(Dataset):
  def __init__(self, X, y):
    self.X = X
    self.y = y

  def __len__(self):
    return self.X.shape[0]

  def __getitem__(self, index):
    return self.X[index], self.y[index]

train_valid = MyDataset(texts['train'], labels['train'])
test = MyDataset(texts['test'], labels['test'])

valid_size = len(train_valid) // 5
train_size = len(train_valid) - valid_size
train, valid = random_split(train_valid,
                            [train_size, valid_size],
                            generator=torch.Generator().manual_seed(42)
                            )

### DataLoader

In [9]:
from torch.utils.data import DataLoader

# ミニバッチのサイズ
BATCH_SIZE = 100

# 訓練データだけシャッフル
train_loader = DataLoader(train, batch_size=BATCH_SIZE, shuffle=True)
valid_loader = DataLoader(valid, batch_size=BATCH_SIZE)
test_loader = DataLoader(test, batch_size=BATCH_SIZE)

## 3. モデルの定義と学習の準備

### モデルの定義

In [10]:
class TextSentiment(nn.Module):
  def __init__(self, embed_dim, num_class):
    super(TextSentiment, self).__init__()
    self.fc1 = nn.Linear(embed_dim, 500)
    self.fc2 = nn.Linear(500, 100)
    self.fc3 = nn.Linear(100, num_class)

  def forward(self, x):
    x = F.relu(self.fc1(x))
    x = F.relu(self.fc2(x))
    x = self.fc3(x)
    return x

In [11]:
EMBED_DIM = texts['train'].size(1)
NUM_CLASS = len(np.unique(labels['train']))
model = TextSentiment(EMBED_DIM, NUM_CLASS).to(device)

In [12]:
print(EMBED_DIM, NUM_CLASS)

300 2


### 損失関数と最適化アルゴリズム

* 損失関数を除いて、以下の設定はいい加減なので、自分で調整してみよう。
* schedulerの使い方は、調べてみよう。

In [13]:
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
#scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[20,50], gamma=0.1)

## 4. 分類器の訓練と評価

### 評価を行なう関数
* 正解率で評価する関数を定義しておく。

In [14]:
def eval(model, criterion, loader):
  model.eval()
  
  total_loss = 0.0
  total_acc = 0.0
  total_size = 0
  for input, target in loader:
    with torch.no_grad():
      input, target = input.to(device), target.to(device)
      output = model(input)
      loss = criterion(output, target)
      total_loss += loss.item() * len(target)
      total_acc += (output.argmax(1) == target).float().sum().item()
      total_size += len(target)

  return total_loss / total_size, total_acc / total_size

### 訓練を行なう関数

In [15]:
def train(model, criterion, optimizer, train_loader, valid_loader, n_epochs=100):
  model.train()

  # training loop
  for epoch in range(n_epochs):

    train_loss = 0.0
    for input, target in train_loader:
      output = model(input.to(device))
      loss = criterion(output, target.to(device))
      train_loss += loss.item() * len(target) # 表示用の集計

      loss.backward()
      optimizer.step()
      optimizer.zero_grad()

    valid_loss, valid_acc = eval(model, criterion, valid_loader)

    # logging
    print(f'epoch {epoch + 1:6d} |',
          f'train loss {train_loss / train_size:8.4f} |',
          f'valid loss {valid_loss:8.4f} | valid acc {valid_acc:8.3f}')

### 訓練と評価の実施

In [16]:
train(model, criterion, optimizer, train_loader, valid_loader, 100)

epoch      1 | train loss   0.4947 | valid loss   0.3888 | valid acc    0.834
epoch      2 | train loss   0.3683 | valid loss   0.3711 | valid acc    0.845
epoch      3 | train loss   0.3539 | valid loss   0.3770 | valid acc    0.840
epoch      4 | train loss   0.3495 | valid loss   0.3607 | valid acc    0.847
epoch      5 | train loss   0.3452 | valid loss   0.3557 | valid acc    0.850
epoch      6 | train loss   0.3391 | valid loss   0.3552 | valid acc    0.853
epoch      7 | train loss   0.3374 | valid loss   0.3764 | valid acc    0.829
epoch      8 | train loss   0.3378 | valid loss   0.3641 | valid acc    0.839
epoch      9 | train loss   0.3330 | valid loss   0.3588 | valid acc    0.845
epoch     10 | train loss   0.3315 | valid loss   0.3484 | valid acc    0.854
epoch     11 | train loss   0.3328 | valid loss   0.3469 | valid acc    0.855
epoch     12 | train loss   0.3288 | valid loss   0.3457 | valid acc    0.854
epoch     13 | train loss   0.3280 | valid loss   0.3469 | valid



---



---



## 5. 単語埋め込みもパラメータになっているモデル
* fasttextの単語埋め込みを使うのをやめる。
* 単語埋め込みも同時に学習することにする。

### IMDbデータセットをテキストデータとして読み直す

In [17]:
!pip install ml_datasets

Collecting ml_datasets
  Downloading ml_datasets-0.2.0-py3-none-any.whl (15 kB)
Installing collected packages: ml-datasets
Successfully installed ml-datasets-0.2.0


In [18]:
from ml_datasets import imdb
train_data, test_data = imdb()

84131840it [00:49, 1716148.48it/s]                             


Untaring file...


In [19]:
train_texts, train_labels = zip(*train_data)
test_texts, test_labels = zip(*test_data)

In [20]:
train_texts[0]

'Another turgid action/adventure flick from the Quinn Martin Productions factory. Roy Thinnes plays undercover agent Diamond Head (Mr. Head, to you), working for his G-Man handler "Aunt Mary", looking for "Tree", who\'s on a mission to...well, just watch the movie. \n\n\n\nThis one deserved and got the full MST3K sendup. As the boys and various reviewers have pointed out, the movie "Fargo" had more Hawaiian locations than this film. Apparently shot on a puny budget, this movie highlights Hawaii\'s broken-down dive shops, gas stations, and cheapo hotels. Zulu -- later to star as Kono in Hawaii-Five-O -- appears as Thinnes\' lumpy, inept sidekick, while France Nguyen models the Jenny Craig diet gone horribly wrong. Others sharing the flickering screen include a drunken Richard Harris knockoff, a George Takai imitator, a not-so-smart hit-man with sprayed-on Sansabelt slacks, and the villain "Tree", sporting a veddy British accent. You can pretty much figure out the plot halfway through th

In [21]:
train_labels[0]

'neg'

### ラベルを0/1の整数に変換

In [22]:
unique_labels = np.unique(train_labels)
label_id = {}
for i, label in enumerate(unique_labels):
  label_id[label] = i

In [23]:
train_labels = [label_id[label] for label in train_labels]
test_labels = [label_id[label] for label in test_labels]

In [24]:
print(train_labels[:10])

[0, 1, 0, 1, 1, 0, 1, 0, 1, 0]


### sklearnのCountVectorizerを使ってトークン化
* `torchtext`を使う方法は後日説明。

* 語彙集合の構築

In [25]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(min_df=10, max_df=0.2)
vectorizer.fit(train_texts)

CountVectorizer(max_df=0.2, min_df=10)

In [26]:
vocab = vectorizer.get_feature_names_out()
print([vocab[i] for i in range(10)])

['00', '000', '007', '01', '02', '05', '06', '07', '08', '10']


* ある単語が語彙集合に入っているかどうかは、下のようにしてチェックできる。

In [27]:
'to' in vectorizer.vocabulary_

False

* preprocessorとtokenizerの作成

In [28]:
preprocessor = vectorizer.build_preprocessor()
tokenizer = vectorizer.build_tokenizer()

* トークン列をインデックス列に変換する関数
 * 単語のインデックスを、パディング用の単語と、未知語との２つ分、後ろにずらす。
 * テキストの長さを`max_len`に揃えるという作業も同時に行なう。

In [29]:
PAD_IDX = 0
UNK_IDX = 1
VOCAB_SIZE = len(vocab) + 2

def encode(text, max_len=1000, padding_idx=PAD_IDX, unknown_idx=UNK_IDX):
  idx_seq = []
  for token in tokenizer(preprocessor(text)):
    if token in vectorizer.vocabulary_:
      idx_seq.append(vectorizer.vocabulary_[token])
    else:
      idx_seq.append(unknown_idx)
  if len(idx_seq) < max_len:
    idx_seq += [padding_idx] * (max_len - len(idx_seq))
  else:
    idx_seq = idx_seq[:max_len]
  return idx_seq

In [30]:
print(VOCAB_SIZE)

18419


In [31]:
print(encode(train_texts[0]))

[892, 17055, 400, 506, 6487, 1, 1, 13015, 10169, 12714, 6093, 14025, 1, 12270, 17181, 581, 4682, 7682, 10832, 7682, 1, 1, 18217, 1, 1, 10051, 1, 1285, 10180, 9795, 1, 16930, 1, 1, 10604, 1, 1, 1, 1, 1, 1, 1, 1, 4571, 1, 7212, 1, 6815, 10836, 1, 1, 1, 2114, 1, 17535, 13739, 1, 12328, 1, 1, 1, 6168, 1, 1, 7661, 9753, 1, 1, 1, 963, 14754, 1, 1, 2304, 1, 1, 7857, 7660, 2245, 5073, 4947, 14745, 6934, 15617, 1, 2866, 8061, 18415, 9420, 1, 15578, 1, 1, 1, 7660, 6435, 973, 1, 1, 1, 8476, 14824, 18016, 6698, 1, 10655, 1, 8955, 3890, 4711, 7177, 8035, 18283, 11569, 14642, 1, 6489, 14351, 8394, 5200, 13790, 7614, 9270, 7013, 1, 1, 1, 1, 15094, 7910, 10051, 1, 15481, 1, 1, 1, 1, 1, 17661, 16930, 15466, 1, 2226, 327, 1, 1, 12626, 1, 6348, 1, 1, 1, 7505, 16591, 1, 11473, 3949, 1, 13433, 1, 5633, 1, 7049, 10329, 1, 1, 217, 1, 1, 1, 17944, 18005, 1, 1, 1, 1, 1, 1, 1, 1, 1, 11415, 2666, 10918, 8119, 1, 16500, 18282, 1, 6542, 1, 1, 1, 1, 9281, 18059, 16987, 11240, 4682, 7682, 1, 4769, 1, 1, 1, 1, 1, 165

* バッチ単位でトークン列をインデックス列に変換する関数

In [32]:
def batch_encode(texts):
  sequences = []
  for text in texts:
    sequences.append(encode(text))
  return torch.Tensor(sequences)

### Dataset

In [33]:
from torch.utils.data import Dataset, random_split

class MyTextDataset(Dataset):
  def __init__(self, texts, labels):
    self.texts = texts
    self.labels = labels

  def __len__(self):
    return len(self.texts)

  def __getitem__(self, index):
    return self.texts[index], self.labels[index]

train_valid_set = MyTextDataset(train_texts, train_labels)
test_set = MyTextDataset(test_texts, test_labels)

valid_size = len(train_valid) // 5
train_size = len(train_valid) - valid_size
train_set, valid_set = random_split(train_valid_set,
                                    [train_size, valid_size],
                                    generator=torch.Generator().manual_seed(42)
                                    )

### DataLoader
* ミニバッチのテキストをインデックス列へ変換するcollation用の関数も定義する。

In [34]:
def collate_fn(batch):
  batch_texts, batch_labels = zip(*batch)
  return batch_encode(batch_texts).type(torch.LongTensor), torch.LongTensor(batch_labels)

In [35]:
from torch.utils.data import DataLoader

# ミニバッチのサイズ
BATCH_SIZE = 100

# 訓練データだけシャッフル
train_loader = DataLoader(train_set, batch_size=BATCH_SIZE, collate_fn=collate_fn, shuffle=True)
valid_loader = DataLoader(valid_set, batch_size=BATCH_SIZE, collate_fn=collate_fn)
test_loader = DataLoader(test_set, batch_size=BATCH_SIZE, collate_fn=collate_fn)

### モデルの定義
* `nn.Embedding`を使うところがポイント。

In [36]:
class EmbeddedTextSentiment(nn.Module):
  def __init__(self, embed_dim, num_class, vocab_size, padding_idx=PAD_IDX):
    super(EmbeddedTextSentiment, self).__init__()
    self.embed = nn.Embedding(vocab_size, embed_dim, padding_idx=padding_idx)
    self.fc1 = nn.Linear(embed_dim, 500)
    self.fc2 = nn.Linear(500, 100)
    self.fc3 = nn.Linear(100, num_class)
    self.dropout = nn.Dropout()

  def forward(self, text):
    embedded = self.dropout(self.embed(text))
    x = embedded.mean(1)
    x = F.relu(self.fc1(x))
    x = F.relu(self.fc2(x))
    x = self.fc3(x)
    return x

* モデルのインスタンスを作成

In [37]:
EMBED_DIM = 300
NUM_CLASS = len(np.unique(train_labels))
model = EmbeddedTextSentiment(EMBED_DIM, NUM_CLASS, VOCAB_SIZE, PAD_IDX).to(device)

### 損失関数と最適化アルゴリズム

In [38]:
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
#scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[20,50], gamma=0.1)

### 学習の実行

In [39]:
train(model, criterion, optimizer, train_loader, valid_loader, 100)

epoch      1 | train loss   0.6892 | valid loss   0.6366 | valid acc    0.657
epoch      2 | train loss   0.4393 | valid loss   0.3637 | valid acc    0.845
epoch      3 | train loss   0.2783 | valid loss   0.3350 | valid acc    0.871
epoch      4 | train loss   0.2252 | valid loss   0.3508 | valid acc    0.866
epoch      5 | train loss   0.1800 | valid loss   0.3043 | valid acc    0.887
epoch      6 | train loss   0.1506 | valid loss   0.3250 | valid acc    0.889
epoch      7 | train loss   0.1158 | valid loss   0.3619 | valid acc    0.885
epoch      8 | train loss   0.0920 | valid loss   0.3705 | valid acc    0.884
epoch      9 | train loss   0.0797 | valid loss   0.3981 | valid acc    0.879
epoch     10 | train loss   0.0624 | valid loss   0.4890 | valid acc    0.879
epoch     11 | train loss   0.0468 | valid loss   0.5029 | valid acc    0.876
epoch     12 | train loss   0.0350 | valid loss   0.5341 | valid acc    0.872
epoch     13 | train loss   0.0265 | valid loss   0.5801 | valid

KeyboardInterrupt: ignored

In [40]:
loss, acc = eval(model, criterion, train_loader)
print(f'train loss {loss:8.4f} | train acc {acc:8.3f}')

train loss   0.0035 | train acc    1.000


# 課題
* モデルやoptimizerやschedulerを変更して、validation setを使ってチューニングしよう。
* 最後に、自分で選択した設定を使って、test set上で評価しよう。

In [18]:
loss, acc = eval(model, criterion, test_loader)
print(f'test loss {loss:8.4f} | test acc {acc:8.3f}')

test loss   0.7982 | test acc    0.834
