<a href="https://colab.research.google.com/github/superbunny38/2021DeepLearning/blob/main/pytorch/Chap_5_(1)_Embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 순환 신경망(RNN, Recurrent Neural Network)

usage:
- Document Classifiers: 트위터나 리뷰의 감성 파악, 뉴스 기사 분류
- Sequence-to-sequence learning: 영어를 프랑스어로 변환하는, 언어 번역과 같은 작업
- Time-series forecasting: 전날의 판매 기록에 관한 자세한 사항이 제공될 때 상점의 판매 예측

<br><br>
**벡터화(vectorization):** text -> (word, character, n-gram) -> number

**토큰화(tokenization):** text->token; 텍스트를 토큰(텍스트 단위)으로 나누는 작업

**mapping**(one-hot encoding or word embedding): mapping (token->vector)

<br>

**text->token->number**

## 토큰화 Tokenization
- text -> characters
- text -> words
- text -> n-gram

In [None]:
toy_story_review = "Just perfect. Script, character, animation....this manages to break free of the yoke of 'children's movie' to simply be one of the best movies of the 90's, full-stop."

### Text->Char

In [None]:
character = list(toy_story_review)
character

['J',
 'u',
 's',
 't',
 ' ',
 'p',
 'e',
 'r',
 'f',
 'e',
 'c',
 't',
 '.',
 ' ',
 'S',
 'c',
 'r',
 'i',
 'p',
 't',
 ',',
 ' ',
 'c',
 'h',
 'a',
 'r',
 'a',
 'c',
 't',
 'e',
 'r',
 ',',
 ' ',
 'a',
 'n',
 'i',
 'm',
 'a',
 't',
 'i',
 'o',
 'n',
 '.',
 '.',
 '.',
 '.',
 't',
 'h',
 'i',
 's',
 ' ',
 'm',
 'a',
 'n',
 'a',
 'g',
 'e',
 's',
 ' ',
 't',
 'o',
 ' ',
 'b',
 'r',
 'e',
 'a',
 'k',
 ' ',
 'f',
 'r',
 'e',
 'e',
 ' ',
 'o',
 'f',
 ' ',
 't',
 'h',
 'e',
 ' ',
 'y',
 'o',
 'k',
 'e',
 ' ',
 'o',
 'f',
 ' ',
 "'",
 'c',
 'h',
 'i',
 'l',
 'd',
 'r',
 'e',
 'n',
 "'",
 's',
 ' ',
 'm',
 'o',
 'v',
 'i',
 'e',
 "'",
 ' ',
 't',
 'o',
 ' ',
 's',
 'i',
 'm',
 'p',
 'l',
 'y',
 ' ',
 'b',
 'e',
 ' ',
 'o',
 'n',
 'e',
 ' ',
 'o',
 'f',
 ' ',
 't',
 'h',
 'e',
 ' ',
 'b',
 'e',
 's',
 't',
 ' ',
 'm',
 'o',
 'v',
 'i',
 'e',
 's',
 ' ',
 'o',
 'f',
 ' ',
 't',
 'h',
 'e',
 ' ',
 '9',
 '0',
 "'",
 's',
 ',',
 ' ',
 'f',
 'u',
 'l',
 'l',
 '-',
 's',
 't',
 'o',
 'p',
 '.']

### Text->Word

In [None]:
words = list(toy_story_review.split())#공백을 구분자로 사용
words

['Just',
 'perfect.',
 'Script,',
 'character,',
 'animation....this',
 'manages',
 'to',
 'break',
 'free',
 'of',
 'the',
 'yoke',
 'of',
 "'children's",
 "movie'",
 'to',
 'simply',
 'be',
 'one',
 'of',
 'the',
 'best',
 'movies',
 'of',
 'the',
 "90's,",
 'full-stop.']

### N-gram
n: 함께 사용될 수 있는 단어의 숫자

텍스트의 순차적인 특성을 잃어버림 -> 얕은 기계 학습 모델에 자주 사용함.(심층학습에서는 잘 사용되지 않음)

In [None]:
from nltk import ngrams
print(list(ngrams(toy_story_review.split(),2)))#bigram(n=2)

[('Just', 'perfect.'), ('perfect.', 'Script,'), ('Script,', 'character,'), ('character,', 'animation....this'), ('animation....this', 'manages'), ('manages', 'to'), ('to', 'break'), ('break', 'free'), ('free', 'of'), ('of', 'the'), ('the', 'yoke'), ('yoke', 'of'), ('of', "'children's"), ("'children's", "movie'"), ("movie'", 'to'), ('to', 'simply'), ('simply', 'be'), ('be', 'one'), ('one', 'of'), ('of', 'the'), ('the', 'best'), ('best', 'movies'), ('movies', 'of'), ('of', 'the'), ('the', "90's,"), ("90's,", 'full-stop.')]


In [None]:
print(list(ngrams(toy_story_review.split(),3)))#n=3

[('Just', 'perfect.', 'Script,'), ('perfect.', 'Script,', 'character,'), ('Script,', 'character,', 'animation....this'), ('character,', 'animation....this', 'manages'), ('animation....this', 'manages', 'to'), ('manages', 'to', 'break'), ('to', 'break', 'free'), ('break', 'free', 'of'), ('free', 'of', 'the'), ('of', 'the', 'yoke'), ('the', 'yoke', 'of'), ('yoke', 'of', "'children's"), ('of', "'children's", "movie'"), ("'children's", "movie'", 'to'), ("movie'", 'to', 'simply'), ('to', 'simply', 'be'), ('simply', 'be', 'one'), ('be', 'one', 'of'), ('one', 'of', 'the'), ('of', 'the', 'best'), ('the', 'best', 'movies'), ('best', 'movies', 'of'), ('movies', 'of', 'the'), ('of', 'the', "90's,"), ('the', "90's,", 'full-stop.')]


## 벡터화 Vectorization

- one-hot encoding
- word embedding

### One-Hot Encoding

In [None]:
sample_sentence = "An apple a day keeps doctor away said the doctor"

In [None]:
import numpy as np
class Dictionary(object):
  def __init__(self):
    self.word2index = {}#인덱스와 함께 모든 고유한 단어를 저장할 딕셔너리
    self.index2word = []#고유한 단어 저장
    self.length = 0#고유한 전체 단어의 개수

  def add_word(self, word):#단어 추가
    if word not in self.index2word:
      self.index2word.append(word)
      self.word2index[word] = self.length +1#dictionary
      self.length +=1#어휘 길이 증가
      return self.word2index[word]
  
  def __len__(self):#어휘 길이 반환
    return len(self.index2word)
  
  def onehot_encoded(self, word):
    vec = np.zeros(self.length+1)#0으로 채움
    vec[self.word2index[word]] = 1#단어의 인덱스의 벡터 값만 1로 채움
    return vec

In [None]:
dic = Dictionary()
for tok in sample_sentence.split():
  dic.add_word(tok)

In [None]:
dic.word2index

{'An': 1,
 'a': 3,
 'apple': 2,
 'away': 7,
 'day': 4,
 'doctor': 6,
 'keeps': 5,
 'said': 8,
 'the': 9}

In [None]:
#index 1부터 채운 one-hot vector
for word in dic.index2word:
  print("word:",word)
  print(dic.onehot_encoded(word))

word: An
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
word: apple
[0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
word: a
[0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
word: day
[0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
word: keeps
[0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
word: doctor
[0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
word: away
[0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
word: said
[0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
word: the
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]


toy_story_review 데이터 사용해보기

In [None]:
dic = Dictionary()

for tok in toy_story_review.split():
  dic.add_word(tok)

print(dic.word2index)

{'Just': 1, 'perfect.': 2, 'Script,': 3, 'character,': 4, 'animation....this': 5, 'manages': 6, 'to': 7, 'break': 8, 'free': 9, 'of': 10, 'the': 11, 'yoke': 12, "'children's": 13, "movie'": 14, 'simply': 15, 'be': 16, 'one': 17, 'best': 18, 'movies': 19, "90's,": 20, 'full-stop.': 21}


In [None]:
for word in dic.index2word:
  print("word:",word)
  print(dic.onehot_encoded(word))

word: Just
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
word: perfect.
[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
word: Script,
[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
word: character,
[0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
word: animation....this
[0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
word: manages
[0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
word: to
[0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
word: break
[0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
word: free
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
word: of
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
word: the
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
word: yoke
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
word: 'children's
[0. 0. 0.

one-hot의 문제점: 
- 데이터가 너무 희소하다(sparse)
- 고유한 단어들이 증가함에 따라 벡터의 길이가 빠르게 커진다
- 단어들 간의 내부적인 연관성을 표현하지 못한다

**-> 심층 학습에서 거의 사용하지 않는다**

### Word Embedding
- 실수로 채워진 밀집 표현을 제공
- 벡터 차원은 하이퍼파라미터
- 의미적으로 더 가까운 단어가 더 비슷한 표현을 갖도록 조정
- 데이터가 너무 적은 경우 pretrained word embedding 사용


### 감성 분류기(Sentiment Analysis)를 구축하면서 Word Embedding 학습하기
1. IMDb 데이터 다운로드 및 텍스트 토큰화 수행하기
2. 어휘 구축하기
3. 벡터들의 배치 생성하기
4. 임베딩으로 네트워크 모델 생성하기
5. 모델 학습하기

#### 1. IMDb 데이터 다운로드 및 텍스트 토큰화 수행하기<1>

In [1]:
import torchtext

torchtext.data로 토큰화(tokenization)하기

In [2]:
from torchtext.legacy import data
#Field: 토큰화하는 방법을 정의
text = data.Field(lower=True, batch_first = True, fix_length = 20)#실제 텍스트 (모든 텍스트를 소문자(lower)로, 최대길이는 20으로 자름)
label = data.Field(sequential = False)#레이블 데이터

torchtext.datasets으로 토큰화(tokenization)하기

In [3]:
import torchtext.legacy.datasets

In [4]:
from torchtext.legacy import datasets
train_data, test_data = datasets.IMDB.splits(text,label)

downloading aclImdb_v1.tar.gz


100%|██████████| 84.1M/84.1M [00:03<00:00, 26.2MB/s]


In [5]:
train,test = train_data, test_data
print("train.fields:")
print(train.fields)

train.fields:
{'text': <torchtext.legacy.data.field.Field object at 0x7fe69d5661d0>, 'label': <torchtext.legacy.data.field.Field object at 0x7fe69d566290>}


학습 데이터셋의 변수

In [6]:
train[0]

<torchtext.legacy.data.example.Example at 0x7fe697fb2f90>

In [7]:
#vars():returns the __dict__ attribute of the given object
print(vars(train[0]))

{'text': ['i', 'have', 'to', 'hand', 'it', 'to', 'the', 'creative', 'team', 'behind', 'these', '"american', 'pie"', 'movies.', '"direct', 'to', 'dvd"', 'typically', 'is', 'synonymous', 'with', 'cheap,', 'incompetent', 'film-making.', 'yet', 'last', 'year', 'i', 'was', 'pleasantly', 'surprised', 'when', 'i', 'found', 'myself', 'thoroughly', 'enjoying', 'the', 'dvd', 'sequel', '"the', 'naked', 'mile".', 'the', 'filmmakers', 'took', 'advantage', 'of', 'the', 'opportunity', 'to', 'deliver', 'a', 'raunchy,', 'yet', 'funny', 'little', 'film.', 'this', 'year', 'they', 'offer', 'up', 'the', 'followup,', '"beta', 'house".', 'this', 'is', 'the', 'honest', 'truth,', '"beta', 'house"', 'makes', 'the', 'first', 'few', '"american', 'pie"', 'movies', 'look', 'like', '"the', 'little', 'mermaid".<br', '/><br', '/>this', 'is', 'no', 'holds', 'barred,', 'tasteless,', 'laugh-out', 'loud', 'fun.', 'sure,', 'the', 'story', 'is', 'a', 'bit', 'thin,', 'but', "that's", 'the', 'beauty', 'of', 'the', 'whole', 't

In [8]:
vars(train[0])["text"]

['i',
 'have',
 'to',
 'hand',
 'it',
 'to',
 'the',
 'creative',
 'team',
 'behind',
 'these',
 '"american',
 'pie"',
 'movies.',
 '"direct',
 'to',
 'dvd"',
 'typically',
 'is',
 'synonymous',
 'with',
 'cheap,',
 'incompetent',
 'film-making.',
 'yet',
 'last',
 'year',
 'i',
 'was',
 'pleasantly',
 'surprised',
 'when',
 'i',
 'found',
 'myself',
 'thoroughly',
 'enjoying',
 'the',
 'dvd',
 'sequel',
 '"the',
 'naked',
 'mile".',
 'the',
 'filmmakers',
 'took',
 'advantage',
 'of',
 'the',
 'opportunity',
 'to',
 'deliver',
 'a',
 'raunchy,',
 'yet',
 'funny',
 'little',
 'film.',
 'this',
 'year',
 'they',
 'offer',
 'up',
 'the',
 'followup,',
 '"beta',
 'house".',
 'this',
 'is',
 'the',
 'honest',
 'truth,',
 '"beta',
 'house"',
 'makes',
 'the',
 'first',
 'few',
 '"american',
 'pie"',
 'movies',
 'look',
 'like',
 '"the',
 'little',
 'mermaid".<br',
 '/><br',
 '/>this',
 'is',
 'no',
 'holds',
 'barred,',
 'tasteless,',
 'laugh-out',
 'loud',
 'fun.',
 'sure,',
 'the',
 'stor

In [9]:
vars(train[0])["label"]#label

'pos'

#### 2. 어휘Vocabulary 구축하기<2>

In [10]:
#어휘를 구축하는데 필요한 train 객체 전달
#사전에 학습된 가중치 이용
#6B: trained on Wikipedia 2014 corpus of 6 billion words
#차원 크기:300, 단어의 숫자 10000개로 제한, 10번 이상 출현하지 않은 단어는 제거
text.build_vocab(train, vectors = torchtext.vocab.GloVe(name="6B", # trained on Wikipedia 2014 corpus
                              dim=300), max_size= 10000, min_freq = 10)
label.build_vocab(train)

.vector_cache/glove.6B.zip: 862MB [02:40, 5.37MB/s]                           
100%|█████████▉| 399999/400000 [00:51<00:00, 7746.06it/s]


In [11]:
#단어별 출현 빈도
text.vocab.freqs

Counter({'i': 70480,
         'have': 27344,
         'to': 133967,
         'hand': 651,
         'it': 65505,
         'the': 322198,
         'creative': 299,
         'team': 579,
         'behind': 1131,
         'these': 5233,
         '"american': 71,
         'pie"': 10,
         'movies.': 922,
         '"direct': 2,
         'dvd"': 3,
         'typically': 111,
         'is': 104171,
         'synonymous': 13,
         'with': 42729,
         'cheap,': 93,
         'incompetent': 75,
         'film-making.': 44,
         'yet': 2157,
         'last': 2699,
         'year': 1406,
         'was': 47024,
         'pleasantly': 124,
         'surprised': 695,
         'when': 13609,
         'found': 2494,
         'myself': 855,
         'thoroughly': 334,
         'enjoying': 158,
         'dvd': 1616,
         'sequel': 489,
         '"the': 2714,
         'naked': 344,
         'mile".': 2,
         'filmmakers': 476,
         'took': 1091,
         'advantage': 126,
       

In [12]:
text.vocab.vectors

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0466,  0.2132, -0.0074,  ...,  0.0091, -0.2099,  0.0539],
        ...,
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.7724, -0.1800,  0.2072,  ...,  0.6736,  0.2263, -0.2919],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]])

In [13]:
#dict
for idx, key in enumerate(text.vocab.stoi.keys()):#strings to numerical identifiers.
  if idx>9:
    break
  print(key,":",text.vocab.stoi[key])

<unk> : 0
<pad> : 1
the : 2
a : 3
and : 4
of : 5
to : 6
is : 7
in : 8
i : 9


In [14]:
#list
text.vocab.itos[:10]

['<unk>', '<pad>', 'the', 'a', 'and', 'of', 'to', 'is', 'in', 'i']

In [15]:
label.vocab.itos#0:unknown, 1:neg, 2:pos

['<unk>', 'neg', 'pos']

#### 3. 벡터들의 배치 생성하기<3>
BucketIterator사용: 모든 텍스트의 배치를 생성하고 단어들을 단어들의 인덱스 번호로 대체

In [54]:
train_iter, test_iter = data.BucketIterator.splits((train, test), batch_size = 128,
                                                   device = None,#use gpu. device = -1: cpu
                                                   shuffle = True)

In [43]:
batch = next(iter(train_iter))

In [44]:
batch.text

tensor([[   0,    0,    0,  ...,   10, 1309, 1069],
        [  12,    7,  314,  ...,    0,   91,   30],
        [ 133,    9,  386,  ...,   86, 1004,   12],
        ...,
        [ 192,   12,   62,  ...,   69,   66,  338],
        [ 675, 8031, 4539,  ...,  820,   11, 1295],
        [   0,    7,    8,  ...,    7,   30,    5]])

In [45]:
len(batch.text)#batch size

128

In [46]:
len(batch.text[0])#fixed length = 20

20

In [47]:
batch.label#0:unknown, 1:neg, 2:pos

tensor([1, 1, 1, 2, 1, 2, 2, 1, 2, 2, 1, 1, 2, 1, 2, 2, 1, 1, 1, 1, 1, 2, 1, 2,
        2, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 1, 1, 2, 1, 1, 1,
        1, 1, 2, 2, 1, 1, 1, 1, 2, 2, 1, 2, 2, 1, 1, 2, 1, 1, 2, 1, 2, 1, 2, 2,
        2, 1, 2, 1, 1, 1, 1, 1, 2, 1, 2, 1, 2, 2, 1, 1, 2, 2, 2, 1, 2, 1, 2, 2,
        2, 2, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 2, 1, 2, 2, 1, 1, 1, 2, 2, 1, 2,
        1, 2, 1, 1, 2, 2, 1, 2])

#### 4. 임베딩으로 네트워크 모델 생성하기<4>
단어 임베딩을 생성하고 각 리뷰의 감성(sentiment)을 예측하는 모델 학습

In [48]:
import torch.nn as nn
from torch.autograd import Variable
import torch.nn.functional as F

class EmbeddingNetwork(nn.Module):
  def __init__(self, emb_size, hidden_size1, hidden_size2 = 400):#감정분류기 모델
    super().__init__()
    self.embedding = nn.Embedding(emb_size, hidden_size1,padding_idx=0)#어휘의 크기, 차원의 크기
    self.fc = nn.Linear(hidden_size2, 3)#label 3개
  def forward(self, x):
    #print("x.size(0):",x.size(0))#128
    embeds = self.embedding(x).view(x.size(0),-1)#다른 배치들과 섞이는 것 막음
    out = self.fc(embeds)
    return F.log_softmax(out, dim = -1)#출력 크기: 배치 사이즈 x fixed_length x 차원

#### 5. 모델 학습하기<5>

In [64]:
def fit(optimizer, epoch, model, data_loader, phase = 'training', volatile = False):
  if phase == "training":
    model.train()
  if phase == "validation":
    model.eval()

  volatile = True
  running_loss = 0.0
  running_correct = 0

  for batch_idx, batch in enumerate(data_loader):
    #print("batch: {}/128".format(batch_idx+1))
    text, target = batch.text, batch.label
    #print("len",len(batch.text))
    try:
      text, target = text.cuda(), target.cuda()
    except:#no cuda
      data, target = Variable(data, volatile), Variable(target)

    if phase == "training":
      optimizer.zero_grad()

    output = model(text)
    loss = F.nll_loss(output, target)
    running_loss += F.nll_loss(output, target, size_average=False).data#[0]
    predictions = output.data.max(dim = 1, keepdim = True)[1]
    running_correct += predictions.eq(target.data.view_as(predictions)).cpu().sum()

    if phase == "training":
      loss.backward()
      optimizer.step()

    loss = running_loss/len(data_loader.dataset)
    accuracy = 100.*running_correct/len(data_loader.dataset)
    if batch_idx == 195:
      print(f"batch_idx: {batch_idx} | {phase} loss is {loss:{5}.{2}} and {phase} accuracy is {running_correct}/{len(data_loader.dataset)}{accuracy:{10}.{4}}".format(loss, accuracy))
  return loss, accuracy

In [65]:
model = EmbeddingNetwork(20*128*10,20)#emb_size, hidden_size1
try:
  model.cuda()
except:
  print("GPU not available")

In [66]:
import torch.optim as optim
optimizer = optim.Adam(model.parameters(),lr=0.01)

In [67]:
train_losses , train_accuracy = [],[] 
validation_losses , validation_accuracy = [],[]

train_iter.repeat = False#배치 생성 멈춤
test_iter.repeat = False#배치 생성 멈춤
for epoch in range(1,10):
    print("================epoch{}/19===============".format(epoch))
    epoch_loss, epoch_accuracy = fit(optimizer, epoch,model,train_iter,phase='training')
    validation_epoch_loss, validation_epoch_accuracy = fit(optimizer, epoch,model,test_iter,phase='validation')
    train_losses.append(epoch_loss) 
    train_accuracy.append(epoch_accuracy) 
    validation_losses.append(validation_epoch_loss) 
    validation_accuracy.append(validation_epoch_accuracy)





batch_idx: 195 | training loss is  0.77 and training accuracy is 13659/25000     54.64
batch_idx: 195 | validation loss is  0.71 and validation accuracy is 15022/25000     60.09
batch_idx: 195 | training loss is  0.62 and training accuracy is 16892/25000     67.57
batch_idx: 195 | validation loss is  0.68 and validation accuracy is 16203/25000     64.81
batch_idx: 195 | training loss is  0.51 and training accuracy is 18860/25000     75.44
batch_idx: 195 | validation loss is  0.72 and validation accuracy is 16413/25000     65.65
batch_idx: 195 | training loss is   0.4 and training accuracy is 20566/25000     82.26
batch_idx: 195 | validation loss is  0.84 and validation accuracy is 16429/25000     65.72
batch_idx: 195 | training loss is  0.28 and training accuracy is 22077/25000     88.31
batch_idx: 195 | validation loss is   1.0 and validation accuracy is 16151/25000      64.6
batch_idx: 195 | training loss is  0.17 and training accuracy is 23460/25000     93.84
batch_idx: 195 | valida

# 사전 학습된 단어 임베딩 사용하기
1. 임베딩 다운로드하기
2. 모델에 임베딩 불러오기
3. 임베딩 레이어 가중치 고정하기

## 1. 임베딩 다운로드하기<1>

In [68]:
from torchtext.vocab import GloVe

In [70]:
text.build_vocab(train, vectors = GloVe(name = '6B',dim = 300), max_size = 10000, min_freq = 10)
label.build_vocab(train,)

In [71]:
text.vocab.vectors#임베딩에 접근

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0466,  0.2132, -0.0074,  ...,  0.0091, -0.2099,  0.0539],
        ...,
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.7724, -0.1800,  0.2072,  ...,  0.6736,  0.2263, -0.2919],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]])

In [72]:
len(text.vocab.vectors)#vocab_size x dimensions

10002

In [73]:
len(text.vocab.vectors[0])#fixed length = 300

300

## 2. 모델에 임베딩 불러오기<2>
임베딩을 임베딩 레이어의 가중치에 저장

In [75]:
model.embedding.weight.data = text.vocab.vectors#임베딩 레이어의 가중치에 접근하여 임베딩 가중치를 지정

In [76]:
class EmbeddingNetwork(nn.Module):
  def __init__(self, embedding_size, hidden_size1, hidden_size2 = 400):
    super().__init__()
    self.embedding = nn.Embedding(embedding_size, hidden_size1)
    self.fc1 = nn.Linear(hidden_size2, 3)
  
  def forward(self,x):
    embeds = self.embedding(x).view(x.size(0),-1)#batch size
    out = self.fc1(embeds)
    return F.log_softmax(out,dim = -1)

In [77]:
len(text.vocab.stoi)

10002

In [82]:
model = EmbeddingNetwork(len(text.vocab.stoi),600,12000)#책:model = EmbeddingNetwork(len(text.vocab.stoi),300,12000)

In [83]:
try:
  model.cuda()
except:
  print("GPU not available")

## 3. 임베딩 레이어 가중치 고정하기<3>
1. requires_grad = False: 가중치를 위한 기울기가 필요없음을 지시
2. optimizer에 임베딩 레이어의 parameter을 전달하지 않아야 함(아니면 1때문에 에러남)

In [84]:
model.embedding.weight.requires_grad = False
optimizer = optim.SGD([param for param in model.parameters() if param.requires_grad == True], lr = 0.001)

훈련

In [86]:
train_losses , train_accuracy = [],[] 
validation_losses , validation_accuracy = [],[]

train_iter.repeat = False#배치 생성 멈춤
test_iter.repeat = False#배치 생성 멈춤
for epoch in range(1,19):
    print("================epoch{}/19===============".format(epoch))
    epoch_loss, epoch_accuracy = fit(optimizer, epoch,model,train_iter,phase='training')
    validation_epoch_loss, validation_epoch_accuracy = fit(optimizer, epoch,model,test_iter,phase='validation')
    train_losses.append(epoch_loss) 
    train_accuracy.append(epoch_accuracy) 
    validation_losses.append(validation_epoch_loss) 
    validation_accuracy.append(validation_epoch_accuracy)





batch_idx: 195 | training loss is  0.54 and training accuracy is 18344/25000     73.38
batch_idx: 195 | validation loss is  0.69 and validation accuracy is 14772/25000     59.09
batch_idx: 195 | training loss is  0.54 and training accuracy is 18578/25000     74.31
batch_idx: 195 | validation loss is  0.69 and validation accuracy is 14799/25000      59.2
batch_idx: 195 | training loss is  0.53 and training accuracy is 18700/25000      74.8
batch_idx: 195 | validation loss is  0.69 and validation accuracy is 14769/25000     59.08
batch_idx: 195 | training loss is  0.52 and training accuracy is 18862/25000     75.45
batch_idx: 195 | validation loss is  0.69 and validation accuracy is 14771/25000     59.08
batch_idx: 195 | training loss is  0.52 and training accuracy is 18934/25000     75.74
batch_idx: 195 | validation loss is  0.69 and validation accuracy is 14788/25000     59.15
batch_idx: 195 | training loss is  0.51 and training accuracy is 19062/25000     76.25
batch_idx: 195 | valida

지금까지의 accuracy가 낮은 이유:
**텍스트의 순차적인 특성에 대한 이점을 활용하지 못하기 때문**