# 머신 러닝 교과서 - 파이토치편

<table align="left"><tr><td>
<a href="https://colab.research.google.com/github/rickiepark/ml-with-pytorch/blob/main/ch15/ch15_part3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="코랩에서 실행하기"/></a>
</td></tr></table>

## 패키지 버전 체크

check_packages.py 스크립트에서 로드하기 위해 폴더를 추가합니다:

In [1]:
import sys

# 코랩의 경우 깃허브 저장소로부터 python_environment_check.py를 다운로드 합니다.
if 'google.colab' in sys.modules:
    !wget https://raw.githubusercontent.com/rickiepark/ml-with-pytorch/main/python_environment_check.py
    !wget https://raw.githubusercontent.com/rickiepark/ml-with-pytorch/main/ch15/1268-0.txt
else:
    sys.path.insert(0, '..')

--2023-09-05 08:00:39--  https://raw.githubusercontent.com/rickiepark/ml-with-pytorch/main/python_environment_check.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1629 (1.6K) [text/plain]
Saving to: ‘python_environment_check.py’


2023-09-05 08:00:39 (28.6 MB/s) - ‘python_environment_check.py’ saved [1629/1629]

--2023-09-05 08:00:40--  https://raw.githubusercontent.com/rickiepark/ml-with-pytorch/main/ch15/1268-0.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1171600 (1.1M) [text/plain]
Saving to: ‘1268-0.txt’


2023-09-0

권장 패키지 버전을 확인하세요:

In [2]:
from python_environment_check import check_packages


d = {
    'torch': '1.8.0',
}
check_packages(d)

[OK] Your Python version is 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0]
[OK] torch 2.0.1+cu118


# 15장 - 순환 신경망으로 순차 데이터 모델링 (파트 3/3)

**목차**

- 파이토치로 시퀀스 모델링을 위한 RNN 구현
  - 두 번째 프로젝트: 텐서플로로 글자 단위 언어 모델 구현
    - 데이터셋 전처리
    - 문자 수준의 RNN 모델 만들기
    - 평가 단계: 새로운 텍스트 생성
- 요약

In [3]:
from IPython.display import Image
%matplotlib inline

## 두 번째 프로젝트: 텐서플로로 글자 단위 언어 모델 구현

In [4]:
Image(url='https://raw.githubusercontent.com/rickiepark/ml-with-pytorch/main/ch15/figures/15_11.png', width=500)

### 데이터셋 전처리

In [5]:
import numpy as np

## 텍스트 읽고 전처리하기
with open('1268-0.txt', 'r', encoding="utf8") as fp:
    text=fp.read()

start_indx = text.find('THE MYSTERIOUS ISLAND')
end_indx = text.find('End of the Project Gutenberg')

text = text[start_indx:end_indx]
char_set = set(text)
print('전체 길이:', len(text))
print('고유한 문자:', len(char_set))

전체 길이: 1112350
고유한 문자: 80


In [6]:
Image(url='https://raw.githubusercontent.com/rickiepark/ml-with-pytorch/main/ch15/figures/15_12.png', width=500)

In [7]:
chars_sorted = sorted(char_set)
char2int = {ch:i for i,ch in enumerate(chars_sorted)}
char_array = np.array(chars_sorted)

text_encoded = np.array(
    [char2int[ch] for ch in text],
    dtype=np.int32)

print('인코딩된 텍스트 크기: ', text_encoded.shape)

print(text[:15], '     == 인코딩 ==> ', text_encoded[:15])
print(text_encoded[15:21], ' == 디코딩  ==> ', ''.join(char_array[text_encoded[15:21]]))

인코딩된 텍스트 크기:  (1112350,)
THE MYSTERIOUS       == 인코딩 ==>  [44 32 29  1 37 48 43 44 29 42 33 39 45 43  1]
[33 43 36 25 38 28]  == 디코딩  ==>  ISLAND


In [8]:
for ex in text_encoded[:5]:
    print('{} -> {}'.format(ex, char_array[ex]))

44 -> T
32 -> H
29 -> E
1 ->  
37 -> M


In [9]:
Image(url='https://raw.githubusercontent.com/rickiepark/ml-with-pytorch/main/ch15/figures/15_13.png', width=500)

In [10]:
Image(url='https://raw.githubusercontent.com/rickiepark/ml-with-pytorch/main/ch15/figures/15_14.png', width=500)

In [11]:
seq_length = 40
chunk_size = seq_length + 1

text_chunks = [text_encoded[i:i+chunk_size]
               for i in range(len(text_encoded)-chunk_size+1)]

## 조사:
for seq in text_chunks[:1]:
    input_seq = seq[:seq_length]
    target = seq[seq_length]
    print(input_seq, ' -> ', target)
    print(repr(''.join(char_array[input_seq])),
          ' -> ', repr(''.join(char_array[target])))

[44 32 29  1 37 48 43 44 29 42 33 39 45 43  1 33 43 36 25 38 28  1  6  6
  6  0  0  0  0  0 40 67 64 53 70 52 54 53  1 51]  ->  74
'THE MYSTERIOUS ISLAND ***\n\n\n\n\nProduced b'  ->  'y'


In [12]:
import torch
from torch.utils.data import Dataset

class TextDataset(Dataset):
    def __init__(self, text_chunks):
        self.text_chunks = text_chunks

    def __len__(self):
        return len(self.text_chunks)

    def __getitem__(self, idx):
        text_chunk = self.text_chunks[idx]
        return text_chunk[:-1].long(), text_chunk[1:].long()

seq_dataset = TextDataset(torch.tensor(text_chunks))

  seq_dataset = TextDataset(torch.tensor(text_chunks))


In [13]:
for i, (seq, target) in enumerate(seq_dataset):
    print('입력 (x):', repr(''.join(char_array[seq])))
    print('타깃 (y):', repr(''.join(char_array[target])))
    print()
    if i == 1:
        break

입력 (x): 'THE MYSTERIOUS ISLAND ***\n\n\n\n\nProduced b'
타깃 (y): 'HE MYSTERIOUS ISLAND ***\n\n\n\n\nProduced by'

입력 (x): 'HE MYSTERIOUS ISLAND ***\n\n\n\n\nProduced by'
타깃 (y): 'E MYSTERIOUS ISLAND ***\n\n\n\n\nProduced by '



In [14]:
device = torch.device("cuda:0")
# device = 'cpu'

In [15]:
from torch.utils.data import DataLoader

batch_size = 64

torch.manual_seed(1)
seq_dl = DataLoader(seq_dataset, batch_size=batch_size, shuffle=True, drop_last=True)

### 문자 수준의 RNN 모델 만들기

In [16]:
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, vocab_size, embed_dim, rnn_hidden_size):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.rnn_hidden_size = rnn_hidden_size
        self.rnn = nn.LSTM(embed_dim, rnn_hidden_size,
                           batch_first=True)
        self.fc = nn.Linear(rnn_hidden_size, vocab_size)

    def forward(self, x, hidden, cell):
        out = self.embedding(x).unsqueeze(1)
        out, (hidden, cell) = self.rnn(out, (hidden, cell))
        out = self.fc(out).reshape(out.size(0), -1)
        return out, hidden, cell

    def init_hidden(self, batch_size):
        hidden = torch.zeros(1, batch_size, self.rnn_hidden_size)
        cell = torch.zeros(1, batch_size, self.rnn_hidden_size)
        return hidden.to(device), cell.to(device)

vocab_size = len(char_array)
embed_dim = 256
rnn_hidden_size = 512

torch.manual_seed(1)
model = RNN(vocab_size, embed_dim, rnn_hidden_size)
model = model.to(device)
model

RNN(
  (embedding): Embedding(80, 256)
  (rnn): LSTM(256, 512, batch_first=True)
  (fc): Linear(in_features=512, out_features=80, bias=True)
)

In [17]:
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.005)

num_epochs = 10000

torch.manual_seed(1)

for epoch in range(num_epochs):
    hidden, cell = model.init_hidden(batch_size)
    seq_batch, target_batch = next(iter(seq_dl))
    seq_batch = seq_batch.to(device)
    target_batch = target_batch.to(device)
    optimizer.zero_grad()
    loss = 0
    for c in range(seq_length):
        pred, hidden, cell = model(seq_batch[:, c], hidden, cell)
        loss += loss_fn(pred, target_batch[:, c])
    loss.backward()
    optimizer.step()
    loss = loss.item()/seq_length
    if epoch % 500 == 0:
        print(f'에포크 {epoch} 손실: {loss:.4f}')

에포크 0 손실: 4.3722
에포크 500 손실: 1.3959
에포크 1000 손실: 1.3312
에포크 1500 손실: 1.2258
에포크 2000 손실: 1.2467
에포크 2500 손실: 1.1968
에포크 3000 손실: 1.1577
에포크 3500 손실: 1.1474
에포크 4000 손실: 1.1845
에포크 4500 손실: 1.1594
에포크 5000 손실: 1.0981
에포크 5500 손실: 1.1191
에포크 6000 손실: 1.1446
에포크 6500 손실: 1.1238
에포크 7000 손실: 1.1057
에포크 7500 손실: 1.1707
에포크 8000 손실: 1.1454
에포크 8500 손실: 1.1631
에포크 9000 손실: 1.0950
에포크 9500 손실: 1.1302


### 평가 단계: 새로운 텍스트 생성

In [18]:
from torch.distributions.categorical import Categorical

torch.manual_seed(1)

logits = torch.tensor([[1.0, 1.0, 1.0]])

print('확률:', nn.functional.softmax(logits, dim=1).numpy()[0])

m = Categorical(logits=logits)
samples = m.sample((10,))

print(samples.numpy())

확률: [0.33333334 0.33333334 0.33333334]
[[0]
 [0]
 [0]
 [0]
 [1]
 [0]
 [1]
 [2]
 [1]
 [1]]


In [19]:
torch.manual_seed(1)

logits = torch.tensor([[1.0, 1.0, 3.0]])

print('확률:', nn.functional.softmax(logits, dim=1).numpy()[0])

m = Categorical(logits=logits)
samples = m.sample((10,))

print(samples.numpy())

확률: [0.10650698 0.10650698 0.78698605]
[[0]
 [2]
 [2]
 [1]
 [2]
 [1]
 [2]
 [2]
 [2]
 [2]]


In [20]:
def sample(model, starting_str,
           len_generated_text=500,
           scale_factor=1.0):

    encoded_input = torch.tensor([char2int[s] for s in starting_str])
    encoded_input = torch.reshape(encoded_input, (1, -1))

    generated_str = starting_str

    model.eval()
    hidden, cell = model.init_hidden(1)
    hidden = hidden.to('cpu')
    cell = cell.to('cpu')
    for c in range(len(starting_str)-1):
        _, hidden, cell = model(encoded_input[:, c].view(1), hidden, cell)

    last_char = encoded_input[:, -1]
    for i in range(len_generated_text):
        logits, hidden, cell = model(last_char.view(1), hidden, cell)
        logits = torch.squeeze(logits, 0)
        scaled_logits = logits * scale_factor
        m = Categorical(logits=scaled_logits)
        last_char = m.sample()
        generated_str += str(char_array[last_char])

    return generated_str

torch.manual_seed(1)
model.to('cpu')
print(sample(model, starting_str='The island'))

The island was begin and, as the lumma wild build is
small walkon of animals; a bed to igration the sailor, but in well in protecting the spurs of the ocean hand,
or being animals, smple; to have been dangerous exist asonish in agains of ever result or sistance in the intelligence, all the animal who, he escaped from the furnace of Pencroft, who accumulated the passage of these detaker’s prittering on the wreck of
this temposeral or two to a Prosce from where their eyes ranks, had not a single cries of wh


* **예측 가능성 vs. 무작위성**

In [21]:
logits = torch.tensor([[1.0, 1.0, 3.0]])

print('스케일 조정 전의 확률:        ', nn.functional.softmax(logits, dim=1).numpy()[0])

print('0.5배 조정 후 확률:', nn.functional.softmax(0.5*logits, dim=1).numpy()[0])

print('0.1배 조정 후 확률:', nn.functional.softmax(0.1*logits, dim=1).numpy()[0])

스케일 조정 전의 확률:         [0.10650698 0.10650698 0.78698605]
0.5배 조정 후 확률: [0.21194156 0.21194156 0.57611686]
0.1배 조정 후 확률: [0.3104238  0.3104238  0.37915248]


In [22]:
torch.manual_seed(1)
print(sample(model, starting_str='The island',
             scale_factor=2.0))

The island was an
insure, and the sand would be seen their close of the house, and the intelligence and a straight of the surrounded on the shores of the island would have an abundant which would have been able to followed by the power.

“No, my friend,” said Herbert, “and as if you, Pencroft and Herbert was that the first rivales were completed the operation of the colonists and his companions.

Herbert, who was seen the colonists’ pottering of the coast of the colonists were honest for the convicts, the


In [23]:
torch.manual_seed(1)
print(sample(model, starting_str='The island',
             scale_factor=0.5))

The island
would bozent jemutterinuana rollum. “Lazzing ISLCalson of anihmic; Cape, I pier
‘pinutnely,” Tide,” said Coolleguark: their arras; t miys by--nisaries,” he akyed Cyrtam,
smple;
to hain; in--nder, Cor exhaoial;, an inhago was deed--a hruck frosts, Jacriam clvinomes’, at lo,-koednoated, grar, he escen;
Positm ton ”ickally
severally drawn I am-Neb “or here, pierced how ly histe’s privatebk, we diok recemevin,
leanging?” sighted, two
CPratch fallewled! Yon.

After an
weakle nearl fifly Lur, will we


# 요약