##**2. Word2Vec**
1. 주어진 단어들을 word2vec 모델에 들어갈 수 있는 형태로 만듭니다.
2. CBOW, Skip-gram 모델을 각각 구현합니다.
3. 모델을 실제로 학습해보고 결과를 확인합니다.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### **필요 패키지 import**

In [2]:
!pip install konlpy

Collecting konlpy
[?25l  Downloading https://files.pythonhosted.org/packages/85/0e/f385566fec837c0b83f216b2da65db9997b35dd675e107752005b7d392b1/konlpy-0.5.2-py2.py3-none-any.whl (19.4MB)
[K     |████████████████████████████████| 19.4MB 1.3MB/s 
[?25hCollecting beautifulsoup4==4.6.0
[?25l  Downloading https://files.pythonhosted.org/packages/9e/d4/10f46e5cfac773e22707237bfcd51bbffeaf0a576b0a847ec7ab15bd7ace/beautifulsoup4-4.6.0-py3-none-any.whl (86kB)
[K     |████████████████████████████████| 92kB 8.9MB/s 
[?25hCollecting colorama
  Downloading https://files.pythonhosted.org/packages/44/98/5b86278fbbf250d239ae0ecb724f8572af1c91f4a11edf4d36a206189440/colorama-0.4.4-py2.py3-none-any.whl
Collecting tweepy>=3.7.0
  Downloading https://files.pythonhosted.org/packages/67/c3/6bed87f3b1e5ed2f34bd58bf7978e308c86e255193916be76e5a5ce5dfca/tweepy-3.10.0-py2.py3-none-any.whl
Collecting JPype1>=0.7.0
[?25l  Downloading https://files.pythonhosted.org/packages/de/af/93f92b38ec1ff3091cd38982ed19ce

In [3]:
from tqdm import tqdm
from konlpy.tag import Okt
from torch import nn
from torch.nn import functional as F
from torch.utils.data import Dataset, DataLoader
from collections import defaultdict

import torch
import copy
import numpy as np

### **데이터 전처리**



데이터를 확인하고 Word2Vec 형식에 맞게 전처리합니다.  
학습 데이터는 1번 실습과 동일하고, 테스트를 위한 단어를 아래와 같이 가정해봅시다.

In [4]:
train_data = [
  "정말 맛있습니다. 추천합니다.",
  "기대했던 것보단 별로였네요.",
  "다 좋은데 가격이 너무 비싸서 다시 가고 싶다는 생각이 안 드네요.",
  "완전 최고입니다! 재방문 의사 있습니다.",
  "음식도 서비스도 다 만족스러웠습니다.",
  "위생 상태가 좀 별로였습니다. 좀 더 개선되기를 바랍니다.",
  "맛도 좋았고 직원분들 서비스도 너무 친절했습니다.",
  "기념일에 방문했는데 음식도 분위기도 서비스도 다 좋았습니다.",
  "전반적으로 음식이 너무 짰습니다. 저는 별로였네요.",
  "위생에 조금 더 신경 썼으면 좋겠습니다. 조금 불쾌했습니다."       
]

test_words = ["음식", "맛", "서비스", "위생", "가격"]

Tokenization과 vocab을 만드는 과정은 이전 실습과 유사합니다.

In [5]:
tokenizer = Okt()

In [6]:
def make_tokenized(data):
  tokenized = []
  for sent in tqdm(data):
    tokens = tokenizer.morphs(sent, stem=True)
    tokenized.append(tokens)

  return tokenized

In [7]:
train_tokenized = make_tokenized(train_data)

100%|██████████| 10/10 [00:06<00:00,  1.44it/s]


In [8]:
word_count = defaultdict(int)

for tokens in tqdm(train_tokenized):
  for token in tokens:
    word_count[token] += 1

100%|██████████| 10/10 [00:00<00:00, 8413.85it/s]


In [9]:
word_count = sorted(word_count.items(), key=lambda x: x[1], reverse=True)
print(list(word_count))

[('.', 14), ('도', 7), ('이다', 4), ('좋다', 4), ('별로', 3), ('다', 3), ('이', 3), ('너무', 3), ('음식', 3), ('서비스', 3), ('하다', 2), ('방문', 2), ('위생', 2), ('좀', 2), ('더', 2), ('에', 2), ('조금', 2), ('정말', 1), ('맛있다', 1), ('추천', 1), ('기대하다', 1), ('것', 1), ('보단', 1), ('가격', 1), ('비싸다', 1), ('다시', 1), ('가다', 1), ('싶다', 1), ('생각', 1), ('안', 1), ('드네', 1), ('요', 1), ('완전', 1), ('최고', 1), ('!', 1), ('재', 1), ('의사', 1), ('있다', 1), ('만족스럽다', 1), ('상태', 1), ('가', 1), ('개선', 1), ('되다', 1), ('기르다', 1), ('바라다', 1), ('맛', 1), ('직원', 1), ('분들', 1), ('친절하다', 1), ('기념일', 1), ('분위기', 1), ('전반', 1), ('적', 1), ('으로', 1), ('짜다', 1), ('저', 1), ('는', 1), ('신경', 1), ('써다', 1), ('불쾌하다', 1)]


In [10]:
w2i = {}
for pair in tqdm(word_count):
  if pair[0] not in w2i:
    w2i[pair[0]] = len(w2i)

100%|██████████| 60/60 [00:00<00:00, 207126.12it/s]


In [11]:
print(train_tokenized)
print(w2i)

[['정말', '맛있다', '.', '추천', '하다', '.'], ['기대하다', '것', '보단', '별로', '이다', '.'], ['다', '좋다', '가격', '이', '너무', '비싸다', '다시', '가다', '싶다', '생각', '이', '안', '드네', '요', '.'], ['완전', '최고', '이다', '!', '재', '방문', '의사', '있다', '.'], ['음식', '도', '서비스', '도', '다', '만족스럽다', '.'], ['위생', '상태', '가', '좀', '별로', '이다', '.', '좀', '더', '개선', '되다', '기르다', '바라다', '.'], ['맛', '도', '좋다', '직원', '분들', '서비스', '도', '너무', '친절하다', '.'], ['기념일', '에', '방문', '하다', '음식', '도', '분위기', '도', '서비스', '도', '다', '좋다', '.'], ['전반', '적', '으로', '음식', '이', '너무', '짜다', '.', '저', '는', '별로', '이다', '.'], ['위생', '에', '조금', '더', '신경', '써다', '좋다', '.', '조금', '불쾌하다', '.']]
{'.': 0, '도': 1, '이다': 2, '좋다': 3, '별로': 4, '다': 5, '이': 6, '너무': 7, '음식': 8, '서비스': 9, '하다': 10, '방문': 11, '위생': 12, '좀': 13, '더': 14, '에': 15, '조금': 16, '정말': 17, '맛있다': 18, '추천': 19, '기대하다': 20, '것': 21, '보단': 22, '가격': 23, '비싸다': 24, '다시': 25, '가다': 26, '싶다': 27, '생각': 28, '안': 29, '드네': 30, '요': 31, '완전': 32, '최고': 33, '!': 34, '재': 35, '의사': 36, '있다': 37, '만족스럽다': 38, '상태

실제 모델에 들어가기 위한 input을 만들기 위해 `Dataset` 클래스를 정의합니다.

In [12]:
class CBOWDataset(Dataset):
  def __init__(self, train_tokenized, window_size=2):
    self.x = []
    self.y = []

    for tokens in tqdm(train_tokenized):
      token_ids = [w2i[token] for token in tokens]
      for i, id in enumerate(token_ids):
        if i-window_size >= 0 and i+window_size < len(token_ids):
          self.x.append(token_ids[i-window_size:i] + token_ids[i+1:i+window_size+1])
          self.y.append(id)

    #################
    ####이해 안됨####
    #################
    self.x = torch.LongTensor(self.x)  # (전체 데이터 개수, 2 * window_size)
    self.y = torch.LongTensor(self.y)  # (전체 데이터 개수)

  # Dataset class 상속해야 하므로
  def __len__(self):
    return self.x.shape[0]

  def __getitem__(self, idx):
    return self.x[idx], self.y[idx]

In [13]:
class SkipGramDataset(Dataset):
  def __init__(self, train_tokenized, window_size=2):
    self.x = []
    self.y = []

    for tokens in tqdm(train_tokenized):
      token_ids = [w2i[token] for token in tokens]
      for i, id in enumerate(token_ids):
        if i-window_size >= 0 and i+window_size < len(token_ids):
          self.y += (token_ids[i-window_size:i] + token_ids[i+1:i+window_size+1])
          self.x += [id] * 2 * window_size

    self.x = torch.LongTensor(self.x)  # (전체 데이터 개수)
    self.y = torch.LongTensor(self.y)  # (전체 데이터 개수)

  def __len__(self):
    return self.x.shape[0]

  def __getitem__(self, idx):
    return self.x[idx], self.y[idx]

각 모델에 맞는 `Dataset` 객체를 생성합니다.

In [14]:
cbow_set = CBOWDataset(train_tokenized)
skipgram_set = SkipGramDataset(train_tokenized)
print(list(skipgram_set))

100%|██████████| 10/10 [00:00<00:00, 7211.66it/s]
100%|██████████| 10/10 [00:00<00:00, 8158.54it/s]

[(tensor(0), tensor(17)), (tensor(0), tensor(18)), (tensor(0), tensor(19)), (tensor(0), tensor(10)), (tensor(19), tensor(18)), (tensor(19), tensor(0)), (tensor(19), tensor(10)), (tensor(19), tensor(0)), (tensor(22), tensor(20)), (tensor(22), tensor(21)), (tensor(22), tensor(4)), (tensor(22), tensor(2)), (tensor(4), tensor(21)), (tensor(4), tensor(22)), (tensor(4), tensor(2)), (tensor(4), tensor(0)), (tensor(23), tensor(5)), (tensor(23), tensor(3)), (tensor(23), tensor(6)), (tensor(23), tensor(7)), (tensor(6), tensor(3)), (tensor(6), tensor(23)), (tensor(6), tensor(7)), (tensor(6), tensor(24)), (tensor(7), tensor(23)), (tensor(7), tensor(6)), (tensor(7), tensor(24)), (tensor(7), tensor(25)), (tensor(24), tensor(6)), (tensor(24), tensor(7)), (tensor(24), tensor(25)), (tensor(24), tensor(26)), (tensor(25), tensor(7)), (tensor(25), tensor(24)), (tensor(25), tensor(26)), (tensor(25), tensor(27)), (tensor(26), tensor(24)), (tensor(26), tensor(25)), (tensor(26), tensor(27)), (tensor(26), tens




### **모델 Class 구현**

차례대로 두 가지 Word2Vec 모델을 구현합니다.  


*   `self.embedding`: `vocab_size` 크기의 one-hot vector를 특정 크기의 `dim` 차원으로 embedding 시키는 layer.
*   `self.linear`: 변환된 embedding vector를 다시 원래 `vocab_size`로 바꾸는 layer.


In [15]:
class CBOW(nn.Module):
  def __init__(self, vocab_size, dim):
    super(CBOW, self).__init__()
    # embedding - linear layer와 비슷한 역할
    self.embedding = nn.Embedding(vocab_size, dim, sparse=True)
    self.linear = nn.Linear(dim, vocab_size)

  # Tensor의 사이즈 트래킹하면 이해 쉽다
  # B: batch size, W: window size, d_w: word embedding size, V: vocab size
  def forward(self, x):  # x: (B, 2W)
    embeddings = self.embedding(x)  # (B, 2W, d_w)
    embeddings = torch.sum(embeddings, dim=1)  # (B, d_w)
    output = self.linear(embeddings)  # (B, V)
    return output

In [16]:
class SkipGram(nn.Module):
  def __init__(self, vocab_size, dim):
    super(SkipGram, self).__init__()
    self.embedding = nn.Embedding(vocab_size, dim, sparse=True)
    self.linear = nn.Linear(dim, vocab_size)

  # B: batch size, W: window size, d_w: word embedding size, V: vocab size
  def forward(self, x): # x: (B)
    embeddings = self.embedding(x)  # (B, d_w)
    output = self.linear(embeddings)  # (B, V)
    return output

두 가지 모델을 생성합니다.

In [17]:
cbow = CBOW(vocab_size=len(w2i), dim=256)
skipgram = SkipGram(vocab_size=len(w2i), dim=256)

### **모델 학습**

다음과 같이 hyperparamter를 세팅하고 `DataLoader` 객체를 만듭니다.

In [18]:
batch_size=4
learning_rate = 5e-4
num_epochs = 5
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

cbow_loader = DataLoader(cbow_set, batch_size=batch_size)
skipgram_loader = DataLoader(skipgram_set, batch_size=batch_size)

첫번째로 CBOW 모델 학습입니다.

In [19]:
cbow.train()
cbow = cbow.to(device)
optim = torch.optim.SGD(cbow.parameters(), lr=learning_rate)
loss_function = nn.CrossEntropyLoss()

for e in range(1, num_epochs+1):
  print("#" * 50)
  print(f"Epoch: {e}")
  for batch in tqdm(cbow_loader):
    x, y = batch
    x, y = x.to(device), y.to(device) # (B, W), (B)
    output = cbow(x)  # (B, V)
 
    optim.zero_grad()
    loss = loss_function(output, y)
    loss.backward()
    optim.step()

    print(f"Train loss: {loss.item()}")

print("Finished.")

  6%|▋         | 1/16 [00:00<00:02,  5.59it/s]

##################################################
Epoch: 1
Train loss: 5.435768127441406
Train loss: 4.737085819244385
Train loss: 3.8809356689453125
Train loss: 4.558345794677734
Train loss: 4.855749607086182
Train loss: 4.750556468963623
Train loss: 4.100275039672852
Train loss: 4.424317359924316


100%|██████████| 16/16 [00:00<00:00, 73.66it/s]
100%|██████████| 16/16 [00:00<00:00, 458.25it/s]
100%|██████████| 16/16 [00:00<00:00, 420.82it/s]
100%|██████████| 16/16 [00:00<00:00, 462.46it/s]
100%|██████████| 16/16 [00:00<00:00, 486.00it/s]

Train loss: 4.976825714111328
Train loss: 5.697813987731934
Train loss: 4.941936492919922
Train loss: 4.34714937210083
Train loss: 5.589223861694336
Train loss: 4.609934329986572
Train loss: 5.865617752075195
Train loss: 4.707053184509277
##################################################
Epoch: 2
Train loss: 5.223531246185303
Train loss: 4.572081565856934
Train loss: 3.7769925594329834
Train loss: 4.430196762084961
Train loss: 4.734350681304932
Train loss: 4.4327592849731445
Train loss: 3.925676107406616
Train loss: 4.2785162925720215
Train loss: 4.840892791748047
Train loss: 5.490176200866699
Train loss: 4.744015693664551
Train loss: 3.950728178024292
Train loss: 5.434819221496582
Train loss: 4.485015392303467
Train loss: 5.702638149261475
Train loss: 4.547245979309082
##################################################
Epoch: 3
Train loss: 5.016544818878174
Train loss: 4.410897254943848
Train loss: 3.6751351356506348
Train loss: 4.304274559020996
Train loss: 4.61474609375
Train loss:




다음으로 Skip-gram 모델 학습입니다.

In [20]:
skipgram.train()
skipgram = skipgram.to(device)
optim = torch.optim.SGD(skipgram.parameters(), lr=learning_rate)
loss_function = nn.CrossEntropyLoss()

for e in range(1, num_epochs+1):
  print("#" * 50)
  print(f"Epoch: {e}")
  for batch in tqdm(skipgram_loader):
    x, y = batch
    x, y = x.to(device), y.to(device) # (B, W), (B)
    output = skipgram(x)  # (B, V)

    optim.zero_grad()
    loss = loss_function(output, y)
    loss.backward()
    optim.step()

    print(f"Train loss: {loss.item()}")

print("Finished.")

100%|██████████| 64/64 [00:00<00:00, 482.00it/s]
  0%|          | 0/64 [00:00<?, ?it/s]

##################################################
Epoch: 1
Train loss: 4.03630256652832
Train loss: 4.296536445617676
Train loss: 4.155908107757568
Train loss: 4.52178955078125
Train loss: 4.287509918212891
Train loss: 4.6464338302612305
Train loss: 4.04913330078125
Train loss: 4.170066833496094
Train loss: 4.122392177581787
Train loss: 3.759453773498535
Train loss: 4.098999977111816
Train loss: 4.435192584991455
Train loss: 3.6452903747558594
Train loss: 4.035694122314453
Train loss: 3.8914072513580322
Train loss: 4.280170440673828
Train loss: 4.607410430908203
Train loss: 4.065797328948975
Train loss: 4.341219902038574
Train loss: 4.229527950286865
Train loss: 4.400463581085205
Train loss: 4.2741827964782715
Train loss: 4.282057285308838
Train loss: 4.3686017990112305
Train loss: 4.274510383605957
Train loss: 4.509020805358887
Train loss: 4.509653568267822
Train loss: 4.219294548034668
Train loss: 4.58566427230835
Train loss: 4.074611663818359
Train loss: 4.225931167602539
Train los

100%|██████████| 64/64 [00:00<00:00, 493.43it/s]
 80%|███████▉  | 51/64 [00:00<00:00, 501.33it/s]

Train loss: 4.045519828796387
Train loss: 3.8208184242248535
Train loss: 3.9476537704467773
Train loss: 4.034929275512695
Train loss: 4.6783342361450195
Train loss: 4.02187442779541
Train loss: 4.05312442779541
Train loss: 4.435026168823242
Train loss: 4.455121994018555
Train loss: 4.580511569976807
Train loss: 3.9534265995025635
Train loss: 4.530774116516113
Train loss: 4.0649895668029785
Train loss: 4.10957145690918
Train loss: 4.353387832641602
Train loss: 4.036271572113037
Train loss: 4.21743106842041
Train loss: 4.151681900024414
Train loss: 4.430605888366699
Train loss: 3.95585560798645
Train loss: 3.638232469558716
Train loss: 4.111879825592041
Train loss: 4.348885536193848
Train loss: 3.802661657333374
Train loss: 4.012706756591797
Train loss: 4.7205963134765625
Train loss: 4.030889511108398
Train loss: 4.598814964294434
Train loss: 3.9153223037719727
Train loss: 3.9843320846557617
Train loss: 4.45389986038208
Train loss: 3.978278636932373
Train loss: 4.54858922958374
#########

100%|██████████| 64/64 [00:00<00:00, 477.49it/s]
100%|██████████| 64/64 [00:00<00:00, 484.49it/s]
  0%|          | 0/64 [00:00<?, ?it/s]

Train loss: 4.510119438171387
##################################################
Epoch: 4
Train loss: 3.9764862060546875
Train loss: 4.135628700256348
Train loss: 4.063020706176758
Train loss: 4.333221912384033
Train loss: 4.2181596755981445
Train loss: 4.53911018371582
Train loss: 3.9328060150146484
Train loss: 4.088410377502441
Train loss: 4.058306694030762
Train loss: 3.7025699615478516
Train loss: 4.006335258483887
Train loss: 4.3348002433776855
Train loss: 3.5826809406280518
Train loss: 3.960249662399292
Train loss: 3.8235867023468018
Train loss: 4.199080467224121
Train loss: 4.519191741943359
Train loss: 3.9749889373779297
Train loss: 4.257242679595947
Train loss: 4.140230178833008
Train loss: 4.079373359680176
Train loss: 3.987938404083252
Train loss: 4.111688613891602
Train loss: 4.27157735824585
Train loss: 4.134945869445801
Train loss: 4.313202857971191
Train loss: 4.373735427856445
Train loss: 4.1599507331848145
Train loss: 4.437480449676514
Train loss: 3.9791259765625
Train

100%|██████████| 64/64 [00:00<00:00, 484.99it/s]

Train loss: 4.389190196990967
Train loss: 3.9479756355285645
Train loss: 4.120926380157471
Train loss: 3.9576218128204346
Train loss: 3.7226765155792236
Train loss: 3.854337692260742
Train loss: 3.928571939468384
Train loss: 4.592382431030273
Train loss: 3.840343952178955
Train loss: 3.9089560508728027
Train loss: 4.340804100036621
Train loss: 4.376692771911621
Train loss: 4.477689266204834
Train loss: 3.883192777633667
Train loss: 4.326930999755859
Train loss: 3.9292471408843994
Train loss: 3.7315382957458496
Train loss: 4.032629489898682
Train loss: 3.7629446983337402
Train loss: 4.057472229003906
Train loss: 4.085832118988037
Train loss: 4.351588726043701
Train loss: 3.861945867538452
Train loss: 3.5205159187316895
Train loss: 4.042654991149902
Train loss: 4.271206855773926
Train loss: 3.7099459171295166
Train loss: 3.9124703407287598
Train loss: 4.530483722686768
Train loss: 3.948550224304199
Train loss: 4.512050151824951
Train loss: 3.8485941886901855
Train loss: 3.905396461486816




### **테스트**

학습된 각 모델을 이용하여 test 단어들의 word embedding을 확인합니다.

In [21]:
for word in test_words:
  input_id = torch.LongTensor([w2i[word]]).to(device)
  emb = cbow.embedding(input_id)  # 실제 embedding에서 쓰이는 layer

  print(f"Word: {word}")
  print(emb.squeeze(0))

Word: 음식
tensor([ 0.5655,  0.6094, -1.4602, -0.7539,  0.7851,  1.1172,  0.5106, -1.0855,
        -0.8203,  1.7882,  0.7187, -0.5192,  2.4939,  0.4270, -0.1389, -1.4208,
        -0.0912,  0.2100, -0.3090,  0.7315,  0.9560,  1.1048,  0.5664,  0.6855,
        -1.0471,  0.6955, -0.6369, -0.5698,  0.1802, -0.1469, -0.5089,  0.7219,
         0.3549, -0.5010, -1.0451, -0.8232,  0.5631, -1.2935, -0.6190, -1.9657,
         0.0340,  0.5603,  0.9635,  1.2438,  0.2367,  0.9624, -0.0854,  0.3417,
         0.0827, -1.6525,  1.2037,  1.0047, -0.6298,  0.3998,  0.1758, -0.7145,
         0.5621,  1.8548,  0.2831,  1.8812, -1.2140, -1.3266,  0.2295,  0.1602,
         1.0919,  0.8368, -1.1242, -0.1384,  0.4094, -0.7004, -0.8623,  0.2245,
        -1.5883,  1.0903, -2.1249, -0.7558,  0.5022,  2.0867, -0.8757, -0.4281,
         1.9099, -0.0679, -0.4101, -0.4657,  1.1702, -0.0165,  0.9458,  0.7911,
        -0.8669,  2.0140,  0.1920, -0.5677, -0.7590,  0.2278, -0.6010, -0.4919,
        -0.1059,  1.9491, -0.27

In [23]:
for word in test_words:
  input_id = torch.LongTensor([w2i[word]]).to(device)
  emb = skipgram.embedding(input_id)

  print(f"Word: {word}")
#   print(max(emb.squeeze(0)))
  print(emb.squeeze(0))

Word: 음식
tensor([ 1.1983e+00,  1.5706e+00,  6.4878e-01, -1.5704e-01, -1.0057e+00,
         1.9709e-01, -2.4592e+00,  1.1434e+00,  1.6028e+00, -1.3183e+00,
        -3.0424e-01,  2.0450e-01,  2.1755e+00, -1.3450e+00,  3.8729e-01,
         1.6102e-01,  1.9721e+00, -8.5626e-01, -1.6671e+00,  2.5125e-01,
         5.4297e-01,  2.7435e-01, -1.5755e+00,  3.0072e-01, -4.4546e-03,
        -1.7763e+00,  2.8374e-02,  1.2363e+00,  4.7967e-02, -1.1169e+00,
        -3.8118e-01,  7.0826e-01, -7.2185e-01, -8.6117e-01,  5.8667e-01,
         3.1745e-01,  2.8449e-01,  2.4755e-01,  1.9394e+00,  1.3061e+00,
         1.6599e-01, -3.9597e-01,  1.2322e+00, -4.4917e-01,  9.6016e-01,
         3.5474e-01, -6.2827e-01,  4.7963e-01,  7.1378e-01, -6.4345e-01,
        -1.6703e-01,  8.9595e-01,  1.7764e-01,  3.6570e-01,  8.8914e-01,
         2.3848e+00, -1.4099e+00, -1.9547e+00, -9.7431e-01, -1.6728e-01,
         9.2746e-01, -8.1176e-01, -5.3486e-01, -3.0831e-01, -6.6231e-01,
         7.7846e-01,  1.4408e+00, -7.8270e

In [None]:
!apt-get install -qq texlive texlive-xetex texlive-latex-extra pandoc
!pip install -qq pypandoc

from google.colab import drive
drive.mount('/content/drive')

!jupyter nbconvert --to PDF '/content/drive/My Drive/Colab Notebooks/1_naive_bayes.ipynb의 사본'