# Seq2Seq
- reference : https://github.com/bentrevett/pytorch-seq2seq
- https://github.com/bentrevett/pytorch-seq2seq/blob/main/1%20-%20Sequence%20to%20Sequence%20Learning%20with%20Neural%20Networks.ipynb

## 1. Sequence to Sequence Learning with Neural NEtworks
### Introduction
### Preparing Data
- PyTorch  for creating the models
- spaCy to assist in the tokenization of the data
- torchtext to provider some helper functions
- datasets to load and manipulate our data
- evaluate for calculationg metrics

In [5]:
import torch
import torch.nn as nn
import torch.optim as optim
import random
import numpy as np
import spacy
import datasets
import torchtext
import tqdm
import evaluate

In [6]:
seed = 1234

random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cudnn.daterministic = True

### Dataset

- https://huggingface.co/datasets/bentrevett/multi30k


In [7]:
dataset = datasets.load_dataset('bentrevett/multi30k')

Downloading readme:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Downloading data: 100%|██████████| 4.60M/4.60M [00:01<00:00, 3.76MB/s]
Downloading data: 100%|██████████| 164k/164k [00:00<00:00, 327kB/s]
Downloading data: 100%|██████████| 156k/156k [00:00<00:00, 557kB/s]


Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [8]:
dataset

DatasetDict({
    train: Dataset({
        features: ['en', 'de'],
        num_rows: 29000
    })
    validation: Dataset({
        features: ['en', 'de'],
        num_rows: 1014
    })
    test: Dataset({
        features: ['en', 'de'],
        num_rows: 1000
    })
})

In [9]:
train_data, valid_data, test_data = (dataset['train'], dataset['validation'], dataset['test'])

In [10]:
train_data[0]

{'en': 'Two young, White males are outside near many bushes.',
 'de': 'Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.'}

# tokenizers
- tokenizer: 문자열을 해당 토큰의 리스트로 변환
- spaCy: 각 언어별 tokenizer 모델 사용 가능
- 독일어용: de_core_news_sm
- 영어용:en_core_web_sm

In [11]:
!python -m spacy download en_core_web_sm
!python -m spacy download de_core_news_sm


Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     --------------------------------------- 0.0/12.8 MB 162.5 kB/s eta 0:01:19
     --------------------------------------- 0.0/12.8 MB 245.8 kB/s eta 0:00:52
     --------------------------------------- 0.1/12.8 MB 416.7 kB/s eta 0:00:31
      --------------------------------------- 0.3/12.8 MB 1.2 MB/s eta 0:00:11
     - -------------------------------------- 0.6/12.8 MB 2.3 MB/s eta 0:00:06
     --- ------------------------------------ 1.0/12.8 MB 3.1 MB/s eta 0:00:04
     ------- -------------------------------- 2.3/12.8 MB 6.3 MB/s eta 0:00:02
     ------------- -------------------------- 4.5/12.8 MB 11.0 MB/s eta 0:00:01
     ----------------------- ---------------

In [15]:
en_nlp  = spacy.load("en_core_web_sm")
de_nlp = spacy.load('de_core_news_sm')

In [17]:
string = "what a locely day it is today!"

print([token.text for token in en_nlp.tokenizer(string)])

['what', 'a', 'locely', 'day', 'it', 'is', 'today', '!']
