**Мета роботи**: Ознайомитись з використанням перетворювачів у spaCy, а також інших попередньо навчених перетворювачів. Застосувати перетворювачі для генерації діалогів.
Створити програму, що налаштовує попередньо навчені перетворювачі та використовує їх для генерації діалогів

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [34]:
import json
import random

from torch.utils.data import DataLoader, TensorDataset
from transformers import BartTokenizer, BartForConditionalGeneration
import torch

from sklearn.model_selection import train_test_split

In [3]:
model_name = "facebook/bart-base"
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

Створення навчальних даних

In [23]:
filename = r'C:\Users\5500\DataspellProjects\UniProjects\data\hotels.json'
with open(filename, "r", encoding="utf-8") as f:
    raw_data = json.load(f)

data = []

for dialog in raw_data:
    turns = dialog["turns"]
    for i in range(len(turns) - 1):
        user_turn = turns[i]
        system_turn = turns[i + 1]

        if user_turn["speaker"] == "USER" and system_turn["speaker"] == "SYSTEM":
            data.append((user_turn["utterance"], system_turn["utterance"]))

data = [pair for pair in data if  len(pair)==2]

In [24]:
train_data, test_data = train_test_split(data, test_size=0.25, random_state=42)

In [38]:
train_data[:3]

[("I'm searching for a hotel in NYC, and I'd like to reserve a room at Rodeway Inn Bronx Zoo.",
  'How long is the booking for?'),
 ('Thanks.', 'Do you need help with anything else?'),
 ('Could you tell me the phone number?',
  'The telephone number is +1 416-663-9500.')]

Токенізація і перетворення у тензори

In [26]:
input_ids = []
attention_masks = []
labels = []

for user_input, system_response in train_data:
    encoded_input = tokenizer(user_input, return_tensors="pt", padding="max_length", truncation=True, max_length=64)
    encoded_output = tokenizer(system_response, return_tensors="pt", padding="max_length", truncation=True, max_length=64)

    input_ids.append(encoded_input["input_ids"].squeeze())
    attention_masks.append(encoded_input["attention_mask"].squeeze())
    labels.append(encoded_output["input_ids"].squeeze())

input_ids = torch.stack(input_ids)
attention_masks = torch.stack(attention_masks)
labels = torch.stack(labels)


Оптимізатор і функція втрат

In [27]:
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
loss_fn = torch.nn.CrossEntropyLoss(ignore_index=tokenizer.pad_token_id)

Формування датасету та батчів

In [30]:
dataset = TensorDataset(input_ids, attention_masks, labels)

batch_size = 8
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

Тренування моделі

In [31]:
model.train()
for epoch in range(10):
    total_loss = 0
    for batch in dataloader:
        b_input_ids, b_attention_mask, b_labels = batch

        outputs = model(input_ids=b_input_ids,
                        attention_mask=b_attention_mask,
                        labels=b_labels)

        loss = outputs.loss
        total_loss += loss.item()

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print(f"Epoch {epoch + 1} Loss: {total_loss:.4f}")


Epoch 1 Loss: 42.6080
Epoch 2 Loss: 37.9207
Epoch 3 Loss: 35.7519
Epoch 4 Loss: 33.1226
Epoch 5 Loss: 32.7188
Epoch 6 Loss: 32.3723
Epoch 7 Loss: 28.0397
Epoch 8 Loss: 25.7373
Epoch 9 Loss: 23.5766
Epoch 10 Loss: 21.6203


Демонстрація роботи на випадковій фразі з тестувального набору

In [37]:
def generate_response(input_text):
    model.eval()
    encoded_input = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=64)
    output = model.generate(
        input_ids=encoded_input["input_ids"],
        attention_mask=encoded_input["attention_mask"],
        max_length=50,
        num_beams=5,
        early_stopping=True
    )
    return tokenizer.decode(output[0], skip_special_tokens=True)

example = random.choice(test_data)
print(f"Request: {example[0]}")
print(f"Response: {example[1]}")
print(f"Generated response: {generate_response(example[0])}")


Request: I need 1 room at the Hanover Hotel Victoria. I'll arrive on the 5th of March
Response: Please confirm the details for your stay in London: 1 room at the Hanover Hotel Victoria for 3 days, arriving next Tuesday.
Generated response: Please confirm: You'd like 1 room at the Hanover Hotel Victoria in Victoria for the day after tomorrow, and you'll be leaving on March 13th.
