## Sodobna obdelava naravnega jezika: BERT prek praktičnih primerov

## Klasifikacija besedil

Praktični del 3. delavnice v sklopu Akademije umetne inteligence za poslovne aplikacije.

V tej beležki se bomo naučili uporabiti *BERT-a* oziroma podobne modele za klasifikacijo besedil. Ukvarjali se bomo z analizo sentimenta - kritike filmov z IMDB-ja bomo klasificirali kot pozitivne oz. negativne.

# Najprej si uredimo dostop do GPU-ja v tej Colab seji:
- `Edit -> Notebook settings -> Hardware accelerator` mora biti nastavljen na enega izmed GPU-jev.
- po potrebi se ponovno poveženo z gumbom `Connect` v desnem zgornjem kotu.

In [None]:
!nvidia-smi

In [None]:
%%capture
!pip install datasets evaluate transformers[torch]

In [None]:
from datasets import load_dataset
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer
)

Prenesimo podatke in poglejmo nekaj primerov:

In [None]:
imdb = load_dataset("imdb")

In [None]:
imdb

In [None]:
label_list = set(imdb["train"]["label"])
print(label_list)

In [None]:
imdb["train"][0]

In [None]:
imdb["train"][-1]

Naš cilj v tej beležki je "finetunanje" (dodatno učenje) že obstoječega BERT modela, da bo klasificiral neko filmsko kritiko kot pozitivno ali negativno.

#### Priprava podatkov

In [None]:
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased", model_max_length=256)

In [None]:
def tokenize(rows):
    return tokenizer(rows["text"], max_length=256, truncation=True)

imdb_train = imdb["train"].map(tokenize, batched=True)
imdb_test = imdb["test"].map(tokenize, batched=True)

In [None]:
imdb_train[0]

#### Model

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert/distilbert-base-uncased",
    num_labels=2,
    id2label={0: "negative", 1: "positive"},
    label2id={"negative": 0, "positive": 1}
)

#### Učenje

In [None]:
training_args = TrainingArguments(
    output_dir="/content/",
    learning_rate=3e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

In [None]:
trainer = Trainer(
    model=model.cuda(),
    args=training_args,
    train_dataset=imdb_train,
    eval_dataset=imdb_test,
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
)

In [None]:
trainer.train()

#### Evalvarijamo naučen model

In [None]:
from evaluate import evaluator
from transformers import AutoModelForSequenceClassification, pipeline

In [None]:
!mkdir /content/bert-imdb
!gdown -O /content/bert-imdb/config.json https://drive.google.com/uc?id=1S-M9cDLYPi4tglFgrgYrpql9MQ1Y1v8Y
!gdown -O /content/bert-imdb/model.safetensors https://drive.google.com/uc?id=1OFUyFQ7vNa8dyzadfUmcDZOEpmBMjZuy

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(
    "/content/bert-imdb",
    num_labels=2,
    id2label={0: "negative", 1: "positive"},
    label2id={"negative": 0, "positive": 1},
)

In [None]:
task_evaluator = evaluator("text-classification")

In [None]:
eval_results = task_evaluator.compute(
    model_or_pipeline=model.cuda(),
    tokenizer=tokenizer,
    data=imdb_test.shuffle(seed=42).select(range(1000)),
    label_mapping={"negative": 0, "positive": 1}
)

In [None]:
eval_results

#### Eksperimentirajmo

In [None]:
imdb_pipeline = pipeline("text-classification", model=model, tokenizer=tokenizer)

In [None]:
imdb_pipeline("This movie is good!")

In [None]:
imdb_pipeline("This movie is bad!")

Ali lahko za to uporabimo tudi ChatGPT?

In [None]:
%%capture
!pip install -q openai

In [None]:
import openai
import requests

In [None]:
openai.api_key = ""

In [None]:
def call_model(msg: str, temperature: float = 1., top_p: float = 1., model: str = "gpt-3.5-turbo", system: str = None):
    URL = "https://api.openai.com/v1/chat/completions"

    messages = [{"role": "user", "content": msg}] if system is None else [{"role": "system", "content": system}, {"role": "user", "content": msg}]

    payload = {
      "model": model,
      "messages": messages,
      "temperature" : temperature,
      "top_p":top_p,
    }

    headers = {
      "Content-Type": "application/json",
      "Authorization": f"Bearer {openai.api_key}"
    }

    response = requests.post(URL, headers=headers, json=payload, stream=False)
    return response.json()['choices'][0]['message']['content'].strip()

In [None]:
call_model("What's the capital of Slovenia?")

Poskusite klasificirat nekaj primerov z uporabo ChatGPT-ja in primerjajte rezultate z BERT-om.