## Sodobna obdelava naravnega jezika: BERT prek praktičnih primerov

## Zaznava imenskih entitet

Praktični del 3. delavnice v sklopu Akademije umetne inteligence za poslovne aplikacije.

V tej beležki se bomo naučili uporabiti *BERT-a* oziroma podobne modele za zaznavo imenskih entitet. Model bo lahko vsako besedo v besedilu klasificiral v različne kategorije, kot so oseba, organizacija, lokacija, itd.

# Najprej si uredimo dostop do GPU-ja v tej Colab seji:
- `Edit -> Notebook settings -> Hardware accelerator` mora biti nastavljen na enega izmed GPU-jev.
- po potrebi se ponovno poveženo z gumbom `Connect` v desnem zgornjem kotu.

In [None]:
!nvidia-smi

In [None]:
%%capture
!pip install datasets evaluate transformers[torch] seqeval colorama

In [None]:
from datasets import load_dataset
from transformers import (
    AutoModelForTokenClassification,
    AutoTokenizer,
    DataCollatorForTokenClassification,
    TrainingArguments,
    Trainer
)

Prenesimo podatke in poglejmo nekaj primerov:

In [None]:
dataset = load_dataset("conll2003")

In [None]:
dataset

In [None]:
dataset["train"][0]["tokens"]

In [None]:
dataset["train"][0]["ner_tags"]

In [None]:
dataset["train"].features["ner_tags"]

Zakaj ločujemo začetek entitete (B-PER vs I-PER)? Da lahko ločimo med večimi zaporednimi entitetami, e.g.:

To sta Luka, Andrej.

In [None]:
words = dataset["train"][0]["tokens"]
labels = dataset["train"][0]["ner_tags"]
label_names = dataset["train"].features["ner_tags"].feature.names

sentence_str = ""
label_str = ""
for word, label in zip(words, labels):
    max_length = max(len(word), len(label_names[label]))
    sentence_str += word + " " * (max_length - len(word) + 1)
    label_str += label_names[label] + " " * (max_length - len(label_names[label]) + 1)

print(sentence_str)
print(label_str)

Naš cilj v tej beležki je "finetunanje" (dodatno učenje) že obstoječega BERT modela, da bo klasificiral vsako besedo v besedilu v neko kategorijo.

#### Priprava podatkov

In [None]:
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased", model_max_length=512)

In [None]:
def align_labels_with_tokens(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word:
            current_word = word_id
            label = -100 if word_id is None else labels[word_id]
            new_labels.append(label)
        elif word_id is None:
            new_labels.append(-100)
        else:
            label = labels[word_id]
            if label % 2 == 1:
                label += 1
            new_labels.append(label)

    return new_labels


def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"], truncation=True, is_split_into_words=True
    )
    all_labels = examples["ner_tags"]
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_inputs.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels, word_ids))

    tokenized_inputs["labels"] = new_labels
    return tokenized_inputs

In [None]:
dataset_tokenized = dataset.map(
    tokenize_and_align_labels,
    batched=True
)

#### Model

In [None]:
id2label = {i: label for i, label in enumerate(dataset["train"].features["ner_tags"].feature.names)}

model = AutoModelForTokenClassification.from_pretrained(
    "distilbert/distilbert-base-uncased",
    id2label=id2label,
    label2id={v: k for k, v in id2label.items()},
)

In [None]:
training_args = TrainingArguments(
    output_dir="/content/",
    learning_rate=3e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    num_train_epochs=3,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

In [None]:
trainer = Trainer(
    model=model.cuda(),
    args=training_args,
    train_dataset=dataset_tokenized["train"],
    eval_dataset=dataset_tokenized["validation"],
    tokenizer=tokenizer,
    data_collator=DataCollatorForTokenClassification(tokenizer=tokenizer),
)

In [None]:
trainer.train()

### Evalvirajmo naučen model

In [None]:
from evaluate import evaluator
from transformers import AutoModelForTokenClassification, pipeline

In [None]:
!mkdir /content/bert-ner
!gdown -O /content/bert-ner/config.json https://drive.google.com/uc?id=1uAfr57dXkz9r-fIIT93u_miKiCMb1YmK
!gdown -O /content/bert-ner/model.safetensors https://drive.google.com/uc?id=14f1skonBD3LbRPFPrjY3haBxjvL7trrf

In [None]:
id2label = {i: label for i, label in enumerate(dataset["train"].features["ner_tags"].feature.names)}

model = AutoModelForTokenClassification.from_pretrained(
    "/content/bert-ner",
    id2label=id2label,
    label2id={v: k for k, v in id2label.items()},
)

In [None]:
task_evaluator = evaluator("token-classification")

In [None]:
eval_results = task_evaluator.compute(
    model_or_pipeline=model.cuda(),
    tokenizer=tokenizer,
    data=dataset_tokenized["test"]
)

In [None]:
eval_results

#### Eksperimentirajmo

In [None]:
from colorama import Fore

In [None]:
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer)

In [None]:
type2color = {
    "PER": Fore.GREEN,
    "ORG": Fore.BLUE,
    "LOC": Fore.RED,
    "MISC": Fore.YELLOW
}

def detect_and_print_entities(text: str) -> None:
  ents = ner_pipeline(text)
  if len(ents) == 0:
    print(text)
    return

  print(text[:ents[0]["start"]], end="")
  for i in range(len(ents)):
    print(type2color[ents[i]["entity"][2:]] + text[ents[i]["start"]:ents[i]["end"]], end="")
    if i == len(ents) - 1:
      print(Fore.BLACK + text[ents[i]["end"]:])
    else:
      print(Fore.BLACK + text[ents[i]["end"]:ents[i + 1]["start"]], end="")

In [None]:
detect_and_print_entities("Lewis Hamilton's quest for the all-time record of Formula 1 wins was put on hold when he was hit with penalties at the Russian Grand Prix. Hamilton's Mercedes team-mate Valtteri Bottas dominated after the world champion was given a 10-second penalty for doing two illegal practice starts. Bottas was on the better strategy - starting on the medium tyres while Hamilton was on softs after a chaotic qualifying session for the Briton - and was tracking Hamilton in the early laps waiting for the race to play out. Behind the top three, Racing Point's Sergio Perez and Renault's Daniel Ricciardo had equally lonely races, the Australian having sufficient pace to overcome a five-second penalty for failing to comply with rules regarding how to rejoin the track when a car runs wide at Turn Two. Ferrari's Charles Leclerc made excellent use of a long first stint on the medium tyres to vault up from 11th on the grid to finish sixth, ahead of the second Renault of Esteban Ocon, the Alpha Tauris of Daniil Kvyat and Pierre Gasly and Alexander Albon's Red Bull. What's next? The Eifel Grand Prix on 11 October as the Nurburgring returns to the F1 calendar for the first time since 2013. The 24-hour touring car race there this weekend has been hit with miserable wet and wintery conditions in the Eifel mountains. Will F1 face the same?")

In [None]:
detect_and_print_entities("Sir David Attenborough has broken Jennifer Aniston's record for the fastest time to reach a million followers on Instagram. At 94 years young, the naturalist's follower count raced to seven figures in four hours 44 minutes on Thursday, according to Guinness World Records. His debut post said: \'Saving our planet is now a communications challenge.\' Last October, Friends star Aniston reached the milestone in five hours and 16 minutes. Sir David's Instagram debut precedes the release of a book and a Netflix documentary, both titled A Life On Our Planet.")

In [None]:
detect_and_print_entities("Using Lidar, in 2016 the Foundation for Maya Cultural and Natural Heritage launched the largest archaeological survey ever undertaken of the Maya lowlands. In the first phase, whose results were published in 2018, they mapped 2,100km of the Maya Biosphere Reserve. Their hope in the further phases – the second one of which took place in summer 2019, while I was there – is to triple the coverage area. That would make the project the largest Lidar survey not only in Central America, but in the world.")

Ali lahko za to uporabimo tudi ChatGPT?

In [None]:
%%capture
!pip install -q openai

In [None]:
import openai
import requests

In [None]:
openai.api_key = ""

In [None]:
def call_model(msg: str, temperature: float = 1., top_p: float = 1., model: str = "gpt-3.5-turbo", system: str = None):
    URL = "https://api.openai.com/v1/chat/completions"

    messages = [{"role": "user", "content": msg}] if system is None else [{"role": "system", "content": system}, {"role": "user", "content": msg}]

    payload = {
      "model": model,
      "messages": messages,
      "temperature" : temperature,
      "top_p":top_p,
    }

    headers = {
      "Content-Type": "application/json",
      "Authorization": f"Bearer {openai.api_key}"
    }

    response = requests.post(URL, headers=headers, json=payload, stream=False)
    return response.json()['choices'][0]['message']['content'].strip()

In [None]:
call_model("What's the capital of Slovenia?")

Iz zgornjih primerov poskusite izvleči entitete z uporabo ChatGPT-ja.