# User guide

Read this section before running the code

This is a program module for answering questions about processes, developed in Python 3. When "The code" section is fully executed (in sequential order, one by one), program prompts user to ask a question. For a given question, module attempts to answer it using some machine learning approaches. 

<br />

## Executing the code and running the chatbot

In order to run the chatbot, You need to run entire "The code" section.

Before running the section, make sure You have appropriate JSON model, and appropriate paths (as strings) for training data (2 instances of paths, one for User intent, and one for NER). Also make sure you run on GPU mode to decrease the execution time of the section, as training can take some time. 

Function `chatbot(process)` is the core function for interacting with the chatbot. It takes a name of the process as the parameter and prompts a user using `input()` built-in function. Once trained, just call the `chatbot` function with appropriate JSON process name and You are good to go.

<br />

## How it works?

For a given query (user input), model predicts what is the question domain that is being queried. When it predicts domains that require data extraction from the query, it tries to extract the task names of the specified process. After these two steps, it has all the necessary information to decide what to say. The model then prints one of few hardcoded responses. 

<br />

## Some notes:

- `!pip install tensorflow_text` requires runtime restart (user intent recognition model)
- GPU mode execution: ~5 minutes
- CPU mode execution: ~30 minutes
- JSON processes should be linear
- Model was developed for simple queries

<br />

## Sources:

User intent model: https://github.com/AldoF95/intent_recognition_masters_thesis

LaBSE 2 base model (used in user intent model): https://tfhub.dev/google/LaBSE/2

Finetuning the model for NER task (tutorial for first step of data extraction): https://github.com/dmoonat/Named-Entity-Recognition/blob/main/Fine_tune_NER.ipynb

WikiAnn dataset (used in data extraction model): https://huggingface.co/datasets/wikiann

XML-RoBERTa base model (used in data extraction model): https://huggingface.co/xlm-roberta-base


all-MiniLM-L6-v2 sentence transformer (used for data extraction): https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

# The code

## Execution and training times:

CPU
- Execution: ~35m (training included) 
- Training: ~27m

GPU
- Execution: ~5m 30s (training included)
- Training: ~1m

*- updated 07/09/2022 -*

In [None]:
# Define the process that the models will be trained for

# trainedProcess = "završni"
# trainedProcessJSON = "Izrada završnog rada"

trainedProcess = "praksa"
trainedProcessJSON = "Praksa"

JSON setup

In [None]:
json = [
    {
        "name": "Praksa",
        "phases": [
            {
                "name": "Odabir preferencija",
                "alias": ["Prijava prakse", "Odabir zadatka", "Prvi korak"],
                "description": "Odabir preferencija je prvi korak u procesu polaganja prakse. Zahtjeva da student odabere zadatak sa popisa...",
                "duration": "1 mjesec",
            },
            {
                "name": "Ispunjavanje prijavnice",
                "description": "Ispunjavanje prijavnice je drugi korak u procesu polaganja prakse. Student mora ispuniti prijavnicu koja se nalazi na stranici kolegija...",
                "duration": "1 tjedan",
            },
            {
                "name": "Predaja dnevnika prakse",
                "alias": ["Završetak prakse", "Dnevnik"],
                "description": "Predaja dnevnika prakse zadnji je korak u procesu polaganja prakse. S završetkom rada, student predaje dnevnik prakse na stranicu kolegija...",
                "duration": "3 dana",
            },
        ],
        "duration": "2 mjeseca",
    },
    {
        "name": "Izrada završnog rada",
        "phases": [
            {
                "name": "Prijava teme",
                "alias": ["Prvi korak"],
                "description": "Prvi korak u procesu izrade završnog rada je prijava teme. Zahtjeva da student odabere mentora te prijavi temu sa popisa...",
                "duration": "5 dana",
            },
            {
                "name": "Ispuna obrasca",
                "description": "Student ispunjava obrazac sa prijavljenom temom...",
                "duration": "4 dana",
            },
            {
                "name": "Obrana rada",
                "description": "Student brani svoj rad pred komosijom...",
                "duration": "1 sat",
            },
        ],
        "duration": "3 mjeseca",
    },
]

# If tasks do not contain alias propery, assign an empty one to them
for process in json:
    for task in process["phases"]:
        if "alias" not in task:
            task["alias"] = []

## User intent recognition model
Source: https://github.com/AldoF95/intent_recognition_masters_thesis

`!pip install tensorflow_text` requires runtime restart


CPU time
- Execution time: ~6m (training included)
- Training time: ~3m (10 epochs)

GPU time
- Execution time: ~3m (training included)
- Training time: ~15s (10 epochs)

Loading spreadsheet might fail if no sheet specified (not sure)


In [None]:
# Define training epochs
training_epochs = 10
label_size = 6


# Define dataset URL for training

# UIDatasetURL = "/content/User intent chatbot data.xlsx"
UIDatasetURL = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vSPR-FPTMBcYRynP4JdwYQQ8dAhSx1x8i1LPckUcuIUUlrWT82b5Thqb1bBNnPeGJPxxX1CJAlFSd6F/pub?output=xlsx'

In [None]:
# Will require runetime restart on Google colab (sometimes, idk)
!pip install tensorflow_text

In [None]:
!pip install text-hr

### Data loading

- Define the preprocesor and the base model
- LaBSE 2 base model used: https://tfhub.dev/google/LaBSE/2
- Load the data from published google spreadsheet
- Merge categories and Normalize data within them

In [None]:
import tensorflow as tf
import tensorflow_text as tft
import tensorflow_hub as tfh
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# Text preprocessor for bert based models
preprocessor = tfh.KerasLayer('https://tfhub.dev/google/universal-sentence-encoder-cmlm/multilingual-preprocess/2')

# Language Agnostic BERT sentence encoder
model = tfh.KerasLayer('https://tfhub.dev/google/LaBSE/2')



In [None]:
# Read the data
import pandas as pd
data = pd.read_excel(UIDatasetURL)

In [None]:
columns = ['text', 'intent', 'process']
data.columns = columns

In [None]:
data = data[data["process"] == trainedProcess].drop(columns="process")

In [None]:
data.head()

#### Category merging

In [None]:
# Convert categories to codes
data['intent'] = data['intent'].astype('category')
data['intent_codes'] = data['intent'].cat.codes

In [None]:
# Display the distribution of codes
values = data['intent'].value_counts()
plt.stem(values)

#### Normalize data

### Text preprocessing

1. Remove punctuation
2. Lowercase the text
3. Apply tokenization
4. Remove stopwords
5. Apply lemmatizer

In [None]:
import string
import re
import nltk
import text_hr

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


In [None]:
def remove_punctuation(text):
    return "".join([i for i in text if i not in string.punctuation])

def tokenization(text):
    return re.split(r"\s+",text)

stopwords = nltk.corpus.stopwords.words('english')
def remove_stopwords(text):
    return [i for i in text if i not in stopwords]

porter_stemmer = PorterStemmer()
def stemming(text):
    return [porter_stemmer.stem(word) for word in text]

wordnet_lemmatizer = WordNetLemmatizer()
def lemmatizer(text):
    return [wordnet_lemmatizer.lemmatize(word) for word in text]

In [None]:
data['text'] = data['text']\
    .apply(lambda x: remove_punctuation(x))\
    .apply(lambda x: x.lower())\
    .apply(lambda x: tokenization(x))\
    .apply(lambda x: lemmatizer(x))

In [None]:
data['text'].head()

0    [što, sve, moram, napraviti, za, praksu]
1     [koji, su, koraci, za, obaviti, praksu]
2             [šta, je, odabir, preferencija]
3        [kako, se, predaje, dnevnik, praske]
4                     [koliko, traje, praksa]
Name: text, dtype: object

In [None]:
stop_words_list_hr = []
for word_base, l_key, cnt, _suff_id, wform_key, wform in text_hr.get_all_std_words():
    if word_base is not None: stop_words_list_hr.append(word_base)
    if wform is not None: stop_words_list_hr.append(wform)

In [None]:
stop_words_list_hr = list(dict.fromkeys(stop_words_list_hr))
len(stop_words_list_hr)

1207

In [None]:
def remove_stopwords_hr(text):
    output = [i for i in text if i not in stop_words_list_hr]
    return output

In [None]:
data['text'] = data['text'].apply(lambda x: remove_stopwords_hr(x))

In [None]:
data['text'].head()

0            [napraviti, praksu]
1      [koraci, obaviti, praksu]
2    [šta, odabir, preferencija]
3     [predaje, dnevnik, praske]
4                [traje, praksa]
Name: text, dtype: object

In [None]:
data['text'] = data['text'].str.join(" ")
data['text'].head()

0           napraviti praksu
1      koraci obaviti praksu
2    šta odabir preferencija
3     predaje dnevnik praske
4               traje praksa
Name: text, dtype: object

In [None]:
data.head()

Unnamed: 0,text,intent,intent_codes
0,napraviti praksu,P1,0
1,koraci obaviti praksu,P1,0
2,šta odabir preferencija,P3,2
3,predaje dnevnik praske,P3,2
4,traje praksa,P2,1


### Split validation and training data

Train 75%, validation 25%

In [None]:
codes = data['intent_codes'].unique()

In [None]:
# Variable to understand the meaning behind codes
CODES_REPR = data[["intent_codes", "intent"]].drop_duplicates().sort_values("intent_codes")


def codeToIntent(prediction) -> str:
    """ Returns the intent of the prediction, not the code """
    return CODES_REPR[CODES_REPR["intent_codes"] == prediction.argmax()].iloc[0]["intent"]

In [None]:
preprocessed_validation_data = pd.DataFrame(columns=data.columns)
preprocessed_train_data = pd.DataFrame(columns=data.columns)

for c in codes:
    sample = data[data['intent_codes'] == c]
    sample = sample.sample(frac=1)
    # val = sample.sample(frac=0.25)
    val = sample.sample(frac=0)
    train = pd.concat([sample, val]).drop_duplicates(keep=False)
    preprocessed_validation_data = preprocessed_validation_data.append(val, ignore_index=True)
    preprocessed_train_data = preprocessed_train_data.append(train, ignore_index=True)

In [None]:
# Preprocessed google translation data
train_data_eng = preprocessed_train_data[['text', 'intent_codes']]
train_data_eng.columns = ['text', 'intent_codes']

validation_data_eng = preprocessed_validation_data[['text', 'intent_codes']]
validation_data_eng.columns = ['text', 'intent_codes']

In [None]:
def df_to_dataset(df, shuffle=True, batch_size=16):
    df = df.copy()
    labels = df.pop('intent_codes')
    lables_cat = tf.keras.utils.to_categorical(labels, label_size)
    dataset = tf.data.Dataset.from_tensor_slices((dict(df), lables_cat))
    if shuffle:
        dataset = dataset.shuffle(buffer_size=len(df))
    dataset = dataset.batch(batch_size).prefetch(batch_size)
    return dataset

In [None]:
_validation = train_data_eng
train_data_eng = df_to_dataset(train_data_eng)

# validation_data_eng = df_to_dataset(validation_data_eng)
validation_data_eng = df_to_dataset(_validation)

### Model definition and training

10 epochs training (testing purposes)

In [None]:
# Model builder
def model_build():
    inputs = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
    encoded_input = preprocessor(inputs)
    encoder_outputs = model(encoded_input)

    x = encoder_outputs['pooled_output']
    x = tf.keras.layers.Dropout(0.1)(x)
    x = tf.keras.layers.Dense(128, activation='relu')(x)
    x = tf.keras.layers.Dropout(0.7)(x)
    outputs = tf.keras.layers.Dense(label_size, activation='softmax', name='classifier')(x)
    
    return tf.keras.Model(inputs, outputs)

# Build a model with preprocessed data 
model_eng = model_build()
model_eng.compile(
    optimizer = tf.keras.optimizers.Adam(0.001),
    loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True),
    metrics = tf.keras.metrics.CategoricalAccuracy()
)

eng_history = model_eng.fit(
    train_data_eng,
    epochs = training_epochs,
    batch_size = 16,
    validation_data = validation_data_eng
)

Epoch 1/10


  output, from_logits, "Softmax", "categorical_crossentropy"


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## Data extraction model (finetuning)
Source: https://github.com/dmoonat/Named-Entity-Recognition/blob/main/Fine_tune_NER.ipynb <br>

CPU time
- Execution time: ~28 min (training included)
- Training time: ~23 min (2 epochs, 1k data)
- Finetuning time: ~1 min (10 epochs, < 100 data)

GPU time
- Execution time: ~3 min (training included)
- Training time: ~40s (2 epochs, 1k data)
- Finetuning time: 1s (10 epochs, < 100 data)

In [None]:
# Define training epochs
mainEpochs = 2

# Define finetuning epochs
finetuneEpochs = 10

In [None]:
!pip install datasets -q
!pip install tokenizers -q
!pip install transformers -q
!pip install seqeval -q

[K     |████████████████████████████████| 365 kB 26.0 MB/s 
[K     |████████████████████████████████| 212 kB 65.6 MB/s 
[K     |████████████████████████████████| 120 kB 71.6 MB/s 
[K     |████████████████████████████████| 115 kB 67.1 MB/s 
[K     |████████████████████████████████| 127 kB 59.6 MB/s 
[K     |████████████████████████████████| 6.6 MB 23.9 MB/s 
[K     |████████████████████████████████| 4.7 MB 26.1 MB/s 
[K     |████████████████████████████████| 43 kB 2.2 MB/s 
[?25h  Building wheel for seqeval (setup.py) ... [?25l[?25hdone


### Load the datasets

Loading [Wikianng](https://huggingface.co/datasets/wikiann) main croatian dataset

Loading custom [Google spreadsheeta](https://docs.google.com/spreadsheets/d/e/2PACX-1vSPR-FPTMBcYRynP4JdwYQQ8dAhSx1x8i1LPckUcuIUUlrWT82b5Thqb1bBNnPeGJPxxX1CJAlFSd6F/pub?output=xlsx) finetuning croatian dataset

In [None]:
from datasets import load_dataset, Dataset
import pandas as pd

# Main training data
dataset = load_dataset("wikiann", "hr")

# Define dataset URL for training

# UIDatasetURL = "/content/User intent chatbot data.xlsx"
UIDatasetURL = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vSPR-FPTMBcYRynP4JdwYQQ8dAhSx1x8i1LPckUcuIUUlrWT82b5Thqb1bBNnPeGJPxxX1CJAlFSd6F/pub?output=xlsx'

# Finetuning data
nerData = pd.read_excel(UIDatasetURL, sheet_name="List 2")

Downloading builder script:   0%|          | 0.00/3.94k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/12.6k [00:00<?, ?B/s]

Downloading and preparing dataset wikiann/hr (download: 223.17 MiB, generated: 9.27 MiB, post-processed: Unknown size, total: 232.44 MiB) to /root/.cache/huggingface/datasets/wikiann/hr/1.1.0/4bfd4fe4468ab78bb6e096968f61fab7a888f44f9d3371c2f3fea7e74a5a354e...


Downloading data:   0%|          | 0.00/234M [00:00<?, ?B/s]

Generating validation split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/20000 [00:00<?, ? examples/s]

Dataset wikiann downloaded and prepared to /root/.cache/huggingface/datasets/wikiann/hr/1.1.0/4bfd4fe4468ab78bb6e096968f61fab7a888f44f9d3371c2f3fea7e74a5a354e. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
nerTags = [[int(nerTag) for nerTag in i.split(' ')] for i in nerData['ner_tags'].values.tolist()]
langs = [['hr'] * len(i) for i in nerTags]
tokens = [tokens.split(' ') for tokens in nerData['tokens'].values.tolist()]
spans = [[spans] for spans in nerData['spans'].values.tolist()]

# Convert data to Dataset
fineTunedDs = Dataset.from_dict({
    'langs': langs,
    'ner_tags': nerTags,
    'spans': spans,
    'tokens': tokens
})

In [None]:
# label_names = dataset["train"].features["ner_tags"].feature.names
label_names = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']

In [None]:
nerData = nerData[nerData["process"] == trainedProcess].drop(columns="process")

In [None]:
nerData

Unnamed: 0,ner_tags,tokens,spans
0,0 0 0 0 3 4,Imam pitanje u vezi odabira preferencija,ORG: Odabir preferencija
1,0 0 0 0 0 3 4,Šta sve moram napraviti za ispunjavanje prijav...,ORG: Ispunjavanje prijavnice
2,0 0 3 4 4,Kada se predaje dnevnik prakse,ORG: Predaja dnevnika prakse
3,0 0 0 3 4,Kako se obavlja prijava prakse,ORG: Prijava prakse
4,0 0 3 4,Gdje obavljam odabir zadatka?,ORG: Odabir zadatka
5,0 0 3 4 0 0 0,Koji je prvi korak kod prijave prakse,ORG: Prvi korak
6,0 0 3 4,Kada je završetak prakse,ORG: Završetak prakse
7,0 0 3,Trebam predati dnevnik,ORG: Dnevnik


### Data preprocessing (tokenization)

Using [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) tokenizer

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")

Downloading config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

Downloading sentencepiece.bpe.model:   0%|          | 0.00/4.83M [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/8.68M [00:00<?, ?B/s]

In [None]:
# Get the values for input_ids, attention_mask, adjusted labels
def tokenize_adjust_labels(all_samples_per_split):
    tokenized_samples = tokenizer.batch_encode_plus(all_samples_per_split["tokens"], is_split_into_words=True, truncation=True)

    total_adjusted_labels = []
  
    for k in range(0, len(tokenized_samples["input_ids"])):
        prev_wid = -1
        word_ids_list = tokenized_samples.word_ids(batch_index=k)
        existing_label_ids = all_samples_per_split["ner_tags"][k]
        i = -1
        adjusted_label_ids = []
   
        for word_idx in word_ids_list:
            # Special tokens have a word id that is None. We set the label to -100 so they are automatically
            # ignored in the loss function.
            if word_idx is None:
                adjusted_label_ids.append(-100)
            elif word_idx != prev_wid:
                i += 1
                adjusted_label_ids.append(existing_label_ids[i])
                prev_wid = word_idx
            else:
                label_name = label_names[existing_label_ids[i]]
                adjusted_label_ids.append(existing_label_ids[i])
                
        total_adjusted_labels.append(adjusted_label_ids)
    
    # Add adjusted labels to the tokenized samples
    tokenized_samples["labels"] = total_adjusted_labels
    return tokenized_samples

# Apply tokenization to both main and finetuning datasets  
tokenized_dataset = dataset.map(tokenize_adjust_labels, batched=True, remove_columns=['tokens', 'ner_tags', 'langs', 'spans'])
tokenizedFineTunedDs = fineTunedDs.map(tokenize_adjust_labels, batched=True, remove_columns=['tokens', 'ner_tags', 'langs', 'spans'])

  0%|          | 0/10 [00:00<?, ?ba/s]

  0%|          | 0/10 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

### Preparations

Using [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) model

In [None]:
import torch
from torch.utils.data import DataLoader
from transformers import AutoModelForTokenClassification, AdamW

In [None]:
# Check if gpu is present
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

In [None]:
model = AutoModelForTokenClassification.from_pretrained("xlm-roberta-base", num_labels=len(label_names))
model.to(device)

In [None]:
import numpy as np
from datasets import load_metric
metric = load_metric("seqeval")

def compute_metrics(p):
    predictions, labels = p

    # Select predicted index with maximum logit for each token
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_names[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

Downloading builder script:   0%|          | 0.00/2.47k [00:00<?, ?B/s]

### Model training and finetuning

```
batch_size = 16
mainEpochs = 2  # previously defined
finetuneEpochs = 10  # previously defined
```

In [None]:
from transformers import TrainingArguments, Trainer, DataCollatorForTokenClassification

batch_size = 16
# logging_steps = len(tokenized_dataset['train']) // batch_size
# logging_steps = len(Dataset.from_dict(tokenized_dataset["validation"][:1000])) // batch_size
# logging_steps = len(tokenizedFineTunedDs) // batch_size


training_args = TrainingArguments(
    output_dir="/content/drive/MyDrive/bert-fine-tune-ner/results",
    # num_train_epochs=epochs,
    num_train_epochs=mainEpochs,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    evaluation_strategy="epoch",
    disable_tqdm=False,
    # logging_steps=logging_steps,
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,

    # Select only the first 1000 examples
    train_dataset=Dataset.from_dict(tokenized_dataset["train"][:1000]),
    eval_dataset=Dataset.from_dict(tokenized_dataset["validation"][:1000]),
    
    # train_dataset=tokenized_dataset["train"],
    # eval_dataset=tokenized_dataset["validation"],
    
    data_collator=DataCollatorForTokenClassification(tokenizer),
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [None]:
# Fine tune using train method
trainer.train()

***** Running training *****
  Num examples = 1000
  Num Epochs = 2
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 126


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,0.514716,0.468152,0.655076,0.54606,0.845334
2,No log,0.37752,0.663594,0.748201,0.703363,0.891849


***** Running Evaluation *****
  Num examples = 1000
  Batch size = 16
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 16


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=126, training_loss=0.7115381634424603, metrics={'train_runtime': 32.1524, 'train_samples_per_second': 62.204, 'train_steps_per_second': 3.919, 'total_flos': 38747161184160.0, 'train_loss': 0.7115381634424603, 'epoch': 2.0})

In [None]:
training_args = TrainingArguments(
    output_dir="/content/drive/MyDrive/bert-fine-tune-ner/results",
    num_train_epochs=finetuneEpochs,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    evaluation_strategy="epoch",
    disable_tqdm=False,
    # logging_steps=logging_steps,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenizedFineTunedDs,
    eval_dataset=tokenizedFineTunedDs,
    data_collator=DataCollatorForTokenClassification(tokenizer),
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

***** Running training *****
  Num examples = 12
  Num Epochs = 10
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 10


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,1.021512,0.4,0.4,0.4,0.640449
2,No log,0.862334,0.636364,0.35,0.451613,0.707865
3,No log,0.644364,0.5,0.35,0.411765,0.741573
4,No log,0.470643,0.705882,0.6,0.648649,0.831461
5,No log,0.382795,0.764706,0.65,0.702703,0.853933
6,No log,0.339815,0.8125,0.65,0.722222,0.865169
7,No log,0.309937,0.764706,0.65,0.702703,0.876404
8,No log,0.285355,0.789474,0.75,0.769231,0.910112
9,No log,0.263799,0.75,0.75,0.75,0.910112
10,No log,0.251271,0.75,0.75,0.75,0.910112


***** Running Evaluation *****
  Num examples = 12
  Batch size = 16
***** Running Evaluation *****
  Num examples = 12
  Batch size = 16
***** Running Evaluation *****
  Num examples = 12
  Batch size = 16
***** Running Evaluation *****
  Num examples = 12
  Batch size = 16
***** Running Evaluation *****
  Num examples = 12
  Batch size = 16
***** Running Evaluation *****
  Num examples = 12
  Batch size = 16
***** Running Evaluation *****
  Num examples = 12
  Batch size = 16
***** Running Evaluation *****
  Num examples = 12
  Batch size = 16
***** Running Evaluation *****
  Num examples = 12
  Batch size = 16
***** Running Evaluation *****
  Num examples = 12
  Batch size = 16


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=10, training_loss=0.8014575004577636, metrics={'train_runtime': 2.017, 'train_samples_per_second': 59.495, 'train_steps_per_second': 4.958, 'total_flos': 796174544880.0, 'train_loss': 0.8014575004577636, 'epoch': 10.0})

In [None]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 12
  Batch size = 16


{'eval_loss': 0.2512713372707367,
 'eval_precision': 0.75,
 'eval_recall': 0.75,
 'eval_f1': 0.75,
 'eval_accuracy': 0.9101123595505618,
 'eval_runtime': 0.0389,
 'eval_samples_per_second': 308.728,
 'eval_steps_per_second': 25.727,
 'epoch': 10.0}

In [None]:
"""
predictions, labels, _ = trainer.predict(tokenizedFineTunedDs)
predictions = np.argmax(predictions, axis=2)

# Remove ignored index (special tokens)
true_predictions = [
    [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]
true_labels = [
    [label_names[l] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]
results = metric.compute(predictions=true_predictions, references=true_labels)
results
"""

'\npredictions, labels, _ = trainer.predict(tokenizedFineTunedDs)\npredictions = np.argmax(predictions, axis=2)\n\n# Remove ignored index (special tokens)\ntrue_predictions = [\n    [label_names[p] for (p, l) in zip(prediction, label) if l != -100]\n    for prediction, label in zip(predictions, labels)\n]\ntrue_labels = [\n    [label_names[l] for (p, l) in zip(prediction, label) if l != -100]\n    for prediction, label in zip(predictions, labels)\n]\nresults = metric.compute(predictions=true_predictions, references=true_labels)\nresults\n'

## Using the NER model

- More diverse fine tuning examples needed

In [None]:
!pip install datasets -q

In [None]:
import numpy as np
from datasets import Dataset
from typing import List, Dict

In [None]:
def datasetBuilder(text: str) -> Dataset:
    """ Returns instance of Dataset object ready for prediction """
    return Dataset.from_dict({ 'tokens': [text] })

In [None]:
# Get the values for input_ids, attention_mask
def tokenizer_encode(ds: Dataset):
    return tokenizer.batch_encode_plus([ds["tokens"][0]], is_split_into_words=False, truncation=True)

In [None]:
def getPrediction(tokenizedDs: Dataset) -> List[List[int]]:
    """ Returns a list of a list of NER codes """
    predictions, labels, _ = trainer.predict(tokenizedDs)
    return np.argmax(predictions, axis=2)

In [None]:
# Old function
# Function that shows the result
def outcome_(ner_pred: List[str], initialText: str) -> Dict:
    modelLabels = ["PER", "ORG", "LOC"]

    nerDict = {i: [] for i in modelLabels}
    nerDict[""] = []

    currentStringList = []
    currentEntity = ""
    tokenizedText = tokenizer.tokenize(initialText)

    for i, x in enumerate(tokenizedText):
        if ner_pred[i] == 0:
            continue
        elif ner_pred[i] % 2 == 0:
            currentStringList.append(x)
        else:
            nerDict[currentEntity].append(" ".join(currentStringList))
            currentStringList = [x]
            currentEntity = modelLabels[(ner_pred[i] - 1) // 2]

    nerDict[currentEntity].append(" ".join(currentStringList))
    del nerDict[""]

    # Return dictionary without empty values
    return {k: v for k, v in nerDict.items() if v}


# Function that shows the result
def outcome(ner_pred: List[str], initialText: str) -> Dict:
    tokenizedText = tokenizer.tokenize(initialText)
    currentString = "".join([x for i, x in enumerate(tokenizedText) if ner_pred[i] != 0])
                
    # Return dictionary without empty values
    return { "Task": currentString.replace("▁", " ")[1:] }

In [None]:
def predictNER(text: str, debugging: bool=True) -> Dict:
    # Input goes here
    testDs = datasetBuilder(text)

    # Tokenize input
    tokenizedTestDs = testDs.map(tokenizer_encode, batched=True, remove_columns=['tokens'])

    # Get predictions
    true_predictions = getPrediction(tokenizedTestDs)

    if debugging: print(true_predictions)

    # Return all NERs
    return outcome(true_predictions[0][1:-1], text)

In [None]:
print(predictNER("Kako se istaknuo Marko"))
print(predictNER("U Telekomu nije bilo svijetla"))
print(predictNER("Zašto je išao u Plodine"))
print()
print(predictNER("Imam pitanje u vezi odabira preferencija"))
print(predictNER("Šta sve moram napraviti za ispunjavanje prijavnice"))
print(predictNER("Kada se predaje dnevnik prakse?"))
print()
print(predictNER("Šta da napravim ako trebam prijaviti dnevnik prakse?"))
print(predictNER("Kako ide prijavljivanje dnevnika prakse?"))
print(predictNER("Gdje da odaberem preferencije?"))
print(predictNER("Gdje da ispunim prijavnicu?"))
print(predictNER("Pomoc, ne znam kako predati dnevnik"))



  0%|          | 0/1 [00:00<?, ?ba/s]

***** Running Prediction *****
  Num examples = 1
  Batch size = 16


[[3 0 0 0 1 2]]
{'Task': 'Marko'}


  0%|          | 0/1 [00:00<?, ?ba/s]

***** Running Prediction *****
  Num examples = 1
  Batch size = 16


[[3 0 3 3 0 0 0 0 0]]
{'Task': 'Telekomu'}


  0%|          | 0/1 [00:00<?, ?ba/s]

***** Running Prediction *****
  Num examples = 1
  Batch size = 16


[[3 0 0 0 0 0 5 5 0]]
{'Task': 'Plodine'}



  0%|          | 0/1 [00:00<?, ?ba/s]

***** Running Prediction *****
  Num examples = 1
  Batch size = 16


[[3 0 0 0 0 3 3 4 4 4 3]]
{'Task': 'odabira preferencija'}


  0%|          | 0/1 [00:00<?, ?ba/s]

***** Running Prediction *****
  Num examples = 1
  Batch size = 16


[[3 0 0 0 0 0 3 3 4 4 0]]
{'Task': 'ispunjavanje prijavnice'}


  0%|          | 0/1 [00:00<?, ?ba/s]

***** Running Prediction *****
  Num examples = 1
  Batch size = 16


[[3 0 0 3 3 4 4 4 3]]
{'Task': 'predaje dnevnik prakse?'}



  0%|          | 0/1 [00:00<?, ?ba/s]

***** Running Prediction *****
  Num examples = 1
  Batch size = 16


[[0 0 0 0 0 0 0 0 0 3 4 0 0]]
{'Task': 'dnevnik prakse'}


  0%|          | 0/1 [00:00<?, ?ba/s]

***** Running Prediction *****
  Num examples = 1
  Batch size = 16


[[3 0 0 3 3 3 4 4 0 3]]
{'Task': 'prijavljivanje dnevnika prakse'}


  0%|          | 0/1 [00:00<?, ?ba/s]

***** Running Prediction *****
  Num examples = 1
  Batch size = 16


[[3 0 0 0 0 3 4 4 4 0 3]]
{'Task': 'm preferencije'}


  0%|          | 0/1 [00:00<?, ?ba/s]

***** Running Prediction *****
  Num examples = 1
  Batch size = 16


[[3 0 0 0 0 4 4 0 3]]
{'Task': 'prijavnicu'}


  0%|          | 0/1 [00:00<?, ?ba/s]

***** Running Prediction *****
  Num examples = 1
  Batch size = 16


[[0 0 0 0 0 0 0 0 3 0]]
{'Task': 'dnevnik'}


In [None]:
predictNER("Gdje se izvršava obrana rada?")

  0%|          | 0/1 [00:00<?, ?ba/s]

***** Running Prediction *****
  Num examples = 1
  Batch size = 16


[[3 0 0 0 3 3 3 4 0 3]]


{'Task': 'ava obrana rada'}

## Testing the NER model

In [None]:
testCases = [
    predictNER("Pomoc, ne znam kako predati dnevnik") == {'Task': 'predati dnevnik'},
    predictNER("Gdje da ispunim prijavnicu?") == {'Task': 'ispuni prijavnicu?'},
    predictNER("Gdje da odaberem preferencije?") == {'Task': 'odaberem preferencije'},
    predictNER("Kako ide prijavljivanje dnevnika prakse?") == {'Task': 'prijavljivanje dnevnika prakse'},
    predictNER("Šta da napravim ako trebam prijaviti dnevnik prakse?") == {'Task': 'prijaviti dnevnik prakse'}
]

all(testCases)

  0%|          | 0/1 [00:00<?, ?ba/s]

***** Running Prediction *****
  Num examples = 1
  Batch size = 16


[[0 0 0 0 0 0 0 0 3 0]]


  0%|          | 0/1 [00:00<?, ?ba/s]

***** Running Prediction *****
  Num examples = 1
  Batch size = 16


[[3 0 0 0 0 4 4 0 3]]


  0%|          | 0/1 [00:00<?, ?ba/s]

***** Running Prediction *****
  Num examples = 1
  Batch size = 16


[[3 0 0 0 0 3 4 4 4 0 3]]


  0%|          | 0/1 [00:00<?, ?ba/s]

***** Running Prediction *****
  Num examples = 1
  Batch size = 16


[[3 0 0 3 3 3 4 4 0 3]]


  0%|          | 0/1 [00:00<?, ?ba/s]

***** Running Prediction *****
  Num examples = 1
  Batch size = 16


[[0 0 0 0 0 0 0 0 0 3 4 0 0]]


False

## Testing the user intent model

In [None]:
from typing import List

In [None]:
"""
examples = [
    { "text": "Trebam odradit praksu", "intent": "P1"},
    { "text": "Koliko traje praksa?", "intent": "P2"},
    { "text": "Kako ide odabir preferencija", "intent": "P3"},
    { "text": "Koliko traje odabir preferencija?", "intent": "P4"},
    { "text": "Šta ide nakon predaja dnevnika prakse?", "intent": "P5"},
    { "text": "Šta ako ne mogu doći na praksu?", "intent": "P6"},
]
"""

examples = [
    { "text": "Kako ide proces izrade završnog rada?", "intent": "P1"},
    { "text": "Koliko traje završni rad?", "intent": "P2"},
    { "text": "Kako se prijavljuje tema za završni rad?", "intent": "P3"},
    { "text": "Koliko traje prijava teme", "intent": "P4"},
    { "text": "Šta je nakon obrane rada?", "intent": "P5"},
    { "text": "Šta ako je vani kiša?", "intent": "P6"},
]

def testIntentModel(intentModel) -> List[bool]:
    """ Test the abovetrained model on some "must work" examples """
    text_examples = [e["text"] for e in examples]
    y_pred = intentModel.predict(text_examples, verbose=False)
    return [codeToIntent(y) == examples[i]["intent"] for i, y in enumerate(y_pred)]

# Aim to have as many Trues as possible
testResults = testIntentModel(model_eng)

print(f"Results: {testResults}")
print(f"All tests passed: {all(testResults)}")

## Sentence similarity

Used [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) sentence transformer

In [None]:
!pip install -U sentence-transformers

In [None]:
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

In [None]:
from typing import List

In [None]:
def getTaskSimilarityIndex(flatIndex: int, tasks) -> int:
    """ Get task index based on the flatten task list """
    for index, task in enumerate(tasks):
        if flatIndex <= len(task["alias"]):
            return index
        
        flatIndex -= len(task["alias"]) + 1
        
    return -1

In [None]:
def getFlattenTasks(tasks) -> List[str]:
    """ Returns the flatten version of task names and their aliases """
    resTasks = []

    for task in tasks:
        resTasks.append(task["name"])
        resTasks = resTasks + task["alias"]
    
    return resTasks

In [None]:
def taskSimilarity(text: str, tasks) -> int:
    """ Returns the task index which is the most similar to the text """
    return getTaskSimilarityIndex(torch.argmax(util.pytorch_cos_sim(
        model.encode(predictNER(text), convert_to_tensor=True),
        model.encode(getFlattenTasks(tasks), convert_to_tensor=True)
    )).item(), tasks)

## Using the user intent model

In [None]:
def preprocessText(text: str) -> str:
    """ Do the same preprocessing as the UI model training input data """
    text = remove_punctuation(text)
    text = text.lower()
    text = tokenization(text)
    text = lemmatizer(text)
    text = remove_stopwords_hr(text)

    return " ".join(text)

In [None]:
def predict_intent(text: str) -> str:
    """ Predict the text intent based on the abovetrained model """
    return codeToIntent(model_eng.predict([preprocessText(text)], verbose=False))

In [None]:
def getPhases(phases) -> str:
    """ P1: Returns the formatted phases """
    phases = [phase["name"].lower() for phase in phases]
    return ', '.join(phases[:-1]) + ' i ' + phases[-1]

In [None]:
# Define functions that handle output text formatting

def getP1String(process) -> str:
    return f"Faze procesa za proces '{process['name']}' su: {getPhases(process['phases'])}"

def getP2String(process) -> str:
    return f"Proces '{process['name']}' traje {process['duration']}"

def getP3String(taskName: str, task) -> str:
    return f"Kratki opis '{taskName}': {task['description']}"

def getP4String(taskName: str, task) -> str:
    return f"Proces '{taskName}' traje {task['duration']}"

def getP5String(taskIndex: int, taskName: str, process) -> str:
    if len(process["phases"]) <= taskIndex + 1:
        return f"'{taskName}' je zadnji korak u procesu '{process['name']}'"
    
    return f"Nakon '{taskName}' je '{process['phases'][taskIndex + 1]['name'].lower()}'"

def getP6String() -> str:
    return "Nažalost, ne razumijem Vaše pitanje"

In [None]:
def print_result(text: str, process) -> None:
    """ Chatbot output messages based on intent """
    intent = predict_intent(text)
    taskIndex = taskSimilarity(text, process["phases"])
    task = process["phases"][taskIndex]
    taskName = task["name"].lower()

    # P1: Koje su faze
    if intent == 'P1':
        print(getP1String(process))

    # P2: Koliko traje cijeli proces
    elif intent == 'P2':
        print(getP2String(process))

    # P3: Kako ide odabir preferencija?
    elif intent == 'P3':
        print(getP3String(taskName, task))

    # P4: Koliko traje {task}
    elif intent == 'P4':
        print(getP4String(taskName, task))

    # P5: Što je nakon {task}
    elif intent == 'P5':
        print(getP5String(taskIndex, taskName, process))
    
    # Ništa od navedenog
    else:
        print(getP6String())

In [None]:
def chatbot(processName: str) -> None:
    """ By: Rafael Krstačić """
    currentProcess = None

    for process in json:
        if process["name"] == processName:
            currentProcess = process
            break
    else:
        raise KeyError("Process does not exist in json")

    print("Za prekid razgovora unesi 'q'")
    while True:
        user_input = input("\n>>> ")
        if user_input.lower() == "q":
            break

        print_result(user_input, currentProcess)
    print("Doviđenja! ( ^_^)/")


In [None]:
# Demo queries (for process "praksa")

# kolko traje praska?
# sta ide nakon predaja dnevnika?
# sto je nakon prijavnic=
# Trebam riješiti praksu
# Kako ide oabir preferencaij?
# trajanje odabira preferenca
# U 5 mi je bus, dal cu stic?
# q

In [None]:
# Main program driver
if __name__ == "__main__":
    chatbot(trainedProcessJSON)

Za prekid razgovora unesi 'q'

>>> koliko traje praska
Proces 'odabir preferencija' traje 1 mjesec

>>> trebam rjesiti praksu
Faze procesa za proces 'Praksa' su: odabir preferencija, ispunjavanje prijavnice i predaja dnevnika prakse

>>> kolko traje odabir preferencija
Proces 'odabir preferencija' traje 1 mjesec

>>> Tko je Joe?
Nažalost, ne razumijem Vaše pitanje

>>> q
Doviđenja! ( ^_^)/
