# A general plan
* Explore metrics used to evaluate conversational flow
* Choose an open-source model that 
    1. fits my hardware capacity
    2. has nice metrics
* Choose conversational datasets to finetune
* Prepare said datasets
* Add LoRA code - thanks to yandexdataschool for additional guidance!
* Finetune the model (using huggingface trainer)

...then wrap the model into a container

## Explore metrics
Naive, regarding interaction statistics
* Length of conversation: how long (in user lines) lasts a chat on average
* Time between replies to the chatbot, which could indicate the level of a user's engagement. Requires analytics to establish thresholds.
* Human feedback, as in
    - "Were you satisfied by the conversation? Y/N" at the end of the conversation - does not give fine-grained information about what happened during the conversation that led to a particular assessment
    - grade 1-10 - just as above. 
    - This is overall messy, but it makes sense for such feedback to be collected, then post-analyzed by assessors.
* How often do users come back to chat again? Need users!
* How often is a chat with a model abandoned at the start? Need users!


Naive, based on text
* repetition / fluency, as in N of distinct n-grams - the only quantifiable metric. However, it really is just proxy metrics for generation quality; does not measure the adequacy of the conversation...
* perplexity on the user questions? - this is not about answers' quality at all; besides, I doubt that huge models have difficulties in this domain
* overlap between a user's question and a model's answer - I think this is a bit outdated for LLM evaluation, since it used to be a problem around 2017-2020 maybe. Could make sense if we have enough time to check the LM's answer before outputting it to the user.
* F-measure/BLEU/etc. on questions with pre-defined answers? - might be good for tracking factual information, but the good chit-chat has nothing pre-defined a priori. There are separate tasks that could be used to see the prompted model performance though, e.g. summarization.
* Embedding distance between the user's question and the model's answer - seems pretty much the same as the previous idea
* % of negative user responses ("no, that's not what I was talking about"; "that's a dumb response"; "bad bot"); % of positive user responses ("You're funny!", "Great, thanks", "you're right")
    - very hard to define 'negative', but theoretically it could be (embedding) cosine similarity to sets of responses similar to above; results in a problem of constructing such sets, extracting user responses from paragraphs, etc.
    - metrics such as these are really very much user-dependent (what if a user is often ironic? what if they're displaying opinions far from a bot's training data distribution? what if a user is in a bad mood...)
    - but __could be captured__ if reframed as an entailment task (LM-user answer pairs, entailment/contradiction/neutral; we're looking for contradictions; something like https://arxiv.org/abs/1904.03371 ?) or as a sentiment analysis task, but requires additional models

I actually expect that interaction-based metrics per model are much more telling than the textual ones.


There are also specialized metrics involving various aspects of quality, e.g. empathy: https://github.com/Sea94/ieval

## Choose a model

A recent causal LM that fits my memory is all I ask, so under 13B
* OPT? Comes in flavours like < 1B, 1.3B, 2.7B, 6.7B
* GPT-J? 6B
* LLaMA? 7B

I looked at https://weightwatcher.ai/leaderboard.html and https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard and compared the top models regarding their Alpha and Average, respectively. Frankly, these two metrics seem to be contradicting each other. 

Surprisingly, an OPT-1.3B model was pretty small, high in Alpha and decent in terms of Truthfulness, although other metrics were not as impressive. Since I haven't had any opinion based on experience, I decided to test the pipeline on a small OPT. 

If I have enough time, I'll switch to the fresh LLaMA-2 (7B) -- it demonstrates a good average on the leaderboard tasks.



In [1]:
# %pip install transformers datasets peft accelerate bitsandbytes sacremoses pandas 

In [2]:
from transformers import AutoTokenizer

model_id = "facebook/opt-1.3b"

tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)

## Choose datasets & prepare them

Personally, I think that all dialogue corpora ought to be curated. As a source of data, Reddit probably contains most lively responses, but it is potentially unpredictable and should be tested first. I considered the following datasets:

* ConvAI2/3 dataset - on the closer look, it has a deeper structure that is useful in RLHF or evaluation contexts
* The NPS Chat Corpus - not in Huggingface Datasets, but potentially useful
* Cornell Movie-Dialogs Corpus
* HC3 (which also contains ChatGPT answers, but we won't need them)
* and DailyDialogue corpus


And then the problems started.

- I haven't opened a PR yet, but the builder script for 'cornell_movie_dialog' is broken
    
    \# y = load_dataset('cornell_movie_dialog')['train']  
- 'Hello-SimpleAI/HC3' is broken too
    
    \# z = load_dataset('Hello-SimpleAI/HC3')['train'] # ['question', 'human_answers']
- The NPS Chat Corpus is a part of NLTK, and it's completely unusable because it is a loose collection of posts and not (somewhat coherent) dialogs. 


I was left with DailyDialogue and manually-processed HC3.

By the way, these should've worked better if the data also contained prompts (like 'USER1 says '), in a manner in which the chatbot is prompted (see src/chatbot.py)

In [3]:
from transformers import DataCollatorForLanguageModeling
from datasets import load_dataset

dd = load_dataset('daily_dialog') # ['dialog']

In [4]:
# by the way, daily_dialogs is suddenly pre-tokenized in some places
from sacremoses import MosesDetokenizer
detok = MosesDetokenizer('en')

def process_daily_dialogs(examples):
    temp = [[detok.detokenize(x.split()) for x in y] for y in examples['dialog']]
    temp = [' | '.join(x) for x in temp]
    examples = tokenizer(temp)
    return examples

dd['train'] = dd['train'].map(process_daily_dialogs, batched=True, remove_columns=["dialog", 'act', 'emotion'])
dd['test'] = dd['test'].map(process_daily_dialogs, batched=True, remove_columns=["dialog", 'act', 'emotion'])

In [5]:
# !wget https://huggingface.co/datasets/Hello-SimpleAI/HC3/resolve/main/all.jsonl
import json
import pandas as pd
from datasets import Dataset

def load_hc3():
    with open("all.jsonl") as inp:
        data = pd.DataFrame.from_records([json.loads(line) for line in inp])
    return data

hc3 = Dataset.from_pandas(load_hc3()) # 'question', 'human_answers'
hc3 = hc3.train_test_split(test_size=0.1)

In [None]:
def process_hc3(examples):
    # This could have been done in pandas, I think...
    pairs = []
    for x, y in zip(examples['question'], examples['human_answers']):
        x = [detok.detokenize(x.split())] * len(y)
        y = [detok.detokenize(t.split()) for t in y]
        pairs.extend([' | '.join(d) for d in zip(x, y)])
    examples = tokenizer(pairs)
    return examples
    
hc3['train'] = hc3['train'].map(process_hc3, batched=True, remove_columns=["question", 'human_answers', 'chatgpt_answers', 'index', 'source'])
hc3['test'] = hc3['test'].map(process_hc3, batched=True, remove_columns=["question", 'human_answers', 'chatgpt_answers', 'index', 'source'])

In [7]:
from datasets import concatenate_datasets

data_train = concatenate_datasets([dd['train'], hc3['train']]) 
data_test = concatenate_datasets([dd['test'], hc3['test']]) 

## LoRA

I'll use one PEFT-LoRa example, because I don't really have a lot of time to look deeper than that. I suppose that Prefix Tuning could work better, though.

In [9]:
from transformers import AutoModelForCausalLM
from peft import prepare_model_for_int8_training, LoraConfig, get_peft_model

# I've seen people use int4 in BitsAndBytes for training; a bit too extreme for a baseline
# model = AutoModelForCausalLM.from_pretrained(model_id, load_in_8bit=True, device_map={"":0})
# model = prepare_model_for_int8_training(model)

model = AutoModelForCausalLM.from_pretrained(model_id)
model.gradient_checkpointing_enable()


config = LoraConfig(
    r=8, lora_alpha=32, target_modules=["q_proj", "k_proj", "v_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)

Downloading pytorch_model.bin:   0%|          | 0.00/2.63G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

## Combine everything

In [14]:
!export PYTORCH_ENABLE_MPS_FALLBACK=1

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [15]:
from transformers import Trainer, TrainingArguments


loader = DataCollatorForLanguageModeling(tokenizer, mlm=False)

trainer = Trainer(
    model=model,
    train_dataset=data_train,
    eval_dataset=data_test,
    args=TrainingArguments(
        per_device_train_batch_size=16,
        gradient_accumulation_steps=4,
        # warmup_steps=500,
        # max_steps=10000,
        warmup_steps=1,
        max_steps=1,
        learning_rate=2e-4,  # ?
        # fp16=True,
        logging_steps=500,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    data_collator=loader
)
trainer.train()
trainer.save_model('weights')

  0%|          | 0/1 [00:00<?, ?it/s]

RuntimeError: MPS does not support cumsum op with int64 input