## Conversational AI Assignment 1: Fine-tuning for Knowledge-aware Response Generation

LLMs are being widely deployed as dialogue systems, e.g. QA systems. When we want to ask domain-specific questions, an off-the-shelf language model might lack the desired knowledge. Adding the desired knowledge and fine-tuning can address these gaps. 

In this assignment, we will be fine-tuning a large language model (Llama3.2 1B Instruct) for task-oriented conversational modeling with subjective knowledge. The task and dataset are explained in detail [here](https://github.com/alexa/dstc11-track5).

In essence, the task is to respond to subjective user requests regarding hotels and restaurants, e.g. "does the restaurant have a nice vibe?". Our conversational model has to generate a response to this question based on conversation history and previous reviews it has access to. 

This task consists of three sub-tasks: 
1. Knowledge-seeking Turn Detection
2. Knowledge Selection
3. **Knowledge-grounded Response Generation** (our focus)

We will be focusing on the last sub-task, _knowledge-grounded response generation_, for which we need the dialog history, request, and knowledge. With these as the input, the model should then generate a response.

**This notebook contains the following parts:**
1. Exploring the DSTC11 Task5 data and formatting it in the proper structure
2. Off-the-shelf inference with Llama3.2
3. Fine-tuning Llama3.2 on the DSTC11 Task5 data
4. Analyzing the model behavior

## Assignment

We will be using the dialogue history between a user and a system and subjective knowledge w.r.t. relevant user reviews to have Llama3.2 generate a response to a knowledge-seeking request from the user. The goal is to compare the performance between off-the-shelf usage and a domain-specific fine-tuned model. For this assignment, we will focus on manual human evaluation. 

Throughout the notebook, you will find questions you need to answer to complete the assignment. These are both coding questions as well as report questions for you to understand the data, as well as understanding the model behavior. These questions are indicated as  **<span style="background:yellow">Q#</span>**.

**Assignment steps:**
1. Load the dataset and understand its structure, individual components, what the inputs should be and what the output should be. **(Q1)**
2. Convert the dataset to a structure that is useful for our LLM to generate its responses with. **(Q2 - Q5)**
3. Manually examine off-the-shelf/vanilla Llama3.2 performance on 10 dataset samples. **(Q6)**
4. Fine-tune Llama3.2 on our dataset. **(Q7)**
5. Manually examine finetuned Llama3.2 performance. **(Q8 - Q13)**
6. Write a 4-pages report to describe your findings, maximum 5 pages in LNCS format. The report should consist of the following: 
    1. *Introduction* - 1-page that introduces the task, model, summary of methodology, and findings
    2. *Data* - 1-page that describes the dataset we use:
        1. How does the dataset look like? 
        2. What are the individual components and how are they relevant for the LLM? 
        3. What pre-processing steps do we take and why? 
    3. *Methodology* - 0.5-page that describes the process for fine-tuning: 
        1. What are the inputs and what is the target output? 
        2. Which hyperparameters (i.e. amount of epochs, learning rate, and warmup steps) did you use? 
        3. What is the training loss you achieved and what does this mean for your model (e.g. a training loss around 0 probably hints at overfitting and above 1 means that the model isn't learning the task properly).
    4. *Results* - 1-page that describes the findings of the manual examination, based on the questions Q8-Q13 that you will find in this notebook 
    5. *Division of work* - a short paragraph describing how the steps of the coding and report writing were tackled between the two group members.
    
Feel free to add more information for each section if it fits within the page limit! 

**Submission:**
Please submit your code (as a Kaggle notebook) and your report (PDF) on Canvas by **15th November 23:59**.

**Grading:** The grading will be split into two parts: 
1. Code accuracy and quality for steps 1-5 (40% of the grade)
2. Quality of the report (60% of the grade)

## A couple of notes about Kaggle notebooks
* [Terms and conditions](https://www.kaggle.com/terms)
* Short note on the [GPU limits](https://www.kaggle.com/discussions/general/108481)
* [Uploading your own model to Kaggle](https://www.kaggle.com/discussions/questions-and-answers/63328)

**Kaggle directories** may be confusing because they are present on another machine that we have no access to (effectively "in the cloud"). 
To help your orientation, keep this in mind:
* Input data files are available in the read-only "/kaggle/input/" directory
* You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
* You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Kaggle seems to run smoothly, including in terms of GPUs. There are two common environment issues that happen because we forget to turn on the internet or the GPUs. Specifically:
* `ERROR: Could not find a version that satisfies the requirement ... (from versions: none) ERROR: No matching distribution found for ...` - to solve this issue, turn your notebook Internet ON.
* `OSError` when loading the Ollama models - to solve this, make sure your accelator is set to `GPU P100` or `GPU T4x2`

So, these are four good practices when working with Kaggle notebooks:
1. Pay attention to usage statistics, especially memory, CPU, and GPU
2. Pay attention to the quota of GPU (measured in hours)
3. "Turn OFF" the internet after each session. Turn on the internet when starting a session.
4. "Turn off" the accelerator after each use. Turn the accelerator on when starting a session.

## Exploring the data
The data we will use contains these essential components: 
- __knowledge__: The reviews for the corresponding restaurant or hotel
- __dialogue__: The dialogue history between the user and the system
- __response__: The ground-truth response we expect from the system, which is the target response for the lanuage model we fine-tune

Let's load the data first to understand it better! The data is already attached to this notebook under datasets as _dstc-11-task-5_ and we can use the original dataloader to get the correct formatting

In [97]:
# Import dataset loader
import sys
sys.path.insert(1, "datasetloader")
from dataset_walker import DatasetWalker

# Load trainingset
train_ds = DatasetWalker("train", "dataset", labels=True, incl_knowledge=True)

Let's look at the first example:

In [98]:
sample = train_ds[0]
sample

([{'speaker': 'U',
   'text': 'Can you help me find a place to stay that is moderately priced and includes free wifi?'},
  {'speaker': 'S', 'text': 'sure, i have 17 options for you'},
  {'speaker': 'U',
   'text': "Are any of them in the south? I'd like free parking too."},
  {'speaker': 'S',
   'text': 'Yes, two are in the south and both have free parking and internet. I recommend the Bridge Guesthouse. Would you like me to book a reservation?'},
  {'speaker': 'U',
   'text': 'I have back issues. Does this place have comfortable beds?'}],
 {'target': True,
  'knowledge': [{'domain': 'hotel',
    'entity_id': 11,
    'doc_type': 'review',
    'doc_id': 3,
    'sent_id': 1,
    'sent': 'The room was clean and comfortable and not expensive.'},
   {'domain': 'hotel',
    'entity_id': 11,
    'doc_type': 'review',
    'doc_id': 2,
    'sent_id': 6,
    'sent': 'It could ruin your stay if you mind that kind of thing.'},
   {'domain': 'hotel',
    'entity_id': 11,
    'doc_type': 'review',
 

### Conversation History
The first index is the dialogue, clearly marking what the user and system have said until now. The last dictionary in this list is the last utterance from the dialogue, indicating the query that the model should respond to. 

<span style="background:yellow">__Q1:__ In the block below, write the code to first access the dialogue from _sample_, and then the query:</span>

In [99]:
# Your code here: 
dialogue = sample[0]
query = sample[1]

dialogue, query

([{'speaker': 'U',
   'text': 'Can you help me find a place to stay that is moderately priced and includes free wifi?'},
  {'speaker': 'S', 'text': 'sure, i have 17 options for you'},
  {'speaker': 'U',
   'text': "Are any of them in the south? I'd like free parking too."},
  {'speaker': 'S',
   'text': 'Yes, two are in the south and both have free parking and internet. I recommend the Bridge Guesthouse. Would you like me to book a reservation?'},
  {'speaker': 'U',
   'text': 'I have back issues. Does this place have comfortable beds?'}],
 {'target': True,
  'knowledge': [{'domain': 'hotel',
    'entity_id': 11,
    'doc_type': 'review',
    'doc_id': 3,
    'sent_id': 1,
    'sent': 'The room was clean and comfortable and not expensive.'},
   {'domain': 'hotel',
    'entity_id': 11,
    'doc_type': 'review',
    'doc_id': 2,
    'sent_id': 6,
    'sent': 'It could ruin your stay if you mind that kind of thing.'},
   {'domain': 'hotel',
    'entity_id': 11,
    'doc_type': 'review',
 

<span style="background:yellow">__Q2:__ To make it more readable for the model (and also ourselves), let's reformat the conversation history to the following:</span>

>__User:__ Can you help me find a place to stay that is moderately priced and includes free wifi?
>
>__System:__ sure, i have 17 options for you 

etc.

In [100]:
from typing import List

def format_dialogue(dialogue: List[dict]) -> str: 
    """Formats a list of dialogue by turning it into readable string representation.

    Given a list of dictionaries, each representing a dialogue turn, this function formats the dialogue by prefixing
    each turn with either "User" or "System" based on the speaker. The speaker is identified by the 'speaker' key in 
    the dictionary: 'U' for user, else it is the system.

    Args:
        dialogue (List[dict]): A list of dictionaries where each dictionary contains two keys:
            - 'speaker' (str): A string indicating the speaker of the turn ('U' for user, 'S' for system).
            - 'text' (str): The text spoken by the respective speaker.

    Returns:
        str: A formatted string where each dialogue turn is on a new line, prefixed by either "User" or "System".
    """
    # Your solution here
    output = ""
    
    for text in dialogue:
        output += ("User:" if text["speaker"] == "U" else "System:") + text["text"] + "\n"

    return output


print(format_dialogue(dialogue))

User:Can you help me find a place to stay that is moderately priced and includes free wifi?
System:sure, i have 17 options for you
User:Are any of them in the south? I'd like free parking too.
System:Yes, two are in the south and both have free parking and internet. I recommend the Bridge Guesthouse. Would you like me to book a reservation?
User:I have back issues. Does this place have comfortable beds?



### Knowledge

Let's look at the knowledge, which contains the reviews for the conversation:

In [101]:
knowledge = sample[1]["knowledge"]
knowledge

[{'domain': 'hotel',
  'entity_id': 11,
  'doc_type': 'review',
  'doc_id': 3,
  'sent_id': 1,
  'sent': 'The room was clean and comfortable and not expensive.'},
 {'domain': 'hotel',
  'entity_id': 11,
  'doc_type': 'review',
  'doc_id': 2,
  'sent_id': 6,
  'sent': 'It could ruin your stay if you mind that kind of thing.'},
 {'domain': 'hotel',
  'entity_id': 11,
  'doc_type': 'review',
  'doc_id': 4,
  'sent_id': 3,
  'sent': "Sadly though, I found that the bed in the room wasn't very comfortable at all."},
 {'domain': 'hotel',
  'entity_id': 11,
  'doc_type': 'review',
  'doc_id': 2,
  'sent_id': 5,
  'sent': 'I do have to say, though, the bed is extremely uncomfortable.'},
 {'domain': 'hotel',
  'entity_id': 11,
  'doc_type': 'review',
  'doc_id': 3,
  'sent_id': 5,
  'sent': 'and the interior of the room was very good and bed was also very much comfortable.'}]

This is again a list of different reviews. If we were to give this knowledge directly to the LLM, we are giving it extra information that is not necessary for generating the response. We only need the review _itself_. 

<span style="background:yellow">__Q3:__ In the block below, write a function that takes as input the knowledge of one sample and outputs the reviews in a list:</span>

In [102]:
def get_reviews(knowledge: List[dict]) -> List[str]: 
    """Extracts and returns a list of review sentences from the given knowledge data.

    Given a list of dictionaries representing knowledge data, this function collects review text from each dictionary.
    If there is no sentence in a specific review, it extracts the value for the 'answer' key. 

    Args:
        knowledge (List[dict]): A list of dictionaries containing review information. Each dictionary has either:
            - 'sent' (str): A key holding the review text.
            - 'answer' (str): A fallback key used when 'sent' is not available.

    Returns:
        List[str]: A list of strings where each string is a review extracted from the knowledge data.

    """

    res = []
    for data in knowledge:
        if "sent" in data:
            res.append(data["sent"])
        else:
            res.append(data["answer"])


    return res
    

get_reviews(knowledge)

['The room was clean and comfortable and not expensive.',
 'It could ruin your stay if you mind that kind of thing.',
 "Sadly though, I found that the bed in the room wasn't very comfortable at all.",
 'I do have to say, though, the bed is extremely uncomfortable.',
 'and the interior of the room was very good and bed was also very much comfortable.']

### Response (Ground-truth)
Similarly, we can load the ground-truth response, which will be the target to fine-tune our LLM with. 

In [103]:
response = sample[1]["response"]
response

'The Bridge Guest House is known for having pretty uncomfortable beds according to most guests. Only one guest found it to be comfortable.'

### Create Dataset
Until now we have looked at and applied our functions to only one sample. Later, when we want to fine-tune our model, we will be using Unsloth in combination with HuggingFace. It is thus useful to convert our dataset into a HuggingFace dataset. 

Let's first install Unsloth, which will also install the other dependencies like transformers and torch.

In [104]:
# !pip install "unsloth[cu118-torch240] @ git+https://github.com/unslothai/unsloth.git"
# !pip install torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu118
# !pip install torch==2.4.1

Let us now correctly format our entire dataset! Essentially, we want directly our nicely formatted dialogue, knowledge, and the ground-truth response. We can do this as follows: 

In [105]:
from datasets import Dataset

def reformat_dataset(dataset): 
    reformatted_dataset = {
        "dialogue": [],
        "knowledge": [],
        "response": [],
    }
    for sample in dataset: 
        # Filter out samples that do not require a response.
        if sample[1]["target"]:
            reformatted_dataset["dialogue"].append(format_dialogue(sample[0]))
            reformatted_dataset["knowledge"].append(get_reviews(sample[1]["knowledge"]))
            reformatted_dataset["response"].append(sample[1]["response"])
        
    return reformatted_dataset

reformatted_dataset = reformat_dataset(train_ds)
dataset = Dataset.from_dict(reformatted_dataset)
dataset

Dataset({
    features: ['dialogue', 'knowledge', 'response'],
    num_rows: 16897
})

Now if we access the first sample, it is much more readable and we can access the different components directly!

In [106]:
dataset[0]

{'dialogue': "User:Can you help me find a place to stay that is moderately priced and includes free wifi?\nSystem:sure, i have 17 options for you\nUser:Are any of them in the south? I'd like free parking too.\nSystem:Yes, two are in the south and both have free parking and internet. I recommend the Bridge Guesthouse. Would you like me to book a reservation?\nUser:I have back issues. Does this place have comfortable beds?\n",
 'knowledge': ['The room was clean and comfortable and not expensive.',
  'It could ruin your stay if you mind that kind of thing.',
  "Sadly though, I found that the bed in the room wasn't very comfortable at all.",
  'I do have to say, though, the bed is extremely uncomfortable.',
  'and the interior of the room was very good and bed was also very much comfortable.'],
 'response': 'The Bridge Guest House is known for having pretty uncomfortable beds according to most guests. Only one guest found it to be comfortable.'}

## Fine-Tuning

### Setting up

Now that we have understand our data a bit better, we can look at fine-tuning our model. We will be using HuggingFace Transformers + Unsloth for faster fine-tuning. 

For the large language model, we will be using __Llama3.2 1B Instruct__. Let's load it!

In [1]:
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
    max_seq_length = 1024,
    load_in_4bit = True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r = 16, 
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"], 
    lora_alpha = 16,
    lora_dropout = 0, 
    bias = "none",    
    use_gradient_checkpointing = "unsloth", 
    random_state = 3407,
    use_rslora = False,  
    loftq_config = None, 
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


  from .autonotebook import tqdm as notebook_tqdm


==((====))==  Unsloth 2024.11.5: Fast Llama patching. Transformers = 4.46.2.
   \\   /|    GPU: NVIDIA GeForce RTX 3060 Laptop GPU. Max memory: 6.0 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.1+cu118. CUDA = 8.6. CUDA Toolkit = 11.8.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.27.post2+cu118. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth 2024.11.5 patched 16 layers with 16 QKV layers, 16 O layers and 16 MLP layers.


Now that we have our dataset in the correct format and loaded our model, we need to create a simple prompt to fine-tune our model on. We can give the model the knowledge and dialogue and ask it to give the response: 

> __PROMPT__: '''__KNOWLEDGE__: {knowledge}
>
> __DIALOGUE__: {dialogue}
>
> __RESPONSE__:'''

<span style="background:yellow">__Q4:__ Let's create a function that takes as input one sample from the dataset, creates the prompt and tokenizes it along with the ground-truth target response. This is then added to the dataset. For this, we can use the [map()](https://huggingface.co/docs/datasets/process#map) function:</span> 

In [107]:
def preprocess_function(sample: dict, max_length=512) -> dict:
    """Prepares model inputs by creating the prompt in the correct format, then tokenizing the prompt and response.

    Given a sample that contains knowledge, dialogue, and response, this function constructs a prompt according to the format above. 
    The function tokenizes both the prompt and the response, truncating and padding them to 
    ensure they match the maximum length. It also appends the EOS token to the response and tokenizes it separately to 
    create the labels for the model.

    Args:
        sample (dict): A dictionary containing:
            - 'knowledge' (str): Background knowledge information.
            - 'dialogue' (str): The dialogue/conversation history text.
            - 'response' (str): The target response to be generated by the model.
        max_length (int, optional): The maximum length for tokenized inputs. Defaults to 512.

    Returns:
        dict: A dictionary containing tokenized inputs for the model:
            - 'input_ids' (List[int]): Tokenized prompt.
            - 'attention_mask' (List[int]): Attention mask of input.
            - 'labels' (List[int]): Tokenized response as labels.
            - 'prompt' (str): The original prompt string.
    """
    # Your solution here. 
    prompt = f"KNOWLEDGE: {sample['knowledge']}\n\nDIALOGUE:\n{sample['dialogue']}\nRESPONSE:"
    prompt_tokens = tokenizer(prompt, truncation=True, padding="max_length", max_length=max_length)
    response_tokens = tokenizer(sample['response'] + tokenizer.eos_token, truncation=True, padding="max_length", max_length=max_length)

    return {
        "input_ids": prompt_tokens["input_ids"],
        "attention_mask": prompt_tokens["attention_mask"],
        "labels": response_tokens["input_ids"],
        "prompt": prompt  
    }

# Apply preprocessing to dataset
dataset = dataset.map(preprocess_function)

Map: 100%|██████████| 16897/16897 [00:19<00:00, 852.26 examples/s]


<span style="background:black">__Q5:__ We would also like to use a validation and test set. For this, create a function that takes as input the split of the DSTC11 dataset and returns the dataset processed by all the steps we applied to the training dataset!</span>

In [108]:
def process_dataset_split(split: str) -> Dataset: 
    """Loads, reformats, and processes a dataset split for model training or evaluation.

    This function loads a dataset split (e.g., 'train', 'validation', 'test') using the `DatasetWalker` class, reformats
    the dataset using `reformat_dataset`, and then preprocesses each entry by applying `preprocess_function`. The 
    processed dataset is returned in the form of a HuggingFace `Dataset` object.

    Args:
        split (str): The name of the dataset split to process

    Returns:
        dataset: A HuggingFace `Dataset` object that contains the preprocessed and reformatted data for the specified split.

    """
    new_data = DatasetWalker(split, "dataset", labels=True, incl_knowledge=True) 
    new_data = Dataset.from_dict(reformat_dataset(new_data))
    new_data = new_data.map(preprocess_function)
    return new_data
    

validation_ds = process_dataset_split("val")
test_ds = process_dataset_split("test")



Map: 100%|██████████| 2129/2129 [00:02<00:00, 858.40 examples/s]
Map: 100%|██████████| 2798/2798 [00:03<00:00, 840.18 examples/s]


### Off-the-shelf usage

To see what our LLM does without any fine-tuning, let's run inference directly on a couple of samples!

In [109]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

inputs = tokenizer([dataset[0]["prompt"]], return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens=64, use_cache=True)
print(tokenizer.batch_decode(outputs)[0])
print("Ground-truth: ", dataset[0]["response"])

<|begin_of_text|>KNOWLEDGE: ['The room was clean and comfortable and not expensive.', 'It could ruin your stay if you mind that kind of thing.', "Sadly though, I found that the bed in the room wasn't very comfortable at all.", 'I do have to say, though, the bed is extremely uncomfortable.', 'and the interior of the room was very good and bed was also very much comfortable.']

DIALOGUE:
User:Can you help me find a place to stay that is moderately priced and includes free wifi?
System:sure, i have 17 options for you
User:Are any of them in the south? I'd like free parking too.
System:Yes, two are in the south and both have free parking and internet. I recommend the Bridge Guesthouse. Would you like me to book a reservation?
User:I have back issues. Does this place have comfortable beds?

RESPONSE: I would recommend the Bridge Guesthouse. The beds are very comfortable, and I'm sure you'll be able to get some rest after your stay.<|eot_id|>
Ground-truth:  The Bridge Guest House is known fo

In [110]:
inputs = tokenizer([dataset[1]["prompt"]], return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens=64, use_cache=True)
print(tokenizer.batch_decode(outputs)[0])

print("Ground-truth: ", dataset[1]["response"])

<|begin_of_text|>KNOWLEDGE: ['The staff was nice enough however and we had a good time in the outdoor dining area which had a great view of the mountains.', 'Did I tell you we ate outside in their patio area?', 'It was also nice to just sit there since they had an outdoor dining area.', 'Outdoor seating is available at Curry Garden.', 'The outdoor dining area is so nice.']

DIALOGUE:
User:Do you have information about the Warkworth House?
System:Yes I do! The Warkworth House is a 4 star guesthouse that is located in the east section of town. Would you like for me to book you a room?
User:No, but can you give me that phone number please?
System:Most definitely. The Warkworth House's phone number is 01223363682. Can I help you with anything else?
User:Yes I need to find an expensive place to eat serving Indian food.
System:There are over a dozen expensive Indian restaurants in the city. Do you have an area of town in mind?
User:Actually, can you suggest one of them. I'm willing to try so

In [111]:
inputs = tokenizer([dataset[2]["prompt"]], return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens=64, use_cache=True)
print(tokenizer.batch_decode(outputs)[0])

print("Ground-truth: ", dataset[2]["response"])

<|begin_of_text|>KNOWLEDGE: ['The room was clean and comfortable and not expensive.', 'It could ruin your stay if you mind that kind of thing.', "Sadly though, I found that the bed in the room wasn't very comfortable at all.", 'I do have to say, though, the bed is extremely uncomfortable.', 'and the interior of the room was very good and bed was also very much comfortable.']

DIALOGUE:
User:I'm interested in finding a hotel that has free parking that I can stay at.
System:There are 8 hotels that provide parking. There is no information on free parking but there is one hotel in the cheap price range. Would you be interested?
User:Yes I need a place to stay for sure.  I like 3 star hotels.  Do you have any 3 stars?
System:There are several. I have two hotel options, both on the expensive side. If you'd rather something more moderately priced, there are also some guesthouses available.
User:Oh I almost forgot, I also need the hotel to provide free wifi.  That may narrow my options down a 

#### Question
<span style="background:yellow">__Q6:__ What do you observe in these generations? Are they relevant to the input? Is it responding to the query from the user?</span>
> We are able to first see the knowledge & dialogue that the model is given (the prompt) followed by the response generated by the model and also the ground-truth which is used for comparision of the response. We observed that running it multiple times yielded different responses. Sometimes better, and other times really bad. However, it does manage to stay in topic, its weakness being generating responses based on the the Knowledge. There were times where the model outputs something similar to the ground-truth, unfortunately most of the time it isnt. It does attempt to try to answer the query sometimes but not neccessarily correctly.

### Let's finetune!
For fine-tuning, we use the supervised fine-tuning Trainer from HuggingFace. To understand the function better, you can read the documentation [here](https://huggingface.co/docs/trl/main/en/sft_trainer#trl.SFTTrainer).

The outputs (model weights etc.) will be stored in the _"outputs"_ folder. 

A large part of fine-tuning is determining the values for the hyperparameters. Particularly: the batch size, number of epochs, learning rate, and warmup steps are important parameters to get right. 

To prevent our GPU from running out of memory, we will use gradient accumulation. With a training batch size of $2$ and gradient accumulation step of $4$ (how many forward and backward passes before updating the model weights), this essentially compares to a batch size of $8$. 

It is up to you to determine the values of the number of training epochs, learning rate, and warmup steps. 
In principle, the recommended standard amount of epochs to train is mostly $1-3$ epochs to prevent overfitting.

The amount of warmup steps can range between $6-10\%$ of the total amount of training steps, so this is tied to the amount of epochs you train with. 

For the learning rate, the usual values range between $1e-4$ and $5e-5$. 

Play around with these values! The aim is to have a training loss below $1$, around $0.5$. A training loss close to $0$ indicates overfitting. 

<span style="background:yellow">__Q7:__ What values for the following hyperparameters did you set: number of epochs, learning rate, and warmup steps?</span>

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

import os
os.environ["WANDB_DISABLED"] = "true"

NUM_TRAIN_EPOCHS = #Fill in your value
LEARNING_RATE = #Fill in your value
WARMUP_STEPS = #Fill in your value

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    eval_dataset=validation_ds,
    dataset_kwargs={'skip_prepare_dataset': True},
    max_seq_length = 1024,
    dataset_num_proc = 2,
    packing = False, 
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = WARMUP_STEPS,
        num_train_epochs = NUM_TRAIN_EPOCHS, 
        learning_rate = LEARNING_RATE,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 500,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

In [None]:
trainer.train()

#### Now let's take a look at what our fine-tuned model does! 

In [None]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

inputs = tokenizer([dataset[0]["prompt"]], return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens=64, use_cache=True, pad_token_id = tokenizer.eos_token_id)
print(tokenizer.batch_decode(outputs)[0])
print("Ground-truth: ", dataset[0]["response"])

In [None]:
inputs = tokenizer([dataset[1]["prompt"]], return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens=64, use_cache=True)
print(tokenizer.batch_decode(outputs)[0])

print("Ground-truth: ", dataset[1]["response"])

In [None]:
inputs = tokenizer([dataset[2]["prompt"]], return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens=64, use_cache=True)
print(tokenizer.batch_decode(outputs)[0])

print("Ground-truth: ", dataset[2]["response"])

## OLLAMA

Let's use OLLAMA to interact with our fine-tuned LLM. Just run the following blocks to install OLLAMA properly!

In [None]:
!curl -fsSL https://ollama.com/install.sh | sh

In [None]:
! pip install ollama

In [None]:
import subprocess
subprocess.Popen(["ollama", "serve"])
import time
time.sleep(3) # Wait for a few seconds for Ollama to load!

In [None]:
# Save our fine-tuned model to 8bit Q8_0 GGUF format so we can use it with OLLAMA
if True: model.save_pretrained_gguf("model", tokenizer,)

In [None]:
import ollama

modelfile = '''# Modelfile
FROM "/kaggle/working/model/unsloth.Q8_0.gguf"

TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""
PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|eot_id|>"
PARAMETER stop "<|reserved_special_token"
PARAMETER top_p 0.5
PARAMETER num_predict 42
'''

ollama.create(model='unsloth_model', modelfile=modelfile)

The following blocks are to give Ollama our prompt and get the response from our fine-tuned model! We can then also compare it to what the target output is. You can use the *get_ollama_output* function for the questions that will follow. 

In [None]:
def get_ollama_output(prompt: str) -> str: 
    response = ollama.chat(model='unsloth_model', messages=[
        {
            'role': 'user',
            'content': prompt,
        },
    ])
    return response['message']['content']

In [None]:
idx = 900
print("INPUT:")
print(test_ds[idx]["prompt"])
print("TARGET OUTPUT:")
print(test_ds[idx]["response"])

In [None]:
print(get_ollama_output(test_ds[idx]["prompt"]))

### Analyze Model Responses
Now that we have OLLAMA up and running, let's look at what types of generations the fine-tuned model produces. For this, we will be using samples from the test set! Pick 10 test samples for the analysis. This analysis will go in the report and you will use the questions below as a guide; **answer all these questions in the report**. Remember to give examples in the report when describing your findings!

#### Questions
<span style="background:yellow">__Q8:__ What do you notice in terms of relevancy of the generation? Are the generations responding to the query?</span> 

```
# Your answer here
```

<span style="background:yellow">__Q9:__ In general, does fine-tuning improve or hurt the model's performance?</span> 


```
# Your answer here
```

<span style="background:yellow">__Q10:__ How does the fine-tuned model perform in the following scenario?</span> 
1. When all reviews agree with each other
2. When only one review disagrees
3. When opinions in the reviews are mixed (i.e. high disagreement)


```
# Your answer here
```


<span style="background:yellow">__Q11:__ How does the length of the dialogue/conversation history affect the fine-tuned model's generation?</span> 


In [None]:
# Your coding solution and answer here.

<span style="background:yellow">__Q12:__ Create some samples (~5) that come from a different domain (e.g. airports, shops, web-stores; you can use an LLM). How does the fine-tuned LLM perform on these?</span> 

In [None]:
# Your coding solution and answer here.

<span style="background:yellow">__Q13:__ How do the responses from the fine-tuned model differ from the off-the-shelf model? Answer this for Q8-Q12</span> 

```
# Your answer here
```