# Multiturn Conversation Finetuning
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/togethercomputer/together-cookbook/blob/main/Multiturn_Conversation_Finetuning.ipynb)

## Introduction

In this cookbook we demonstrate how you can train your LLM to converse better by finetuning it on multi-step conversational data. This cookbook is part of a technical deep dive blogpost you can read [here](https://www.together.ai/blog/fine-tuning-llms-for-multi-turn-conversations-a-technical-deep-dive).

[CoQA](https://huggingface.co/datasets/stanfordnlp/coqa/tree/main) is a large-scale dataset for building Conversational Question Answering systems. The goal of the CoQA challenge is to measure the ability of machines to understand a text passage and answer a series of interconnected questions that appear in a conversation.

CoQA contains 127,000+ questions with answers collected from 8000+ conversations. Each conversation is collected by pairing two crowdworkers to chat about a passage in the form of questions and answers. CoQA has a lot of challenging phenomena not present in existing reading comprehension datasets, e.g., coreference and pragmatic reasoning.

<img src="../images/conversation.png" width="500">

## Install Libraries

In [None]:
!pip install -q datasets==3.1.0 transformers together==1.3.4

## Prepare CoQA Dataset for Fine-tuning

Below we load and prepare the CoQA dataset for fine-tuning through the Together AI fine-tuning API.

The code below will format the data to a common conversational format, that can be used for fine-tuning the model.

In [None]:
from datasets import load_dataset

coqa_dataset = load_dataset("stanfordnlp/coqa")

### Lets examine some rows from the CoQA dataset.

We can see that the source passage in which the questions and answers are observed in the `story` column. The `questions` column contains multiple questions and the `answers` column contains a dictionary of answers and also source citation of where the answer starts and ends in the source passage.

The goal of the CoQA challenge is to measure the ability of machines to understand a text passage and answer a series of interconnected questions that appear in a conversation.

This ability can be assessed using two metrics: F1 score, which measures word overlap between predicted and ground truth answers, and Exact Match (EM), which requires the prediction to exactly match one of the ground truth answers.

<img src="../images/CoQA.png" height="500">

In [3]:
coqa_dataset["train"].to_pandas().head()

Unnamed: 0,source,story,questions,answers
0,wikipedia,"The Vatican Apostolic Library (), more commonl...","[When was the Vat formally opened?, what is th...",{'input_text': ['It was formally established i...
1,cnn,New York (CNN) -- More than 80 Michael Jackson...,"[Where was the Auction held?, How much did the...","{'input_text': ['Hard Rock Cafe', '$2 million...."
2,gutenberg,"CHAPTER VII. THE DAUGHTER OF WITHERSTEEN \n\n""...","[What did Venters call Lassiter?, Who asked La...","{'input_text': ['gun-man', 'Jane', 'Yes', 'to ..."
3,cnn,(CNN) -- The longest-running holiday special s...,"[Who is Rudolph's father?, Why does Rudolph ru...","{'input_text': ['Donner', 'he felt like an out..."
4,gutenberg,CHAPTER XXIV. THE INTERRUPTED MASS \n\nThe mor...,"[Who arrived at the church?, Who was followed ...","{'input_text': ['the garrison first', 'Fra. Do..."


### Format the data to conform with the chat format:

```json
{
  "messages": [
    {"role": "system", "content": "You are a helpful AI chatbot."},
    {"role": "user", "content": "Hello, how are you?"},
    {"role": "assistant", "content": "I'm doing well, thank you! How can I help you?"},
    {"role": "user", "content": "Can you explain machine learning?"},
    {"role": "assistant", "content": "Machine learning is..."}
  ]
}
```

This list of messages can then be written to a `.jsonl` file. 

To learn more about the conversation format please see our [docs](https://docs.together.ai/docs/fine-tuning-data-preparation#conversational-data).

In [None]:
# the system prompt,if present, must always be at the beginning
system_prompt = "Read the story and extract answers for the questions.\nStory: {}"

def map_fields(row):
    """    
    Maps the fields from a row of data to a structured format for conversation.
    Args:
        row (dict): A dictionary containing the keys "story", "questions", and "answers".
            - "story" (str): The story content to be used in the system prompt.
            - "questions" (list of str): A list of questions from the user.
            - "answers" (dict): A dictionary containing the key "input_text" which is a list of answers from the assistant.
    Returns:
        dict: A dictionary with a single key "messages" which is a list of message dictionaries.
            Each message dictionary contains:
            - "role" (str): The role of the message sender, either "system", "user", or "assistant".
            - "content" (str): The content of the message.    
    """
    messages = [
        {
            "role": "system",
            "content": system_prompt.format(row["story"]),
        }
    ]
    for q, a in zip(row["questions"], row["answers"]["input_text"]):
        messages.append(
            {
                "role": "user",
                "content": q,
            }
        )
        messages.append(
            {
                "role": "assistant",
                "content": a,
            }
        )

    return {
        "messages": messages
    }

In [None]:
# transform the data using the mapping function
train_messages = coqa_dataset["train"].map(map_fields, remove_columns=coqa_dataset["train"].column_names)

In [10]:
train_messages

Dataset({
    features: ['messages'],
    num_rows: 7199
})

In [7]:
train_messages.to_json("coqa_prepared_train.jsonl")

Creating json from Arrow format:   0%|          | 0/8 [00:00<?, ?ba/s]

23777505

## Fine-tune on Prepared Dataset using Together AI Fine-tuning API

In [14]:
from together import Together
import os

TOGETHER_API_KEY = os.getenv("TOGETHER_API_KEY")
WANDB_API_KEY = os.getenv("WANDB_API_KEY")


client = Together(api_key=TOGETHER_API_KEY)

In [None]:
# Upload dataset to Together AI

train_file_resp = client.files.upload("coqa_prepared_train.jsonl", check=True)
print(train_file_resp)

Uploading file coqa_prepared_train.jsonl: 100%|██████████| 23.8M/23.8M [00:01<00:00, 23.3MB/s]


id='file-63b9f097-e582-4d2e-941e-4b541aa7e328' object=<ObjectType.File: 'file'> created_at=1731886046 type=None purpose=<FilePurpose.FineTune: 'fine-tune'> filename='coqa_prepared_train.jsonl' bytes=0 line_count=0 processed=False FileType='jsonl'


In [None]:
ft_resp = client.fine_tuning.create(
    training_file = train_file_resp.id,
    model = 'meta-llama/Meta-Llama-3.1-8B-Instruct-Reference',
    train_on_inputs= "auto",
    n_epochs = 3,
    n_checkpoints = 1,
    wandb_api_key = WANDB_API_KEY,
    lora = True,
    warmup_ratio=0,
    learning_rate = 1e-5,
    suffix = 'my-demo-finetune',
)

print(ft_resp.id)

Once the fine-tuning job is completed you will be able to see it in your [Together AI dashbaord](https://api.together.ai)

<img src="../images/ft_model.png" height="500">

You can also look at the WandB plots for the run(if a WANDB key was provided):

<img src="../images/wandb_model.png" height="500">

## Evaluate Fine-tuned Model

For evaluation, CoQA uses two metrics: 
- F1 score, which measures word overlap between predicted and ground truth answers
- Exact Match (EM), which requires the prediction to exactly match one of the ground truth answers. 

F1 is the primary metric as it better handles free-form answers by giving partial credit for partially correct responses.

In [None]:
from tqdm.auto import tqdm
from multiprocessing.pool import ThreadPool
import transformers.data.metrics.squad_metrics as squad_metrics

In [None]:
# This function is used to generate model answers on the CoQA validation set from the untuned reference and fine-tuned models

def get_model_answers(model_name):
    """
    Generate model answers for a given model name using a dataset of questions and answers.
    Args:
        model_name (str): The name of the model to use for generating answers.
    Returns:
        list: A list of lists, where each inner list contains the answers generated by the model for the corresponding set of questions in the dataset.
    The function performs the following steps:
    1. Initializes an empty list to store the model answers.
    2. Defines an inner function `get_answers` that takes a data dictionary and generates answers for the questions in the data.
    3. Uses a thread pool to parallelize the process of generating answers for each entry in the validation dataset.
    4. Appends the generated answers to the `model_answers` list.
    5. Returns the `model_answers` list.
    Note:
        - The `system_prompt` and `client` variables are assumed to be defined elsewhere in the code.
        - The `coqa_dataset` variable is assumed to contain the dataset with a "validation" key.
    """

    model_answers = []

    def get_answers(data):
        answers = []
        messages = [
            {
                "role": "system",
                "content": system_prompt.format(data["story"]),
            }
        ]
        for q, true_answer in zip(data["questions"], data["answers"]["input_text"]):
            messages.append(
                {
                    "role": "user",
                    "content": q
                }
            )
            chat_completion = client.chat.completions.create(
                messages=messages,
                model=model_name,
                max_tokens=64,
            )
            answer = chat_completion.choices[0].message.content
            answers.append(answer)
        return answers


    with ThreadPool(8) as pool:
        for answers in tqdm(pool.imap(get_answers, coqa_dataset["validation"]), total=len(coqa_dataset["validation"])):
            model_answers.append(answers)

    return model_answers

In [None]:
# This function will be used to evaluate predicted answers uinsg the Exact Match (EM) and F1 metrics

def get_metrics(pred_answers):
    """
    Calculate the Exact Match (EM) and F1 metrics for predicted answers.
    Args:
        pred_answers (list): A list of predicted answers. Each element in the list is a list of predicted answers for a single question.
    Returns:
        tuple: A tuple containing two elements:
            - em_score (float): The average Exact Match score across all predictions.
            - f1_score (float): The average F1 score across all predictions.
    """

    em_metrics = []
    f1_metrics = []

    for pred, data in tqdm(zip(pred_answers, coqa_dataset["validation"]), total=len(pred_answers)):
        for pred_answer, true_answer in zip(pred, data["answers"]["input_text"]):
            em_metrics.append(squad_metrics.compute_exact(true_answer, pred_answer))
            f1_metrics.append(squad_metrics.compute_f1(true_answer, pred_answer))

    return sum(em_metrics) / len(em_metrics), sum(f1_metrics) / len(f1_metrics)

## Deploy Model and Run Evals

Before we can run the evaluations we need to deploy our finetuned model as a Dedicated Endpoint(DE). After fine-tuning completes, access your model through the Together AI dashboard. Go to Models, select your fine-tuned model, and select Deploy. Choose from the available hardware options - we'll use a single A100-80GB GPU for this example.

<img src="../images/deploy_CFT.png" height="650">

In [None]:
models_names = [
    "meta-llama/Meta-Llama-3.1-8B-Instruct-Reference",
    "zainhas/Meta-Llama-3.1-8B-Instruct-Reference-my-demo-finetune-4224205a", # finetuned model goes here once deployed
]

for model_name in models_names:
    print(model_name)
    answers = get_model_answers(model_name)
    em_metric, f1_metric = get_metrics(answers)
    print(f"EM: {em_metric}, F1: {f1_metric}")

For the evaluation above we saw a marked improvement in the LLMs ability to address conversational questions. The exact match score ~12x and the F1 score goes up ~3x after fine-tuning.

| Llama 3.1 8B | EM | F1|
|---|---|---|
| Original | 0.043 | 0.232 |
| Fine-tuned | 0.62 | 0.78 |

To learn more about our fine-tuning API read the docs [here](https://docs.together.ai/reference/finetune)!