# Fine Tuning GPT-3.5-Turbo

In this notebook, we walk through an example of fine-tuning gpt-3.5-turbo.

Specifically, we attempt to distill GPT-4's knowledge, by generating training data with GPT-4 to then fine-tune GPT-3.5.

All training data is generated using two different sections of our index data, creating both a training and evalution set.

Evaluation is done using the `ragas` library, which we will detail later on.

In [None]:
!pip install llama-index pypdf sentence-transformers ragas

Collecting llama-index
  Downloading llama_index-0.10.43-py3-none-any.whl (6.8 kB)
Collecting pypdf
  Downloading pypdf-4.2.0-py3-none-any.whl (290 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.4/290.4 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentence-transformers
  Downloading sentence_transformers-3.0.1-py3-none-any.whl (227 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ragas
  Downloading ragas-0.1.9-py3-none-any.whl (86 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.1/86.1 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting llama-index-agent-openai<0.3.0,>=0.1.4 (from llama-index)
  Downloading llama_index_agent_openai-0.2.7-py3-none-any.whl (12 kB)
Collecting llama-index-cli<0.2.0,>=0.1.2 (from llama-index)
  Downloading llama_index_cli-0.1.12-py3-none-any.whl (26 kB)
Collecting llama-index-core==0.10.4

In [None]:
from google.colab import drive
drive.mount('/content/drive')
file_path = '/content/drive/My Drive/224n project stuff/masked_examples_max_seq_100.json'
!pip install git+https://github.com/google-research/bleurt.git
!wget https://storage.googleapis.com/bleurt-oss/bleurt-base-128.zip
!unzip bleurt-base-128.zip

Mounted at /content/drive
Collecting git+https://github.com/google-research/bleurt.git
  Cloning https://github.com/google-research/bleurt.git to /tmp/pip-req-build-uofd2isd
  Running command git clone --filter=blob:none --quiet https://github.com/google-research/bleurt.git /tmp/pip-req-build-uofd2isd
  Resolved https://github.com/google-research/bleurt.git to commit cebe7e6f996b40910cfaa520a63db47807e3bf5c
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: BLEURT
  Building wheel for BLEURT (setup.py) ... [?25l[?25hdone
  Created wheel for BLEURT: filename=BLEURT-0.0.2-py3-none-any.whl size=16456763 sha256=1af0c3edbb3913003a05c881a79562c8029a8b9e7fd8a0653189a327f7f9fd18
  Stored in directory: /tmp/pip-ephem-wheel-cache-hc0i545h/wheels/64/f4/2c/509a6c31b8ebde891a81029fd94f199b1b92f0e7cfc20d417a
Successfully built BLEURT
Installing collected packages: BLEURT
Successfully installed BLEURT-0.0.2
--2024-06-08 20:05:42--  https://storage.googleapi

In [None]:
import os
import openai

os.environ["OPENAI_API_KEY"] = ""
openai.api_key = os.environ["OPENAI_API_KEY"]

# Setting up jsonl file

In [None]:
!pip install jsonlines

Collecting jsonlines
  Downloading jsonlines-4.0.0-py3-none-any.whl (8.7 kB)
Installing collected packages: jsonlines
Successfully installed jsonlines-4.0.0


In [None]:
import json
import random
from sklearn.model_selection import train_test_split
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

directory_path = '/content/drive/My Drive/224n'

# Create the directory if it does not exist
os.makedirs(directory_path, exist_ok=True)

# Load the JSON file
file_path = os.path.join(directory_path, 'masked_examples_max_seq_100.json')
with open(file_path, 'r') as f:
    data = json.load(f)

# Shuffle and split the data
random.shuffle(data)
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)

# Function to convert to JSONL
def convert_to_jsonl_chat_format(data, output_file):
    with open(output_file, 'w') as f:
        for entry in data:
            # Ensure the entry has the correct keys
            if "input" in entry and "target" in entry:
                chat_format_entry = {
                    "messages": [
                        {"role": "user", "content": entry.get("input", "")},
                        {"role": "assistant", "content": entry.get("target", "")}
                    ]
                }
                json.dump(chat_format_entry, f)
                f.write('\n')

# Define the paths to save the JSONL files
train_file_path = os.path.join(directory_path, '244nPlease.jsonl')
test_file_path = os.path.join(directory_path, 'test_data.jsonl')

# Convert and save training data
convert_to_jsonl(train_data, train_file_path)

# Convert and save test data
convert_to_jsonl(test_data, test_file_path)

print(f"Training data saved to {train_file_path}")
print(f"Test data saved to {test_file_path}")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


NameError: name 'convert_to_jsonl' is not defined

In [None]:
file_path = train_file_path
fixed_file_path = '/mnt/data/fixed_train_data.jsonl'

# Load the JSONL file and fix the entries
fixed_data = []
with open(file_path, 'r') as f:
    for line in f:
        try:
            entry = json.loads(line)
            # Rename fields
            fixed_entry = {
                "prompt": entry.get("input", ""),
                "completion": entry.get("target", "")
            }
            fixed_data.append(fixed_entry)
        except json.JSONDecodeError as e:
            print(f"Error decoding JSON on line: {line}")
            continue

# Save the fixed JSONL file
with open(fixed_file_path, 'w') as f:
    for entry in fixed_data:
        json.dump(entry, f)
        f.write('\n')

print(f"Fixed JSONL file saved to {fixed_file_path}")

In [None]:
!pip install transformers

from transformers import GPT2Tokenizer

# Initialize the GPT-2 tokenizer (since OpenAI models use similar tokenization)
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Load the JSONL file
file_path = train_file_path
data = []

with open(file_path, 'r') as f:
    for line in f:
        data.append(json.loads(line))

# Function to count tokens
def count_tokens(text):
    return len(tokenizer.encode(text))

# Calculate the total number of tokens
total_tokens = 0

for entry in data:
    for key, value in entry.items():
        total_tokens += count_tokens(value)

print(f"Total number of tokens: {total_tokens}")

# Estimate the cost for fine-tuning
cost_per_1000_tokens = 0.03  # in USD (example cost)
total_cost = (total_tokens / 1000) * cost_per_1000_tokens

print(f"Estimated cost for fine-tuning: ${total_cost:.2f}")



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (2298 > 1024). Running this sequence through the model will result in indexing errors


Total number of tokens: 2889360
Estimated cost for fine-tuning: $86.68


In [None]:
!pip install llama-index-core
!pip install llama-index-llms-openai
!pip install llama-index-llms-replicate
!pip install llama-index-embeddings-huggingface

Collecting llama-index-llms-replicate
  Downloading llama_index_llms_replicate-0.1.3-py3-none-any.whl (2.9 kB)
Installing collected packages: llama-index-llms-replicate
Successfully installed llama-index-llms-replicate-0.1.3
Collecting llama-index-embeddings-huggingface
  Downloading llama_index_embeddings_huggingface-0.2.1-py3-none-any.whl (7.1 kB)
Collecting sentence-transformers<3.0.0,>=2.6.1 (from llama-index-embeddings-huggingface)
  Downloading sentence_transformers-2.7.0-py3-none-any.whl (171 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
Collecting minijinja>=1.0 (from huggingface-hub[inference]>=0.19.0->llama-index-embeddings-huggingface)
  Downloading minijinja-2.0.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (853 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m853.2/853.2 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: minijinja, sente

In [None]:
!pip install transformers
!pip install openai



In [None]:
!pip install llama_index



In [None]:
!pip install llama-index-finetuning
!pip install llama-index-finetuning-callbacks
!pip install llama-index-llms-openai

[31mERROR: Could not find a version that satisfies the requirement llama-index-finetuning-callbacks (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for llama-index-finetuning-callbacks[0m[31m


In [None]:
!pip install llama-index



In [None]:

from llama_index.llms.openai import OpenAI
from llama_index.finetuning.callbacks import OpenAIFineTuningHandler
from llama_index.core.callbacks import CallbackManager

finetuning_handler = OpenAIFineTuningHandler()
callback_manager = CallbackManager([finetuning_handler])

llm = OpenAI(model="gpt-3.5-turbo-1106", temperature=0.3)
llm.callback_manager = callback_manager

## Create Fine-Tuning Data

Fine-Tuning data must be written as a list of messages in a `.jsonl` file. Using the finetuning-handler, we can easily write the messages to a `.jsonl` file.

In [None]:
finetuning_handler.save_finetuning_events("finetuning_events.jsonl")

Wrote 0 examples to finetuning_events.jsonl


## Launch Fine-Tuning Job

In [None]:
from llama_index.finetuning import OpenAIFinetuneEngine

finetune_engine = OpenAIFinetuneEngine(
    "gpt-3.5-turbo-1106",
    "finetuning_events.jsonl",
)


In [None]:
finetune_engine.finetune()

In [None]:
ft_llm = finetune_engine.get_finetuned_model(temperature=0.3)

## Evaluation

After some time, your model will be done training!

The next step is running our fine-tuned model on our eval dataset again to measure any performance increase.

In [None]:
ft_model_name = "ft:gpt-3.5-turbo-0613:..."

In [None]:
from llama_index import ServiceContext
from llama_index.llms import OpenAI
from llama_index.callbacks import OpenAIFineTuningHandler
from llama_index.callbacks import CallbackManager


ft_context = ServiceContext.from_defaults(
    llm=OpenAI(model=ft_model_name, temperature=0.3),
    context_window=2048,  # limit the context window artifically to test refine process
)

In [None]:
questions = []
with open("eval_questions.txt", "r") as f:
    for line in f:
        questions.append(line.strip())

In [None]:
from llama_index import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents, service_context=ft_context)

query_engine = index.as_query_engine(similarity_top_k=2)

In [None]:
contexts = []
answers = []

for question in questions:
    response = query_engine.query(question)
    contexts.append([x.node.get_content() for x in response.source_nodes])
    answers.append(str(response))

In [None]:
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import answer_relevancy, faithfulness

ds = Dataset.from_dict(
    {
        "question": questions,
        "answer": answers,
        "contexts": contexts,
    }
)

result = evaluate(ds, [answer_relevancy, faithfulness])
print(result)

evaluating with [answer_relevancy]


100%|██████████| 3/3 [00:50<00:00, 16.92s/it]


evaluating with [faithfulness]


100%|██████████| 3/3 [03:15<00:00, 65.20s/it]


{'ragas_score': 0.8845, 'answer_relevancy': 0.9758, 'faithfulness': 0.8088}


## Exploring Differences

Let's quickly compare the differences in responses, to demonstrate that fine tuning did indeed change something.

In [None]:
from llama_index import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)

In [None]:
questions = []
with open("eval_questions.txt", "r") as f:
    for line in f:
        questions.append(line.strip())

In [None]:
print(questions[12])

What is a key barrier globally for ocean health, governance, and adaptation to climate change according to the report?


### Original

In [None]:
from llama_index.response.notebook_utils import display_response
from llama_index import ServiceContext
from llama_index.llms import OpenAI


gpt_35_context = ServiceContext.from_defaults(
    llm=OpenAI(model="gpt-3.5-turbo", temperature=0.3),
    context_window=2048,  # limit the context window artifically to test refine process
)

In [None]:
query_engine = index.as_query_engine(service_context=gpt_35_context)

response = query_engine.query(questions[12])

display_response(response)

**`Final Response:`** According to the report, a key barrier globally for ocean health, governance, and adaptation to climate change is the availability of technology, knowledge, and financial support, as well as existing governance structures.

### Fine-Tuned

In [None]:
from llama_index import ServiceContext
from llama_index.llms import OpenAI


ft_context = ServiceContext.from_defaults(
    llm=OpenAI(model=ft_model_name, temperature=0.3),
    context_window=2048,  # limit the context window artifically to test refine process
)

In [None]:
query_engine = index.as_query_engine(service_context=ft_context)

response = query_engine.query(questions[12])

display_response(response)

**`Final Response:`** The report identifies a broad range of barriers and limits for adaptation to climate change in ecosystems and human systems. These limitations include the availability of technology, knowledge, and financial support, as well as existing governance structures. Existing ocean-governance structures are already facing multi-dimensional, scale-related challenges because of climate change.

As we can see, the fine-tuned model provides a more thorough response! This lines up with the increased faithfullness score from ragas, since the answer is more representative of the retrieved context.

## Conclusion

So, in conclusion, finetuning with only ~61 questions actually helped improve our eval scores!

**answer_relevancy: 0.9778 -> 0.9758**

The answer relenvancy appears to be basically unchanged, between models.

**faithfulness: 0.7638 -> 0.8088**

The faithfulness appears to have been improved! This mains the anwers given better fuffil the original question that was asked.