### Starting with a little Jupyter Magic
These magic commands tell Jupyter to automatically reload modules that have changed. This is very useful during development so you don't have to manually restart the kernel after making modifications to your Python files. `%load_ext autoreload` loads the extension, and `%autoreload 2` configures it to reload all modules (except those explicitly excluded).

In [1]:
%load_ext autoreload
%autoreload 2

# 1. Setup our RAG Pipeline
We implement a simple RAG pipeline using LlamaIndex - you are of course welcome to use any other framework you please!

In [2]:
import os
from dotenv import load_dotenv, find_dotenv
import warnings
import nest_asyncio

_ = load_dotenv(find_dotenv())
warnings.filterwarnings("ignore")
nest_asyncio.apply()

In [None]:
from llama_index.core import (
    Settings,
    VectorStoreIndex,
    SimpleDirectoryReader,
)
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

Settings.llm = OpenAI(model="gpt-4o")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-ada-002")

### Ingest documents and Generate RAG Dataset

Okay, the next step in our recipe involves preparing the data and generating a synthetic dataset for Retrieval Augmented Generation (RAG)!

We use gpt-4o to attempt generating 10 question-answer pairs for each chunk of text extracted from the document.

In [5]:
from llama_index.core.llama_dataset.generator import RagDatasetGenerator

docs = SimpleDirectoryReader("../data/").load_data(show_progress=True)
data_gen = RagDatasetGenerator.from_documents(
    docs,
    llm= Settings.llm,
    question_gen_query="You are a teacher/professor. Using the provided context, formulat a single question and its answer",
    num_questions_per_chunk=10
)

Loading files:   0%|          | 0/1 [00:00<?, ?file/s]

Loading files: 100%|██████████| 1/1 [00:00<00:00,  1.19file/s]


In [6]:
qa_dataset = data_gen.generate_dataset_from_nodes()

In [7]:
qa_dataset.examples

[LabelledRagDataExample(query='**Question:** What were the main findings of the study conducted by Susan Athey and Emil Palikot on the labor market value of non-traditional credentials obtained from MOOCs?', query_by=CreatedBy(model_name='gpt-4o', type=<CreatedByType.AI: 'ai'>), reference_contexts=['The value of non-traditional credentials in the labor market*\nSusan Athey & Emil Palikot\nMay 2, 2024\nAbstract\nThis study investigates the labor market value of credentials obtained from Massive Open On-\nline Courses (MOOCs) and shared on business networking platforms. We conducted a random-\nized experiment involving more than 800,000 learners, primarily from developing countries and\nwithout college degrees, who completed technology or business-related courses on the Coursera\nplatform between September 2022 and March 2023. The intervention targeted learners who had\nrecently completed their courses, encouraging them to share their credentials and simplifying the\nsharing process. One

## Vanilla RAG Evaluation
Before running any finetuning it's always important to run a vanilla RAG evaluation. That way we can quantify the gains from finetuning and ascertain if finetuning was even needed!

In this case I host my LLM using Ollama (but you can use other providers such as Local LM, vLLM, etc.)
> !ollama pull llama3.2:1b

In [8]:
from llama_index.llms.ollama import Ollama

llm = Ollama(model="llama3.2:1b")

Creating our RAG query engine
> Seriously it's just one line. Thank you LlamaIndex for making this so easy!

In [9]:
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine(similarity_top_k=6, llm=llm)

Now let's instiate our RAG Evaluator. The RagEvaluatorPack is a Llama Pack developed by amazing open sourced developers. It abstracts away the need to learn a new framework (RAGAS) while allowing you to do the exact same thing with just 1 line of code.

> Pro-Tip: I always suggest using a stronger LLM (gpt-4o) to judge the LLM we are trying to finetune (Llama 3.2 1Bn). That way if our strong LLM thinks the finetuned LLM meets the mark, we'd have an LLM that punches far above its weight!

In [10]:
from llama_index.packs.rag_evaluator import RagEvaluatorPack

rag_evaluator = RagEvaluatorPack(
    query_engine=query_engine, 
    rag_dataset=qa_dataset,
    judge_llm=Settings.llm, #use the same llm that we use to create the dataset to judge
    embed_model=Settings.embed_model
)

This cell will take awhile! It took me 10 minutes!

In [12]:
benchmark_df = rag_evaluator.run()

  0%|          | 0/10 [00:00<?, ?it/s]

100%|██████████| 10/10 [00:48<00:00,  4.88s/it]
100%|██████████| 10/10 [00:34<00:00,  3.48s/it]
100%|██████████| 10/10 [00:47<00:00,  4.72s/it]
100%|██████████| 4/4 [00:16<00:00,  4.17s/it]
2it [00:07,  3.88s/it]
2it [00:11,  5.59s/it]
2it [00:09,  4.67s/it]
2it [00:29, 14.73s/it]
2it [00:09,  4.79s/it]
2it [00:08,  4.44s/it]
2it [00:07,  3.96s/it]
2it [00:26, 13.34s/it]
2it [00:08,  4.39s/it]
2it [00:07,  3.93s/it]
2it [00:15,  7.84s/it]
2it [00:20, 10.08s/it]
2it [00:13,  6.79s/it]
2it [00:24, 12.03s/it]
2it [00:08,  4.28s/it]
2it [00:12,  6.16s/it]
2it [00:09,  4.76s/it]
2it [00:07,  3.90s/it]
2it [00:21, 10.85s/it]
2it [00:08,  4.48s/it]
2it [00:08,  4.23s/it]
2it [00:08,  4.28s/it]
2it [00:08,  4.00s/it]
2it [00:15,  7.76s/it]
2it [00:08,  4.44s/it]
2it [00:07,  3.69s/it]
2it [00:07,  3.64s/it]
2it [00:07,  3.97s/it]
2it [00:16,  8.05s/it]
2it [00:07,  3.68s/it]
2it [00:09,  4.55s/it]
2it [00:08,  4.31s/it]


In [13]:
benchmark_df

rag,base_rag
metrics,Unnamed: 1_level_1
mean_correctness_score,2.8125
mean_relevancy_score,0.6875
mean_faithfulness_score,0.875
mean_context_similarity_score,0.95766


This is not bad actually! Our LLM is actually performing quite well! Let's see if we can make it do better!

# Retrieval Augmented Finetuning

## Getting the RAFT dataset

We start by generating the RAFT dataset - courtesy of yet another Llama pack!

In [14]:
from llama_index.packs.raft_dataset import RAFTDatasetPack

raft_dataset = RAFTDatasetPack(
    file_path = "../data/non_traditional_credentials.pdf",
    llm = Settings.llm,
    embed_model=Settings.embed_model
)

This cell takes quite long to run! Go have a coffee ☕
> It took 19 minutes for the cell to finish running

In [15]:
dataset = raft_dataset.run()

Let's take a look at what we did!

In [18]:
import pandas as pd
df = pd.DataFrame(dataset)
df.head()

Unnamed: 0,id,type,question,context,oracle_context,cot_answer,instruction
0,seed_task_0,general,What percentage increase in credential sharing...,{'sentences': [['The value of non-traditional ...,The value of non-traditional credentials in th...,assistant: To determine the percentage increas...,<DOCUMENT>The value of non-traditional credent...
1,seed_task_1,general,How much more likely were learners in the trea...,{'sentences': [['The control group did not rec...,The value of non-traditional credentials in th...,"assistant: To answer the question ""How much mo...",<DOCUMENT>The control group did not receive th...
2,seed_task_2,general,What was the increase in jobs related to the c...,{'sentences': [['The value of non-traditional ...,The value of non-traditional credentials in th...,assistant: To determine the increase in jobs r...,<DOCUMENT>The value of non-traditional credent...
3,seed_task_3,general,Which group of LinkedIn users showed a more pr...,{'sentences': [['The value of non-traditional ...,The value of non-traditional credentials in th...,assistant: To determine which group of LinkedI...,<DOCUMENT>The value of non-traditional credent...
4,seed_task_4,general,What platform were the courses completed on fo...,"{'sentences': [['Analogously, Past Managerial ...",The value of non-traditional credentials in th...,"assistant: To answer the question ""What platfo...","<DOCUMENT>Analogously, Past\nManagerial Job fo..."


Notice that the LLM also generated context, oracle_context and instruction along with just the question and the chain of thought answer. We'll be using these in crafting the final dataset to finetune our Llama 3.2 1Bn!

In [24]:
from IPython.display import display, Markdown

display(Markdown(df.iloc[0]['instruction']))

<DOCUMENT>The value of non-traditional credentials in the labor market*
Susan Athey & Emil Palikot
May 2, 2024
Abstract
This study investigates the labor market value of credentials obtained from Massive Open On-
line Courses (MOOCs) and shared on business networking platforms. We conducted a random-
ized experiment involving more than 800,000 learners, primarily from developing countries and
without college degrees, who completed technology or business-related courses on the Coursera
platform between September 2022 and March 2023. The intervention targeted learners who had
recently completed their courses, encouraging them to share their credentials and simplifying the
sharing process. One year after the intervention, we collected data from LinkedIn profiles of ap-
proximately 40,000 experimental subjects. We find that the intervention leads to an increase of 17
percentage points for credential sharing. Further, learners in the treatment group were 6% more
likely to report new employment within a year, with an 8% increase in jobs related to their certifi-
cates. This effect was more pronounced among LinkedIn users with lower baseline employability.
Across the entire sample, the treated group received a higher number of certificate views, indicat-
ing an increased interest in their profiles. These results suggest that facilitating credential sharing
and reminding learners of the value of skill signaling can yield significant gains. When the ex-
periment is viewed as an encouragement design for credential sharing, we can estimate the local
average treatment effect (LATE) of credential sharing (that is, the impact of credential sharing on
the workers induced to share by the intervention) for the outcome of getting a job. The LATE esti-
mates are imprecise but large in magnitude; they suggest that credential sharing more than doubles
the baseline probability of getting a new job in scope for the credential.
*We thank Eric Karsten and his team in Coursera for collaborating on this project. </DOCUMENT>
<DOCUMENT>13 p.p.) and 36 p.p. (S.E. </DOCUMENT>
<DOCUMENT>), which corresponds to a
17% increase from baseline. The remaining columns present estimates from the instrumental variable
regression with New Job and New Job in Scope as outcomes. In Columns 6, 7, and 8, we restrict attention
to jobs reported with a starting date at least four months after treatment. We estimate positive and
statistically significant effects. Specifically, we estimate the local average treatment effect of 0.24 (S.E.
0.13) for any new job starting at least one month after treatment and 0.36 (S.E. 0.12) when restricting
14</DOCUMENT>
What percentage increase in credential sharing was observed after the intervention?

In [27]:
display(Markdown(df.iloc[0]['oracle_context']))

The value of non-traditional credentials in the labor market*
Susan Athey & Emil Palikot
May 2, 2024
Abstract
This study investigates the labor market value of credentials obtained from Massive Open On-
line Courses (MOOCs) and shared on business networking platforms. We conducted a random-
ized experiment involving more than 800,000 learners, primarily from developing countries and
without college degrees, who completed technology or business-related courses on the Coursera
platform between September 2022 and March 2023. The intervention targeted learners who had
recently completed their courses, encouraging them to share their credentials and simplifying the
sharing process. One year after the intervention, we collected data from LinkedIn profiles of ap-
proximately 40,000 experimental subjects. We find that the intervention leads to an increase of 17
percentage points for credential sharing. Further, learners in the treatment group were 6% more
likely to report new employment within a year, with an 8% increase in jobs related to their certifi-
cates. This effect was more pronounced among LinkedIn users with lower baseline employability.
Across the entire sample, the treated group received a higher number of certificate views, indicat-
ing an increased interest in their profiles. These results suggest that facilitating credential sharing
and reminding learners of the value of skill signaling can yield significant gains. When the ex-
periment is viewed as an encouragement design for credential sharing, we can estimate the local
average treatment effect (LATE) of credential sharing (that is, the impact of credential sharing on
the workers induced to share by the intervention) for the outcome of getting a job. The LATE esti-
mates are imprecise but large in magnitude; they suggest that credential sharing more than doubles
the baseline probability of getting a new job in scope for the credential.
*We thank Eric Karsten and his team in Coursera for collaborating on this project. 

In [16]:
# Save as .jsonl format
dataset.to_json("raft_train.jsonl")

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

2966201

## Training the LLM
We'll be using the amazing Unsloth framework to save VRAM resources and finish training faster!

Let's first start with a simple train-test split!

In [19]:
splits = dataset.train_test_split(test_size=0.1)
train_ds = splits["train"]
eval_ds  = splits["test"]

In [20]:
train_ds, eval_ds

(Dataset({
     features: ['id', 'type', 'question', 'context', 'oracle_context', 'cot_answer', 'instruction'],
     num_rows: 301
 }),
 Dataset({
     features: ['id', 'type', 'question', 'context', 'oracle_context', 'cot_answer', 'instruction'],
     num_rows: 34
 }))

Load the model and tokenizer we need

In [None]:
from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-1B-Instruct",
    max_seq_length = 2048, # Choose any for long context!
    load_in_4bit = True,  # 4 bit quantization to reduce memory
    load_in_8bit = False, 
    full_finetuning = False, 
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 05-21 06:09:36 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 05-21 06:09:36 [__init__.py:239] Automatically detected platform cuda.
==((====))==  Unsloth 2025.4.7: Fast Llama patching. Transformers: 4.51.3. vLLM: 0.8.5.post1.
   \\   /|    NVIDIA A10G. Num GPUs = 1. Max memory: 22.184 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


We can choose full model finetuning - or just to speed things up, let's use the LoRA method of finetuning!

In [22]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 2025,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.4.7 patched 16 layers with 16 QKV layers, 16 O layers and 16 MLP layers.


## Formatting the prompts
We need to put everything together into a single 'text' field for the LLM to be trained on. Here we collapse the context, and other passages together into a single text prompt for LLM finetuning as specified by the RAFT paper.

In [25]:
def formatting_prompts_func(examples):
    """Define a formatter that injects the retrieved context:"""
    
    texts = []
    for qn, ctx, oracle, instr, ans in zip(
        examples['question'],
        examples["context"],
        examples["oracle_context"],
        examples["instruction"],
        examples["cot_answer"]
    ):
        # You can choose to use `oracle_context` (gold) vs. `context` (retrieved)
        # Here we show both, but you could just use `context`.
        prompt = (
            "### Question:\n"
            f"{qn}\n\n"
            "### Context:\n"
            f"{ctx}\n\n"
            "### (Oracle Passages):\n"
            f"{oracle}\n\n"
            "### Instruction:\n"
            f"{instr}\n\n"
            "### Answer:\n"
        )
        # Append the gold answer plus EOS
        texts.append(prompt + ans + tokenizer.eos_token)
    return {"text": texts}

# then:
train_ds = train_ds.map(formatting_prompts_func, batched=True)
eval_ds = eval_ds.map(formatting_prompts_func, batched=True)

Map:   0%|          | 0/301 [00:00<?, ? examples/s]

Map:   0%|          | 0/34 [00:00<?, ? examples/s]

Let's take a look at what we did!

In [26]:
from IPython.display import display, Markdown

display(Markdown(pd.DataFrame(train_ds).head()['text'].iloc[0]))

### Question:
What is the mean value for the 'Data Science' variable in the LinkedIn matched sample?

### Context:
{'sentences': [['Table 1: Summary statistics pretreatment and outcome variables\nCoursera Internal Data LinkedIn Matched Sample\nVariable name Mean S.E. Mean S.E.\nTreatment 0.499 0.001 0.500 0.003\nPanel A: Pre-treatment covariates\nProfessional Experience Years – – 3.040 0.028\nPast Tech Job – – 0.127 0.002\nPast Managerial Job – – 0.064 0.001\nMain Skill Absolute 0.099 0.001 2.074 0.010\nMain Skill Standardized 0.000 <0.001 0.000 0.001\nComputer Science 0.252 0.001 0.230 0.002\nData Science 0.236 0.001 0.300 0.002\nInformation Technology 0.140 0.001 0.138 0.002\nGuided Project 0.168 0.001 0.097 0.002\nProfessional Certificate 0.005 <0.001 0.005 <0.001\nSpecialization 0.009 <0.001 0.009 0.001\nDeveloping Country 0.896 0.001 0.850 0.002\nAssociate Degree 0.029 <0.001 0.062 0.001\nBachelor Degree 0.127 0.001 0.367 0.003\nSome College 0.072 0.001 0.130 0.002\nDoctorate Degree 0.004 <0.001 0.012 0.001\nHigh School Diploma 0.059 0.001 0.097 0.002\nLess than High School 0.009 <0.001 0.012 0.001\nMasters Degree 0.050 0.001 0.146 0.002\nNo Education Mentioned 0.645 0.002 0.164 0.002\nProfessional Degree 0.004 <0.001 0.010 0.001\nMale 0.302 0.002 0.674 0.002\nGender Not Mentioned 0.533 0.002 0.101 0.002\nPanel B: Outcome variables\nNew Job – – 0.177 0.002\nNew Job in Scope – – 0.133 0.002\nCredential Shared – – 0.181 0.002\nAll Views 0.191 0.001 0.429 0.003\nAll Views by Others 0.143 0.001 0.318 0.002\nViews LinkedIn 0.165 0.001 0.409 0.003\nViews LinkedIn by Others 0.124 0.001 0.296 0.002\nNote: Professional Experience Years is the number of years between the starting date of the first job and August 2023. Past Tech Job\ntakes the value of 1 when the learner had a job title related to technology before randomization and zero otherwise. ', 'effects between the bottom and top tertiles, the difference is 0.1 p.p. (S.E. ', 'For each learner, Coursera assesses skill mastery and assigns a score (Red-\ndick, 2019). Additionally, we compute a max-mean standardization of the learners’ skill level. We also\nobserve the country where the learner registered for the course. Following the OECD classification,\nwe use this information to group countries into developing and developed. Finally, we also observe\nthe information provided by the learners in their registration survey. ']], 'title': [['placeholder_title', 'placeholder_title', 'placeholder_title']]}

### (Oracle Passages):
Table 1: Summary statistics pretreatment and outcome variables
Coursera Internal Data LinkedIn Matched Sample
Variable name Mean S.E. Mean S.E.
Treatment 0.499 0.001 0.500 0.003
Panel A: Pre-treatment covariates
Professional Experience Years – – 3.040 0.028
Past Tech Job – – 0.127 0.002
Past Managerial Job – – 0.064 0.001
Main Skill Absolute 0.099 0.001 2.074 0.010
Main Skill Standardized 0.000 <0.001 0.000 0.001
Computer Science 0.252 0.001 0.230 0.002
Data Science 0.236 0.001 0.300 0.002
Information Technology 0.140 0.001 0.138 0.002
Guided Project 0.168 0.001 0.097 0.002
Professional Certificate 0.005 <0.001 0.005 <0.001
Specialization 0.009 <0.001 0.009 0.001
Developing Country 0.896 0.001 0.850 0.002
Associate Degree 0.029 <0.001 0.062 0.001
Bachelor Degree 0.127 0.001 0.367 0.003
Some College 0.072 0.001 0.130 0.002
Doctorate Degree 0.004 <0.001 0.012 0.001
High School Diploma 0.059 0.001 0.097 0.002
Less than High School 0.009 <0.001 0.012 0.001
Masters Degree 0.050 0.001 0.146 0.002
No Education Mentioned 0.645 0.002 0.164 0.002
Professional Degree 0.004 <0.001 0.010 0.001
Male 0.302 0.002 0.674 0.002
Gender Not Mentioned 0.533 0.002 0.101 0.002
Panel B: Outcome variables
New Job – – 0.177 0.002
New Job in Scope – – 0.133 0.002
Credential Shared – – 0.181 0.002
All Views 0.191 0.001 0.429 0.003
All Views by Others 0.143 0.001 0.318 0.002
Views LinkedIn 0.165 0.001 0.409 0.003
Views LinkedIn by Others 0.124 0.001 0.296 0.002
Note: Professional Experience Years is the number of years between the starting date of the first job and August 2023. Past Tech Job
takes the value of 1 when the learner had a job title related to technology before randomization and zero otherwise. 

### Instruction:
<DOCUMENT>Table 1: Summary statistics pretreatment and outcome variables
Coursera Internal Data LinkedIn Matched Sample
Variable name Mean S.E. Mean S.E.
Treatment 0.499 0.001 0.500 0.003
Panel A: Pre-treatment covariates
Professional Experience Years – – 3.040 0.028
Past Tech Job – – 0.127 0.002
Past Managerial Job – – 0.064 0.001
Main Skill Absolute 0.099 0.001 2.074 0.010
Main Skill Standardized 0.000 <0.001 0.000 0.001
Computer Science 0.252 0.001 0.230 0.002
Data Science 0.236 0.001 0.300 0.002
Information Technology 0.140 0.001 0.138 0.002
Guided Project 0.168 0.001 0.097 0.002
Professional Certificate 0.005 <0.001 0.005 <0.001
Specialization 0.009 <0.001 0.009 0.001
Developing Country 0.896 0.001 0.850 0.002
Associate Degree 0.029 <0.001 0.062 0.001
Bachelor Degree 0.127 0.001 0.367 0.003
Some College 0.072 0.001 0.130 0.002
Doctorate Degree 0.004 <0.001 0.012 0.001
High School Diploma 0.059 0.001 0.097 0.002
Less than High School 0.009 <0.001 0.012 0.001
Masters Degree 0.050 0.001 0.146 0.002
No Education Mentioned 0.645 0.002 0.164 0.002
Professional Degree 0.004 <0.001 0.010 0.001
Male 0.302 0.002 0.674 0.002
Gender Not Mentioned 0.533 0.002 0.101 0.002
Panel B: Outcome variables
New Job – – 0.177 0.002
New Job in Scope – – 0.133 0.002
Credential Shared – – 0.181 0.002
All Views 0.191 0.001 0.429 0.003
All Views by Others 0.143 0.001 0.318 0.002
Views LinkedIn 0.165 0.001 0.409 0.003
Views LinkedIn by Others 0.124 0.001 0.296 0.002
Note: Professional Experience Years is the number of years between the starting date of the first job and August 2023. Past Tech Job
takes the value of 1 when the learner had a job title related to technology before randomization and zero otherwise. </DOCUMENT>
<DOCUMENT>effects between the bottom and top tertiles, the difference is 0.1 p.p. (S.E. </DOCUMENT>
<DOCUMENT>For each learner, Coursera assesses skill mastery and assigns a score (Red-
dick, 2019). Additionally, we compute a max-mean standardization of the learners’ skill level. We also
observe the country where the learner registered for the course. Following the OECD classification,
we use this information to group countries into developing and developed. Finally, we also observe
the information provided by the learners in their registration survey. </DOCUMENT>
What is the mean value for the 'Data Science' variable in the LinkedIn matched sample?

### Answer:
assistant: To answer the question about the mean value for the 'Data Science' variable in the LinkedIn matched sample, we need to follow these steps:

1. **Identify the Relevant Section**: The context provides a table with summary statistics for various variables. We need to locate the section that pertains to the LinkedIn matched sample.

2. **Locate the 'Data Science' Variable**: Within the table, find the row labeled 'Data Science'. This row will contain the mean value for the 'Data Science' variable in both the Coursera Internal Data and the LinkedIn Matched Sample.

3. **Extract the Mean Value for LinkedIn Matched Sample**: Once the 'Data Science' row is located, extract the mean value specifically for the LinkedIn matched sample.

Let's proceed with these steps:

- The context provides a table with two columns of mean values: one for Coursera Internal Data and another for LinkedIn Matched Sample.

- ##begin_quote## Data Science 0.236 0.001 0.300 0.002 ##end_quote##: This line from the context shows the mean values for the 'Data Science' variable. The first mean value (0.236) corresponds to the Coursera Internal Data, and the second mean value (0.300) corresponds to the LinkedIn Matched Sample.

Therefore, the mean value for the 'Data Science' variable in the LinkedIn matched sample is 0.300.

<ANSWER>: 0.300<|eot_id|>

### Finally let's start training!

I've experimented a little and I found that 5 training epochs was sufficient. But feel free to adjust the hyperparameters to anything you prefer!

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="llama32_1bn_raft_v2", #This will also be used as your huggingfacehub model id name
    report_to="wandb", #Leave this to be blank if you don't want to use wandb
    run_name="RAFT_SFT_Take7",
    eval_steps=5,
    eval_strategy="steps",
    per_device_train_batch_size=1,    # small batches if quantized
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=8,
    learning_rate=2e-5,
    num_train_epochs=5,
    # max_steps=60,                    # or set num_train_epochs
    save_strategy="no",
    bf16=True,
    fp16=False,
    gradient_checkpointing=True,
    logging_strategy="steps",
    logging_steps=5,
    seed=42,
    optim="adamw_torch",
    lr_scheduler_type="cosine",
)

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_ds,
    eval_dataset = eval_ds, 
    args=training_args,
    dataset_text_field="text",
    
)

Unsloth: Tokenizing ["text"] (num_proc=4):   0%|          | 0/301 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"] (num_proc=4):   0%|          | 0/34 [00:00<?, ? examples/s]

Current memory statistics

In [29]:
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA A10G. Max memory = 22.184 GB.
1.457 GB of memory reserved.


Train!

In [30]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 301 | Num Epochs = 5 | Total steps = 185
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 8
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 8 x 1) = 8
 "-____-"     Trainable parameters = 11,272,192/1,000,000,000 (1.13% trained)
[34m[1mwandb[0m: Currently logged in as: [33mtituslhy[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss,Validation Loss
5,1.493,1.633143
10,1.4666,1.617843
15,1.5463,1.596143
20,1.4859,1.571562
25,1.4498,1.546785
30,1.4265,1.521693
35,1.4468,1.497457
40,1.3767,1.474485
45,1.3344,1.454567
50,1.3655,1.434021


Unsloth: Not an error, but LlamaForCausalLM does not accept `num_items_in_batch`.
Using gradient accumulation will be very slightly less accurate.
Read more on gradient accumulation issues here: https://unsloth.ai/blog/gradient


Compute used memory statistics

In [31]:
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

637.9309 seconds used for training.
10.63 minutes used for training.
Peak reserved memory = 2.156 GB.
Peak reserved memory for training = 0.699 GB.
Peak reserved memory % of max memory = 9.719 %.
Peak reserved memory for training % of max memory = 3.151 %.


## Save the model!

### Local save
It's important to save the merged model - otherwise you'll just be saving the LoRA weights which will make it harder to deploy on Ollama/any platform in the future (you'll need one extra step to pull the base model too)

In [32]:
model.save_pretrained_merged(
    save_directory = "llama32_1bn_raft_merged_v2",     # Local path to store merged model
    tokenizer = tokenizer,
    save_method = "merged_16bit",        # Can also use "merged_4bit" or "merged_8bit" if needed
)

Unsloth: You have 2 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 3.81 out of 15.42 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 16/16 [00:00<00:00, 56.15it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving llama32_1bn_raft_merged_v2/pytorch_model.bin...
Done.


### Push to HuggingFace Hub!

In [33]:
model.push_to_hub_merged(
    repo_id="tituslhy/llama32_1bn_raft_non_traditional_credentials_v2",
    tokenizer=tokenizer,
    save_method="merged_16bit",
    token=os.environ["HUGGINGFACE_ACCESS_TOKEN"]
)

Unsloth: You are pushing to hub, but you passed your HF username = tituslhy.
We shall truncate tituslhy/llama32_1bn_raft_non_traditional_credentials_v2 to llama32_1bn_raft_non_traditional_credentials_v2


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 3.63 out of 15.42 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 16/16 [00:00<00:00, 103.81it/s]


Unsloth: Saving tokenizer...

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

 Done.
Unsloth: Saving llama32_1bn_raft_non_traditional_credentials_v2/pytorch_model.bin...


README.md:   0%|          | 0.00/613 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

Done.
Saved merged model to https://huggingface.co/tituslhy/llama32_1bn_raft_non_traditional_credentials_v2


### Push the GGUF files to HuggingFace Hub

You don't have to run this cell if your llama.cpp gguf execution (`model.push_to_hub_gguf`) works! 
> I had to because it could not find where the llama-quantize binary was.

In [34]:
# ① Point at the real binary in build/bin
real_q = os.path.expanduser("~/llama.cpp/build/bin/llama-quantize")
assert os.path.exists(real_q), f"{real_q} not found!"

# ② Make a local 'llama.cpp' folder in your notebook working directory
cwd = os.getcwd()
local_pack = os.path.join(cwd, "llama.cpp")
os.makedirs(local_pack, exist_ok=True)

# ③ Symlink it as 'llama-quantize' and also as 'quantize'
for name in ("llama-quantize", "quantize"):
    link = os.path.join(local_pack, name)
    if os.path.exists(link) or os.path.islink(link):
        os.remove(link)
    os.symlink(real_q, link)

# ④ Verify
print("Notebook sees:", os.listdir(local_pack))

Notebook sees: ['.github', 'CODEOWNERS', 'pyproject.toml', 'README.md', 'gguf-py', 'ggml', '.clang-tidy', '.pre-commit-config.yaml', 'examples', 'tests', 'convert_llama_ggml_to_gguf.py', 'cmake', '.gitignore', 'CMakeLists.txt', 'build-xcframework.sh', 'scripts', 'Makefile', 'pocs', 'pyrightconfig.json', 'poetry.lock', 'convert_hf_to_gguf_update.py', 'src', 'docs', 'convert_hf_to_gguf.py', 'mypy.ini', 'llama-quantize', 'CONTRIBUTING.md', 'models', '.git', '.dockerignore', 'AUTHORS', 'requirements.txt', 'licenses', '.clang-format', 'flake.nix', 'prompts', 'tools', '.ecrc', '.flake8', 'grammars', '.devops', 'media', '.editorconfig', 'SECURITY.md', 'LICENSE', 'include', 'requirements', 'flake.lock', 'CMakePresets.json', 'ci', 'build', 'common', '.gitmodules', 'convert_lora_to_gguf.py', 'quantize']


In [35]:
model.push_to_hub_gguf(
    "tituslhy/llama32_1bn_raft_non_traditional_credentials_v2", # Change hf to your username!
    tokenizer,
    quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
    token = os.environ["HUGGINGFACE_ACCESS_TOKEN"], # Get a token at https://huggingface.co/settings/tokens
)

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 4.11 out of 15.42 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 16/16 [00:00<00:00, 102.50it/s]

Unsloth: Saving tokenizer...




 Done.
Unsloth: Saving tituslhy/llama32_1bn_raft_non_traditional_credentials_v2/pytorch_model.bin...
Done.


Unsloth: Converting llama model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q4_k_m', 'q8_0', 'q5_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: [1] Converting model at tituslhy/llama32_1bn_raft_non_traditional_credentials_v2 into bf16 GGUF format.
The output location will be /home/ubuntu/ideal-palm-tree/notebooks/tituslhy/llama32_1bn_raft_non_traditional_credentials_v2/unsloth.BF16.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: llama32_1bn_raft_non_traditional_credentials_v2
INFO:hf-to-gguf:Model architecture: LlamaForCausalLM
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight,           torch.float

unsloth.Q4_K_M.gguf:   0%|          | 0.00/808M [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/tituslhy/llama32_1bn_raft_non_traditional_credentials_v2
Unsloth: Uploading GGUF to Huggingface Hub...


unsloth.Q8_0.gguf:   0%|          | 0.00/1.32G [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


Saved GGUF to https://huggingface.co/tituslhy/llama32_1bn_raft_non_traditional_credentials_v2
Unsloth: Uploading GGUF to Huggingface Hub...


unsloth.Q5_K_M.gguf:   0%|          | 0.00/912M [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


Saved GGUF to https://huggingface.co/tituslhy/llama32_1bn_raft_non_traditional_credentials_v2


# Evaluate finetuned LLM
It's now time to evaluate our finetuned LLM! If you're using Ollama like me, you can start by pulling your LLM down using Ollama
> !ollama pull hf.co/tituslhy/llama32_1bn_raft_non_traditional_credentials_v2:Q4_K_M

In [38]:
finetuned_llm = Ollama(
    "hf.co/tituslhy/llama32_1bn_raft_non_traditional_credentials_v2:Q4_K_M"
)

Now we set up our query engine with our new shiny LLM

In [40]:
query_engine_finetuned = index.as_query_engine(
    llm = finetuned_llm,
    similarity_top_k = 6,
)

And instantiate a RagEvaluator using the same qa_dataset generated

In [41]:
finetuned_rag_evaluator = RagEvaluatorPack(
    query_engine=query_engine_finetuned, 
    rag_dataset=qa_dataset,
    judge_llm=Settings.llm, #use the same llm that we use to create the dataset to judge
    embed_model=Settings.embed_model
)

And run!
> This cell will take awhile to run - it took me 7.5mins!

In [42]:
benchmark_df_finetuned = finetuned_rag_evaluator.run()
benchmark_df_finetuned

2it [00:09,  4.70s/it]
2it [00:08,  4.30s/it]
2it [00:08,  4.37s/it]
2it [00:34, 17.36s/it]
2it [00:08,  4.11s/it]
2it [00:08,  4.38s/it]
2it [00:07,  3.83s/it]
2it [00:32, 16.26s/it]
2it [00:08,  4.36s/it]
2it [00:09,  4.66s/it]
2it [00:08,  4.36s/it]
2it [00:17,  8.79s/it]
2it [00:16,  8.11s/it]
2it [00:07,  3.94s/it]
2it [00:08,  4.21s/it]
2it [00:11,  5.83s/it]
2it [00:19,  9.68s/it]
2it [00:12,  6.07s/it]
2it [00:08,  4.06s/it]
2it [00:08,  4.22s/it]
2it [00:15,  7.69s/it]
2it [00:11,  5.93s/it]
2it [00:13,  6.78s/it]
2it [00:07,  4.00s/it]
2it [00:08,  4.07s/it]
2it [00:07,  3.79s/it]
2it [00:18,  9.36s/it]
2it [00:16,  8.26s/it]
2it [00:09,  4.54s/it]
2it [00:09,  4.56s/it]
2it [00:08,  4.19s/it]
2it [00:07,  3.91s/it]


rag,base_rag
metrics,Unnamed: 1_level_1
mean_correctness_score,2.835938
mean_relevancy_score,0.65625
mean_faithfulness_score,0.875
mean_context_similarity_score,0.95766


Hey it worked! But it looks like finetuning did not improve the LLM by very much!