<a href="https://colab.research.google.com/github/tazkera-haque-ds/Interactive-Dev-Environment-for-LLM-Development/blob/main/Open_Source_RAG_with_Gradient.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuned RAG with Gradient

In today's notebook, we'll be working through an example of how you can leverage Gradient's services to fine-tune a model, host that model on their endpoints, and leverage it in LangChain!

We're going to be focusing on a relatively simplified example today:

Instruct-tuning Llama2-7b-chat.

We'll be using the following tools:

- [LlamaIndex](https://gpt-index.readthedocs.io/en/stable/examples/finetuning/gradient/gradient_text2sql.html)
  - LlamaIndex will be helping us fine-tune our Gradient model with its convenient wrapper
  - It will also help us quickly and conveniently fine-tune our embedding model to our data.
- [LangChain](https://python.langchain.com/docs/integrations/llms/gradient)
  - We'll be using LangChain to power our simple RAG application!
- [Gradient](https://docs.gradient.ai/docs/introduction) - the star of the show today!
  - Gradient makes it easy to fine-tune your models, and then leverage those fine-tuned models! With a per-token pricing, you can run your fine-tuned models on demand!


## Sign Up for Gradient

### Register

The first step to this process will be, of course, to sign up!

Head over to Gradient's [registration page](https://auth.gradient.ai/register) to get started!

### Create a Workspace

Now, you'll want to create a workspace. You can do so from your dashboard, found [here](https://auth.gradient.ai/).

![image](https://i.imgur.com/ZzDhNiP.png)

Take note of the Workspace ID!

### Add Billing to Your Workspace

You receive a number of credits by signing up to Gradient - but you'll still want to add billing! You can find that by clicking `More` in the workspace, and navigating to the `Billing` menu.

![image](https://i.imgur.com/XWMdPKk.png)

### Create Access Token

Now you can navigate to the `Access toknes` tab, and create an access token!

Simply click `Generate new access token` and then enter your password.

Be sure to store your access token somewhere safe!

![image](https://i.imgur.com/xMaFgeZ.png)

### Done! 🎉

Now you're ready to carry on with the demo! This demo should not exceed the credits you receive for signing up with Gradient!

## Instruct-tuning Llama-2-7b-chat

We'll be "instruct-tuning" [Llama-2-7b-chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) today.

In essence, we're going to try to make it better at following our provided instructions.

We'll be leveraging the [MosaicML Instruct-v3](https://huggingface.co/datasets/mosaicml/instruct-v3) dataset to do this, which includes a number of instruction aligned datasets and is permissively licensed.

Let's start by grabbing our dependencies.

In [None]:
!pip install llama-index gradientai cohere langchain -qU

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.0/48.0 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.7/2.7 MB[0m [31m29.4 MB/s[0m eta [36m0:00:00[0m
[?25h

Now we can provide the access token and workspace ID we obtained earlier!

In [None]:
import getpass
import os

os.environ["GRADIENT_ACCESS_TOKEN"] = getpass.getpass("Gradient Access Token: ")

Gradient Access Token: ··········


In [None]:
os.environ["GRADIENT_WORKSPACE_ID"] = getpass.getpass("Gradient Workspace ID: ")

Gradient Workspace ID: ··········


In [None]:
!pip install datasets -qU

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m493.7/493.7 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.2/311.2 kB[0m [31m19.8 MB/s[0m eta [36m0:00:00[0m
[?25h

### Format Dataset

In order to properly instruct-tune our model - we'll want to convert it into the expected format.

Gradient's fine-tuning system needs a `.jsonl` file where each row corresponds to a training example.

Each row should have an object called `inputs` which contains your fully formatted prompt.

Let's walk through the steps of how we can do that!

#### Load HF Dataset

First things first, we need to load our `mosaicml/instruct-v3` dataset. It's a great collection of effective and safe tasks.

> NOTE: While we're using a safety aligned dataset - there's no guarentee our model will be safe! Please be sure to consider additional safety measures if you're productionalizing your model!

In [None]:
from datasets import load_dataset

instruct_tune_dataset = load_dataset("mosaicml/instruct-v3")

Let's take a peek at our dataset.

It's our job to merge these `prompt` and `response` columns into a single formatted prompt for instruct-tuning.

In [None]:
instruct_tune_dataset

DatasetDict({
    train: Dataset({
        features: ['prompt', 'response', 'source'],
        num_rows: 56167
    })
    test: Dataset({
        features: ['prompt', 'response', 'source'],
        num_rows: 6807
    })
})

#### Create Formatted Prompt

In the following function we'll be merging our `prompt` and `response` columns by creating the following template:

```
<s>### Instruction:
Below is an instruction that describes a task. Write a response that appropriately completes the request.
{USER MESSAGE}

### Response:
{RESPONSE}</s>
```

> NOTE: This sequence was selected due to the [findings](https://gpt-index.readthedocs.io/en/stable/examples/finetuning/gradient/gradient_text2sql.html#map-training-dataset-dictionaries-to-prompts) of the LlamaIndex team.

In [None]:
def create_prompt(sample):
  bos_token = "<s>"
  system_message = "Below is an instruction that describes a task. Write a response that appropriately completes the request."
  user_message = sample["prompt"].replace(system_message, "").replace("\n\n### Instruction\n", "").replace("\n### Response\n", "").strip()
  response = sample["response"]
  eos_token = "</s>"

  full_prompt = ""
  full_prompt += bos_token
  full_prompt += "### Instruction:"
  full_prompt += "\n" + system_message
  full_prompt += "\n" + user_message
  full_prompt += "\n\n### Response:"
  full_prompt += "\n" + response
  full_prompt += eos_token

  return {"inputs" : full_prompt}

Let's check and see how it works.

In [None]:
create_prompt(instruct_tune_dataset["train"][1])["inputs"]

'<s>### Instruction:\nBelow is an instruction that describes a task. Write a response that appropriately completes the request.\nWhat are different types of grass?\n\n### Response:\nThere are more than 12,000 species of grass. The most common is Kentucky Bluegrass, because it grows quickly, easily, and is soft to the touch. Rygrass is shiny and bright green colored. Fescues are dark green and shiny. Bermuda grass is harder but can grow in drier soil.</s>'

That looks great!

#### Map to Dataset

Now we can map our formatting function across our dataset!

In [None]:
instruct_tune_dataset = instruct_tune_dataset.map(create_prompt)

Map:   0%|          | 0/56167 [00:00<?, ? examples/s]

Map:   0%|          | 0/6807 [00:00<?, ? examples/s]

In [None]:
instruct_tune_dataset

DatasetDict({
    train: Dataset({
        features: ['prompt', 'response', 'source', 'inputs'],
        num_rows: 56167
    })
    test: Dataset({
        features: ['prompt', 'response', 'source', 'inputs'],
        num_rows: 6807
    })
})

In [None]:
instruct_tune_dataset["train"][1]["inputs"]

'<s>### Instruction\nBelow is an instruction that describes a task. Write a response that appropriately completes the request.\nWhat are different types of grass?\n\n### Response\nThere are more than 12,000 species of grass. The most common is Kentucky Bluegrass, because it grows quickly, easily, and is soft to the touch. Rygrass is shiny and bright green colored. Fescues are dark green and shiny. Bermuda grass is harder but can grow in drier soil.</s>'

#### Filtering Dataset

Alright! We're just about done!

We're going to make a small change to the dataset based on the maximum allowed training context window - which is `2048` tokens.

For this example, we'll simply naively filter all long responses out.

In [None]:
pruned_dataset = instruct_tune_dataset.filter(lambda x: len(x["inputs"]) <= 2000)

Filter:   0%|          | 0/56167 [00:00<?, ? examples/s]

Filter:   0%|          | 0/6807 [00:00<?, ? examples/s]

In [None]:
pruned_dataset

DatasetDict({
    train: Dataset({
        features: ['prompt', 'response', 'source', 'inputs'],
        num_rows: 40736
    })
    test: Dataset({
        features: ['prompt', 'response', 'source', 'inputs'],
        num_rows: 5512
    })
})

#### Saving to JSONL

We can take advantage of the `datasets` library's `to_json` to export our dataset in the desired format.

In [None]:
for split, dataset in pruned_dataset.items():
  dataset.to_json(f"instruct_tune_{split}.jsonl")

Creating json from Arrow format:   0%|          | 0/41 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

### Instruct-tuning!

Now we're ready to start the training!

Let's walk through what's happening - we're going to be leveraging LlamaIndex's convenient wrappers to make this already simple process even simpler.

#### Initializing a Base Model

For our base model, we'll be using `llama2-7b-chat`.

You can check out the docs [here](https://gpt-index.readthedocs.io/en/latest/api_reference/llms/gradient_base_model.html) if you wanted to dig a little deeper into LlamaIndex's `GradientBaseModelLLM`.

We could stop right here - and use this as our LLM - but we're going to carry on an fine-tune a model instead!

In [None]:
from llama_index.llms import GradientBaseModelLLM

base_model_slug = "llama2-7b-chat"
base_llm = GradientBaseModelLLM(
    base_model_slug=base_model_slug, max_tokens=300
)

#### Initializing Our Fine-tune Engine

Once again, LlamaIndex has built a convenient wrapper we can use to set up our fine-tuning job on Gradient with!

Check out the docs [here](https://gpt-index.readthedocs.io/en/v0.8.58/api_reference/finetuning.html#llama_index.finetuning.GradientFinetuneEngine), though they're still being worked on.

Let's take a peek at some of the parameters and see what they do for us:

- `base_model_slug` - this is a reference to the model `Slug ID`, you can find those IDs [here](https://docs.gradient.ai/docs/models-1#%EF%B8%8F-gradient-hosted-llms) in the "Model IDs for reference in the API and CLI" table.
- `name` - this is the name given to your fine-tuned model
- `data_path` - this will point to the formatted `jsonl` file and be used by the `GradientFinetuneEngine` to pull training examples from.
- `verbose` - lets us know what's going on!
- `max_steps` - the number of steps the model will be fine-tuned on
- `batch_size` - the number of examples used to train at a time

The basic idea is that we will repeatedly fine-tune the model - bit by bit - as we work through our `max_steps`.

Check out some awesome [tips and tricks](https://docs.gradient.ai/docs/tips-and-tricks) provided by the Gradient team if you wanted to dive deeper into what exactly we can do with our fine-tuning!

In [None]:
from llama_index.finetuning.gradient.base import GradientFinetuneEngine

finetune_engine = GradientFinetuneEngine(
    base_model_slug=base_model_slug,
    name="instruct_tune",
    data_path="/content/instruct_tune_train.jsonl",
    verbose=True,
    max_steps=100,
    batch_size=4,
)

Now we can grab our `model_adapter_id` from our `finetune_engine`.

This will be useful when we need to address our specific fine-tuned model in the future.

> NOTE: If you're following closely - you'll notice that this has `adapter` in it. That's right - Gradient is using everyone's favourite LoRA to fine-tune!

In [None]:
finetune_engine.model_adapter_id

'4d838eac-d40f-4cbc-8ca1-58a397a1de84_model_adapter'

#### Instruct-tuning Llama 2 7B Chat

Finally, here we go!

We're now ready to call our `finetune()` method on our `finetune_engine` to start sending examples to fine-tune our Gradient model!

In [None]:
epochs = 1
for i in range(epochs):
    print(f"** EPOCH {i} **")
    finetune_engine.finetune()

** EPOCH 0 **
fine-tuning step 4: loss=1930.9631, trainable tokens=963
fine-tuning step 8: loss=1117.605, trainable tokens=640
fine-tuning step 12: loss=1046.6661, trainable tokens=638
fine-tuning step 16: loss=1399.2902, trainable tokens=900
fine-tuning step 20: loss=1534.9531, trainable tokens=905
fine-tuning step 24: loss=601.874, trainable tokens=414
fine-tuning step 28: loss=1762.9719, trainable tokens=1311
fine-tuning step 32: loss=1151.291, trainable tokens=1085
fine-tuning step 36: loss=1525.5577, trainable tokens=885
fine-tuning step 40: loss=765.99927, trainable tokens=597
fine-tuning step 44: loss=1641.0099, trainable tokens=1321
fine-tuning step 48: loss=991.6924, trainable tokens=610
fine-tuning step 52: loss=840.37494, trainable tokens=596
fine-tuning step 56: loss=1528.7517, trainable tokens=923
fine-tuning step 60: loss=1034.7272, trainable tokens=866
fine-tuning step 64: loss=826.2357, trainable tokens=563
fine-tuning step 68: loss=1207.7329, trainable tokens=952
fine-

## Hosting An Embedding Model with Gradient

It's never been easier to get rockin' with a hosted embedding model.

All we need to do is provide our access token and workspace ID (which should already be in your env from before) and select the BGE embedding model (currently the only support embedding model, though more are on the way) and we're done!

In [None]:
from getpass import getpass
import os

if not os.environ.get("GRADIENT_ACCESS_TOKEN", None):
    os.environ["GRADIENT_ACCESS_TOKEN"] = getpass("gradient.ai access token:")
if not os.environ.get("GRADIENT_WORKSPACE_ID", None):
    os.environ["GRADIENT_WORKSPACE_ID"] = getpass("gradient.ai workspace id:")

In [None]:
from langchain.embeddings import GradientEmbeddings

embeddings = GradientEmbeddings(model="bge-large")

Let's try it out!

In [None]:
len(embeddings.embed_query("Hello, is it me you're looking for?"))

1024

## Creating a RAG Pipeline Powered by Gradient and LangChain

Now we can build our RAG system with LangChain!

First thing we'll do, however, is make sure our model works!



Let's create our Gradient client and list our models using their [Python SDK](https://docs.gradient.ai/docs/sdk-quickstart).

We can use this to find our fine-tuned model!

In [None]:
import gradientai

client = gradientai.Gradient()

models = client.list_models(only_base=False)
for model in models:
  if "adapter" in model.id:
    print(model.id, model.name)

f90b83dd-7448-42c4-afe8-46ca257e3221_model_adapter instruct_tune
a9ddb8f3-3665-4a6c-80f1-343781e0c4bb_model_adapter instruct_tune
9fd56e25-9e97-41fd-bfb6-d39f89af2a37_model_adapter instruct_tune
4d838eac-d40f-4cbc-8ca1-58a397a1de84_model_adapter instruct_tune


Now we can load our `GradientLLM` - it's really that easy!

We can pass in additional parameters like how many tokens to generate, and more!

Check it out [here](https://github.com/langchain-ai/langchain/blob/master/libs/langchain/langchain/llms/gradient_ai.py)!

In [None]:
from langchain.llms import GradientLLM

llm = GradientLLM(
    model=models[-1].id,
    model_kwargs=dict(max_generated_token_count=128),
)

We're just going to reproduce our training template here - and see what this model can do!

In [None]:
from langchain.prompts import PromptTemplate

template = """"\
### Instruction:
Below is an instruction that describes a task. Write a response that appropriately completes the request.
{input}

### Response:
"""

prompt = PromptTemplate(template=template, input_variables=["input"])

We'll create a simple `LLMChain` that chains our prompt into our LLM.

In [None]:
from langchain.chains import LLMChain

llm_chain = LLMChain(prompt=prompt, llm=llm)

Let's ask a simple question - and see how it fares.

In [None]:
input = "What is the opposite of Gradient Descent?"

llm_chain.run(input=input)

'The opposite of Gradient Descent is called Stochastic Gradient Descent.'

Not a very satisfying answer - it's clear our model requires additional context to get this right.

Let's build a simple RAG prompt and see how it does.

In [None]:
template = """"\
### Instruction:
Below is an instruction that describes a task. Write a response that appropriately completes the request.

Based on the provided context, please answer the provided question. You can only use the provided context to answer the question.
If you do not know the answer - please respond with "I don't know".

Context:
{context}

Question:
{question}

### Response:
"""

rag_prompt = PromptTemplate(template=template, input_variables=["context", "question"])

We'll create our LLM chain using LangChain's [LCEL](https://python.langchain.com/docs/expression_language/) this time - which is a wonderful way to build chains!

In [None]:
llm_chain = rag_prompt | llm

Now we can ask questions and have them be grounded by our context.

In [None]:
question = "What is the opposite of Gradient Descent?"
context = "In mathematics, gradient descent (also often called steepest descent) is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the gradient (or approximate gradient) of the function at the current point, because this is the direction of steepest descent. Conversely, stepping in the direction of the gradient will lead to a local maximum of that function; the procedure is then known as gradient ascent. It is particularly useful in machine learning for minimizing the cost or loss function.[1] Gradient descent should not be confused with local search algorithms, although both are iterative methods for optimization."

llm_chain.invoke({"question" :question, "context" : context})

'Steepest Ascent'

Let's ask a question about something that it should not know the answer to.

In [None]:
question = "What is the maximum airspeed velocity of an unladen swallow?"
context = "In mathematics, gradient descent (also often called steepest descent) is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the gradient (or approximate gradient) of the function at the current point, because this is the direction of steepest descent. Conversely, stepping in the direction of the gradient will lead to a local maximum of that function; the procedure is then known as gradient ascent. It is particularly useful in machine learning for minimizing the cost or loss function.[1] Gradient descent should not be confused with local search algorithms, although both are iterative methods for optimization."

llm_chain.invoke({"question" :question, "context" : context})

"I don't know."

Great!

### Creating a RAG Chain in LangChain

Let's do the thing!

The first thing we'll do is grab some documents to use as our index from Arxiv!

The second thing we'll do is build a retrieval pipeline with FAISS and our Gradient-hosted embeddings model.

In [None]:
!pip install faiss-cpu arxiv pymupdf -qU

Collecting faiss-cpu
  Downloading faiss_cpu-1.7.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.6/17.6 MB[0m [31m28.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.7.4


We'll load, and then split, 5 of the most relevant papers to the query "Gradient Descent".

In [None]:
from langchain.document_loaders import ArxivLoader

docs = ArxivLoader(query="Gradient Descent", load_max_docs=5).load()

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1250,
    chunk_overlap = 100,
    length_function = len,
    is_separator_regex = False
)

In [None]:
split_docs = text_splitter.split_documents(docs)

Now we can create our FAISS vectorstore from those split documents.

The Gradient embeddings API can currently only handle 100 items at a time, so we'll batch our embeddings to accommodate that.

In [None]:
len(split_docs)

227

In [None]:
from langchain.vectorstores import FAISS

vectorstore = FAISS.from_documents(split_docs[:100], embedding=embeddings)
vectorstore.add_documents(split_docs[100:200])
vectorstore.add_documents(split_docs[200:])

print("Completed")

Completed


Now we can create our retriever!

In [None]:
retriever = vectorstore.as_retriever()

All that's left now is to create our RAG chain!

In [None]:
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough

rag_chain = (
    {
        "context" : retriever, "question" : RunnablePassthrough()
    }
    | rag_prompt
    | llm
    | StrOutputParser()
)

In [None]:
rag_chain.invoke("What is Gradient Descent?")

'Gradient descent is a method for minimizing a function by iteratively adjusting the parameters of a function to reduce the value of the function.'

In [None]:
rag_chain.invoke("Is it mandatory to learn gradient descent in detail to build large language model applications?")

'No, it is not mandatory to learn gradient descent in detail to build large language model applications.'

In [None]:
rag_chain.invoke("What do I need to learn about gradient descent to build large language model applciations?")

'You need to learn about gradient descent, its applications, and its limitations.'