# RAG with Reasoning Models

Author: [Zain Hasan](https://x.com/zainhasan6)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/togethercomputer/together-cookbook/blob/main/RAG_with_Reasoning_Models.ipynb)

### Introduction

In this notebook we will use a reasoning model to answer questions based on a set of context.

The interesting part is that the reasoning model will not only answer the question but also provide a justification for the answer.

Additionally if no relevant context is found, the model will say that it does not know the answer and provide a justification for this in the thinking tokens!

The notebook is structured as follows:

- Install the relevant libraries
- Download the full document for SB1047  
- Process and chunk the document
- Embed and index the chunks
- Retrieve the top 5 chunks
- Call the generative model with the retrieved chunks and the query

## Install Relevant Libraries

In [1]:
!pip install -qU together beautifulsoup4 numpy

In [2]:
import os
from together import Together

# Paste in your Together AI API Key or load it

client = Together(api_key = os.environ.get("TOGETHER_API_KEY"))

### Download the Full Document for SB1047

In [3]:
import requests
from bs4 import BeautifulSoup

def get_legiscan_text(url):
    """
    Fetches and returns the text content from a given LegiScan URL.
    Args:
        url (str): The URL of the LegiScan page to fetch.
    Returns:
        str: The text content of the page.
    Raises:
        requests.exceptions.RequestException: If there is an issue with the HTTP request.
    """
    # Basic headers to mimic a browser
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    }

    # Make the request
    response = requests.get(url, headers=headers)

    # Parse HTML
    soup = BeautifulSoup(response.text, 'html.parser')

    # Get text content
    content = soup.get_text()

    return content

url = "https://legiscan.com/CA/text/SB1047/id/2999979/California-2023-SB1047-Amended.html"
text = get_legiscan_text(url)
print(text[:1000])

California-2023-SB1047-Amended
                Amended
               IN 
                Senate
               May 16, 2024
                Amended
               IN 
                Senate
               April 30, 2024
                Amended
               IN 
                Senate
               April 16, 2024
                Amended
               IN 
                Senate
               April 08, 2024
                Amended
               IN 
                Senate
               March 20, 2024
                    CALIFORNIA LEGISLATURE—
                    2023–2024 REGULAR SESSION
                Senate Bill
              No. 1047Introduced by Senator Wiener(Coauthors: Senators Roth, Rubio, and Stern)February 07, 2024An act to add Chapter 22.6 (commencing with Section 22602) to Division 8 of the Business and Professions Code, and to add Sections 11547.6 and 11547.7 to the Government Code, relating to artificial intelligence.LEGISLATIVE COUNSEL'S DIGESTSB 1047, as amended, Wi

### 1. Data Processing and Chunking

We will RAG over the recent [**SB1047**](https://legiscan.com/CA/text/SB1047/id/2999979/California-2023-SB1047-Amended.html) bill in California.

In [4]:
# We can get away with naive fixed sized chunking as the context generation will add meaning to these chunks

def create_chunks(document, chunk_size=300, overlap=50):
    return [document[i : i + chunk_size] for i in range(0, len(document), chunk_size - overlap)]

In [5]:
chunks = create_chunks(text, chunk_size=350, overlap=40)

for i, chunk in enumerate(chunks):
    print(f"Chunk {i + 1}: {chunk}")

Chunk 1: California-2023-SB1047-Amended
                Amended
               IN 
                Senate
               May 16, 2024
                Amended
               IN 
                Senate
               April 30, 2024
                Amended
               IN 
                Senate
               April 16, 2024
                Amended
         
Chunk 2: , 2024
                Amended
               IN 
                Senate
               April 08, 2024
                Amended
               IN 
                Senate
               March 20, 2024
                    CALIFORNIA LEGISLATURE—
                    2023–2024 REGULAR SESSION
                Senate Bill
              No. 1047Introduced 
Chunk 3: e Bill
              No. 1047Introduced by Senator Wiener(Coauthors: Senators Roth, Rubio, and Stern)February 07, 2024An act to add Chapter 22.6 (commencing with Section 22602) to Division 8 of the Business and Professions Code, and to add Sections 11547.6 and 11547.7 to

### 2. Indexing and Embedding Generation

We will now use `bge-large-en-v1.5` to embed the augmented chunks above into a vector index.

In [6]:
from typing import List

import numpy as np

def generate_embeddings(input_texts: List[str], model_api_string: str) -> List[List[float]]:
    """Generate embeddings from Together python library.

    Args:
        input_texts: a list of string input texts.
        model_api_string: str. An API string for a specific embedding model of your choice.

    Returns:
        embeddings_list: a list of embeddings. Each element corresponds to the each input text.
    """
    outputs = client.embeddings.create(
        input=input_texts,
        model=model_api_string,
    )
    return np.array([x.embedding for x in outputs.data])

In [7]:
embeddings = generate_embeddings(list(chunks), "BAAI/bge-large-en-v1.5")

In [8]:
# Each vector is 1024 dimensional

len(embeddings[0])

1024

In [9]:
# Generate the vector embeddings for the query
query = "what is the maximum allowable floating point operation per second this bill allows?"

query_embedding = generate_embeddings([query], 'BAAI/bge-large-en-v1.5')[0]

In [10]:
# Calculate cosine similarity between the query embedding and each movie embedding

dot_product = np.dot(query_embedding, np.array(embeddings).T)
query_norm = np.linalg.norm(query_embedding)
embeddings_norm = np.linalg.norm(embeddings, axis=1)
similarity_scores = dot_product / (query_norm * embeddings_norm)
indices = np.argsort(-similarity_scores)

In [11]:
top_5_indices = indices[:5]
top_5_indices

array([ 30,   2,  34, 135,  33])

In [12]:
top_5_chunks = [chunks[index] for index in indices][:5]

top_5_chunks

[' duty exemption.(4)\xa0Unauthorized use of the hazardous capability of a covered model.(d)\xa0“Computing cluster” means a set of machines transitively connected by data center networking of over 100 gigabits per second that has a theoretical maximum computing capacity of at least 10^20 integer or floating-point operations per second and can be used for t',
 "e Bill\n              No. 1047Introduced by Senator Wiener(Coauthors: Senators Roth, Rubio, and Stern)February\xa007,\xa02024An act to add Chapter 22.6 (commencing with Section 22602) to Division 8 of the Business and Professions Code, and to add Sections 11547.6 and 11547.7 to the Government Code, relating to artificial intelligence.LEGISLATIVE COUNSEL'S",
 'ficial intelligence model was trained using a quantity of computing power sufficiently large that it could reasonably be expected to have similar or greater performance as an artificial intelligence model trained using a quantity of computing power greater than 10^26 integer

In [13]:
# compact function to retrieve the top k chunks

def vector_retrieval(query: str, top_k: int = 5, vector_index: np.ndarray = None, chunks: List[str] = None) -> List[int]:
    """
    Retrieve the top-k most similar items from an index based on a query.
    Args:
        query (str): The query string to search for.
        top_k (int, optional): The number of top similar items to retrieve. Defaults to 5.
        index (np.ndarray, optional): The index array containing embeddings to search against. Defaults to None.
    Returns:
        List[int]: A list of indices corresponding to the top-k most similar items in the index.
    """

    query_embedding = generate_embeddings([query], 'BAAI/bge-large-en-v1.5')[0]
    
    
    dot_product = np.dot(query_embedding, np.array(vector_index).T)
    query_norm = np.linalg.norm(query_embedding)
    vector_index_norm = np.linalg.norm(vector_index, axis=1)
    
    similarity_scores = dot_product / (query_norm * vector_index_norm)

    return [chunks[index] for index in np.argsort(-similarity_scores)[:top_k]]

In [14]:
vector_retrieval(query = "what is the maximum allowable floating point operation per second this bill allows?", top_k = 5, vector_index = embeddings, chunks = chunks)

[' duty exemption.(4)\xa0Unauthorized use of the hazardous capability of a covered model.(d)\xa0“Computing cluster” means a set of machines transitively connected by data center networking of over 100 gigabits per second that has a theoretical maximum computing capacity of at least 10^20 integer or floating-point operations per second and can be used for t',
 "e Bill\n              No. 1047Introduced by Senator Wiener(Coauthors: Senators Roth, Rubio, and Stern)February\xa007,\xa02024An act to add Chapter 22.6 (commencing with Section 22602) to Division 8 of the Business and Professions Code, and to add Sections 11547.6 and 11547.7 to the Government Code, relating to artificial intelligence.LEGISLATIVE COUNSEL'S",
 'ficial intelligence model was trained using a quantity of computing power sufficiently large that it could reasonably be expected to have similar or greater performance as an artificial intelligence model trained using a quantity of computing power greater than 10^26 integer

We now have a way to retrieve from the vector index given a query.

In [15]:
retrieved_chunks = vector_retrieval(query = "what is the maximum allowable floating point operation per second this bill allows?", top_k = 5, vector_index = embeddings, chunks = chunks)

In [16]:
# Lets add the top 5 documents to a string

formatted_chunks = ''

for i, chunk in enumerate(retrieved_chunks):
    formatted_chunks += f"Context {i+1}: {chunk}\n"

print(formatted_chunks)

Context 1:  duty exemption.(4) Unauthorized use of the hazardous capability of a covered model.(d) “Computing cluster” means a set of machines transitively connected by data center networking of over 100 gigabits per second that has a theoretical maximum computing capacity of at least 10^20 integer or floating-point operations per second and can be used for t
Context 2: e Bill
              No. 1047Introduced by Senator Wiener(Coauthors: Senators Roth, Rubio, and Stern)February 07, 2024An act to add Chapter 22.6 (commencing with Section 22602) to Division 8 of the Business and Professions Code, and to add Sections 11547.6 and 11547.7 to the Government Code, relating to artificial intelligence.LEGISLATIVE COUNSEL'S
Context 3: ficial intelligence model was trained using a quantity of computing power sufficiently large that it could reasonably be expected to have similar or greater performance as an artificial intelligence model trained using a quantity of computing power greater than 10^

### 3. Call Reasoning Generative Model - DeepSeek R1

We will pass the finalized 5 chunks into an LLM to get our final answer.

In [17]:
query = "What is the maximum allowable floating point operation per second this bill allows for model training?"


PROMPT = """
Answer the question: {query}. 
IMPORTANT RULE: Use the information provided to answer the question. For each claim in the answer provide a source from the information provided. 
Here is relevant information: {formatted_chunks} 
"""


stream = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1",
    messages=[
      {"role": "system", "content": "You are a helpful chatbot."},
      {"role": "user", "content": PROMPT.format(query=query, formatted_chunks=formatted_chunks)},
    ],
      stream=True,
)

response = ''

for chunk in stream:
  response += chunk.choices[0].delta.content or ""
  print(chunk.choices[0].delta.content or "", end="", flush=True)

<think>
Okay, let's tackle this question. The user is asking for the maximum allowable floating point operations per second (FLOPs) allowed for model training under the bill mentioned. The answer needs to be based on the provided contexts, and each claim must have a source from those contexts.

First, I'll go through each context to find relevant mentions of FLOPs. 

Context 1 talks about a "computing cluster" defined as having a theoretical maximum capacity of at least 10^20 integer or floating-point operations per second. But this seems to be about the cluster's capacity, not the training limit. 

Context 2 is the bill introduction, no numbers here. 

Context 3 mentions training with computing power greater than 10^26 FLOPs in 2024. This looks like a threshold for determining if a model is covered. 

Context 4 discusses CalCompute, but no FLOPs number. 

Context 5 defines a "covered model" under two criteria. The first is training with more than 10^26 FLOPs. The second is similar per

### What happens if the retreived chunks are not relevant?

In [18]:
query = "What is the circumference of the moon?"

stream = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1",
    messages=[
      {"role": "system", "content": "You are a helpful chatbot."},
      {"role": "user", "content": PROMPT.format(query=query, formatted_chunks=formatted_chunks)},
    ],
      stream=True,
)

response = ''

for chunk in stream:
  response += chunk.choices[0].delta.content or ""
  print(chunk.choices[0].delta.content or "", end="", flush=True)

<think>
Okay, let's tackle this question. The user is asking for the circumference of the moon. But wait, the provided contexts don't seem to mention anything about the moon, its size, or circumference. Let me check each context again to be sure.

Context 1 talks about duty exemptions and computing clusters, mentioning 10^20 operations. Context 2 is about a bill related to AI. Context 3 again discusses AI models and computing power of 10^26 operations. Context 4 mentions creating a cloud computing cluster called CalCompute. Context 5 defines a covered model based on computing power. 

None of these contexts provide any data about the moon's physical characteristics. The user specified to use only the given information and cite sources from the provided contexts. Since there's no relevant information here about the moon's circumference, I need to state that the answer isn't available in the given contexts. I should also list the contexts checked to show I reviewed them all.
</think>

Th