---
---
# Notebook: [ Week #05: Build Your RAG Pipeline with Enhanced Retrieval]

The challenges in this notebook are to implement at least:

- 1 x Technique from **Pre-Retrieval Processes**
- 1 x Technique from **Retrieval Processes**
- 1 x Technique from **Post-Retrieval Processes**

---

Note:
- You may want to challenge yourself to implement those techniques that are covered in our **Course Notes**, but **NOT in the walkthrough of the notebook**.
- You can create as many code cells as needed.

## Setup
---

In [1]:
from openai import OpenAI
from getpass import getpass

with open('../openai_key', 'r') as file:
    API_KEY = file.read().rstrip()

client = OpenAI(api_key=API_KEY)

---

## Helper Functions

---

### Function for Generating Embedding

In [2]:
def get_embedding(input, model='text-embedding-3-small'):
    response = client.embeddings.create(
        input=input,
        model=model
    )
    return [x.embedding for x in response.data]

### Function for Text Generation

In [3]:
# This is the "Updated" helper function for calling LLM
def get_completion(prompt, model="gpt-4o-mini", temperature=0, top_p=1.0, max_tokens=256, n=1, json_output=False):
    if json_output == True:
      output_json_structure = {"type": "json_object"}
    else:
      output_json_structure = None

    messages = [{"role": "user", "content": prompt}]
    response = client.chat.completions.create( #originally was openai.chat.completions
        model=model,
        messages=messages,
        temperature=temperature,
        top_p=top_p,
        max_tokens=max_tokens,
        n=1,
        response_format=output_json_structure,
    )
    return response.choices[0].message.content

In [4]:
# This a "modified" helper function that we will discuss in this session
# Note that this function directly take in "messages" as the parameter.
def get_completion_by_messages(messages, model="gpt-4o-mini", temperature=0, top_p=1.0, max_tokens=1024, n=1):
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=temperature,
        top_p=top_p,
        max_tokens=max_tokens,
        n=1
    )
    return response.choices[0].message.content

### Functions for Token Counting

In [5]:
# This function is for calculating the tokens given the "message"
# ⚠️ This is simplified implementation that is good enough for a rough estimation

import tiktoken

def count_tokens(text):
    encoding = tiktoken.encoding_for_model('gpt-4o-mini')
    return len(encoding.encode(text))

def count_tokens_from_message_rough(messages):
    encoding = tiktoken.encoding_for_model('gpt-4o-mini')
    value = ' '.join([x.get('content') for x in messages])
    return len(encoding.encode(value))


## Setting up Credentials & Common Components for LangChain

In [6]:
import os
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
os.environ["OPENAI_API_KEY"] = API_KEY

# embedding model that we will use for the session
embeddings_model = OpenAIEmbeddings(model='text-embedding-3-small')

# llm to be used in RAG pipeplines in this notebook
llm = ChatOpenAI(model='gpt-4o-mini', temperature=0, seed=42)

---
---

<br>

**\[ Overview of Steps in RAG \]**

- 1. **Document Loading**
	- In this initial step, relevant documents are ingested and prepared for further processing. This process typically occurs offline.
- 2. **Splitting & Chunking**
	- The text from the documents is split into smaller chunks or segments.
	- These chunks serve as the building blocks for subsequent stages.
- 3. **Storage**
	- The embeddings (vector representations) of these chunks are created and stored in a vector store.
	- These embeddings capture the semantic meaning of the text.
- 4. **Retrieval**
	- When an online query arrives, the system retrieves relevant chunks from the vector store based on the query.
	- This retrieval step ensures that the system identifies the most pertinent information.
- 5. **Output**
	- Finally, the retrieved chunks are used to generate a coherent response.
	- This output can be in the form of natural language text, summaries, or other relevant content.

![](https://abc-notes.data.tech.gov.sg/resources/img/topic-4-rag-overview.png)

# Setting Up the Common Process

In [7]:
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

## Download the notes

In [8]:
# Download and unzip into local folder
url = "https://abc-notes.data.tech.gov.sg/resources/data/notes_rag.zip"

import requests
import zipfile
import io

response = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(response.content))

# Take note that the files are unzipped into a folder
z.extractall('./notes_rag')


## Document Loading

In [9]:
from langchain_community.document_loaders import TextLoader

In [10]:
# list of filenames to load
filename_list = [
    '2. Key Parameters for LLMs.txt',
    '3. LLMs and Hallucinations.txt',
    '4. Prompting Techniques for Builders.txt',
]

# load the documents
list_of_documents_loaded = []
for filename in filename_list:
    try:
        # try to load the document
        markdown_path = os.path.join('notes', filename)
        loader = TextLoader(markdown_path)

        # load() returns a list of Document objects
        data = loader.load()
        # use extend() to add to the list_of_documents_loaded
        list_of_documents_loaded.extend(data)
        print(f"Loaded {filename}")

    except Exception as e:
        # if there is an error loading the document, print the error and continue to the next document
        print(f"Error loading {filename}: {e}")
        continue

print("Total documents loaded:", len(list_of_documents_loaded))

Loaded 2. Key Parameters for LLMs.txt
Loaded 3. LLMs and Hallucinations.txt
Loaded 4. Prompting Techniques for Builders.txt
Total documents loaded: 3


---
---


# Technique(s) for Improving Pre-Retrieval Process

## Semantic Chunking

In [11]:
from langchain_experimental.text_splitter import SemanticChunker

# Create the text splitter
text_splitter = SemanticChunker(embeddings_model)

# Split the documents into smaller chunks
splitted_documents = text_splitter.split_documents(list_of_documents_loaded)

## Multi Query Retrieval

In [12]:
# Set logging for the queries
import logging

# Refer to LangChain documentation to find which loggers to set
# Different LangChain Classes/Modules have different loggers to set
logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

In [13]:
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai import ChatOpenAI
from langchain_chroma import Chroma
from langchain.chains import RetrievalQA

# Create the vector database
vectordb = Chroma.from_documents(
    documents=splitted_documents,
    embedding=embeddings_model,
    collection_name="naive_splitter", # one database can have multiple collections
    persist_directory="./vector_db"
)

# Create the multiquery retriever
retriever_multiquery = MultiQueryRetriever.from_llm(
  retriever=vectordb.as_retriever(), llm=llm,
)

# Create the multiquery pipeline
qa_chain_multiquery= RetrievalQA.from_llm(
    retriever=retriever_multiquery, llm=llm
)

In [14]:
# test the multiquery pipeline
qa_chain_multiquery.invoke("What is temperature in LLMs?")

INFO:langchain.retrievers.multi_query:Generated queries: ['What role does temperature play in the functioning of large language models (LLMs)?', 'How does the concept of temperature affect the output of LLMs during text generation?', 'Can you explain how temperature influences the behavior of large language models?']


{'query': 'What is temperature in LLMs?',
 'result': 'In the context of Large Language Models (LLMs), "temperature" refers to a parameter that controls the randomness of the model’s predictions. A high temperature setting makes the model more likely to produce varied and sometimes unexpected responses, while a low temperature results in more predictable and conservative outputs. Essentially, it adjusts the probability distribution of the next token being generated, influencing the diversity of the generated text.'}

---
---
<br>

# Technique(s) for Improving Retrieval Process

## Parent-Child Indexing

In [17]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore

# This text splitter is used to create the parent documents
parent_splitter = RecursiveCharacterTextSplitter(separators=["\n# "], chunk_size=4000, length_function=count_tokens)

# This text splitter is used to create the child documents
# It should create documents smaller than the parent
child_splitter = RecursiveCharacterTextSplitter(separators=["\n## "], chunk_size=1250, length_function=count_tokens)

# The vectorstore to use to index the child chunks
vectordb = Chroma(collection_name="parent_child", embedding_function=embeddings_model)

# The storage layer for the parent documents
store = InMemoryStore()

# Specificy a Retriever
parentchildretriever = ParentDocumentRetriever(
    vectorstore=vectordb,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
    search_kwargs={'k': 5}
)

# The splitting & embeddings happen
retriever.add_documents(list_of_documents_loaded)

## Combining with Pre-Retrieval Techniques

In [18]:
# Create the multiquery retriever
retriever_multiquery = MultiQueryRetriever.from_llm(
  retriever=parentchildretriever, llm=llm,
)

# Create the multiquery pipeline
qa_chain_multiquery= RetrievalQA.from_llm(
    retriever=retriever_multiquery, llm=llm
)

In [23]:
qa_chain_multiquery.invoke("What is temperature in LLMs?")

INFO:langchain.retrievers.multi_query:Generated queries: ['What role does temperature play in the functioning of large language models (LLMs)?', 'How does the concept of temperature affect the output generation in LLMs?', 'Can you explain how temperature influences the behavior of large language models?']


{'query': 'What is temperature in LLMs?',
 'result': 'In the context of large language models (LLMs), "temperature" is a parameter that controls the randomness of the model\'s output during text generation. A lower temperature (e.g., close to 0) makes the model\'s predictions more deterministic and focused, often resulting in more coherent and sensible text. A higher temperature (e.g., above 1) increases randomness, allowing for more diverse and creative outputs, but it may also lead to less coherent or more erratic responses. Adjusting the temperature helps balance between creativity and coherence in generated text.'}

---
---
<br>

# Technique(s) for Improving Post-retrieval Process

In [None]:
# Set a threshold score and dovetail with the previous techniques
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore

# This text splitter is used to create the parent documents
parent_splitter = RecursiveCharacterTextSplitter(separators=["\n# "], chunk_size=4000, length_function=count_tokens)

# This text splitter is used to create the child documents
# It should create documents smaller than the parent
child_splitter = RecursiveCharacterTextSplitter(separators=["\n## "], chunk_size=1250, length_function=count_tokens)

# The vectorstore to use to index the child chunks
vectordb = Chroma(collection_name="parent_child", embedding_function=embeddings_model)

# The storage layer for the parent documents
store = InMemoryStore()

# Specificy a Retriever
parentchildretriever = ParentDocumentRetriever(
    vectorstore=vectordb,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
    search_kwargs={'k': 5, "score_threshold": 0.2 ## this is the new addition}
)

# The splitting & embeddings happen
retriever.add_documents(list_of_documents_loaded)

# Create the multiquery retriever
retriever_multiquery = MultiQueryRetriever.from_llm(
  retriever=parentchildretriever, llm=llm,
)

# Create the multiquery pipeline
qa_chain_multiquery= RetrievalQA.from_llm(
    retriever=retriever_multiquery, llm=llm
)