# Advanced Chunking Exercise

You will implement a RAG application for long and messy legal documents. You will implement the best practices you learned so far, including semantic chunking, and chunk enrichment. Then, you will implement semantic search and response generation with citation to the original documents.

### Visual improvements

We will use [rich library](https://github.com/Textualize/rich) to make the output more readable, and supress warning messages.

In [1]:
from rich.console import Console
from rich_theme_manager import Theme, ThemeManager
import pathlib

theme_dir = pathlib.Path("../themes")
theme_manager = ThemeManager(theme_dir=theme_dir)
dark = theme_manager.get("dark")

# Create a console with the dark theme
console = Console(theme=dark)

In [2]:
import warnings

# Suppress warnings
warnings.filterwarnings('ignore')

## Loading complex PDF documents

You will load a complex legal PDF document from the [case.law](https://case.law/) website. This website has millions of legal documents, and we will load a random PDF file from that site with more than 1,000 pages. 

To parse the PDF file you will use a PDF processor library, [pymupdf4llm](https://pymupdf.readthedocs.io/en/latest/pymupdf4llm/), which makes it easy to extract text and other media from PDF files for RAG applications. 

In [None]:
import pymupdf4llm

import requests
import os

random_doc_number = 196
url = f"https://static.case.law/wash-app/{random_doc_number}.pdf"
response = requests.get(url)

data_folder = "data"
if not os.path.exists(data_folder):
    os.makedirs(data_folder)

with open(os.path.join(data_folder, f"{random_doc_number}.pdf"), "wb") as file:
    file.write(response.content)

md_text = pymupdf4llm.to_markdown(f"data/{random_doc_number}.pdf", page_chunks=True)

### Show a ramdom page from the document

Let's check a random page from the PDF document and print its image and the extracted text.

In [5]:
import fitz
from IPython.display import display, HTML

random_page_number = 149
## Convert the PDF to an PNG image
pdf_path = "data/196.pdf"
pdf_document = fitz.open(pdf_path)
page = pdf_document.load_page(random_page_number)  # Page numbering starts from 0
pix = page.get_pixmap()
pix.save("random_page.png")
pdf_document.close()

# Text content
text_content = f"""
<h3>Extracted Text</h3>
<p>{md_text[random_page_number]["text"]}</p>
"""

# HTML layout for two columns to show the image and text side by side
html_content = f"""
<div style="display: flex; align-items: center;">
    <div style="flex: 60%; padding: 5px;">
        <img src="{'random_page.png'}" style="max-width: 100%; height: auto;"/>
    </div>
    <div style="flex: 40%; padding: 5px;">
        {text_content}
    </div>
</div>
"""

# Display in Jupyter notebook
display(HTML(html_content))

You can see that the PDF processor extracts additional information on the document such as title, page count, etc. We can use this metadata for the metadata of our chunks in the vector database.

In [None]:
console.print(md_text[random_page_number])

## Split the documents into Chunks

You will use the statistical chunker that we used in the hands-on lab. However, we want an encoder that is trained on legal document and can generate better embedding vectors to improve the retrieval results. For this exercise you will an encoder from Hugging Face hub: https://huggingface.co/nlpaueb/legal-bert-base-uncased.

In [None]:
from semantic_router.encoders import HuggingFaceEncoder

encoder = HuggingFaceEncoder(
    ### YOUR CODE HERE ###
)
console.print(encoder)


In [7]:
from semantic_chunkers import StatisticalChunker
import logging

logging.disable(logging.CRITICAL)

chunker = StatisticalChunker(
    encoder=encoder,
    min_split_tokens=100,
    max_split_tokens=500,
)
console.print(chunker)

### Chunking the full document text

We will concatenate the text from all the pages of the document. We will insert the page number between the pages to allow the retrieval and then then the generation steps to create direct citation to the relevant page in the long document.

In [8]:
concatenated_text = " ".join([page["text"] + f"<page_break_{i}>" for i, page in enumerate(md_text)])

chunks = ### YOUR CODE HERE ###


How many chunks were created?

In [None]:
### YOUR CODE HERE  ###

Let's print a random chunk:

In [10]:
console.print(chunks[0][5])

What is the average numebr of tokens in the chunks?

In [None]:
### YOUR CODE HERE ###

## Enrich the chunk with context and metadata

We will iterate over all the chunks. This can take some time based on the number of chunks.

Since we want to be able to process a large number documents in our RAG system, we need to create a UUID that will used as the ID of the chunk within the vector database. The UUID is comprised of the URL of the document and the chunk index. This structure allows you to get a specific chunk index directly, whick will be improtant in the augmentation phase.

In [22]:
import uuid
import re

doc_url = url
title = md_text[0]["metadata"]["title"]
# Enrich the metadata with filters that are relevant for future retrieval queries.
state = "Washington"

from tqdm import tqdm

def generate_uuid(doc_url, i):
    return str(uuid.uuid5(uuid.NAMESPACE_URL, f"{doc_url}/{i}"))

corpus_json = []
for i, chunk in tqdm(enumerate(chunks[0]), total=len(chunks[0]), desc="Processing chunks"):
    chunk_text = ' '.join(chunk.splits)
    # Extract the page number from the page breaks
    page_match = re.search(r'<page_break_(\d+)>', chunk_text)
    page = page_match.group(1) if page_match else 0
    chunk_uuid = generate_uuid(doc_url, i)
    corpus_json.append({
        "id": chunk_uuid,
        "document": chunk_text,
        # Add the title of the document to the chunk text for embedding
        "embedding": encoder([f"{title} \n {chunk_text}"])[0],
        "metadata" : {
            "title": title,
            "state": state,
            "doc_url": doc_url,
            "chunk_index": i,
            "page": page,
        }
    })


Processing chunks: 100%|██████████| 2336/2336 [03:42<00:00, 10.51it/s]


## Loading into a Vector Database

You will use a new vector data, [Chroma](https://github.com/chroma-core/chroma). It can illustrate the modularity of the RAG application, and the similar concepts across the providers.

### Creating the collection 

You will use the default values for this simpler exercise.

In [14]:
import chromadb
# setup Chroma in-memory, for easy prototyping. Can add persistence easily!
client = chromadb.Client()

# Create collection. get_collection, create_collection, delete_collection also available!
collection_name = "legal-pdfs"
collection = client.get_or_create_collection(collection_name)


### Unserting the documents

We will use the embedding, metadata and documents that were calculated above.

In [15]:
collection.add(
    documents=### YOUR CODE HERE ###,
    embeddings=[obj["embedding"] for obj in corpus_json],
    metadatas=### YOUR CODE HERE ###,
    ### YOUR CODE HERE ###
)

### Query the vector collection

We will add an example filter to the query based on the metadata that we created for each chunk (`{"state": "Washington"}`).

In [16]:
query_text = "cases about loan default"
query_embedding = ### YOUR CODE HERE ###

In [17]:
hits = collection.query(
    query_embeddings=query_embedding,
    n_results=5,
    where={"state": "Washington"},
)


## Augmentation Step

We suspect that the chunk context is too small and we want to concatenate the chunks around it, before we send the text to the generation step. 

In [None]:
console.print(hits)

### Augmenting the search result

You will iterate over all the search results and prepare them to the generation step. The main augmentation is the concatenation of the sourounding chunks text.

In [19]:
# define a variable to hold the search results with specific fields
search_results = []

for document, metadata in zip(hits["documents"][0], hits["metadatas"][0]):
    doc_url = metadata["doc_url"]
    chunk_index = metadata["chunk_index"]
    doc_id = generate_uuid(doc_url, chunk_index)
    # Calculate the chunk IDs of the previous and next chunks
    previous_chunk_id = ### YOUR CODE HERE ###
    next_chunk_id = ### YOUR CODE HERE ###
    # Get the chunks from the vector collection with the chunk ids.
    previous_chunk = collection.get(### YOUR CODE HERE ###)
    next_chunk = ### YOUR CODE HERE ###
    search_results.append({
        # Concatenate the previous, current, and next document chunks to form a single document
        "document": f"{previous_chunk['documents'][0]} {document} {next_chunk['documents'][0]}",
        "metadata": metadata,
    })

Let's print the first search result, before sending it to the generation model:

In [None]:
console.print(search_results[0])

In [22]:
from openai import OpenAI
from rich.panel import Panel
from rich.text import Text

client = OpenAI()
system_message = """
You are a paralegal specialist. 
### YOUR CODE HERE ###
### YOUR CODE HERE ###
"""
completion = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": system_message},
        {"role": "user", "content": query_text},
        {"role": "assistant", "content": str(search_results)}
    ]
)

response_text = Text(completion.choices[0].message.content)
styled_panel = Panel(
    response_text,
    title=f"Reply to '{query_text}'",
    expand=False,
    border_style="bright_yellow",
    padding=(1, 1)
)

console.print(styled_panel)