# Theravada Scripture Search POC

In this notebook, we'll create a proof-of-concept of search over the translated Theravadan texts from https://www.dhammatalks.org/suttas/.

## Setup

In [1]:

# Run once.

%load_ext autoreload
%autoreload 2

import os

# Move execution back one directory.
os.chdir('..')

In [2]:
from IPython.display import HTML

HTML(
    """
<style>
.output_scroll {
    max-height: 400px;
    overflow-y: auto;
}
</style>
"""
)

In [3]:
import pandas as pd
from dotenv import load_dotenv

load_dotenv()


True

## Load data

In [4]:
dhamma_talks_suttas = pd.read_csv('data/dhamma_talks_suttas.csv', index_col=0)
print(dhamma_talks_suttas.shape)
dhamma_talks_suttas.head()

(240, 8)


Unnamed: 0,collection,title,url_source,religion,subgroup,source,translation_source,text
0,AN,A Single Thing,https://www.dhammatalks.org/suttas/AN/AN1_21.html,Buddhism,Theravada,Dhamma Talks,Thanissaro Bhikkhu,"21. âI donât envision a single thing that,..."
1,AN,A Pool of Water,https://www.dhammatalks.org/suttas/AN/AN1_45.html,Buddhism,Theravada,Dhamma Talks,Thanissaro Bhikkhu,45. âSuppose there were a pool of waterâsu...
2,AN,Soft,https://www.dhammatalks.org/suttas/AN/AN1_48.html,Buddhism,Theravada,Dhamma Talks,Thanissaro Bhikkhu,"âJust as, of all trees, the balsam is foremo..."
3,AN,Quick to Reverse Itself,https://www.dhammatalks.org/suttas/AN/AN1_49.html,Buddhism,Theravada,Dhamma Talks,Thanissaro Bhikkhu,âI donât envision a single thing that is a...
4,AN,Luminous,https://www.dhammatalks.org/suttas/AN/AN1_50.html,Buddhism,Theravada,Dhamma Talks,Thanissaro Bhikkhu,"âLuminous, monks, is the mind.1 And it is de..."


In [5]:

def decode_text(text):
    if not isinstance(text, str):
        return text
    try:
        # First try to decode as UTF-8
        return text.encode("latin1").decode("utf-8")
    except UnicodeDecodeError:
        try:
            # If that fails, try to decode as latin1 first
            return text.encode("latin1").decode("latin1")
        except:
            # If all else fails, return the original text
            return text

def preprocess(df: pd.DataFrame) -> pd.DataFrame:
    print(df.shape)
    df = df.dropna(subset=["text"])
    print(df.shape)
    df = df.drop_duplicates(subset=["text"])
    print(df.shape)
    df["text"] = df["text"].apply(decode_text)
    return df


dhamma_talks_suttas = preprocess(dhamma_talks_suttas)
dhamma_talks_suttas.shape


(240, 8)
(239, 8)
(235, 8)


(235, 8)

## Create basic search

In [6]:
from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from smolagents import CodeAgent, HfApiModel

from scripture_search.search import ScriptureSearchTool

def row_to_doc(row: pd.Series) -> Document:
    return Document(
        page_content=row["text"], 
        metadata={
            "collection": row["collection"],
            "title": row["title"],
            "url_source": row["url_source"],
            "religion": row["religion"],
            "subgroup": row["subgroup"],
            "source": row["source"],
            "translation_source": row["translation_source"],
        }
    )

documents = dhamma_talks_suttas.apply(row_to_doc, axis=1).tolist()

# Split the documents into smaller chunks for more efficient search
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    add_start_index=True,
    strip_whitespace=True,
    separators=["\n\n", "\n", ".", " ", ""],
)
docs_processed = text_splitter.split_documents(documents)
print(len(docs_processed))

# # Create the retriever tool
scripture_search_tool = ScriptureSearchTool(docs_processed)

  from .autonotebook import tqdm as notebook_tqdm


9158


In [7]:
print(scripture_search_tool.forward("I struggle with motivation to meditate. What can I do?")[:2000])


 The following scriptures may be helpful:


===== Scripture: The Simile of the Cloth | source: https://www.dhammatalks.org/suttas/MN/MN7.html =====
wouldn’t cleanse
a dark deed.
What can the Sundarikā do?
What the Payāga? What the Bāhuka?
A person of animosity,
one who’s done wrong,
cannot be cleansed there
of evil deeds.
But for one who is pure,
it’s always the Phaggu festival;
for one who is pure,
always the uposatha.
For one who is pure, clean in his deeds,
his practices       always
reach consummation.
Bathe right here, brahman.
Create safety for yourself
with regard to all beings.
If you
don’t tell a lie,
don’t harm living beings,

===== Scripture: To Gaá¹aka MoggallÄna | source: https://www.dhammatalks.org/suttas/MN/MN107.html =====
“What can I do about that, Master Gotama? I’m the one who shows the way.”
“In the same way, brahman—when unbinding is there, and the path leading to unbinding is there, and I am there as the guide—when my disciples are thus exhorted & instructed by

These aren't super useful in their current form. Let's use agents to try to improve the output.

## Agentic Librarian approach
With this approach, the agent will only directly output text from the suttas and will not try to interpret the text at all.

In [12]:
# # Initialize the agent
model = HfApiModel(
    # model_id='Qwen/Qwen2.5-Coder-32B-Instruct', # it is possible that this model may be overloaded
    model_id="https://pflgm2locj2t89co.us-east-1.aws.endpoints.huggingface.cloud",
    token=os.getenv("HF_TOKEN"),
)
system_prompt="""You are a helpful librarian specializing in Buddhist texts and teachings.
When a user asks a question, use the scripture_search_tool tool to find relevant passages from the suttas.
You may need to refine the search query to get better results. Finally, provide the passages
as are, without any additional commentary or interpretation. Use the same format as the search tool
for your output.
"""
agent = CodeAgent(tools=[scripture_search_tool], model=model)
# Add the system prompt to the input messages
agent.input_messages = [
    {
        "role": "system",
        "content": system_prompt,
    }
]

# # Example usage
response = agent.run(
    "I struggle with motivation to meditate. What can I do?"
)


AttributeError: 'AgentText' object has no attribute 'output'

In [15]:
print(response)

AttributeError: 'AgentText' object has no attribute 'wrap_text'