# **Financial-Regulation Assistant powered by Retrieval Augmented Generation**
Fin-Reg Assistant is a platform that utilizes retrieval-augmented generation to address questions related to India's stock and securities exchange, providing insights into regulatory frameworks, guidelines and compliance.



## Environment


In [None]:
# Install the required packages
# !pip install langchain langchain_community langchain_pinecone pinecone-client pandas numpy

In [None]:
import os
from dotenv import load_dotenv
load_dotenv()

import pandas as pd
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_pinecone import PineconeVectorStore
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from datasets import Dataset
from langchain import hub
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_community.llms.huggingface_endpoint import HuggingFaceEndpoint
from ragas import evaluate
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    answer_similarity,
    answer_correctness
)

## Initialising Environment variables

Initialize Pinecone and Huggingface API keys

In [None]:
PINECONE_API_KEY = os.environ.get("PINECONE_API_KEY")
HUGGINGFACEHUB_API_TOKEN = os.environ.get("HUGGINGFACEHUB_API_TOKEN")
index_name = "better-rag"

Optionally use langsmith for tracking LLM behaviour and token usage.

In [None]:
# LANGCHAIN_TRACING_V2=True
# LANGCHAIN_API_KEY = os.environ.get("LANGCHAIN_API_KEY")

## Embedding Model
 To generate embeddings, we will be using Colbertv2.0 to generate dense embeddings which balances efficiency and contextualization, making it highly effective for document retrieval tasks.

In [None]:
embeddings = HuggingFaceEmbeddings(model_name ="colbert-ir/colbertv2.0")

No sentence-transformers model found with name colbert-ir/colbertv2.0. Creating a new one with MEAN pooling.


## Generating the Vector Index

The data used in this project are the Master circulars published by the Securities and Exchange Board of India (SEBI). These circulars are issued periodically to provide updated and consolidated information to market participants, including intermediaries, investors, and other stakeholders.

The PDF is first loaded using `PyPDF` and split using `RecursiveChracterTexrSplitter`.


In [None]:
loader = PyPDFLoader("1691151096694(1).pdf")
text = loader.load_and_split()

def chunk_maker(text):

    text_splitter = RecursiveCharacterTextSplitter(
        # separator="\n",  # Adjust separator if needed
        chunk_size=1000,
        chunk_overlap=200,
        length_function=len
    )
    chunks = text_splitter.split_documents(text)
    return chunks

text_chunks=chunk_maker(text)

Store the embeddings in a Pinecone vector database.

In [None]:
docsearch = PineconeVectorStore.from_documents(text_chunks, embeddings, index_name=index_name)

Alternatively, if the Pinecone database has already been initialised, we can simply connect to the existing index.

In [None]:
# docsearch = PineconeVectorStore.from_existing_index(index_name=index_name,embedding=embeddings)

Set the vector database as a retriever to return the 3 most similar documents.

In [None]:
retriever = docsearch.as_retriever(search_kwargs={"k": 3})

## Prompt
[Langsmith Hub](https://smith.langchain.com/hub) is a service which provides pre-made templates that can be used for a variety of tasks such as RAG, Agent systems, QA, etc. For this system we are using a [prompt](https://smith.langchain.com/hub/rlm/rag-prompt) which informs the LLM to answer the query in a concise manner while only using the information present in the context and reminding it to say it does not know the answer if the context does not contain enough information to answer the question.

In [None]:
prompt = hub.pull("rlm/rag-prompt")

example_messages = prompt.invoke(
    {"context": "filler context", "question": "filler question"}
).to_messages()
example_messages

[HumanMessage(content="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: filler question \nContext: filler context \nAnswer:")]

## Generation

The generation task will be handled by the Mistral-7B-Instruct-v0.1 model. The inference for this model is provided by HugginFaceEndpoint which is free of cost but has some rate-limits.


In [None]:
repo_id = "mistralai/Mistral-7B-Instruct-v0.1"

llm = HuggingFaceEndpoint(repo_id=repo_id,huggingfacehub_api_token=HUGGINGFACEHUB_API_TOKEN,temperature=0.1)

Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to C:\Users\vedan\.cache\huggingface\token
Login successful


## Retrieval Chain

We define a function `format_docs` that takes a list of document objects `docs` and concatenates their `page_content` attributes into a single string, separated by two newline characters.
We then set up a processing chain `rag_chain`.
The chain starts with retrieving documents, formatting their content, and then passing the formatted content through a prompt and a language model.
Finally, the output is parsed and formatted as a string before being returned.


In [None]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

We define a function `process_text` which streams input through the `rag_chain`, printing each chunk for continuous output.


In [None]:
def process_text(input_string):
    for chunk in rag_chain.stream(input_string):
        print(chunk, end="", flush=True)

## Trial Run

In [None]:
process_text("What is the full form of PMLA?")

 PMLA stands for Principal Market Maker Listing Authority.</s>

# Benchmarking
Two custom datasets were developed to test distinct attributes of the proposed RAG systems. Namely SimpleBench and ComplexBench.



1.   SimpleBench consists of 10 straightforward domain specific question with the main objective of evaluating the LLM’s generative capabilities using the retrieved information.
2.   ComplexBench 6 hard questions that intend to challenge both the retriever's capacity to accurately identify relevant context and the LLM's reasoning ability.

The following metrics were used to quantify the performance on the benchmarking datasets:

- **Faithfulness:**
    - It measures the factual consistency of the generated answer against the given context.
    - Its a binary score assigned by the LLM.
- Answer Relevance:
    - It focuses on assessing how relevant the generated answer is to the given prompt.
    - It’s calculated by taking the mean cosine similarity of the original question to a number of artificial questions, which are generated based on the answer.
- Context Recall:
    - It measures the extent to which the retrieved context aligns with the ground truth.
    - To compute it, each sentence in the ground truth answer is analyzed to determine whether it can be attributed to the retrieved context or not.
- Answer Semantic Similarity:
    - It is the assessment of semantic resemblance between the generated answer and the ground truth.
    - Evaluation is done by LLM using the ground truth and the model generated answer.
- Answer Correctness:
    - Measures the accuracy of the generated answer when compared to the ground truth.
    - It utilizes semantic similarity and factual similarity between the generated answer and the ground truth.

**NOTE :**

More Information about the metrics are available [here](https://docs.ragas.io/en/stable/concepts/metrics/index.html).

[RAGAS](https://docs.ragas.io/en/stable/index.html) was used for the calculating the metrics. This implementation of RAGAS requires an OpenAI account which has the ability to run GPT-3.5-turbo-0125. RAGAS provides a [method](https://docs.ragas.io/en/stable/howtos/customisations/bring-your-own-llm-or-embs.html) to implement custom LLM and embedding models but they can lead to unreliable scores.

In [None]:
#Importing SimpleBench

simple_df = pd.read_excel("SimpleBench.xlsx")
simple_df.head()

Unnamed: 0,question,ground_truth,answer
0,What determines the Daily Settlement Price (DS...,The Daily Settlement Price (DSP) in futures tr...,The Daily Settlement Price (DSP) in futures t...
1,What are the objective parameters used to dete...,The objective parameters used to determine the...,The objective parameters used to determine th...
2,What steps should Clearing Corporations take i...,Clearing Corporations planning to launch susce...,Clearing Corporations should update their syst...
3,Can you explain the framework for early delive...,The framework for early delivery in futures co...,The framework for early delivery in futures c...
4,How does the Clearing Corporation ensure compl...,The Clearing Corporation ensures compliance wi...,The Clearing Corporation ensures compliance w...


RAGAS requires the data to be in a specific [format](https://docs.ragas.io/en/stable/howtos/applications/data_preparation.html). The function `iterate_and_update` stores the documents which were identified as relevant by the retriever in a list and stores them in a column called *'contexts'*.

The function then invokes the `rag_chain` and stores the model respone in the dataframe column called *'answer'*.



In [None]:
def iterate_and_update(df):
    for i in range(len(df)):
        relevant_docs = retriever.get_relevant_documents(df['question'].iloc[i])
        relevant_list = [] # Initialize an empty list to hold the documents
        for doc in relevant_docs:
            # For each document, create a list with the document's content and append it to relevant_list
            relevant_list.append(doc.page_content.replace('\n',''))
        # Assign the list of documents to the 'Retrieved_Information' column

        df.at[i, 'contexts'] = relevant_list

        response = rag_chain.invoke(simple_df.question.iloc[i])

        df.at[i, 'answer'] = response

    return df

In [None]:
#Apple the iterate_and_update function to simple_df

simple_df = iterate_and_update(simple_df)

In [None]:
#Convert the dataframe to a Dataset format as that is required for RAGAS

simple_data = Dataset.from_pandas(simple_df)
simple_data

Dataset({
    features: ['question', 'ground_truth', 'contexts', 'answer'],
    num_rows: 10
})

In [None]:
#Perform Evaluation

simple_result = evaluate(
    simple_data,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_recall,
        answer_similarity,
        answer_correctness
    ]
)


Evaluating: 100%|██████████| 50/50 [00:20<00:00,  2.44it/s]


In [None]:
simple_result

{'faithfulness': 0.8854, 'answer_relevancy': 0.9629, 'context_recall': 0.7917, 'answer_similarity': 0.9088, 'answer_correctness': 0.5768}

We perform the same steps for ComplexBench now.

In [None]:
complex_df = pd.read_excel("ComplexBench.xlsx")
complex_df.head()

Unnamed: 0,question,ground_truth,answer
0,Can you explain how governmental interventions...,Governmental interventions and price manipulat...,Governmental interventions and price manipula...
1,How do stock exchanges handle the exercise of ...,Stock exchanges handle option contracts throug...,Stock exchanges handle the exercise of option ...
2,Discuss the criteria for determining exposure ...,The exposure limits to banks by stock exchange...,The criteria for determining exposure limits ...
3,Discuss the significance of the Clearing Corpo...,The discretion of Clearing Corporations to pre...,The Clearing Corporations have the discretion...
4,In what ways does the composition of statutory...,The composition of statutory committees within...,The composition of statutory committees withi...


In [None]:
complete_complex_df = iterate_and_update(complex_df)
complete_complex_df.head()

Unnamed: 0,question,ground_truth,contexts,answer
0,Can you explain how governmental interventions...,Governmental interventions and price manipulat...,[ease of doing business in commodity markets. ...,The Daily Settlement Price (DSP) in futures t...
1,How do stock exchanges handle the exercise of ...,Stock exchanges handle option contracts throug...,[Exchange platfor m in a transparent manner ...,The objective parameters used to determine th...
2,Discuss the criteria for determining exposure ...,The exposure limits to banks by stock exchange...,[b. Open position limits in respect of client...,Clearing Corporations should design a system ...
3,Discuss the significance of the Clearing Corpo...,The discretion of Clearing Corporations to pre...,[Clearing Corporation on their website / by is...,The framework for early delivery in futures c...
4,In what ways does the composition of statutory...,The composition of statutory committees within...,[5. Advisory Committee  Advise the governin...,The Clearing Corporation ensures compliance w...


In [None]:
complex_data = Dataset.from_pandas(complete_complex_df)
complex_data

Dataset({
    features: ['question', 'ground_truth', 'contexts', 'answer'],
    num_rows: 6
})

In [None]:
complex_result = evaluate(
    complex_data,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_recall,
        answer_similarity,
        answer_correctness
    ]
)

Evaluating: 100%|██████████| 30/30 [00:15<00:00,  1.95it/s]


In [None]:
complex_result

{'faithfulness': 0.3571, 'answer_relevancy': 0.7953, 'context_recall': 0.8542, 'answer_similarity': 0.7686, 'answer_correctness': 0.7634}

## Final Benchmarking Metrics

In [None]:
df1 = pd.DataFrame([simple_result.values()], columns=simple_result.keys())
df2 = pd.DataFrame([complex_result.values()], columns=complex_result.keys())
final_metrics = pd.concat([df1, df2], ignore_index=True)
final_metrics['benchmark_name'] = ['Simple Bench', 'Complex Bench']
final_metrics.set_index('benchmark_name')

Unnamed: 0_level_0,faithfulness,answer_relevancy,context_recall,answer_similarity,answer_correctness
benchmark_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Simple Bench,0.885417,0.962931,0.791667,0.908777,0.576778
Complex Bench,0.357143,0.795326,0.854167,0.768567,0.763401


## Challenges and Final Remarks:

*   The performance of a RAG system is strongly dependent on the relevant data the retriever is able to identify. This makes the process of data chunking and retrieval very crucial. More advanced methods of data extraction and chunking can implemented by using services such as the [Unstructured API](https://unstructured.io/).   
*   The chunk size can also be varied along with the distance metric. This implementation has used `cosine distance` but other metrics such as `Euclidean distance` can also be used. This hyperparameter can be changed in the pinecone console.
*   RAGAS is a very helpful tool and has streamlined the process of benchmarking the RAG systems. This does come at the cost of using OpenAI's closed source ecosystem. While RAGAS does allow you to use your custom LLM model, at the time of developing this usecase, it isn't working with both local LLM solutions (Ollama) and hosted LLMs (VertexAI). The library also requires the data to be in a very specific format including the names of the columns in the dataset.  
*  The rag system is currently only using a single master circular. There are more circulars available on [SEBI's official website](https://www.sebi.gov.in/sebiweb/home/HomeAction.do?doListing=yes&sid=1&ssid=6&smid=0) here.