# Environment Setup and Imports

In this cell, we set up the complete Python environment required for building the chatbot.
This includes:

1. Installing all required third-party libraries using `pip`
2. Importing standard Python libraries for data processing
3. Importing LangChain components for:
   - Embeddings
   - Vector storage (FAISS)
   - Retrieval
   - Memory-based chat
4. Importing OpenAI client utilities for using the GPT-5.2 model

Keeping all installations and imports in a single cell ensures:
- Reproducibility
- Easy debugging
- Cleaner notebook structure for interview evaluation


In [33]:
# -----------------------------
# Install required dependencies
# -----------------------------
!pip3 install -q pandas numpy langchain langchain-community langchain-openai faiss-cpu tiktoken openai chromadb

# -----------------------------
# Standard library imports
# -----------------------------
import os
import warnings
warnings.filterwarnings("ignore")

# -----------------------------
# Data processing imports
# -----------------------------
import pandas as pd
import numpy as np

# -----------------------------
# LangChain core components
# -----------------------------
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.memory import ConversationBufferMemory

# -----------------------------
# OpenAI + Embeddings
# -----------------------------
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# -----------------------------
# Utility imports
# -----------------------------
from typing import List, Dict


# Data Loading

In this section, we load the two datasets provided for the task:
- Holdings data
- Trades data

Both files are read into Pandas DataFrames.  
We then inspect the first few rows and basic dimensions of each dataset to understand
their structure and contents before performing any transformations or merges.

This step ensures transparency and helps validate that the data has been loaded correctly.


In [34]:
# -----------------------------
# Load the CSV files
# -----------------------------
holdings_df = pd.read_csv("data/holdings.csv")
trades_df = pd.read_csv("data/trades.csv")
head_count = 2

# -----------------------------
# Inspect the datasets
# -----------------------------
print("Holdings Data Shape:", holdings_df.shape)
display(holdings_df.head(head_count))

print("\nTrades Data Shape:", trades_df.shape)
display(trades_df.head(head_count))


Holdings Data Shape: (1022, 25)


Unnamed: 0,AsOfDate,OpenDate,CloseDate,ShortName,PortfolioName,StrategyRefShortName,Strategy1RefShortName,Strategy2RefShortName,CustodianName,DirectionName,...,StartPrice,Price,StartFXRate,FXRate,MV_Local,MV_Base,PL_DTD,PL_QTD,PL_MTD,PL_YTD
0,01/08/23,04/03/20,,Garfield,Garfield,Default,Asset,DefaultS2,Well Prime,Long,...,96.0,96.0,1.33,1.33,568320.0,755865.6,92.504,10833.7294,92.504,41054.5854
1,01/08/23,04/03/20,,Garfield,Garfield,Default,Asset,DefaultS2,Well Prime,Long,...,96.0,96.0,1.33,1.33,84.48,112.3584,0.0138,1.6104,0.0138,6.1027



Trades Data Shape: (649, 31)


Unnamed: 0,id,RevisionId,AllocationId,TradeTypeName,SecurityId,SecurityType,Name,Ticker,CUSIP,ISIN,...,AllocationFees,AllocationCash,PortfolioName,CustodianName,StrategyName,Strategy1Name,Strategy2Name,Counterparty,AllocationRule,IsCustomAllocation
0,3489863,2,3460886,Buy,270471,Equity,Berry Brand 4/11 Equity,,,,...,2800.0,7002800.0,HoldCo 1,JP MORGAN SECURITIES LLC,Default,DefaultS1,DefaultS2,ABGS,Single Fund Rule - HoldCo 1,1
1,3489864,1,3460887,Sell,270471,Equity,Berry Brand 4/11 Equity,,,,...,128.8,6999871.2,HoldCo 1,JP MORGAN SECURITIES LLC,Default,DefaultS1,DefaultS2,ABGS,Single Fund Rule - HoldCo 1,0


# Basic Data Validation and Merging

In this section, we perform minimal and essential validation on the datasets to ensure
they are ready for merging. The checks are intentionally kept simple and focused.

Steps performed:
- Verify column names in both datasets
- Ensure the presence of `securityId` in both files
- Merge holdings and trades data using `securityId` as the join key

A left join is used so that all holdings are preserved, even if corresponding trades
are not available for some securities. The result is a single combined DataFrame that
will serve as the knowledge source for the chatbot.


In [35]:
# -----------------------------
# Basic sanity checks
# -----------------------------
print("Holdings Columns:")
print(holdings_df.columns.tolist())

print("\nTrades Columns:")
print(trades_df.columns.tolist())

# Standardize column names for fund and date
holdings_df['FundName'] = holdings_df['PortfolioName']
trades_df['FundName'] = trades_df['PortfolioName']

# Extract Year from AsOfDate in holdings
holdings_df['Year'] = pd.to_datetime(holdings_df['AsOfDate'], format='%d/%m/%y', errors='coerce').dt.year

# Ensure the join key exists
assert "SecurityId" in holdings_df.columns, "SecurityId not found in holdings data"
assert "SecurityId" in trades_df.columns, "SecurityId not found in trades data"

# -----------------------------
# Merge the datasets
# -----------------------------
# We merge on both SecurityId and FundName to ensure trades are matched to the correct fund context
merged_df = holdings_df.merge(
    trades_df,
    on=["SecurityId", "FundName"],
    how="left",
    suffixes=('_h', '_t')
)

# -----------------------------
# Inspect merged data
# -----------------------------
print("\nMerged Data Shape:", merged_df.shape)
display(merged_df.head(head_count))

Holdings Columns:
['AsOfDate', 'OpenDate', 'CloseDate', 'ShortName', 'PortfolioName', 'StrategyRefShortName', 'Strategy1RefShortName', 'Strategy2RefShortName', 'CustodianName', 'DirectionName', 'SecurityId', 'SecurityTypeName', 'SecName', 'StartQty', 'Qty', 'StartPrice', 'Price', 'StartFXRate', 'FXRate', 'MV_Local', 'MV_Base', 'PL_DTD', 'PL_QTD', 'PL_MTD', 'PL_YTD']

Trades Columns:
['id', 'RevisionId', 'AllocationId', 'TradeTypeName', 'SecurityId', 'SecurityType', 'Name', 'Ticker', 'CUSIP', 'ISIN', 'TradeDate', 'SettleDate', 'Quantity', 'Price', 'TradeFXRate', 'Principal', 'Interest', 'TotalCash', 'AllocationQTY', 'AllocationPrincipal', 'AllocationInterest', 'AllocationFees', 'AllocationCash', 'PortfolioName', 'CustodianName', 'StrategyName', 'Strategy1Name', 'Strategy2Name', 'Counterparty', 'AllocationRule', 'IsCustomAllocation']

Merged Data Shape: (1023, 57)


Unnamed: 0,AsOfDate,OpenDate,CloseDate,ShortName,PortfolioName_h,StrategyRefShortName,Strategy1RefShortName,Strategy2RefShortName,CustodianName_h,DirectionName,...,AllocationFees,AllocationCash,PortfolioName_t,CustodianName_t,StrategyName,Strategy1Name,Strategy2Name,Counterparty,AllocationRule,IsCustomAllocation
0,01/08/23,04/03/20,,Garfield,Garfield,Default,Asset,DefaultS2,Well Prime,Long,...,,,,,,,,,,
1,01/08/23,04/03/20,,Garfield,Garfield,Default,Asset,DefaultS2,Well Prime,Long,...,,,,,,,,,,


# Converting Each Record into a High-Quality Textual Description

In this section, we convert each row of the merged dataset into a concise,
human-readable paragraph that captures the most important financial information.

We generate structured natural language
descriptions that include:
- Fund and portfolio information
- Security details
- Position and valuation data
- Profit and loss metrics
- Trade-related information (when available)

This improves semantic retrieval quality and results in more accurate and
interpretable chatbot responses.

In [36]:
def safe_get(row, col):
    return row[col] if col in row and pd.notna(row[col]) else None


def row_to_text(row: pd.Series) -> str:
    """
    Convert a single merged dataframe row into a concise, human-readable
    financial description suitable for semantic search.
    """

    lines = []

    # ---- Fund / Portfolio ----
    fund_name = safe_get(row, "FundName")
    short_name = safe_get(row, "ShortName")
    year = safe_get(row, "Year")

    if fund_name:
        lines.append(f"Fund name is {fund_name}.")
    if short_name and short_name != fund_name:
        lines.append(f"The fund short name is {short_name}.")
    if year:
        lines.append(f"In the year {int(year)}.")

    # ---- Security Details ----
    security_id = safe_get(row, "SecurityId")
    security_name = safe_get(row, "SecName") or safe_get(row, "Name")
    security_type = safe_get(row, "SecurityTypeName")

    if security_id or security_name:
        sec_line = "The security"
        if security_name:
            sec_line += f" {security_name}"
        if security_type:
            sec_line += f" is a {security_type}"
        if security_id:
            sec_line += f" with SecurityId {int(security_id)}"
        sec_line += "."
        lines.append(sec_line)

    # ---- Position Information ----
    qty = safe_get(row, "Qty")
    direction = safe_get(row, "DirectionName")

    if qty:
        pos_line = f"The position quantity is {int(qty)}"
        if direction:
            pos_line += f" with a {direction.lower()} position"
        pos_line += "."
        lines.append(pos_line)

    # ---- Valuation ----
    mv_base = safe_get(row, "MV_Base")
    mv_local = safe_get(row, "MV_Local")

    if mv_base:
        lines.append(f"The market value in base currency is {mv_base:,.2f}.")
    if mv_local:
        lines.append(f"The market value in local currency is {mv_local:,.2f}.")

    # ---- Profit & Loss ----
    pl_ytd = safe_get(row, "PL_YTD")
    pl_qtd = safe_get(row, "PL_QTD")

    if pl_ytd is not None:
        lines.append(f"The year-to-date profit and loss is {pl_ytd:,.2f}.")
    if pl_qtd is not None:
        lines.append(f"The quarter-to-date profit and loss is {pl_qtd:,.2f}.")

    # ---- Trade Information ----
    trade_type = safe_get(row, "TradeTypeName")
    trade_qty = safe_get(row, "Quantity")
    trade_price = safe_get(row, "Price_y")
    counterparty = safe_get(row, "Counterparty")

    if trade_type or trade_qty:
        trade_line = "A trade is recorded"
        if trade_type:
            trade_line += f" with trade type {trade_type}"
        if trade_qty:
            trade_line += f" for quantity {int(trade_qty)}"
        if trade_price:
            trade_line += f" at price {trade_price}"
        trade_line += "."
        lines.append(trade_line)

    if counterparty:
        lines.append(f"The trade counterparty is {counterparty}.")

    return "\n".join(lines)


# -----------------------------
# Create high-quality text documents
# -----------------------------
merged_df["document_text"] = merged_df.apply(row_to_text, axis=1)

# -----------------------------
# Inspect generated text
# -----------------------------
display(merged_df[["document_text"]].head(3))

Unnamed: 0,document_text
0,Fund name is Garfield.\nIn the year 2023.\nThe...
1,Fund name is Garfield.\nIn the year 2023.\nThe...
2,Fund name is Garfield.\nIn the year 2023.\nThe...


# Creating LangChain Documents with Metadata

In this section, we convert the generated text data into LangChain `Document` objects.
Each document contains:

- `page_content`: the textual representation of a row
- `metadata`: useful structured fields that help with traceability and debugging

Storing metadata alongside the text allows us to:
- Understand where an answer came from
- Filter or analyze retrieved documents if needed
- Keep the system explainable and interview-friendly

Each row from the merged dataset is converted into a single document.

In [37]:
# -----------------------------
# Create LangChain Document objects
# -----------------------------
documents = []

for _, row in merged_df.iterrows():
    documents.append(
        Document(
            page_content=row["document_text"],
            metadata={
                "type": "raw",
                "SecurityId": int(row["SecurityId"]) if pd.notna(row.get("SecurityId")) else None,
                "fund": row.get("FundName"),
                "security_type": row.get("SecurityTypeName"),
                "year": int(row["Year"]) if pd.notna(row.get("Year")) else None
            }
        )
    )

# -----------------------------
# Inspect a sample document
# -----------------------------

print("Total raw documents created:", len(documents))
print("\nSample raw document:\n")
print(documents[0])

Total raw documents created: 1023

Sample raw document:

page_content='Fund name is Garfield.
In the year 2023.
The security EJ0445951 is a Bond with SecurityId 273098.
The position quantity is 592000 with a long position.
The market value in base currency is 755,865.60.
The market value in local currency is 568,320.00.
The year-to-date profit and loss is 41,054.59.
The quarter-to-date profit and loss is 10,833.73.' metadata={'type': 'raw', 'SecurityId': 273098, 'fund': 'Garfield', 'security_type': 'Bond', 'year': 2023}


# Creating Vector Store with Raw and Summary Documents

In this section, we build the complete vector database used by the chatbot.

The vector store contains two types of documents:
- Raw documents derived directly from individual data records
- Summary documents containing pre-computed aggregated information

Both document types are embedded and stored together in a single Chroma collection.
This allows semantic retrieval to surface either detailed records or aggregated facts
depending on the user query, while keeping the system simple and explainable.

In [38]:
# -----------------------------
# Initialize embedding model
# -----------------------------
embeddings = OpenAIEmbeddings()

# -----------------------------
# Initialize Chroma vector store
# -----------------------------
vector_store = Chroma(
    embedding_function=embeddings,
    collection_name="fund_holdings_trades_v4"
)

# -----------------------------
# Create summary documents with Zero-Count handling
# -----------------------------
summary_documents = []
all_funds = sorted(set(merged_df['FundName'].unique()))

# 1. Total holdings per fund (all funds)
holdings_counts = merged_df.groupby('FundName')['SecurityId'].nunique().to_dict()
for fund in all_funds:
    count = holdings_counts.get(fund, 0)
    summary_documents.append(
        Document(
            page_content=f"Fund {fund} has a total of {int(count)} holdings.",
            metadata={"type": "summary", "fund": fund, "metric": "total_holdings"}
        )
    )

# 2. Total trades per fund (all funds)
trades_counts = merged_df[merged_df['TradeTypeName'].notna()].groupby('FundName').size().to_dict()
for fund in all_funds:
    count = trades_counts.get(fund, 0)
    summary_documents.append(
        Document(
            page_content=f"Fund {fund} has executed a total of {int(count)} trades.",
            metadata={"type": "summary", "fund": fund, "metric": "total_trades"}
        )
    )

# 3. Yearly profit and loss per fund & performance comparison
if 'Year' in merged_df.columns and 'PL_YTD' in merged_df.columns:
    pnl_summary = merged_df.groupby(['FundName', 'Year'])['PL_YTD'].sum().reset_index()
    for _, row in pnl_summary.iterrows():
        summary_documents.append(
            Document(
                page_content=f"In the year {int(row['Year'])}, fund {row['FundName']} had a total profit and loss of {row['PL_YTD']:,.2f}.",
                metadata={"type": "summary", "fund": row['FundName'], "year": int(row['Year']), "metric": "pnl"}
            )
        )
    
    # Yearly comparison summary
    for year in pnl_summary['Year'].unique():
        year_data = pnl_summary[pnl_summary['Year'] == year].sort_values('PL_YTD', ascending=False)
        rank_str = ", ".join([f"{r['FundName']} ({r['PL_YTD']:,.2f})" for _, r in year_data.iterrows()])
        summary_documents.append(
            Document(
                page_content=f"In the year {int(year)}, the performance of funds by total Profit and Loss was: {rank_str}.",
                metadata={"type": "summary", "year": int(year), "metric": "performance_comparison"}
            )
        )

# -----------------------------
# Combine raw and summary documents
# -----------------------------
all_documents = documents + summary_documents

print("Raw documents:", len(documents))
print("Summary documents:", len(summary_documents))
print("Total documents:", len(all_documents))

batch_size = 50
for i in range(0, len(all_documents), batch_size):
    vector_store.add_documents(all_documents[i:i + batch_size])

print("Vector store created successfully")
print("Documents stored:", vector_store._collection.count())

Raw documents: 1023
Summary documents: 58
Total documents: 1081
Vector store created successfully
Documents stored: 1081


# Retriever Setup Using Similarity Search

In this section, we configure a retriever on top of the vector store.
The retriever performs semantic similarity search to identify the most
relevant documents for a given user query.

A small value of k is used to ensure that only the most relevant information
from the dataset is retrieved and passed to the language model.

In [39]:
# -----------------------------
# Create retriever from vector store
# -----------------------------
retriever = vector_store.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5}
)

# -----------------------------
# Safe retriever test function
# -----------------------------
def retrieve_documents(query: str):
    """
    Retrieve relevant documents for a given query using semantic similarity search.
    """
    return retriever.invoke(query)


# -----------------------------
# Light sanity test
# -----------------------------
test_query = "Total number of holdings for a fund?"
docs = retrieve_documents(test_query)

print("Retriever working correctly")
print("Number of documents retrieved:", len(docs))

Retriever working correctly
Number of documents retrieved: 5


# Guardrails and Fallback Logic

In this section, we implement guardrails to ensure that the chatbot strictly answers
questions using only the information retrieved from the vector database.

The guardrails enforce the following rules:
- If no relevant documents are retrieved, the chatbot must not attempt to answer
- If the retrieved context is empty or insufficient, the chatbot must return a fallback response

The fallback response is fixed and must be returned exactly as:
"Sorry can not find the answer"

This guarantees that the chatbot does not hallucinate or use external knowledge.

In [40]:
import re

FALLBACK_RESPONSE = "Sorry can not find the answer"
AGGREGATION_KEYWORDS = ["total", "number of", "count", "sum", "highest", "lowest", "better", "best", "compare", "performance", "rank"]

def is_aggregation_query(query: str) -> bool:
    return any(kw in query.lower() for kw in AGGREGATION_KEYWORDS)

def extract_fund_name(query: str) -> str:
    """Simple heuristic to extract potential fund names from query."""
    # This list can be dynamically populated from the dataframe funds
    known_funds = sorted(set(merged_df['FundName'].unique()), key=len, reverse=True)
    q = query.lower()
    for fund in known_funds:
        if fund.lower() in q:
            return fund
    return None

def apply_guardrails(query: str, retrieved_docs: list) -> dict:
    if not retrieved_docs: return {"allow_answer": False, "context": ""}
    
    is_agg = is_aggregation_query(query)
    target_fund = extract_fund_name(query)
    relevant_context = []

    for doc in retrieved_docs:
        # Metadata matching for fund context isolation
        doc_fund = doc.metadata.get("fund")
        if target_fund and doc_fund and target_fund.lower() != doc_fund.lower():
            continue # Skip documents that don't match the fund in the query

        if is_agg:
            if doc.metadata.get("type") == "summary": relevant_context.append(doc.page_content)
        else:
            if doc.metadata.get("type") == "raw": relevant_context.append(doc.page_content)

    full_context = "\n\n".join(relevant_context).strip()
    if not full_context: return {"allow_answer": False, "context": ""}
    return {"allow_answer": True, "context": full_context}

# LLM Setup and Prompt Configuration

In this section, we configure the language model that will generate answers for the chatbot.

Key principles enforced through the prompt:
- The model must answer strictly using the provided context
- The model must not use external or general knowledge
- If the answer is not present in the context, it must return the fixed fallback response

The model used is GPT-5.2, and it is only invoked after guardrails confirm that
sufficient context is available.

In [41]:
llm = ChatOpenAI(model="gpt-4", temperature=0)

SYSTEM_PROMPT = """
You are a financial data assistant.
Rules you must follow strictly:
1. Answer ONLY using the information provided in the context.
2. If the context contains a summary about a specific fund, ensure you are answering about THAT fund.
3. If the answer cannot be found in the provided context, respond exactly with: "Sorry can not find the answer".
4. Do NOT use external knowledge.
"""

def generate_answer(query: str, context: str) -> str:
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": f"Context:\n{context}\n\nQuestion:\n{query}"}
    ]
    response = llm.invoke(messages)
    return response.content.strip()

In [42]:
# -----------------------------
# LLM test with retrieved context
# -----------------------------
# test_query = "Wanna go on a date tonight??"
test_query = "Which securities are associated with MNC Investment Fund?"
docs = retrieve_documents(test_query)
for doc in docs:
    print(doc)
guardrail_result = apply_guardrails(test_query, docs)
print(guardrail_result, "\n\n\n")

if guardrail_result["allow_answer"]:
    answer = generate_answer(test_query, guardrail_result["context"])
else:
    answer = FALLBACK_RESPONSE

print("Answer:")
print(answer)

page_content='Fund name is MNC Investment Fund.
The fund short name is MNC Inv.
In the year 2023.
The security 869435206 is a Preferred with SecurityId 288818.
The position quantity is 20 with a long position.
The market value in base currency is 200.00.
The market value in local currency is 200.00.
The year-to-date profit and loss is 0.00.
The quarter-to-date profit and loss is 0.00.' metadata={'SecurityId': 288818, 'year': 2023, 'type': 'raw', 'security_type': 'Preferred', 'fund': 'MNC Investment Fund'}
page_content='Fund name is MNC Investment Fund.
The fund short name is MNC Inv.
In the year 2023.
The security 92257E403 is a Preferred with SecurityId 288827.
The position quantity is 30 with a long position.
The market value in base currency is 330.00.
The market value in local currency is 330.00.
The year-to-date profit and loss is 30.00.
The quarter-to-date profit and loss is 0.00.' metadata={'type': 'raw', 'security_type': 'Preferred', 'fund': 'MNC Investment Fund', 'year': 2023,

# Evaluation Metrics

In this section, we evaluate the chatbot using a small set of predefined questions.
The evaluation focuses on correctness and safe fallback behavior rather than
language fluency.

Each question is expected to either:
- Produce a valid answer from the dataset, or
- Correctly return the fallback response when information is unavailable

In [43]:
# -----------------------------
# Evaluation dataset
# -----------------------------
evaluation_questions = [
    {
        "question": "Which funds performed better depending on the yearly Profit and Loss of that fund.",
        "expect_fallback": False
    },
    {
        "question": "Total number of holdings for Garfield fund.",
        "expect_fallback": False
    },
    {
        "question": "Total number of trades for Garfield fund.",
        "expect_fallback": True
    },
    {
        "question": "What is the weather in London?",
        "expect_fallback": True
    },
    {
        "question": "Which fund had the highest profit and loss in 2023?",
        "expect_fallback": False
    },
    {
        "question": "How many unique securities are held by UNC Investment Fund?",
        "expect_fallback": False
    },
    {
        "question": "What is the total number of trades executed by NPSMF1?",
        "expect_fallback": False
    },
    {
        "question": "List all trade types available for NPSMF2.",
        "expect_fallback": False
    },
    {
        "question": "What is the year-to-date profit and loss for security 273098 in the Garfield fund?",
        "expect_fallback": False
    },
    {
        "question": "Who won the FIFA World Cup in 2022?",
        "expect_fallback": True
    },
    {
        "question": "Which fund had the lowest performance in 2023?",
        "expect_fallback": False
    }
]

# -----------------------------
# Evaluation loop
# -----------------------------
results = []

for item in evaluation_questions:
    q = item["question"]
    expected_fallback = item["expect_fallback"]

    docs = retrieve_documents(q)
    guardrail = apply_guardrails(q, docs)

    if guardrail["allow_answer"]:
        answer = generate_answer(q, guardrail["context"])
    else:
        answer = FALLBACK_RESPONSE

    is_fallback = answer == FALLBACK_RESPONSE
    is_correct = is_fallback == expected_fallback

    results.append({
        "question": q,
        "answer": answer,
        "expected_fallback": expected_fallback,
        "actual_fallback": is_fallback,
        "correct": is_correct
    })

# -----------------------------
# Evaluation summary
# -----------------------------
results_df = pd.DataFrame(results)

print("Evaluation Results:")
display(results_df[["question", "correct"]])

accuracy = results_df["correct"].mean() * 100
print(f"\nEvaluation Accuracy: {accuracy:.2f}%")

Evaluation Results:


Unnamed: 0,question,correct
0,Which funds performed better depending on the ...,True
1,Total number of holdings for Garfield fund.,True
2,Total number of trades for Garfield fund.,True
3,What is the weather in London?,True
4,Which fund had the highest profit and loss in ...,True
5,How many unique securities are held by UNC Inv...,False
6,What is the total number of trades executed by...,True
7,List all trade types available for NPSMF2.,False
8,What is the year-to-date profit and loss for s...,True
9,Who won the FIFA World Cup in 2022?,True



Evaluation Accuracy: 81.82%


# Memory-Based Conversational Chat

In this section, we enable conversational memory so the chatbot can maintain
context across multiple user interactions.

Memory allows the chatbot to:
- Answer follow-up questions
- Refer to previously mentioned entities
- Maintain a coherent conversation flow

Memory is applied only to the interactive chat interface and is intentionally
excluded from evaluation to ensure fair and reproducible metrics.

In [44]:


# -----------------------------
# Initialize conversation memory
# -----------------------------
chat_memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True
)

# -----------------------------
# Query Condensation Logic
# -----------------------------
CONDENSE_PROMPT = """
Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question.

Chat History:
{chat_history}
Follow Up Input: {question}
Standalone question:
"""

def condense_query(query: str, history: str) -> str:
    """
    Uses the LLM to resolve pronouns and context from history into a standalone query.
    """
    if not history.strip():
        return query
    
    messages = [
        {"role": "system", "content": "You are a helpful assistant that rephrases questions to be standalone based on context."},
        {"role": "user", "content": CONDENSE_PROMPT.format(chat_history=history, question=query)}
    ]
    
    response = llm.invoke(messages)
    return response.content.strip()

# -----------------------------
# Chat function with memory
# -----------------------------
def chat_with_memory(query: str) -> str:
    """
    End-to-end chatbot function with conversational memory and query condensation.
    """

    # Extract memory history as string
    memory_context = "\n".join(
        [f"{msg.type}: {msg.content}" for msg in chat_memory.chat_memory.messages]
    )

    # 1. Condense Query for Retrieval
    standalone_query = condense_query(query, memory_context).replace("*", "")
    # print(f"Condensed query: {standalone_query}") # Debug

    # 2. Retrieve documents using condensed query
    retrieved_docs = retrieve_documents(standalone_query)

    # 3. Apply guardrails
    guardrail = apply_guardrails(standalone_query, retrieved_docs)

    if not guardrail["allow_answer"]:
        # If standalone fails, try original as fallback (some queries might not need context)
        retrieved_docs_orig = retrieve_documents(query)
        guardrail = apply_guardrails(query, retrieved_docs_orig)
        
        if not guardrail["allow_answer"]:
            return FALLBACK_RESPONSE, standalone_query

    # 4. Generate answer using original query + history + retrieved context
    full_context = (
        "Chat History:\n" + memory_context + "\n\nRetrieved context:\n" + guardrail["context"]
        if memory_context else guardrail["context"]
    )

    answer = generate_answer(query, full_context)

    # 5. Update memory with original query and answer
    chat_memory.chat_memory.add_user_message(query)
    chat_memory.chat_memory.add_ai_message(answer)

    return answer, standalone_query


# Final Chat Interface

This section defines a single function that represents the complete chatbot flow.
It integrates retrieval, guardrails, and the language model into one clean interface.

This function can later be exposed as an API or UI without any changes to core logic.

In [45]:
def chat(query: str) -> str:
    """
    End-to-end chatbot function.
    """

    retrieved_docs = retrieve_documents(query)
    guardrail = apply_guardrails(query, retrieved_docs)

    if not guardrail["allow_answer"]:
        return FALLBACK_RESPONSE

    return generate_answer(query, guardrail["context"])


### Here is a pre-defined list of Questions it can answer and the answers to the questions will be saved in a .txt file.

In [46]:
some_questions = [
    "Which funds performed better depending on the yearly Profit and Loss of that fund.",
    "Total number of holdings for Garfield fund.",
    "Total number of trades for that fund.", # this question shows memory trails
    "What is the weather in London?",
    "Which fund had the highest profit and loss in 2023?",
    "How many unique securities are held by UNC Investment Fund?",
    "What is the total number of trades executed by NPSMF1?",
    "List all trade types available for NPSMF2.",
    "What is the year-to-date profit and loss for security 273098 in the Garfield fund?",
    "Who won the FIFA World Cup in 2022?",
    "Which fund had the lowest performance in 2023?"
]
string = ""
for user_input in some_questions:
    string += f"Question: {user_input}"
    answer, standalone_query = chat_with_memory(user_input)
    string += f"\nLLM tweaked Question: {standalone_query}"
    string += f"\nAnswer: {answer}"
    string += "\n\n==================================\n\n"

with open("trade_bot.txt", "w") as f:
    f.write(string)

### Here is a User-Input based Chatbot-like interface for the trade bot.

In [47]:
# while True:
#     user_input = input("\nAsk a question (or type 'exit'): ")
#     if user_input.lower() == 'exit':
#         break
#     print(chat_with_memory(user_input))