<a href="https://colab.research.google.com/github/sheldonkemper/bank_of_england/blob/main/notebooks/modelling%20/sk_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# JP Morgan, Bank of America, HSBC, Citigroup

In [2]:
!unzip transcripts.zip

unzip:  cannot find or open transcripts.zip, transcripts.zip.zip or transcripts.zip.ZIP.


**Install packages**

- `langchain`: The core framework to build the RAG pipeline. It manages the interactions between Phi (the language model) and retrieval systems, handling tasks like prompt management, memory, chaining, and tool integration.

- `chromadb`: A vector database for storing and retrieving embeddings (vector representations of text). This enables the retrieval component in RAG, where relevant documents or knowledge snippets are stored and retrieved based on similarity to the user query.

- `pypdf`: A library to parse and extract text from PDF files. Useful for ingesting content into the vector database, allowing the RAG system to process knowledge stored in PDF documents.

- `openai`: The OpenAI API client library, used if Phi or other models from OpenAI are accessed. It manages API calls to OpenAI’s servers.

- `sentence-transformers`: A library to generate embeddings for text. Often used in the retrieval step to convert queries and documents into vectors, enabling similarity searches in chromadb.

- `accelerate`: Helps optimize the performance of language models by enabling distributed training and efficient hardware utilization, especially useful when running models like Phi locally or on specific hardware.

- `langchain-community`: Adds community-contributed tools, integrations, and utilities to LangChain, which can include additional retrieval or model support.

In [3]:
!pip install langchain chromadb pypdf openai sentence-transformers accelerate langchain-community

Collecting chromadb
  Downloading chromadb-0.6.3-py3-none-any.whl.metadata (6.8 kB)
Collecting pypdf
  Downloading pypdf-5.3.0-py3-none-any.whl.metadata (7.2 kB)
Collecting langchain-community
  Downloading langchain_community-0.3.18-py3-none-any.whl.metadata (2.4 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.115.8-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.34.0-py3-none-any.whl.metadata (6.5 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.16.0-py2.py3-none-any.whl.metadata (2.9 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.20.1-cp311-cp311-manylinux_2_27_x8

# Phi

Ignore this, I used Smollm instead

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from langchain.llms import HuggingFacePipeline

In [3]:
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct", device_map='auto', torch_dtype="auto", trust_remote_code=True,)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=300)
llm = HuggingFacePipeline(pipeline=pipe)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

configuration_phi3.py:   0%|          | 0.00/11.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py:   0%|          | 0.00/73.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json:   0%|          | 0.00/16.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

Device set to use cuda:0
  llm = HuggingFacePipeline(pipeline=pipe)


# RAG

In [7]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.memory import ConversationBufferWindowMemory
from langchain.llms import HuggingFacePipeline
from langchain.schema import Document
import os
import torch
from google.colab import drive

In [8]:
# -------------------------------
# 1. Mount Google Drive and define folder paths
# -------------------------------
drive.mount('/content/drive', force_remount=True)

# Assuming 'BOE' folder is in 'MyDrive' and already shared
BOE_path = '/content/drive/MyDrive/BOE/bank_of_england/data'

# Now you (and others with access) can work with files in this directory
# For example, you can list the contents:
print(os.listdir(BOE_path))

ValueError: mount failed

In [5]:
# Load Embeddings
embedding_model = "sentence-transformers/all-mpnet-base-v2"
embeddings = HuggingFaceEmbeddings(model_name=embedding_model)

# Load SmolLM Model & Tokenizer
model_name = "HuggingFaceTB/SmolLM-135M-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")

# Set Up LLM Pipeline with Better Generation Settings
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=300,
    temperature=0.5,
    top_p=0.8,
    repetition_penalty=1.3,
    do_sample=True
)
llm = HuggingFacePipeline(pipeline=pipe)

# Load PDFs
folder_path = "/content/transcripts/"
file_names = [f for f in os.listdir(folder_path) if f.endswith(".pdf")]

documents = []
for pdf_path in file_names:
    try:
        loader = PyPDFLoader(os.path.join(folder_path, pdf_path), extract_images=False)
        documents.extend(loader.load_and_split())
        print(f"Loaded: {pdf_path}")
    except Exception as e:
        print(f"Error loading {pdf_path}: {e}")

# Chunking Strategy for Better Retrieval
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
chunks = text_splitter.split_documents(documents)

# Store in ChromaDB
db = Chroma.from_documents(chunks, embedding=embeddings, persist_directory="test_index")
db.persist()

# Load Vector Database & Retriever
vectordb = Chroma(persist_directory="test_index", embedding_function=embeddings)
#retriever = vectordb.as_retriever(search_kwargs={"k": 5})
retriever = vectordb.as_retriever(search_kwargs={"k": 3})  # ✅ Reduce retrieval size


# Conversation Memory (Stores past user inputs)
#memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
memory = ConversationBufferWindowMemory(k=3, memory_key="chat_history", return_messages=True)

def reset_memory_if_needed():
    """Clears memory after 3 exchanges to prevent excessive token growth."""
    if len(memory.chat_memory.messages) > 6:  # Each exchange has a user+bot message
        print("\n🛑 **Memory Full: Resetting Conversation History...**\n")
        memory.clear()


system_prompt = """You are analyzing a bank's quarterly earnings call transcript.
Extract and summarize key financial insights, avoiding unnecessary details.
If the answer isn't found, respond with "I don't know."
Provide sources for your answers at the end.

Your response format:
- **Answer:** [Concise response]
- **Key Insights:** [Bullet-pointed highlights]
- **Sources:** [List of document sources used]

<|user|>
Context:
{context}

Question: {question}<|end|>
<|assistant|>"""

# Clean User Input
def clean_user_input(user_input):
    return user_input.strip().replace("\n", " ").replace("\t", " ")

# Remove Duplicate Chunks
def remove_duplicate_chunks(chunks):
    seen = set()
    unique_chunks = []
    for chunk in chunks:
        chunk_text = chunk.page_content.strip()
        if chunk_text not in seen:
            seen.add(chunk_text)
            unique_chunks.append(chunk)
    return unique_chunks

# Truncate Retrieved Context to Avoid Model Overload ... still happens TODO: Need to refine
def truncate_context(context_list, max_tokens=1000):
    """Trims retrieved context while keeping document metadata, based on token count."""
    truncated_docs = []
    current_tokens = 0

    for doc in context_list:
        doc_tokens = len(tokenizer.encode(doc.page_content))

        if current_tokens + doc_tokens <= max_tokens:
            truncated_docs.append(doc)
            current_tokens += doc_tokens
        else:
            break  # Stop adding when token limit is reached

    return truncated_docs

#
qa_chain = ConversationalRetrievalChain.from_llm(
    llm=llm, retriever=retriever, memory=memory, verbose=False
)


  embeddings = HuggingFaceEmbeddings(model_name=embedding_model)


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.59k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/801k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.10M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/565 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/723 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/269M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/156 [00:00<?, ?B/s]

Device set to use cuda:0


FileNotFoundError: [Errno 2] No such file or directory: '/content/transcripts/'

In [None]:
def format_response(question, response):
    """Formats the chatbot response to only return the question and answer."""

    # Ensure the response is clean
    response_text = response.strip()

    # If model outputs instructions or extra text, remove it
    unwanted_phrases = [
        "Use the following pieces of context",
        "If you don't know the answer, just say that you don't know",
        "Don't try to make up an answer."
    ]

    for phrase in unwanted_phrases:
        if phrase in response_text:
            response_text = response_text.split(phrase)[-1].strip()  # Keep only the last part

    # Ensure the output is structured as needed
    formatted_output = f"Question: {question}\nHelpful Answer: {response_text}"

    return formatted_output

def trim_final_input(question, context, max_tokens=1024):
    """Ensures final model input does not exceed token limit."""
    system_message = """You are analyzing a bank's quarterly earnings call transcript.
    Extract and summarize key financial insights, avoiding unnecessary details.
    If the answer isn't found, respond with "I don't know."
    Provide sources for your answers at the end."""

    input_text = f"{system_message}\n\nContext:\n{context}\n\nQuestion: {question}"

    tokens = tokenizer.encode(input_text, truncation=True, max_length=max_tokens)

    return tokenizer.decode(tokens)




In [None]:
print("\n💬 **Bank Earnings Chatbot** (Type 'exit' to stop)")

while True:
    user_input = input("\n🔹 **You:** ")
    if user_input.lower() == "exit":
        print("\n**Exiting Chatbot. Have a great day!**")
        break

    user_input = clean_user_input(user_input)

    reset_memory_if_needed()

    context = retriever.get_relevant_documents(user_input)
    context = remove_duplicate_chunks(context)
    context = truncate_context(context, max_tokens=800)

    print("\n **Retrieved Context:**")
    for doc in context:
        print(f"- **Source:** {doc.metadata.get('source', 'Unknown Source')}, Page {doc.metadata.get('page', 'Unknown Page')}")

    formatted_input = trim_final_input(user_input, context, max_tokens=1024)

    response = qa_chain({"question": formatted_input})

    print(f"\n{format_response(user_input, response['answer'])}")