<a href="https://colab.research.google.com/github/sheldonkemper/bank_of_england/blob/main/notebooks/modelling/sk_gen_ai_rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://colab.research.google.com/github/sheldonkemper/bank_of_england/blob/main/notebooks/modelling/sk_gen_ai_rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
"""
===================================================
Author: Sheldon Kemper
Role: Data Engineering Lead, Bank of England Employer Project (Quant Collective)
LinkedIn: https://www.linkedin.com/in/sheldon-kemper
Date: 2025-02-04
Version: 1.2

Description:
    This notebook contains a class-based implementation of a Retrieval Augmented Generation (RAG) engine
    designed to analyze bank quarterly earnings call transcripts (in PDF format) stored on Google Drive.
    The code performs the following tasks:

    1. Configures an LLM pipeline using a Flan-T5-based model for text summarization.
    2. Sets up sentence-transformer based embeddings for document vectorization.
    3. Loads and splits PDF documents from one or more specified directories.
    4. Chunks the documents and builds a vector index using Chroma, persisting the index to Google Drive.
    5. Optionally loads an existing persisted vector index to avoid re-indexing, via the 'rebuild_index' parameter.
    6. Retrieves context relevant to user queries from the vector index with token truncation to enforce input limits.
    7. Maintains conversation memory for interactive sessions.
    8. Supports both interactive and programmatic prompt-based querying.
    9. Includes a 'test_mode' option for quick testing with a single PDF.
===================================================
"""



In [2]:
# install langchain-community
!pip install -q langchain-community pypdf tiktoken chromadb sentence-transformers > /dev/null 2>&1

In [3]:
import os
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.memory import ConversationBufferWindowMemory
from langchain_community.llms import HuggingFacePipeline
from langchain.chains import ConversationalRetrievalChain
from google.colab import drive

In [12]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)



In [5]:

# Mount Google Drive to the root location with force_remount
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


# A class-based implementation of an LLM Retrieval Augmented Generation (RAG) engine

In [14]:


# class BankEarningsChatbot:
#     """
#     A class-based implementation of an LLM Retrieval Augmented Generation (RAG) engine
#     designed to analyze bank quarterly earnings call transcripts. It loads PDF documents
#     from one or more specified folders, builds a Chroma vector index, and sets up an interactive
#     conversational chain for prompt-based queries.

#     Parameters:
#         pdf_folders (list or str): A list of folder paths containing PDF files, or a single folder path as a string.
#         persist_directory (str): Directory path to persist the vector index and model outputs.
#         max_length (int): Maximum output length for the T5 model.
#         test_mode (bool): If True, only loads one PDF (from the first folder) for quick testing.
#         rebuild_index (bool): If True, reprocess all PDFs and rebuild the index even if a persisted index exists.
#     """
#     def __init__(self, pdf_folders,
#                  persist_directory="/content/drive/MyDrive/BOE/bank_of_england/data/model_outputs",
#                  max_length=256, test_mode=False, rebuild_index=False):
#         # Allow a single folder (string) or a list of folders.
#         if isinstance(pdf_folders, str):
#             self.pdf_folders = [pdf_folders]
#         else:
#             self.pdf_folders = pdf_folders

#         self.persist_directory = persist_directory
#         self.test_mode = test_mode

#         # Set up the LLM pipeline using the Flan-T5 model.
#         self._setup_llm(max_length)

#         # Configure embeddings.
#         self._setup_embeddings()

#         # Check if we need to rebuild the vector index.
#         if rebuild_index or (not os.path.exists(self.persist_directory)) or (not os.listdir(self.persist_directory)):
#             # Load documents and build the vector index.
#             self._load_documents()
#             self._build_vector_index()
#         else:
#             print("Loading existing vector index from persistence directory.")
#             self.db = Chroma(persist_directory=self.persist_directory, embedding_function=self.embeddings)
#             # Note: If you need to update the in-memory index from the persisted data, this method should suffice.

#         # Configure retriever from the persisted vector database.
#         self._setup_retriever()

#         # Initialize conversation memory.
#         self.memory = ConversationBufferWindowMemory(k=10, memory_key="chat_history", return_messages=True)

#         # Set up the conversational retrieval chain.
#         self.qa_chain = ConversationalRetrievalChain.from_llm(
#             llm=self.llm,
#             retriever=self.retriever,
#             memory=self.memory,
#             verbose=False
#         )

#     def _setup_llm(self, max_length):
#         """
#         Configures the language model pipeline using a T5 model.
#         """
#         self.model_name = "google/flan-t5-large"
#         self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
#         self.model = AutoModelForSeq2SeqLM.from_pretrained(
#             self.model_name,
#             device_map="auto",
#             torch_dtype=torch.float16
#         )
#         # Using the text2text-generation pipeline for T5.
#         self.pipe = pipeline(
#             "text2text-generation",
#             model=self.model,
#             tokenizer=self.tokenizer,
#             max_length=max_length,
#             temperature=0.5,
#             top_p=0.8,
#             do_sample=True
#         )
#         self.llm = HuggingFacePipeline(pipeline=self.pipe)

#     def _setup_embeddings(self):
#         """
#         Initializes the sentence-transformer based embeddings.
#         """
#         self.embedding_model = "sentence-transformers/all-mpnet-base-v2"
#         self.embeddings = HuggingFaceEmbeddings(model_name=self.embedding_model)

#     def _load_documents(self):
#         """
#         Loads and splits PDF documents from the specified folders.
#         In test_mode, only the first PDF (from the first folder) is loaded.
#         """
#         self.documents = []
#         for folder in self.pdf_folders:
#             file_names = [f for f in os.listdir(folder) if f.endswith(".pdf")]
#             if not file_names:
#                 continue
#             # If in test_mode, load only the first PDF file from this folder.
#             if self.test_mode:
#                 file_names = file_names[:1]
#             for pdf_file in file_names:
#                 pdf_path = os.path.join(folder, pdf_file)
#                 try:
#                     loader = PyPDFLoader(pdf_path, extract_images=False)
#                     self.documents.extend(loader.load_and_split())
#                     print(f"Loaded: {pdf_file} from {folder}")
#                 except Exception as e:
#                     print(f"Error loading {pdf_file} from {folder}: {e}")
#             # In test mode, break after processing the first folder.
#             if self.test_mode:
#                 break

#     def _build_vector_index(self):
#         """
#         Chunks documents and builds a Chroma vector index.
#         """
#         text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
#         chunks = text_splitter.split_documents(self.documents)
#         # Remove duplicate chunks.
#         chunks = self.remove_duplicate_chunks(chunks)
#         self.chunks = chunks
#         self.db = Chroma.from_documents(chunks, embedding=self.embeddings, persist_directory=self.persist_directory)
#         self.db.persist()

#     def _setup_retriever(self):
#         """
#         Initializes the retriever from the persisted vector database.
#         """
#         self.vectordb = Chroma(persist_directory=self.persist_directory, embedding_function=self.embeddings)
#         self.retriever = self.vectordb.as_retriever(search_kwargs={"k": 5})

#     @staticmethod
#     def remove_duplicate_chunks(chunks):
#         """
#         Eliminates duplicate document chunks based on their content.
#         """
#         seen = set()
#         unique_chunks = []
#         for chunk in chunks:
#             chunk_text = chunk.page_content.strip()
#             if chunk_text not in seen:
#                 seen.add(chunk_text)
#                 unique_chunks.append(chunk)
#         return unique_chunks

#     def truncate_context(self, context_list, max_tokens=800):
#         """
#         Truncates the retrieved context to avoid overloading the model's input.
#         """
#         truncated_docs = []
#         current_tokens = 0
#         for doc in context_list:
#             doc_tokens = len(self.tokenizer.encode(doc.page_content))
#             if current_tokens + doc_tokens <= max_tokens:
#                 truncated_docs.append(doc)
#                 current_tokens += doc_tokens
#             else:
#                 break
#         return truncated_docs

#     @staticmethod
#     def clean_user_input(user_input):
#         """
#         Cleans and standardizes user input.
#         """
#         return user_input.strip().replace("\n", " ").replace("\t", " ")

#     def reset_memory_if_needed(self):
#         """
#         Clears conversation history if the number of exchanges exceeds a threshold.
#         """
#         if len(self.memory.chat_memory.messages) > 6:
#             print("\nMemory Full: Resetting Conversation History...\n")
#             self.memory.clear()

#     def format_response(self, question, response):
#         """
#         Formats the output to clearly present both the question and the answer.
#         """
#         response_text = response.strip()
#         unwanted_phrases = [
#             "Use the following pieces of context",
#             "If you don't know the answer, just say that you don't know",
#             "Don't try to make up an answer."
#         ]
#         for phrase in unwanted_phrases:
#             if phrase in response_text:
#                 response_text = response_text.split(phrase)[-1].strip()
#         return f"Question: {question}\nHelpful Answer: {response_text}"

#     def trim_final_input(self, question, context, max_tokens=512):
#       """
#       Truncates the final input to meet the token limit, preserving document metadata.
#       """
#       system_message = (
#           "You are analyzing a bank's quarterly earnings call transcript. "
#           "Provide a bullet-point summary of the most important takeaways with specific details: "
#           "list key revenue trends (include any percentage changes if available), major expense drivers, "
#           "and management's outlook for the future. If numerical details are not available, provide qualitative insights."
#       )
#       # Join the context documents into one coherent string.
#       context_str = "\n".join([doc.page_content for doc in context])
#       input_text = f"{system_message}\n\nContext:\n{context_str}\n\nQuestion: {question}"
#       tokens = self.tokenizer.encode(input_text, truncation=True, max_length=max_tokens)
#       return self.tokenizer.decode(tokens)


#     def answer_question(self, question):
#         """
#         Processes the user query: retrieves context, prepares the prompt,
#         and returns a formatted answer. If no relevant documents are retrieved,
#         a fallback message is returned.
#         """
#         question = self.clean_user_input(question)
#         self.reset_memory_if_needed()

#         # Retrieve and process context.
#         context = self.retriever.get_relevant_documents(question)
#         context = self.remove_duplicate_chunks(context)
#         context = self.truncate_context(context, max_tokens=800)

#         # Fallback: if no relevant context is found.
#         if not context:
#             return f"Question: {question}\nHelpful Answer: I don't have information regarding that query."

#         print("\nRetrieved Context:")
#         for doc in context:
#             source = doc.metadata.get('source', 'Unknown Source')
#             page = doc.metadata.get('page', 'Unknown Page')
#             print(f"- Source: {source}, Page: {page}")

#         # Enforce a 512-token limit for the final prompt.
#         formatted_input = self.trim_final_input(question, context, max_tokens=512)
#         response = self.qa_chain({"question": formatted_input})
#         return self.format_response(question, response['answer'])

#     def run_chatbot(self):
#         """
#         Initiates an interactive loop for prompt-based queries.
#         """
#         print("\n💬 Bank Earnings Chatbot (Type 'exit' to stop)")
#         while True:
#             user_input = input("\nYou: ")
#             if user_input.lower() == "exit":
#                 print("\nExiting Chatbot. Have a great day!")
#                 break
#             answer = self.answer_question(user_input)
#             print("\n" + answer)




In [None]:
import os
import re
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.memory import ConversationBufferWindowMemory
from langchain_community.llms import HuggingFacePipeline
from langchain.chains import ConversationalRetrievalChain

class BankEarningsChatbot:
    """
    A class-based implementation of an LLM Retrieval Augmented Generation (RAG) engine
    designed to analyze bank quarterly earnings call transcripts. It loads PDF documents
    from one or more specified folders, builds a Chroma vector index, and sets up an interactive
    conversational chain for prompt-based queries.

    Parameters:
        pdf_folders (list or str): Folder paths containing PDF files.
        persist_directory (str): Directory to persist the vector index.
        max_length (int): Maximum output length for the T5 model.
        test_mode (bool): If True, loads only one PDF per folder for quick testing.
        rebuild_index (bool): If True, reprocess all PDFs and rebuild the index.
        verbose (bool): If True, prints additional debug information.
    """
    def __init__(self, pdf_folders,
                 persist_directory="/content/drive/MyDrive/BOE/bank_of_england/data/model_outputs",
                 max_length=256, test_mode=False, rebuild_index=False, verbose=False):
        if isinstance(pdf_folders, str):
            self.pdf_folders = [pdf_folders]
        else:
            self.pdf_folders = pdf_folders

        self.persist_directory = persist_directory
        self.test_mode = test_mode
        self.verbose = verbose

        self._setup_llm(max_length)
        self._setup_embeddings()

        if rebuild_index or (not os.path.exists(self.persist_directory)) or (not os.listdir(self.persist_directory)):
            self._load_documents()
            self._build_vector_index()
        else:
            if self.verbose:
                print("Loading existing vector index from persistence directory.")
            self.db = Chroma(persist_directory=self.persist_directory, embedding_function=self.embeddings)

        self._setup_retriever()
        self.memory = ConversationBufferWindowMemory(k=10, memory_key="chat_history", return_messages=True)
        self.qa_chain = ConversationalRetrievalChain.from_llm(
            llm=self.llm,
            retriever=self.retriever,
            memory=self.memory,
            verbose=False
        )

    def _setup_llm(self, max_length):
        self.model_name = "google/flan-t5-large"
        self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
        self.model = AutoModelForSeq2SeqLM.from_pretrained(
            self.model_name,
            device_map="auto",
            torch_dtype=torch.float16
        )
        self.pipe = pipeline(
            "text2text-generation",
            model=self.model,
            tokenizer=self.tokenizer,
            max_length=max_length,
            temperature=0.5,
            top_p=0.8,
            do_sample=True
        )
        self.llm = HuggingFacePipeline(pipeline=self.pipe)

    def _setup_embeddings(self):
        self.embedding_model = "sentence-transformers/all-mpnet-base-v2"
        self.embeddings = HuggingFaceEmbeddings(model_name=self.embedding_model)

    def _load_documents(self):
        self.documents = []
        for folder in self.pdf_folders:
            bank_name = os.path.basename(folder).lower()
            file_names = [f for f in os.listdir(folder) if f.endswith(".pdf")]
            if not file_names:
                continue
            if self.test_mode:
                file_names = file_names[:1]
            for pdf_file in file_names:
                pdf_path = os.path.join(folder, pdf_file)
                try:
                    loader = PyPDFLoader(pdf_path, extract_images=False)
                    docs = loader.load_and_split()
                    for doc in docs:
                        doc.metadata["bank"] = bank_name
                    self.documents.extend(docs)
                    if self.verbose:
                        print(f"Loaded: {pdf_file} from {folder}")
                except Exception as e:
                    if self.verbose:
                        print(f"Error loading {pdf_file} from {folder}: {e}")
            if self.test_mode:
                break

    def _build_vector_index(self):
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
        chunks = text_splitter.split_documents(self.documents)
        chunks = self.remove_duplicate_chunks(chunks)
        self.chunks = chunks
        self.db = Chroma.from_documents(chunks, embedding=self.embeddings, persist_directory=self.persist_directory)
        self.db.persist()

    def _setup_retriever(self):
        self.vectordb = Chroma(persist_directory=self.persist_directory, embedding_function=self.embeddings)
        self.retriever = self.vectordb.as_retriever(search_kwargs={"k": 5})

    @staticmethod
    def remove_duplicate_chunks(chunks):
        seen = set()
        unique_chunks = []
        for chunk in chunks:
            chunk_text = chunk.page_content.strip()
            if chunk_text not in seen:
                seen.add(chunk_text)
                unique_chunks.append(chunk)
        return unique_chunks

    def truncate_context(self, context_list, max_tokens=800):
        truncated_docs = []
        current_tokens = 0
        for doc in context_list:
            doc_tokens = len(self.tokenizer.encode(doc.page_content))
            if current_tokens + doc_tokens <= max_tokens:
                truncated_docs.append(doc)
                current_tokens += doc_tokens
            else:
                break
        return truncated_docs

    @staticmethod
    def clean_user_input(user_input):
        return user_input.strip().replace("\n", " ").replace("\t", " ")

    def reset_memory_if_needed(self):
        if len(self.memory.chat_memory.messages) > 6:
            if self.verbose:
                print("\nMemory Full: Resetting Conversation History...\n")
            self.memory.clear()

    def format_response(self, question, response):
        # Ensure response is extracted as a string
        if isinstance(response, dict) and 'answer' in response:
            response_text = response['answer'].strip()
        else:
            response_text = str(response).strip()
        unwanted_phrases = [
            "Use the following pieces of context",
            "If you don't know the answer, just say that you don't know",
            "Don't try to make up an answer."
        ]
        for phrase in unwanted_phrases:
            if phrase in response_text:
                response_text = response_text.split(phrase)[-1].strip()
        return f"Question: {question}\nHelpful Answer: {response_text}"

    def trim_final_input(self, question, context, max_tokens=512):
        system_message = (
            "You are analyzing a bank's quarterly earnings call transcript. "
            "Provide a bullet-point summary of the most important takeaways with specific details: "
            "list key revenue trends (include any percentage changes if available), major expense drivers, "
            "and management's outlook for the future. If numerical details are not available, provide qualitative insights."
        )
        context_str = "\n".join([doc.page_content for doc in context])
        input_text = f"{system_message}\n\nContext:\n{context_str}\n\nQuestion: {question}"
        tokens = self.tokenizer.encode(input_text, truncation=True, max_length=max_tokens)
        return self.tokenizer.decode(tokens)

    def answer_question(self, question):
        question = self.clean_user_input(question)
        self.reset_memory_if_needed()
        context = self.retriever.get_relevant_documents(question)
        context = self.remove_duplicate_chunks(context)
        context = self.truncate_context(context, max_tokens=800)
        if not context:
            return f"Question: {question}\nHelpful Answer: I don't have information regarding that query."
        if self.verbose:
            print("\nRetrieved Context:")
            for doc in context:
                source = doc.metadata.get('source', 'Unknown Source')
                page = doc.metadata.get('page', 'Unknown Page')
                print(f"- Source: {source}, Page: {page}")
        formatted_input = self.trim_final_input(question, context, max_tokens=512)
        response = self.qa_chain(formatted_input)
        return self.format_response(question, response)

    def run_chatbot(self):
        print("\n💬 Bank Earnings Chatbot (Type 'exit' to stop)")
        while True:
            user_input = input("\nYou: ")
            if user_input.lower() == "exit":
                print("\nExiting Chatbot. Have a great day!")
                break
            answer = self.answer_question(user_input)
            print("\n" + answer)

    # ---------------- Multi-Step Methods for Year-on-Year & Quarterly Sentiment Analysis ----------------

    def summarize_individual_transcripts(self):
        summaries = {}
        for doc in self.documents:
            source = doc.metadata.get('source', 'Unknown Source')
            bank = doc.metadata.get('bank', 'unknown')
            transcript_text = doc.page_content
            prompt = f"Please provide a bullet-point sentiment summary for the following transcript:\n\n{transcript_text}"
            response = self.llm(prompt)
            # Extract answer string if needed
            if isinstance(response, dict) and 'answer' in response:
                answer_text = response['answer']
            else:
                answer_text = str(response)
            summaries[source] = {"bank": bank, "summary": answer_text}
            if self.verbose:
                print(f"Summarized transcript from {source}")
        return summaries

    def group_summaries_by_bank_and_quarter(self, summaries):
        grouped = {}
        for source, info in summaries.items():
            bank = info["bank"]
            summary = info["summary"]
            match = re.search(r'(\d{1}[qQ]|[qQ]\d)[-_]?(\d{2,4})', source)
            if match:
                quarter_raw = match.group(1)
                year = match.group(2)
                quarter_num = re.search(r'\d', quarter_raw).group(0)
                key = f"{year}-Q{quarter_num}"
            else:
                key = "Unknown"
            if bank not in grouped:
                grouped[bank] = {}
            if key not in grouped[bank]:
                grouped[bank][key] = []
            grouped[bank][key].append(summary)
        return grouped

    def aggregate_quarterly_summaries_by_bank(self, grouped_quarterly):
        quarterly_aggregates = {}
        for bank, quarters in grouped_quarterly.items():
            quarterly_aggregates[bank] = {}
            for key, summaries in quarters.items():
                combined = "\n".join(summaries)
                prompt = (f"Based on the following quarterly sentiment summaries for {bank.upper()} ({key}), "
                          "provide a concise bullet-point overview of the overall sentiment for that quarter:\n\n" + combined)
                response = self.llm(prompt)
                if isinstance(response, dict) and 'answer' in response:
                    answer_text = response['answer']
                else:
                    answer_text = str(response)
                quarterly_aggregates[bank][key] = answer_text
        return quarterly_aggregates

    def forecast_next_quarter_sentiment(self, bank, historical_quarterly):
        combined = "\n".join(historical_quarterly)
        prompt = (
            f"Based on the following historical quarterly sentiment summaries for {bank.upper()}, "
            "forecast the overall sentiment for the next quarter. Provide a bullet-point summary of the expected trends, "
            "including any changes in tone, risk factors, or optimism:\n\n" + combined
        )
        response = self.llm(prompt)
        if isinstance(response, dict) and 'answer' in response:
            return response['answer']
        return str(response)

    @staticmethod
    def parse_quarter_key(key):
        try:
            parts = key.split("-")
            year = int(parts[0])
            quarter = int(re.search(r'\d', parts[1]).group(0))
            return year, quarter
        except Exception:
            return (0, 0)

    def analyze_and_forecast_sentiment_by_bank(self):
        print("Generating individual transcript summaries...")
        summaries = self.summarize_individual_transcripts()
        print("Grouping summaries by bank and quarter...")
        grouped_quarterly = self.group_summaries_by_bank_and_quarter(summaries)
        print("Aggregating quarterly summaries...")
        quarterly_aggregates = self.aggregate_quarterly_summaries_by_bank(grouped_quarterly)

        analysis = {}
        for bank, quarters in quarterly_aggregates.items():
            analysis[bank] = {}
            valid_keys = [k for k in quarters.keys() if k != "Unknown"]
            if not valid_keys:
                continue
            sorted_keys = sorted(valid_keys, key=lambda k: self.parse_quarter_key(k))
            most_recent_key = sorted_keys[-1]
            current_year, current_quarter = self.parse_quarter_key(most_recent_key)

            years = sorted({self.parse_quarter_key(k)[0] for k in quarters if k != "Unknown"})
            previous_year = max([y for y in years if y < current_year], default=None)
            previous_year_summaries = []
            if previous_year is not None:
                for key in quarters:
                    year, _ = self.parse_quarter_key(key)
                    if year == previous_year:
                        previous_year_summaries.extend(quarters[key])
                combined_prev = "\n".join(previous_year_summaries)
                prompt_prev = (f"Based on the following sentiment summaries for all quarters in {previous_year} for {bank.upper()}, "
                               "provide a bullet-point summary of the overall sentiment trends for that year:\n\n" + combined_prev)
                response_prev = self.llm(prompt_prev)
                if isinstance(response_prev, dict) and 'answer' in response_prev:
                    previous_year_summary = response_prev['answer']
                else:
                    previous_year_summary = str(response_prev)
            else:
                previous_year_summary = "Not available"

            current_summary = quarters.get(most_recent_key, "Not available")

            historical = []
            for key in sorted_keys:
                historical.extend(quarters[key])
            forecast = self.forecast_next_quarter_sentiment(bank, historical) if historical else "Not available"

            analysis[bank] = {
                "previous_year_summary": previous_year_summary,
                "current_quarter_summary": current_summary,
                "forecast_next_quarter": forecast,
                "most_recent_key": most_recent_key
            }
        return analysis

# ----------------------------
# Example usage:
pdf_folders = [
    "/content/drive/MyDrive/BOE/bank_of_england/data/raw/jpmorgan",
    "/content/drive/MyDrive/BOE/bank_of_england/data/raw/ubs"
]
persist_directory = "/content/drive/MyDrive/BOE/bank_of_england/data/model_outputs"
chatbot = BankEarningsChatbot(pdf_folders, persist_directory=persist_directory, test_mode=False, rebuild_index=True, verbose=False)

analysis = chatbot.analyze_and_forecast_sentiment_by_bank()

print("Yearly and Quarterly Sentiment Analysis by Bank:")
for bank, data in analysis.items():
    print(f"{bank.upper()} Analysis:")
    print(f"Most Recent Quarter Key: {data.get('most_recent_key','N/A')}")
    print("Previous Year Sentiment Summary:")
    print(data["previous_year_summary"])
    print("Current Quarter Sentiment Summary:")
    print(data["current_quarter_summary"])
    print("Forecast for Next Quarter:")
    print(data["forecast_next_quarter"])
    print("-" * 60)


Device set to use cuda:0


Generating individual transcript summaries...


Token indices sequence length is longer than the specified maximum sequence length for this model (827 > 512). Running this sequence through the model will result in indexing errors


Based on the transcripts and retrieval‐augmented setup, here are some recommendations for crafting prompts that are likely to yield the most accurate and domain‐specific responses:

- **Be Specific About the Timeframe:**  
  Instead of asking “What were the key insights?” specify the quarter or transcript you’re interested in. For example:  
  - "What were the key financial insights from the Q4 2023 earnings call?"  
  - "Summarize the main drivers of revenue in the Q1 2023 transcript."

- **Target Specific Financial Metrics or Themes:**  
  Focus on particular areas the transcripts cover, such as revenue trends, expense drivers, or capital performance. For example:  
  - "How did revenue change compared to the previous quarter in the Q4 2023 earnings call?"  
  - "What were the primary expense drivers discussed in the Q4 2023 transcript?"

- **Incorporate Domain-Specific Language:**  
  Use terminology that reflects the financial domain to guide the model. For example:  
  - "What risk factors and forward-looking statements were highlighted in the Q3 2023 transcript?"  
  - "Outline the key operational challenges and strategic responses mentioned in the earnings call."

- **Prompt for Summaries and Insights:**  
  Asking for summaries can help the model focus on extracting concise information from large volumes of text. For example:  
  - "Provide a concise summary of the key financial insights from the Q4 2023 earnings transcript, including revenue, expenses, and capital allocation."  
  - "What are the overall sentiments and key management strategies discussed in the transcript?"

By tailoring your queries with specific quarters, financial metrics, and industry language, you guide the retrieval and summarization process more effectively. This structured approach should lead to more precise and contextually relevant responses from your system.

Interactive Chatbot Session:
By calling chatbot.run_chatbot(), you launch an interactive loop. In this mode, the program continuously waits for user input from the command line. As the user types questions, the chatbot processes each one in real time and prints the response. This mode is ideal for a live, conversational experience where the operator manually drives the dialogue.

Programmatic Prompt Processing:
Instead of an interactive loop, you can supply a list of predefined prompts (as shown in the example). The code then iterates over this list, calling chatbot.answer_question(prompt) for each query. It prints both the prompt and the corresponding answer. This approach is useful for batch testing, automated evaluations, or when you want to process a fixed set of queries without manual intervention.

# Instantiate the chatbot object

In [13]:
# ----------------------------
# Example usage of the BankEarningsChatbot class with T5, multiple data sources, and test mode enabled
# ----------------------------

# Define your PDF folder paths (ensure these paths contain your earnings transcripts in PDF format).
pdf_folders = [
    "/content/drive/MyDrive/BOE/bank_of_england/data/raw/jpmorgan",
    "/content/drive/MyDrive/BOE/bank_of_england/data/raw/ubs"
]

# Define the persistence directory for model outputs and the vector index.
persist_directory = "/content/drive/MyDrive/BOE/bank_of_england/data/model_outputs"

# Instantiate the chatbot object.
# Set rebuild_index=True if you want to force re-indexing, otherwise it will load the persisted index if it exists.
chatbot = BankEarningsChatbot(pdf_folders, persist_directory=persist_directory, test_mode=False, rebuild_index=True)

Device set to use cuda:0


Loaded: 1q23-earnings-transcript.pdf from /content/drive/MyDrive/BOE/bank_of_england/data/raw/jpmorgan
Loaded: 2q23-earnings-transcript.pdf from /content/drive/MyDrive/BOE/bank_of_england/data/raw/jpmorgan
Loaded: 4q24-earnings-transcript.pdf from /content/drive/MyDrive/BOE/bank_of_england/data/raw/jpmorgan
Loaded: jpm-1q24-earnings-call-transcript.pdf from /content/drive/MyDrive/BOE/bank_of_england/data/raw/jpmorgan
Loaded: jpm-2q24-earnings-call-transcript-final.pdf from /content/drive/MyDrive/BOE/bank_of_england/data/raw/jpmorgan
Loaded: jpm-3q23-earnings-call-transcript.pdf from /content/drive/MyDrive/BOE/bank_of_england/data/raw/jpmorgan
Loaded: jpm-4q23-earnings-call-transcript.pdf from /content/drive/MyDrive/BOE/bank_of_england/data/raw/jpmorgan
Loaded: jpmc-third-quarter-2024-earnings-conference-call-transcript.pdf from /content/drive/MyDrive/BOE/bank_of_england/data/raw/jpmorgan
Loaded: 1q23-earnings-call-remarks.pdf from /content/drive/MyDrive/BOE/bank_of_england/data/raw/ubs

# For debugging, process a list of prompts programmatically

In [None]:
# Process a list of prompts programmatically.
prompts = [
    # "What were the key insights from the latest earnings call for JP Morgan?",
    # "How did revenue change in Q4 2024 compared to the previous Q3 2024 for JP Morgan?",
    # "What is the overall sentiment of the earnings call for UBS 2024?",
    "Please provide a bullet-point summary of the most important takeaways from UBS's latest earnings call. Focus on key revenue trends (include any percentage changes if available), major expense drivers, and management's outlook for the future."
]

for prompt in prompts:
    response = chatbot.answer_question(prompt)
    print("Question:", prompt)
    print("Response:", response)
    print("-" * 60)

In [16]:
prompt = "Please provide a summary of the overall sentiment expressed in the earnings call transcripts over the last two years for UBS bank. Consider all quarterly transcripts for each year and highlight any trends or shifts in sentiment over time."
response = chatbot.answer_question(prompt)
print("Question:", prompt)
print("Response:", response)


Token indices sequence length is longer than the specified maximum sequence length for this model (800 > 512). Running this sequence through the model will result in indexing errors



Retrieved Context:
- Source: /content/drive/MyDrive/BOE/bank_of_england/data/raw/ubs/1q23-earnings-call-remarks.pdf, Page: 0
- Source: /content/drive/MyDrive/BOE/bank_of_england/data/raw/ubs/3q24-earnings-call-remarks.pdf, Page: 18
- Source: /content/drive/MyDrive/BOE/bank_of_england/data/raw/ubs/3q24-earnings-call-remarks.pdf, Page: 2
Question: Please provide a summary of the overall sentiment expressed in the earnings call transcripts over the last two years for UBS bank. Consider all quarterly transcripts for each year and highlight any trends or shifts in sentiment over time.
Response: Question: Please provide a summary of the overall sentiment expressed in the earnings call transcripts over the last two years for UBS bank. Consider all quarterly transcripts for each year and highlight any trends or shifts in sentiment over time.
Helpful Answer: 477 million dollars in profit before tax and a 15% return on attributed equity


# Launch an interactive  Chatbot session

In [17]:
chatbot.run_chatbot()


💬 Bank Earnings Chatbot (Type 'exit' to stop)

You: Please provide a summary of the overall sentiment expressed in the earnings call transcripts over the last two years for UBS bank. Consider all quarterly transcripts for each year and highlight any trends or shifts in sentiment over time.

Retrieved Context:
- Source: /content/drive/MyDrive/BOE/bank_of_england/data/raw/ubs/1q23-earnings-call-remarks.pdf, Page: 0
- Source: /content/drive/MyDrive/BOE/bank_of_england/data/raw/ubs/3q24-earnings-call-remarks.pdf, Page: 18
- Source: /content/drive/MyDrive/BOE/bank_of_england/data/raw/ubs/3q24-earnings-call-remarks.pdf, Page: 2

Question: Please provide a summary of the overall sentiment expressed in the earnings call transcripts over the last two years for UBS bank. Consider all quarterly transcripts for each year and highlight any trends or shifts in sentiment over time.
Helpful Answer: Amit Goel, Barclays


KeyboardInterrupt: Interrupted by user