# Building a Financial Analyst Chatbot with RAG 🤖📈

Traditional financial analysis is a laborious process - sifting through densely packed reports, extracting the relevant numbers, connecting disparate data points, and interpreting what it all means. But what if you could simply have a conversation about finances and get instant, accurate insights?

That's what I've built here by combining the reasoning capabilities of a LLM model (Mistral Instruct v3) with the factual precision of Retrieval-Augmented Generation (RAG) using LVMH financial reports as the knowledge source for this demonstration. 

With RAG, the chatbot doesn't just respond with generic information - it dives into actual financial documents, extracts the most relevant data, and delivers insights through natural conversation.

This Notebook Covers:

* Building a RAG Pipeline using `llama-index`
* Building a text generation pipleline for the chatbot using `Mistral-7B-Instruct-v0.3`
* Building a chatbot interface using `Gradio UI`


Let's dive in! 🚀

In [1]:
!pip install -q -U llama-index
!pip install -q -U llama-index-embeddings-huggingface
!pip install -q -U optimum
!pip install -q -U bitsandbytes
!pip install -q -U gradio

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m29.6 MB/s[0m eta [36m0:00:00[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.4/40.4 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m47.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m264.5/264.5 kB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m571.1/571.1 kB[0m [31m31.2 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
preprocessing 0.1.13 requires nltk==3.2.4, but you have nltk 3.9.1 which is incompatible.[0m[31m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m433.6/433.6 kB[0m [31m12.0 MB/s[0m eta [36m0:0

In [2]:
import torch
import gradio as gr
from threading import Thread

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TextIteratorStreamer


from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings


import os
from kaggle_secrets import UserSecretsClient

In [4]:
os.environ["HF_TOKEN"]= UserSecretsClient().get_secret("HF_TOKEN")

# Load quantized mistral instruct model

In [3]:
# NF4 Quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

In [5]:
# Model checkpoint
model_checkpoint = "mistralai/Mistral-7B-Instruct-v0.3"
# Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
# Load Model
model = AutoModelForCausalLM.from_pretrained(
        model_checkpoint,
        quantization_config=bnb_config,
        device_map="auto",
        trust_remote_code=True)

tokenizer_config.json:   0%|          | 0.00/141k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/601 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.55G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

# RAG (Retrieval Augmented Generation)

Load an embedding model

In [121]:
embedding_model=HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

RAG code

In [122]:
class RAGSystem:
    def __init__(self,
                 dir_path,
                 embedding_model,
                 chunk_size=256,
                 chunk_overlap=25,
                 top_k=3,
                 similarity_threshold=0.5):

       # Configure global settings
       Settings.embed_model = embedding_model
       Settings.llm = None  # Focus only on embedding generation
       Settings.chunk_size = chunk_size
       Settings.chunk_overlap = chunk_overlap

       # Attributes
       self.dir_path = dir_path
       self.embedding_model = embedding_model
       self.top_k = top_k
       self.similarity_threshold = similarity_threshold
       self.documents = self._load_documents()
       self.index = self._create_index()
       self.query_engine = self._configure_query_engine()

    def _load_documents(self):
        """Load documents from the specified path."""
        reader = SimpleDirectoryReader(self.dir_path)
        return reader.load_data()

    def _create_index(self):
        """Create vector index from documents."""
        # High level transformation API : accepts an array of Document objects to parse and chunk them up
        return VectorStoreIndex.from_documents(self.documents)

    def _configure_query_engine(self):
        """Configure the retrieval query engine."""
        retriever = VectorIndexRetriever(
            index=self.index,
            similarity_top_k=self.top_k
        )

        return RetrieverQueryEngine(
            retriever=retriever,
            node_postprocessors=[
                SimilarityPostprocessor(similarity_cutoff=self.similarity_threshold)
            ]
        )

    def build_prompt(self, query):
        """Build a RAG prompt with retrieved context."""

        # retrieve knowledge
        response = self.query_engine.query(query)
        context_parts = []
        for node in response.source_nodes:

            # Extract node source
            file_path = node.metadata.get("file_path", "Unknown File")
            file_name = os.path.basename(file_path)  
            page_number = node.metadata.get("page_label", "Unknown Page") 
            source=f"{file_name}:{page_number}"
            source_info = f"Source : [file: {file_name} , page: {page_number}]"

            # Node text
            node_text=node.text
            
            # Add node text and source info to context
            context_parts.append(f"{source_info}\n{node_text}\n")


        context = "\n --- \n".join(context_parts)

        return self._prompt_template(context, query)

    def _prompt_template(self, context, query):

        """Format the final prompt with context and query."""
        prompt_template=f"""
                        Context information is below.
                        ---------------------                    
                        
                        {context}
                    
                        ---------------------
                        Given the context information and not prior knowledge, answer the following query.
                        Query: {query}
                        """
    
        return prompt_template

    #def generate_response(self, query, llm):
        #"""Generate a response using the RAG system and an LLM."""
        #prompt = self.build_prompt(query)
        #return llm.generate(prompt)

# Text Generation code for the chatbot

In [123]:
def generate_resp(chat, tokenizer, model, temperature):
    """
        Generates model response using chat history.
    """
    # Ensure inference mode
    model.eval()

    # Apply the chat template
    formatted_chat = tokenizer.apply_chat_template(chat,
                                                  tokenize=False,
                                                  add_generation_prompt=True
                                                  )

    # Tokenize the chat
    inputs = tokenizer(formatted_chat,
                      return_tensors="pt",
                      add_special_tokens=False)

    # Move the tokenized inputs and attention masks to the same device the model is on
    inputs = {key: tensor.to(model.device) for key, tensor in inputs.items()}

    # Initialize streamer to handle tokens as they are generated (we pss tokenizer for automatic decoding)
    streamer = TextIteratorStreamer(tokenizer,
                                    skip_special_tokens=True,
                                    skip_prompt=True)

    # Set generation parameters
    generation_kwargs = dict(
        **inputs,
        streamer=streamer,
        max_new_tokens=512,
        do_sample=True,
        temperature=temperature,
        pad_token_id=tokenizer.eos_token_id
    )

    # Run generation in a separate thread
    thread = Thread(target=model.generate, kwargs=generation_kwargs)
    thread.start()


    return streamer

# Chatbot function and UI

Initialize RAG

In [125]:
rag_system = RAGSystem(
        top_k=3,
        similarity_threshold=0.5,
        dir_path="/kaggle/input/lvmh-financial-report-pdf",
        embedding_model=embedding_model,
        chunk_size=500,
        chunk_overlap=50
)

LLM is explicitly disabled. Using MockLLM.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Define chatbot function

In [141]:
system_prompt= f"""
You are a financial analyst specializing in corporate earnings reports. 
Your task is to analyze the LVMH financial reports and provide accurate, concise, and well-structured responses.

- Present financial data in a clear, structured manner, using bullet points or tables when necessary.
- Where applicable, compare figures with previous years to highlight trends.
- Always cite the source at the end of your answer, following this format "Source : (file name, page number)"

Maintain a professional and neutral tone, avoiding unnecessary elaboration. 
Your goal is to provide **precise, data-driven insights** for financial analysis.
"""

In [142]:
def chat_interface(message, history):
    """ Gradio function."""
    # Initialize history with system prompt
    if not history:
        history.append({"role": "system", "content": system_prompt})
   
    # Get RAG prompt
    prompt = rag_system.build_prompt(message)

    # Prepare chat concatenating history and user input
    chat = history + [{"role": "user", "content": prompt}]

    # Get the streamer object that will yield generated text
    streamer = generate_resp(chat, tokenizer, model, temperature=0.1)

    # Streaming response
    response = ""
    for new_text in streamer:
        response += new_text
        yield response

Define chatbot interface

In [None]:
chatbot=gr.ChatInterface(fn=chat_interface,
                 type="messages",
                 examples=["What are the key financial highlights of 2024?",
                           "What was the revenue distribution by geographic region in 2024?",
                           "How efficient is LVMH in managing its assets (ROA for 2024)",
                           "is LVMH positioned for growth in 2025?",
                           "What are the key financial risks LVMH might face in the coming years?",
                           "Does LVMH's financial data suggest that it's more dependent on organic growth or acquisitions?"
                           ])

Launch the chatbot

In [143]:
chatbot.launch()

* Running on local URL:  http://127.0.0.1:7893
Kaggle notebooks require sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

* Running on public URL: https://c072d02eb194e955c8.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]