<a href="https://colab.research.google.com/github/zakariajaadi/data-science-portofolio/blob/main/financial-analyst-chatbot-with-rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building a Financial Analyst Chatbot with RAG 🤖💰

Traditional financial analysis is a laborious process - sifting through densely packed reports, extracting the relevant numbers, connecting disparate data points, and interpreting what it all means. But what if you could simply have a conversation about finances and get instant, accurate insights?

That's what I've built here by combining the reasoning capabilities of a LLM model (Mistral Instruct v3) with the factual precision of Retrieval-Augmented Generation (RAG) using LVMH financial reports as the knowledge source for this demonstration.

With RAG, the chatbot doesn't just respond with generic information - it dives into actual financial documents, extracts the most relevant data, and delivers insights through natural conversation.

This Notebook Covers:

* Building a RAG Pipeline using `llama-index`
* Building a text generation pipleline for the chatbot using `Mistral-7B-Instruct-v0.3`
* Building a chatbot interface using `Gradio UI`


Watch the chatbot in action before we dive in  :

# Demo video 🚀

In [None]:
# @title
from google.colab import drive
from IPython.display import Video, HTML

drive.mount('/content/drive')

Video("/content/drive/My Drive/colab demo videos/financial-chatbot-rag-demo.mp4", embed=True, width=800, height=600, html_attributes='controls loop autoplay')


# Let's dive in 🚀 :
## Libs and imports

In [2]:
!pip install -q -U llama-index
!pip install -q -U llama-index-embeddings-huggingface
!pip install -q -U llama-parse
!pip install -q -U optimum
!pip install -q -U bitsandbytes
!pip install -q -U gradio

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.6 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m48.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.8/40.8 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m253.9/253.9 kB[0m [31m20.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.3/302.3 kB[0m [31m24.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m63.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m129.2/129.2 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m

In [3]:
import os
import torch
import gradio as gr
from threading import Thread

import nest_asyncio
import asyncio


from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TextIteratorStreamer


from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_parse import LlamaParse
from llama_index.core.prompts import RichPromptTemplate

## 1- Load quantized mistral instruct model

In [4]:
# NF4 Quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

In [5]:
# Model checkpoint
model_checkpoint = "mistralai/Mistral-7B-Instruct-v0.3"
# Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
# Load Model
model = AutoModelForCausalLM.from_pretrained(
        model_checkpoint,
        quantization_config=bnb_config,
        device_map="auto",
        trust_remote_code=True)

tokenizer_config.json:   0%|          | 0.00/141k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/601 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.55G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

## 2- RAG (Retrieval Augmented Generation)

Load an embedding model

In [6]:
embedding_model=HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

RAG code

In [27]:
class RAGSystem:

    def __init__(self, dir_path, embedding_model, chunk_size=256, chunk_overlap=25, top_k=3, similarity_threshold=0.5):

       # ---- Configure Llama Index global settings ---- #

       Settings.embed_model = embedding_model
       Settings.llm = None  # Focus only on embedding generation
       Settings.chunk_size = chunk_size
       Settings.chunk_overlap = chunk_overlap

       # ----- Attributes ----- #

       self.dir_path = dir_path
       self.embedding_model = embedding_model
       self.top_k = top_k
       self.similarity_threshold = similarity_threshold

       self.documents = self._load_documents()
       self.index = self._create_index()
       self.query_engine = self._configure_query_engine()

       self.prompt_template=RichPromptTemplate("""
                        Context information is below.
                        ---------------------
                        {{ context_str }}
                        ---------------------
                        Given the context information and not prior knowledge, answer the following question : {{ query_str }}
                        """)

    # ---- Functions ---- #

    def _load_documents(self):
        """Load documents from the specified path."""

        # For async execution
        nest_asyncio.apply()

        # Set up parser
        parser = LlamaParse(result_type="markdown")
        file_extractor = {".pdf": parser}

        # Parse data into markdown
        reader = SimpleDirectoryReader(input_dir=self.dir_path, file_extractor=file_extractor)

        return asyncio.run(reader.aload_data())

    def _create_index(self):
        """Create vector index from documents."""

        # High level transformation API : accepts an array of Document objects to parse and chunk them up
        return VectorStoreIndex.from_documents(self.documents)

    def _configure_query_engine(self):
        """Configure the retrieval query engine."""

        # Create a retriever that fetches the top K most similar chunks
        retriever = VectorIndexRetriever(
            index=self.index,
            similarity_top_k=self.top_k
        )

        # Create a query engine with similarity threshold
        retriever_query_engine= RetrieverQueryEngine(
            retriever=retriever,
            node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=self.similarity_threshold)]
        )

        return retriever_query_engine

    def build_prompt(self, query):
        """Build a RAG prompt with retrieved context."""

        # Retrieve knowledge
        response = self.query_engine.query(query)

        # Prepare context
        context_parts = []
        for node in response.source_nodes:

            # Extract node source
            file_path = node.metadata.get("file_path", "Unknown File")
            file_name = os.path.basename(file_path)
            page_number = node.metadata.get("page_label", "Unknown Page")
            source=f"{file_name}:{page_number}"
            source_info = f"Source : [file: {file_name} , page: {page_number}]"

            # Node text
            node_text=node.text

            # Add node text and source info to context
            context_parts.append(f"{source_info}\n{node_text}\n")

        context = "\n --- \n".join(context_parts)

        return self.prompt_template.format(context_str=context, query_str=query)


    #def generate_response(self, query, llm):
        #"""Generate a response using the RAG system and an LLM."""
        #prompt = self.build_prompt(query)
        #return llm.generate(prompt)

## 3- Text Generation code for the chatbot

In [8]:
def generate_resp(chat, tokenizer, model, temperature):
    """
        Generates model response using chat history.
    """
    # Ensure inference mode
    model.eval()

    # Apply the chat template
    formatted_chat = tokenizer.apply_chat_template(chat,
                                                  tokenize=False,
                                                  add_generation_prompt=True
                                                  )

    # Tokenize the chat
    inputs = tokenizer(formatted_chat,
                      return_tensors="pt",
                      add_special_tokens=False)

    # Move the tokenized inputs and attention masks to the same device the model is on
    inputs = {key: tensor.to(model.device) for key, tensor in inputs.items()}

    # Initialize streamer to handle tokens as they are generated (we pss tokenizer for automatic decoding)
    streamer = TextIteratorStreamer(tokenizer,
                                    skip_special_tokens=True,
                                    skip_prompt=True)

    # Set generation parameters
    generation_kwargs = dict(
        **inputs,
        streamer=streamer,
        max_new_tokens=512,
        do_sample=True,
        temperature=temperature,
        pad_token_id=tokenizer.eos_token_id
    )

    # Run generation in a separate thread
    thread = Thread(target=model.generate, kwargs=generation_kwargs)
    thread.start()

    return streamer

## 4- Chatbot function and UI

Initialize RAG

In [12]:
import os
from google.colab import userdata
os.environ['LLAMA_CLOUD_API_KEY']=userdata.get('LLAMA_CLOUD_API_KEY')

In [28]:
rag_system = RAGSystem(
        top_k=3,
        similarity_threshold=0.5,
        dir_path="./lvmh-financial-data",
        embedding_model=embedding_model,
        chunk_size=500,
        chunk_overlap=50
)

LLM is explicitly disabled. Using MockLLM.
Started parsing the file under job_id d976b2c5-229b-4463-9320-d18d9d0258e1


Define chatbot function

In [29]:
system_prompt= f"""
You are a financial analyst specializing in corporate earnings reports.
Your task is to analyze the LVMH financial reports and provide accurate, concise, and well-structured responses.

- Present financial data in a clear, structured manner, using bullet points or tables when necessary.
- Where applicable, compare figures with previous years to highlight trends.

Maintain a professional and neutral tone, avoiding unnecessary elaboration.
Your goal is to provide **precise, data-driven insights** for financial analysis.
"""

In [30]:
def chat_interface(message, history):
    """ Gradio function."""
    # Initialize history with system prompt
    if not history:
        history.append({"role": "system", "content": system_prompt})

    # Get RAG prompt
    prompt = rag_system.build_prompt(message)

    # Prepare chat concatenating history and user input
    chat = history + [{"role": "user", "content": prompt}]

    # Get the streamer object that will yield generated text
    streamer = generate_resp(chat, tokenizer, model, temperature=0.1)

    # Streaming response
    response = ""
    for new_text in streamer:
        response += new_text
        yield response

Define chatbot interface

In [31]:
chatbot=gr.ChatInterface(fn=chat_interface,
                 type="messages",
                 examples=["What are the key financial highlights of 2024?",
                           "What was the revenue distribution by geographic region in 2024?",
                           "How efficient is LVMH in managing its assets (ROA for 2024)",
                           "is LVMH positioned for growth in 2025?",
                           "What are the key financial risks LVMH might face in the coming years?",
                           "Does LVMH's financial data suggest that it's more dependent on organic growth or acquisitions?"
                           ])

Launch the chatbot

In [32]:
chatbot.launch()

Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://517941dc1c32f95556.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




# Et Voila ! 🎉🤗