<a href="https://www.kaggle.com/code/krishnayarlagadda/rag-powered-personalized-etf-analysis?scriptVersionId=256358958" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# RAG-Powered Personalized ETF Data Analysis

## Introduction

Empower your ETF's investment decisions with a personalized analysis tool. This project leverages Retrieval Augmented Generation (RAG) to learn investment preferences from your chosen YouTube financial expert. You can then upload data for specific ETF's you're considering and engage with a chatbot to compare these funds against the expert's criteria, gaining tailored insights to inform your choices.

## Project Description

This project creates an intelligent system for analyzing and comparing ETF's based on your preferred financial guidance. Here's how it works:

1.  **Learn from Your Expert:** Provide the URL of a YouTube video featuring a financial expert discussing ETF evaluation. The system extracts and understands their key criteria and philosophy.
2.  **Analyze Your Funds:** Upload PDF documents containing data for the specific ETF's you want to evaluate.
3.  **Intelligent Comparison:** Engage with a RAG-powered chatbot. Ask questions about your chosen ETF in relation to the expert's learned criteria (e.g., "How does this ETF expense ratio compare to the expert's ideal range?").
4.  **Grounded Insights:** The chatbot retrieves relevant information from the expert's video transcript and your fund data to provide contextually accurate and personalized comparisons.

## Generative AI Features

This project utilizes the following Generative AI capabilities:

1.  **Document Understanding:** Extracts key information from user-uploaded PDF documents containing ETF data.
2.  **Video Understanding (via Transcript Analysis):** Processes the text transcript of the chosen YouTube video to identify the financial expert's ETF evaluation criteria and preferences.
3.  **Embeddings:** Creates semantic representations of both the expert's preferences and the characteristics of the user-provided ETF for efficient information retrieval.
4.  **Vector Search/Vector Store/Vector Database:** Stores and indexes these embeddings (using ChromaDB) to enable fast and relevant context retrieval for the chatbot.
5.  **Retrieval Augmented Generation (RAG):** Powers the core functionality of the chatbot. It retrieves relevant information from the learned expert preferences and the user's ETF data to generate grounded and personalized answers to user questions.

## Setup


In [1]:
!pip uninstall -qqy jupyterlab-lsp

!pip install -U -q "google-genai==1.7.0" youtube-transcript-api PyPDF2 "chromadb==0.6.3"

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m144.7/144.7 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m611.1/611.1 kB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m43.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.0/485.0 kB[0m [31m25.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m107.2/107.2 kB[0m [31m6.4 MB/s

In [2]:
from google import genai
from google.genai import types
from IPython.display import HTML, Markdown, display
from chromadb import Documents, EmbeddingFunction, Embeddings

import chromadb
import sys
import os

from chromadb.config import Settings

genai.__version__

'1.7.0'

**Set up your API key**

To run the following cell, your API key must be stored it in a Kaggle secret named GOOGLE_API_KEY.

To make the key available through Kaggle secrets, choose Secrets from the Add-ons menu and follow the instructions to add your key or enable it for this notebook.

In [3]:
from kaggle_secrets import UserSecretsClient

GOOGLE_API_KEY = UserSecretsClient().get_secret("GOOGLE_API_KEY")

client = genai.Client(api_key=GOOGLE_API_KEY)


Automated Retry

In [4]:
# Define a retry policy. The model might make multiple consecutive calls automatically
# for a complex query, this ensures the client retries if it hits quota limits.
from google.api_core import retry

is_retriable = lambda e: (isinstance(e, genai.errors.APIError) and e.code in {429, 503})

if not hasattr(genai.models.Models.generate_content, '__wrapped__'):
  genai.models.Models.generate_content = retry.Retry(
      predicate=is_retriable)(genai.models.Models.generate_content)

## Extracting ETF Selection Criteria from a YouTube Video

In [5]:
from youtube_transcript_api import YouTubeTranscriptApi

def get_youtube_transcript(video_id):
    """
    Fetches the transcript from a YouTube video.

    Args:
        video_id (str): The ID of the YouTube video.

    Returns:
        str: The transcript text, or None if an error occurs.
    """
    try:
        # Attempt to retrieve the transcript
        transcript_list = YouTubeTranscriptApi.get_transcript(video_id)

        # Check if the transcript list is empty
        if not transcript_list:
            print("Error: No transcript found for this video.")
            return None

        # Concatenate the text from all transcript segments
        transcript_text = " ".join([item['text'] for item in transcript_list])
        return transcript_text

    except Exception as e:
        # Handle potential errors, such as video not found, no transcript available, etc.
        print(f"Error fetching transcript: {e}")
        return None

##Make sure youtube url contians the uniqiue id of the video
# Sample url looks like this "https://www.youtube.com/watch?v=DVb1hIqG9Zg&ab_channel=ZietInvests"

video_id = "DVb1hIqG9Zg"  # Replace with the actual video ID
transcript = get_youtube_transcript(video_id)

if transcript:
    print("Transcript:")
    print(transcript)
else:
    print("Failed to get transcript.")

Error fetching transcript: type object 'YouTubeTranscriptApi' has no attribute 'get_transcript'
Failed to get transcript.


## Summarize ETF Selection Criteria using AI Model

In [6]:
import google.generativeai as genai

def analyze_etf_video(video_id):
    transcript = get_youtube_transcript(video_id)
    if not transcript:
        return "Error: Could not retrieve transcript."

    prompt = f"""
        You are a financial analyst specializing in ETF research. Analyze the provided content and extract:

        1.  **Comprehensive Summary**:
            -   Key themes and insights about ETF investing
            -   Market outlook or sector analysis mentioned
            -   Any quantitative data (returns, ratios, metrics)

        2.  **ETF Investment Framework**:
            -   Step-by-step methodology for ETF selection
            -   Portfolio construction principles
            -   Risk management approaches mentioned

        3.  **Specific ETF Mentions**:
            -   For each ETF mentioned, extract:
                * Ticker symbol (if provided)
                * Asset class/category
                * Any performance metrics
                * Expense ratios
                * Notable holdings or strategy

        4.  **Selection Criteria Analysis**:
            -   Quantitative factors discussed (expense ratios, liquidity, AUM)
            -   Qualitative factors (management team, index methodology)
            -   Tax considerations
            -   Diversification benefits

        5.  **Actionable Insights**:
            -   Any specific recommendations
            -   Buy/hold/sell suggestions with rationale
            -   Portfolio allocation percentages if mentioned

        Here is the transcript:
        ```{transcript}```
    """

    try:
        genai.configure(api_key=GOOGLE_API_KEY)
        model = genai.GenerativeModel("gemini-1.5-flash")
        response = model.generate_content(prompt)
        return response.text
    except Exception as e:
        return f"Error: {e}"


# result = analyze_etf_video("DVb1hIqG9Zg")
# print(result)


## Extract ETF Details from PDF 

In [7]:
import PyPDF2

def read_pdf(file_path):
    """
    Reads text content from a PDF file.

    Args:
        file_path (str): Path to the PDF file.

    Returns:
        str: Text content of the PDF, or None on error.
    """
    text = ""
    try:
        with open(file_path, 'rb') as file:
            reader = PyPDF2.PdfReader(file)
            for page in reader.pages:
                text += page.extract_text() or ""  # Handle None page text
        return text
    except Exception as e:
        print(f"Error reading PDF: {e}")
        return None


## Embedding and Chunk Functions

In [8]:
class GeminiEmbeddingFunction(EmbeddingFunction):
    document_mode = True
    @retry.Retry(predicate=is_retriable)
    def __call__(self, input: Documents) -> Embeddings:
        if self.document_mode:
            embedding_task = "retrieval_document"
        else:
            embedding_task = "retrieval_query"

        response = client.models.embed_content(
            model="models/text-embedding-004",
            contents=input,
            config=types.EmbedContentConfig(
                task_type=embedding_task,
            ),
        )
        return [e.values for e in response.embeddings]


def chunk_text(text, chunk_size=500, overlap=50):
    """Splits text into overlapping chunks."""
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        start = end - overlap
    return chunks


## Load and Store Data in ChromaDB

In [9]:
def load_data_to_chromadb(video_id, pdf_files, collection_name="etf_data"):
    embed_fn = GeminiEmbeddingFunction()
    embed_fn.document_mode = True

    chroma_client = chromadb.Client(Settings(anonymized_telemetry=False))
    collection = chroma_client.get_or_create_collection(name=collection_name, embedding_function=embed_fn)

    # YouTube Data
    youtube_analysis = analyze_etf_video(video_id)
    if youtube_analysis:
        youtube_chunks = chunk_text(youtube_analysis)
        print(f"Adding {len(youtube_chunks)} YouTube chunks to collection.")
        collection.add(
            documents=youtube_chunks,
            ids=[f"youtube_{video_id}_chunk_{i}" for i in range(len(youtube_chunks))],
            metadatas=[{"source": "youtube", "video_id": video_id, "type": "analysis"} for _ in youtube_chunks]
        )
    else:
        print("No YouTube analysis available.")

    # PDF Data
    for pdf_file in pdf_files:
        pdf_text = read_pdf(pdf_file)
        if pdf_text:
            pdf_chunks = chunk_text(pdf_text)
            print(f"Adding {len(pdf_chunks)} PDF chunks from {pdf_file} to collection.")
            collection.add(
                documents=pdf_chunks,
                ids=[f"pdf_{os.path.basename(pdf_file)}_chunk_{i}" for i in range(len(pdf_chunks))],
                metadatas=[{"source": "pdf", "file_name": pdf_file, "type": "document"} for _ in pdf_chunks]
            )
        else:
            print(f"Failed to read PDF: {pdf_file}")

    # Query to check the number of documents in the collection
    result = collection.query(query_texts=[""], n_results=5)  # Querying the first 5 documents
    print(f"Number of documents in collection '{collection_name}': {len(result['documents'])}")
    
    print("✅ Data loaded into ChromaDB.")


video_id = "DVb1hIqG9Zg"
pdf_files = [
    "/kaggle/input/d/krishnayarlagadda/schwab-etf-data/schwab_us_large_cap_etf.pdf",
    "/kaggle/input/d/krishnayarlagadda/schwab-etf-data/schwab_us_mid_cap_etf.pdf",
    "/kaggle/input/d/krishnayarlagadda/schwab-etf-data/schwab_us_small_cap_etf.pdf"
]

load_data_to_chromadb(video_id, pdf_files)



Error fetching transcript: type object 'YouTubeTranscriptApi' has no attribute 'get_transcript'
Adding 1 YouTube chunks to collection.
Adding 38 PDF chunks from /kaggle/input/d/krishnayarlagadda/schwab-etf-data/schwab_us_large_cap_etf.pdf to collection.
Adding 37 PDF chunks from /kaggle/input/d/krishnayarlagadda/schwab-etf-data/schwab_us_mid_cap_etf.pdf to collection.
Adding 41 PDF chunks from /kaggle/input/d/krishnayarlagadda/schwab-etf-data/schwab_us_small_cap_etf.pdf to collection.
Number of documents in collection 'etf_data': 1
✅ Data loaded into ChromaDB.


## RAG Chat Implementation

In [10]:
def rag_chat(query, collection_name="etf_data", top_k=5):
    from chromadb.config import Settings
    import chromadb

    # Set up embedding function for the query
    embed_fn = GeminiEmbeddingFunction()
    embed_fn.document_mode = False  # Query mode

    # Connect to ChromaDB and get collection
    chroma_client = chromadb.Client(Settings(anonymized_telemetry=False))
    collection = chroma_client.get_or_create_collection(
        name=collection_name, embedding_function=embed_fn
    )

    # Query ChromaDB
    results = collection.query(query_texts=[query], n_results=top_k)

    relevant_chunks = results["documents"][0] if results["documents"] else []
    metadatas = results["metadatas"][0] if results["metadatas"] else []

    # Build context and citation map
    context_blocks = []
    citations = []
    for i, (chunk, meta) in enumerate(zip(relevant_chunks, metadatas)):
        source_info = ""
        if meta.get("source") == "youtube":
            source_info = f"(YouTube Video ID: {meta.get('video_id', 'N/A')})"
        elif meta.get("source") == "pdf":
            source_info = f"(PDF: {meta.get('file_name', 'N/A')})"
        citations.append(f"[{i+1}] {source_info}")
        context_blocks.append(f"[{i+1}] {chunk}")

    full_context = "\n\n".join(context_blocks)
    citation_text = "\n".join(citations)

    # Gemini Prompt
    prompt = f"""
You are a helpful financial assistant. Use the following context from an ETF expert's video and user-uploaded PDFs to answer the user's question. Refer to the numbered citations when you provide your answer.

---Context---
{full_context}
--------------

Question: {query}

Answer with references like [1], [2] where appropriate.
"""

    try:
        model = genai.GenerativeModel("gemini-1.5-flash")
        response = model.generate_content(prompt)
        answer = response.text.strip()
        return f"{answer}\n\nSources:\n{citation_text}"
    except Exception as e:
        return f"Error generating answer: {e}"

In [11]:
user_question = "What does the expert say about small cap ETFs and their risk?"
response = rag_chat(user_question)
print(response)

Error generating answer: 
  No API_KEY or ADC found. Please either:
    - Set the `GOOGLE_API_KEY` environment variable.
    - Manually pass the key with `genai.configure(api_key=my_api_key)`.
    - Or set up Application Default Credentials, see https://ai.google.dev/gemini-api/docs/oauth for more information.
