# Financial Report Summarizer with ChromaDB and HuggingFace EDGAR Corpus

This notebook implements a complete RAG (Retrieval-Augmented Generation) system for financial document analysis using:
- **ChromaDB** for persistent vector storage (instead of FAISS)
- **HuggingFace EDGAR Corpus** dataset (real SEC filings)
- **FinBERT** embeddings optimized for financial text
- **Advanced techniques**: Hybrid search, re-ranking, few-shot prompting

## Features
1. ‚úÖ Persistent ChromaDB vector database
2. ‚úÖ Real EDGAR corpus from HuggingFace
3. ‚úÖ Hybrid search (semantic + keyword)
4. ‚úÖ Cross-encoder re-ranking
5. ‚úÖ Few-shot prompting
6. ‚úÖ GPU-accelerated embeddings

In [8]:
# Cell 1: Install ALL packages with correct versions

!pip uninstall -y httpx
!pip install -q sentence-transformers==2.2.2
!pip install -q transformers==4.35.2
!pip install -q datasets==2.14.6
!pip install -q openai==1.14.0 httpx==0.27.0
!pip install -q pypdf==3.17.1
!pip install -q chromadb==0.4.18
!pip install -q torch torchvision torchaudio

print("‚úÖ All packages installed!")

Found existing installation: httpx 0.28.1
Uninstalling httpx-0.28.1:
  Successfully uninstalled httpx-0.28.1
[0m‚úÖ All packages installed!


In [9]:
# Cell 2: Configure OpenAI API Key

import os

# Set your OpenAI API key here
os.environ['OPENAI_API_KEY'] = ''  # Replace with your actual key

# Or if running in Colab, you can use this:
# from google.colab import userdata
# os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')

print("‚úÖ API key configured!")

‚úÖ API key configured!


## üìä ChromaDB-Powered RAG System

This implementation uses **ChromaDB** for persistent vector storage instead of FAISS.

### Key Advantages:
- **Persistent**: Data survives notebook restarts
- **Scalable**: Handles millions of documents efficiently
- **Metadata filtering**: Can filter by company, date, section, etc.
- **No manual indexing**: Automatically indexes on insert

In [1]:
# Cell 2: Configure OpenAI API Key

import os

# Set your OpenAI API key here
os.environ['OPENAI_API_KEY'] = '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'  # Replace with your actual key

# Or if running in Colab, you can use this:
# from google.colab import userdata
# os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')

print("‚úÖ API key configured!")


import os
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings
import numpy as np
import pypdf
from openai import OpenAI
from typing import Optional, List, Dict
import re

class FinBERTFinancialRAG:
    """
    Complete Financial RAG System with ChromaDB:
    - FinBERT embeddings for financial text understanding
    - ChromaDB for persistent vector storage
    - Support for EDGAR corpus
    - Support for PDF uploads
    - GPU acceleration for embeddings
    """

    def __init__(self, use_finbert: bool = True, persist_directory: str = None):
        """
        Initialize the system

        Args:
            use_finbert: If True, use FinBERT. If False, use faster model.
            persist_directory: Where to store ChromaDB data
        """
        print("ü§ñ Initializing Financial RAG System with ChromaDB...")

        # Set up persistent storage
        if persist_directory is None:
            persist_directory = os.path.expanduser("~/FinancialAI/chromadb")

        os.makedirs(persist_directory, exist_ok=True)
        print(f"üìÅ Database location: {persist_directory}")

        # Initialize ChromaDB client
        self.chroma_client = chromadb.PersistentClient(path=persist_directory)

        # Choose embedding model
        if use_finbert:
            model_name = "ProsusAI/finbert"
            print(f"  üìä Loading FinBERT (optimized for finance)...")
        else:
            model_name = "all-MiniLM-L6-v2"
            print(f"  üìä Loading Sentence Transformer (faster)...")

        self.embedder = SentenceTransformer(model_name)
        self.embedding_dim = self.embedder.get_sentence_embedding_dimension()

        print(f"  ‚úÖ Model loaded! Embedding dimension: {self.embedding_dim}")

        # Create or get collection
        collection_name = "financial_filings"

        try:
            # Try to get existing collection
            self.collection = self.chroma_client.get_collection(name=collection_name)
            print(f"  ‚úÖ Loaded existing collection: {collection_name}")
            print(f"  üìä Documents in collection: {self.collection.count()}")

            # Rebuild chunks and metadata from existing collection
            self._rebuild_chunks_from_collection()

        except:
            # Create new collection
            self.collection = self.chroma_client.create_collection(
                name=collection_name,
                metadata={"description": "Financial SEC filings with FinBERT embeddings"}
            )
            print(f"  ‚úÖ Created new collection: {collection_name}")

            # Initialize empty lists
            self.chunks = []
            self.chunk_metadata = []

        # Initialize OpenAI
        api_key = os.getenv('OPENAI_API_KEY')
        if api_key:
            self.client = OpenAI(api_key=api_key)
        else:
            print("  ‚ö†Ô∏è  OpenAI API key not set - you'll need to set it before asking questions")
            self.client = None

        self.documents_loaded = []

        print("‚úÖ System ready!\n")

    def _rebuild_chunks_from_collection(self):
        """Rebuild chunks and metadata lists from ChromaDB collection"""
        if self.collection.count() == 0:
            self.chunks = []
            self.chunk_metadata = []
            return

        # Get all documents from collection
        results = self.collection.get()

        # Rebuild chunks and metadata
        self.chunks = results['documents']
        self.chunk_metadata = results['metadatas']

        print(f"  üì¶ Loaded {len(self.chunks)} chunks from collection")

    def load_from_edgar(self, document):
        """
        Load a document from EDGAR corpus

        Args:
            document: Single row from EDGAR dataset
        """
        company = document['company']
        filing_type = document['filing_type']
        filing_date = document['filing_date']

        print(f"üìÑ Loading: {company} - {filing_type} ({filing_date})")

        # Extract available sections
        sections_data = []

        if document.get('item_1'):
            sections_data.append(('Item 1 - Business', document['item_1']))

        if document.get('item_1a'):
            sections_data.append(('Item 1A - Risk Factors', document['item_1a']))

        if document.get('item_7'):
            sections_data.append(('Item 7 - MD&A', document['item_7']))

        if document.get('item_7a'):
            sections_data.append(('Item 7A - Quantitative Disclosures', document['item_7a']))

        print(f"  üìë Found {len(sections_data)} sections")

        # Chunk each section
        all_chunks = []
        all_metadatas = []
        all_ids = []
        doc_chunk_count = 0

        for section_name, section_text in sections_data:
            chunks = self._chunk_text(section_text)

            for chunk in chunks:
                # Create unique ID
                doc_id = f"{company}_{filing_date}_{section_name}_{doc_chunk_count}"
                doc_id = doc_id.replace(' ', '_').replace('/', '_').replace('-', '_')

                all_chunks.append(chunk)
                all_metadatas.append({
                    'company': company,
                    'filing_type': filing_type,
                    'filing_date': filing_date,
                    'section': section_name,
                    'source': 'EDGAR'
                })
                all_ids.append(doc_id)
                doc_chunk_count += 1

        # Generate embeddings
        print(f"  üßÆ Generating embeddings for {len(all_chunks)} chunks...")
        embeddings = self.embedder.encode(
            all_chunks,
            show_progress_bar=False,
            batch_size=32,
            convert_to_numpy=True,
            device='cuda'
        )

        # Add to ChromaDB
        self.collection.add(
            embeddings=embeddings.tolist(),
            documents=all_chunks,
            metadatas=all_metadatas,
            ids=all_ids
        )

        # Also add to local lists for backward compatibility with HybridSearch
        self.chunks.extend(all_chunks)
        self.chunk_metadata.extend(all_metadatas)

        # Track document
        self.documents_loaded.append({
            'company': company,
            'filing_type': filing_type,
            'filing_date': filing_date,
            'source': 'EDGAR',
            'chunks': doc_chunk_count
        })

        print(f"  ‚úÖ Added {doc_chunk_count} chunks to ChromaDB")
        print(f"  üìä Total documents in DB: {self.collection.count()}\n")

    def load_from_pdf(self, pdf_path: str, company_name: str):
        """
        Load and process a PDF file

        Args:
            pdf_path: Path to PDF file
            company_name: Name of company
        """
        print(f"üìÑ Loading PDF: {pdf_path}")

        # Extract text from PDF
        text = ""
        try:
            reader = pypdf.PdfReader(pdf_path)
            total_pages = len(reader.pages)
            print(f"  üìñ Extracting text from {total_pages} pages...")

            for page in reader.pages:
                text += page.extract_text()

            print(f"  ‚úÖ Extracted {len(text)} characters")

        except Exception as e:
            print(f"  ‚ùå Error reading PDF: {e}")
            return

        # Chunk the text
        chunks = self._chunk_text(text)
        print(f"  ‚úÇÔ∏è  Created {len(chunks)} chunks")

        # Generate embeddings
        print(f"  üßÆ Generating embeddings...")
        embeddings = self.embedder.encode(
            chunks,
            show_progress_bar=True,
            batch_size=32,
            convert_to_numpy=True,
            device='cuda'
        )

        # Prepare for ChromaDB
        all_ids = []
        all_metadatas = []

        for i in range(len(chunks)):
            doc_id = f"{company_name}_PDF_{i}"
            doc_id = doc_id.replace(' ', '_').replace('/', '_')

            all_ids.append(doc_id)
            all_metadatas.append({
                'company': company_name,
                'filing_type': 'PDF Upload',
                'filing_date': 'N/A',
                'section': 'PDF Document',
                'source': 'PDF'
            })

        # Add to ChromaDB
        self.collection.add(
            embeddings=embeddings.tolist(),
            documents=chunks,
            metadatas=all_metadatas,
            ids=all_ids
        )

        # Add to local lists
        self.chunks.extend(chunks)
        self.chunk_metadata.extend(all_metadatas)

        # Track document
        self.documents_loaded.append({
            'company': company_name,
            'filing_type': 'PDF Upload',
            'filing_date': 'N/A',
            'source': 'PDF',
            'chunks': len(chunks)
        })

        print(f"  ‚úÖ Added {len(chunks)} chunks to ChromaDB")
        print(f"  üìä Total documents in DB: {self.collection.count()}\n")

    def _chunk_text(self, text: str, chunk_size: int = 500):
        """
        Split text into chunks

        Args:
            text: Text to chunk
            chunk_size: Target size for each chunk

        Returns:
            List of text chunks
        """
        # Split into paragraphs
        paragraphs = text.split('\n\n')

        chunks = []
        current_chunk = ""

        for para in paragraphs:
            para = para.strip()

            # Skip very short paragraphs
            if len(para) < 50:
                continue

            # If adding this paragraph exceeds chunk_size, save current chunk
            if len(current_chunk) + len(para) > chunk_size and current_chunk:
                chunks.append(current_chunk.strip())
                current_chunk = para
            else:
                # Add to current chunk
                current_chunk += "\n\n" + para if current_chunk else para

        # Add final chunk
        if current_chunk:
            chunks.append(current_chunk.strip())

        return chunks

    def build_index(self, use_gpu: bool = False):
        """
        Build index - for ChromaDB this is a no-op as indexing happens automatically
        This method exists for backward compatibility with the FAISS version

        Args:
            use_gpu: Ignored for ChromaDB
        """
        print("‚ÑπÔ∏è  ChromaDB indexes automatically - no manual build needed!")
        print(f"‚úÖ Collection ready with {self.collection.count()} documents")

    def ask(self, question: str, top_k: int = 5):
        """
        Ask a question about the documents

        Args:
            question: Question to ask
            top_k: Number of relevant chunks to retrieve

        Returns:
            Generated answer
        """
        if self.collection.count() == 0:
            print("‚ùå No documents loaded! Please load documents first with build_index()")
            return None

        if self.client is None:
            print("‚ùå OpenAI API key not set!")
            return None

        print(f"‚ùì Question: {question}\n")
        print("  üîç Searching ChromaDB for relevant information...")

        # Generate question embedding
        q_embedding = self.embedder.encode([question], device='cuda')

        # Query ChromaDB
        results = self.collection.query(
            query_embeddings=q_embedding.tolist(),
            n_results=top_k
        )

        # Extract results
        chunks = results['documents'][0]
        metadatas = results['metadatas'][0]

        # Build context
        context_parts = []
        sources_used = []

        for i, (chunk, meta) in enumerate(zip(chunks, metadatas)):
            source_info = f"{meta['company']} | {meta['section']}"
            sources_used.append(source_info)
            context_parts.append(f"[Source {i+1}: {source_info}]\n{chunk}")

        context = "\n\n---\n\n".join(context_parts)

        # Generate answer
        prompt = f"""You are an expert financial analyst with deep knowledge of SEC filings and financial statements.\n\nContext from financial documents:\n{context}\n\nQuestion: {question}\n\nInstructions:\n1. Answer ONLY using information from the context above\n2. Think step-by-step if calculations are needed\n3. Always cite which source (company name and section) you're using\n4. Show your work for any calculations or comparisons\n5. Be precise with numbers and units (millions, billions, percentages)\n6. If information is not in the context, explicitly state "Information not available"\n\nYour analysis:"""

        print("  ü§î Generating answer with GPT-3.5-turbo...")

        try:
            response = self.client.chat.completions.create(
                model="gpt-3.5-turbo",
                messages=[
                    {"role": "system", "content": "You are an expert financial analyst."},
                    {"role": "user", "content": prompt}
                ],
                temperature=0.3,
                max_tokens=800
            )

            answer = response.choices[0].message.content

            print("\n" + "="*70)
            print("üìä ANSWER")
            print("="*70)
            print(answer)
            print("="*70)

            print("\nüìö Sources Used:")
            for i, source in enumerate(set(sources_used), 1):
                print(f"  {i}. {source}")
            print()

            return answer

        except Exception as e:
            print(f"‚ùå Error generating answer: {e}")
            return None

    def list_documents(self):
        """Show all loaded documents"""

        if not self.documents_loaded:
            print("üì≠ No documents loaded yet")
            return

        print(f"üìö Loaded Documents ({len(self.documents_loaded)}):\n")

        for i, doc in enumerate(self.documents_loaded, 1):
            print(f"{i}. {doc['company']}")
            print(f"   Type: {doc['filing_type']}")
            print(f"   Source: {doc['source']}")
            print(f"   Date: {doc.get('filing_date', 'N/A')}")
            print(f"   Chunks: {doc['chunks']}")
            print()

    def delete_collection(self):
        """Delete the entire collection"""
        self.chroma_client.delete_collection(name="financial_filings")
        self.chunks = []
        self.chunk_metadata = []
        self.documents_loaded = []
        print("üóëÔ∏è  Collection deleted!")


# Initialize the system
print("="*70)
print("   üöÄ FINANCIAL AI SYSTEM - CHROMADB + FINBERT")
print("="*70)
print()

rag = FinBERTFinancialRAG(use_finbert=True)

print("="*70)

‚úÖ API key configured!


  from .autonotebook import tqdm as notebook_tqdm
[0;93m2025-11-30 23:44:26.512100659 [W:onnxruntime:Default, device_discovery.cc:164 DiscoverDevicesForPlatform] GPU device discovery failed: device_discovery.cc:89 ReadFileContents Failed to open file: "/sys/class/drm/card0/device/vendor"[m
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given


   üöÄ FINANCIAL AI SYSTEM - CHROMADB + FINBERT

ü§ñ Initializing Financial RAG System with ChromaDB...
üìÅ Database location: /root/FinancialAI/chromadb
  üìä Loading FinBERT (optimized for finance)...


No sentence-transformers model found with name /root/.cache/torch/sentence_transformers/ProsusAI_finbert. Creating a new one with MEAN pooling.
Failed to send telemetry event CollectionGetEvent: capture() takes 1 positional argument but 3 were given


  ‚úÖ Model loaded! Embedding dimension: 768
  ‚úÖ Loaded existing collection: financial_filings
  üìä Documents in collection: 1366
  üì¶ Loaded 1366 chunks from collection
‚úÖ System ready!



## üì• Load Real EDGAR Corpus from HuggingFace

We're using the `eloukas/edgar-corpus` dataset which contains real SEC filings from public companies.

**Dataset Features:**
- Real 10-K filings from S&P 500 companies
- Multiple sections: Item 1, 1A, 7, 7A
- Structured data with company names, dates, and filing types

In [13]:
# Cell 4: Load REAL EDGAR Corpus from HuggingFace

from datasets import load_dataset
import time

print("="*70)
print("   üì• LOADING EDGAR CORPUS FROM HUGGINGFACE")
print("="*70)
print()

start_time = time.time()

# Load dataset - Final model configuration
# Total available: 20,000+ companies in EDGAR corpus
# Using 500 companies for batch processing demonstration
NUM_COMPANIES = 10000

print(f"üåê Loading {NUM_COMPANIES} companies from edgar-corpus...")
print("‚è≥ This may take a few minutes on first load...\n")

dataset = load_dataset(
    "eloukas/edgar-corpus",
    split=f"train[:{NUM_COMPANIES}]"
)

elapsed = time.time() - start_time

print(f"‚úÖ Loaded {len(dataset)} companies in {elapsed:.2f} seconds\n")

# DIAGNOSTIC: Check dataset structure
print("="*70)
print("   üîç DATASET STRUCTURE DIAGNOSTIC")
print("="*70)

if len(dataset) > 0:
    first_item = dataset[0]
    print(f"\nüìä Available fields in dataset:")
    for key in first_item.keys():
        value = first_item[key]
        if isinstance(value, str):
            preview = value[:100] + "..." if len(value) > 100 else value
        else:
            preview = str(value)
        print(f"  ‚Ä¢ {key}: {type(value).__name__} - {preview}")

    print(f"\nüìä Item fields: {[k for k in first_item.keys() if k.startswith('item_')]}")
else:
    print("‚ùå Dataset is empty!")

print("\n" + "="*70)

   üì• LOADING EDGAR CORPUS FROM HUGGINGFACE

üåê Loading 10000 companies from edgar-corpus...
‚è≥ This may take a few minutes on first load...

‚úÖ Loaded 10000 companies in 0.96 seconds

   üîç DATASET STRUCTURE DIAGNOSTIC

üìä Available fields in dataset:
  ‚Ä¢ filename: str - 92116_1993.txt
  ‚Ä¢ cik: str - 92116
  ‚Ä¢ year: str - 1993
  ‚Ä¢ section_1: str - Item 1. Business
General
Southern California Water Company (the "Registrant") is a public utility co...
  ‚Ä¢ section_1A: str - 
  ‚Ä¢ section_1B: str - 
  ‚Ä¢ section_2: str - Item 2 - Properties
Franchises, Competition, Acquisitions and Condemnation of Properties
The Registr...
  ‚Ä¢ section_3: str - Item 3. Legal Proceedings
On October 20, 1993, the Registrant and the Internal Revenue Service ("IRS...
  ‚Ä¢ section_4: str - Item 4. Submission of Matters to a Vote of Security Holders
No matter was submitted during the fourt...
  ‚Ä¢ section_5: str - Item 5. Market for Registrant's Common Equity and Related Stockholder Mat

In [14]:
# Cell 5: Load documents from EDGAR corpus into ChromaDB with GPU-Optimized Batching

import time

# =============================================================================
# HELPER FUNCTION FOR EMBEDDING GENERATION
# =============================================================================

def get_finbert_embedding(text):
    """Generate FinBERT embedding for a single text"""
    embedding = rag.embedder.encode(
        [text],
        show_progress_bar=False,
        convert_to_numpy=True,
        device='cuda'
    )
    return embedding[0]

# =============================================================================
# PROCESS WITH BATCHING (FASTER ON GPU)
# =============================================================================

print("="*70)
print("   üì§ LOADING DOCUMENTS INTO CHROMADB (BATCHED)")
print("="*70)
print()

print(f"‚öôÔ∏è  Processing {NUM_COMPANIES} companies with GPU batching...\n")
start_time = time.time()

# Batch processing for GPU efficiency
BATCH_SIZE = 8  # Process 8 companies at once on GPU
batch_texts = []
batch_metadatas = []
batch_ids = []
processed_count = 0
error_count = 0

# Define which sections to process (most important financial sections)
SECTIONS_TO_PROCESS = [
    ('section_1', 'Business Description'),
    ('section_7', 'MD&A'),
    ('section_8', 'Financial Statements')
]

for idx, company_data in enumerate(dataset):
    if idx >= NUM_COMPANIES:
        break

    try:
        # Extract metadata fields - use actual field names from dataset
        cik = company_data.get('cik', f'Unknown_{idx}')
        year = company_data.get('year', 'Unknown')
        filename = company_data.get('filename', f'doc_{idx}.txt')

        # Process each relevant section
        for section_field, section_name in SECTIONS_TO_PROCESS:
            section_text = company_data.get(section_field, '')

            # Validate text content
            if not section_text or len(section_text.strip()) < 100:
                continue

            # Take sample (first 3000 characters for faster processing)
            text_sample = section_text[:3000]

            # Add to batch
            batch_texts.append(text_sample)
            batch_ids.append(f"cik_{cik}_{year}_{section_field}_{len(batch_texts)}")
            batch_metadatas.append({
                "company": f"CIK {cik}",
                "filing_type": "10-K",
                "filing_date": str(year),
                "section": section_name,
                "cik": cik,
                "year": year,
                "index": idx
            })

            # Process batch when full
            if len(batch_texts) >= BATCH_SIZE:
                try:
                    # Create embeddings for batch (GPU accelerated)
                    embeddings = rag.embedder.encode(
                        batch_texts,
                        show_progress_bar=False,
                        batch_size=BATCH_SIZE,
                        convert_to_numpy=True,
                        device='cuda'
                    )

                    # Add to ChromaDB
                    rag.collection.add(
                        documents=batch_texts,
                        embeddings=embeddings.tolist(),
                        ids=batch_ids,
                        metadatas=batch_metadatas
                    )

                    # Update local tracking
                    rag.chunks.extend(batch_texts)
                    rag.chunk_metadata.extend(batch_metadatas)

                    processed_count += len(batch_texts)

                    # Progress update every 50 documents
                    if processed_count % 50 == 0:
                        elapsed = time.time() - start_time
                        rate = processed_count / elapsed
                        print(f"   ‚úì {processed_count} chunks | {rate:.1f} chunks/sec | {elapsed:.1f}s elapsed")

                except Exception as batch_error:
                    print(f"‚ùå Batch error at doc {idx}: {batch_error}")
                    error_count += len(batch_texts)

                # Clear batch
                batch_texts = []
                batch_ids = []
                batch_metadatas = []

    except Exception as e:
        print(f"‚ùå Error at {idx}: {str(e)[:10000]}")
        error_count += 1
        continue

# Process remaining batch
if batch_texts:
    try:
        embeddings = rag.embedder.encode(
            batch_texts,
            show_progress_bar=False,
            batch_size=len(batch_texts),
            convert_to_numpy=True,
            device='cuda'
        )

        rag.collection.add(
            documents=batch_texts,
            embeddings=embeddings.tolist(),
            ids=batch_ids,
            metadatas=batch_metadatas
        )

        rag.chunks.extend(batch_texts)
        rag.chunk_metadata.extend(batch_metadatas)

        processed_count += len(batch_texts)
    except Exception as e:
        print(f"‚ùå Final batch error: {e}")
        error_count += len(batch_texts)

total_time = time.time() - start_time

# =============================================================================
# SUMMARY
# =============================================================================

print("\n" + "="*70)
print("   ‚úÖ LOADING COMPLETE")
print("="*70)

print(f"\nüìä Processing Summary:")
print(f"  ‚Ä¢ Total processed: {processed_count} chunks")
print(f"  ‚Ä¢ Errors: {error_count}")
print(f"  ‚Ä¢ Total time: {total_time:.1f} seconds")
print(f"  ‚Ä¢ Average rate: {processed_count/total_time:.2f} chunks/sec")
print(f"  ‚Ä¢ Documents in ChromaDB: {rag.collection.count()}")

print("\nüí° ChromaDB has automatically indexed all documents!")
print("üí° Ready to answer questions!")
print("="*70)


   üì§ LOADING DOCUMENTS INTO CHROMADB (BATCHED)

‚öôÔ∏è  Processing 10000 companies with GPU batching...



Add of existing embedding ID: cik_92116_1993_section_1_1
Add of existing embedding ID: cik_92116_1993_section_7_2
Add of existing embedding ID: cik_92116_1993_section_8_3
Add of existing embedding ID: cik_103730_1993_section_1_4
Add of existing embedding ID: cik_103730_1993_section_7_5
Add of existing embedding ID: cik_103730_1993_section_8_6
Add of existing embedding ID: cik_100240_1993_section_1_7
Add of existing embedding ID: cik_100240_1993_section_7_8
Insert of existing embedding ID: cik_92116_1993_section_1_1
Insert of existing embedding ID: cik_92116_1993_section_7_2
Insert of existing embedding ID: cik_92116_1993_section_8_3
Insert of existing embedding ID: cik_103730_1993_section_1_4
Insert of existing embedding ID: cik_103730_1993_section_7_5
Insert of existing embedding ID: cik_103730_1993_section_8_6
Insert of existing embedding ID: cik_100240_1993_section_1_7
Insert of existing embedding ID: cik_100240_1993_section_7_8
Add of existing embedding ID: cik_100240_1993_section_

   ‚úì 200 chunks | 121.5 chunks/sec | 1.6s elapsed


Add of existing embedding ID: cik_832427_1993_section_1_1
Add of existing embedding ID: cik_832427_1993_section_7_2
Add of existing embedding ID: cik_832427_1993_section_8_3
Add of existing embedding ID: cik_83047_1993_section_1_4
Add of existing embedding ID: cik_83047_1993_section_7_5
Add of existing embedding ID: cik_83047_1993_section_8_6
Add of existing embedding ID: cik_893928_1993_section_1_7
Add of existing embedding ID: cik_893928_1993_section_7_8
Insert of existing embedding ID: cik_832427_1993_section_1_1
Insert of existing embedding ID: cik_832427_1993_section_7_2
Insert of existing embedding ID: cik_832427_1993_section_8_3
Insert of existing embedding ID: cik_83047_1993_section_1_4
Insert of existing embedding ID: cik_83047_1993_section_7_5
Insert of existing embedding ID: cik_83047_1993_section_8_6
Insert of existing embedding ID: cik_893928_1993_section_1_7
Insert of existing embedding ID: cik_893928_1993_section_7_8
Add of existing embedding ID: cik_789292_1993_section_

   ‚úì 400 chunks | 129.9 chunks/sec | 3.1s elapsed


Add of existing embedding ID: cik_36326_1993_section_7_1
Add of existing embedding ID: cik_36326_1993_section_8_2
Add of existing embedding ID: cik_859257_1993_section_1_3
Add of existing embedding ID: cik_79879_1993_section_1_4
Add of existing embedding ID: cik_42293_1993_section_1_5
Add of existing embedding ID: cik_42293_1993_section_7_6
Add of existing embedding ID: cik_42293_1993_section_8_7
Add of existing embedding ID: cik_3153_1993_section_1_8
Insert of existing embedding ID: cik_36326_1993_section_7_1
Insert of existing embedding ID: cik_36326_1993_section_8_2
Insert of existing embedding ID: cik_859257_1993_section_1_3
Insert of existing embedding ID: cik_79879_1993_section_1_4
Insert of existing embedding ID: cik_42293_1993_section_1_5
Insert of existing embedding ID: cik_42293_1993_section_7_6
Insert of existing embedding ID: cik_42293_1993_section_8_7
Insert of existing embedding ID: cik_3153_1993_section_1_8
Add of existing embedding ID: cik_3153_1993_section_7_1
Add of e

   ‚úì 600 chunks | 132.8 chunks/sec | 4.5s elapsed


Add of existing embedding ID: cik_64803_1993_section_7_1
Add of existing embedding ID: cik_64803_1993_section_8_2
Add of existing embedding ID: cik_817473_1993_section_1_3
Add of existing embedding ID: cik_817473_1993_section_7_4
Add of existing embedding ID: cik_817473_1993_section_8_5
Add of existing embedding ID: cik_52485_1993_section_7_6
Add of existing embedding ID: cik_52485_1993_section_8_7
Add of existing embedding ID: cik_310569_1993_section_1_8
Insert of existing embedding ID: cik_64803_1993_section_7_1
Insert of existing embedding ID: cik_64803_1993_section_8_2
Insert of existing embedding ID: cik_817473_1993_section_1_3
Insert of existing embedding ID: cik_817473_1993_section_7_4
Insert of existing embedding ID: cik_817473_1993_section_8_5
Insert of existing embedding ID: cik_52485_1993_section_7_6
Insert of existing embedding ID: cik_52485_1993_section_8_7
Insert of existing embedding ID: cik_310569_1993_section_1_8
Add of existing embedding ID: cik_310569_1993_section_7_

   ‚úì 800 chunks | 134.3 chunks/sec | 6.0s elapsed


Add of existing embedding ID: cik_354964_1993_section_7_1
Add of existing embedding ID: cik_354964_1993_section_8_2
Add of existing embedding ID: cik_355429_1993_section_1_3
Add of existing embedding ID: cik_355429_1993_section_7_4
Add of existing embedding ID: cik_355429_1993_section_8_5
Add of existing embedding ID: cik_716783_1993_section_1_6
Add of existing embedding ID: cik_716783_1993_section_7_7
Add of existing embedding ID: cik_716783_1993_section_8_8
Insert of existing embedding ID: cik_354964_1993_section_7_1
Insert of existing embedding ID: cik_354964_1993_section_8_2
Insert of existing embedding ID: cik_355429_1993_section_1_3
Insert of existing embedding ID: cik_355429_1993_section_7_4
Insert of existing embedding ID: cik_355429_1993_section_8_5
Insert of existing embedding ID: cik_716783_1993_section_1_6
Insert of existing embedding ID: cik_716783_1993_section_7_7
Insert of existing embedding ID: cik_716783_1993_section_8_8
Add of existing embedding ID: cik_63276_1993_sec

   ‚úì 1000 chunks | 135.1 chunks/sec | 7.4s elapsed


Add of existing embedding ID: cik_794323_1993_section_1_1
Add of existing embedding ID: cik_794323_1993_section_7_2
Add of existing embedding ID: cik_794323_1993_section_8_3
Add of existing embedding ID: cik_18651_1993_section_1_4
Add of existing embedding ID: cik_18651_1993_section_7_5
Add of existing embedding ID: cik_18651_1993_section_8_6
Add of existing embedding ID: cik_52428_1993_section_1_7
Add of existing embedding ID: cik_52428_1993_section_7_8
Insert of existing embedding ID: cik_794323_1993_section_1_1
Insert of existing embedding ID: cik_794323_1993_section_7_2
Insert of existing embedding ID: cik_794323_1993_section_8_3
Insert of existing embedding ID: cik_18651_1993_section_1_4
Insert of existing embedding ID: cik_18651_1993_section_7_5
Insert of existing embedding ID: cik_18651_1993_section_8_6
Insert of existing embedding ID: cik_52428_1993_section_1_7
Insert of existing embedding ID: cik_52428_1993_section_7_8
Add of existing embedding ID: cik_52428_1993_section_8_1
A

   ‚úì 1200 chunks | 135.6 chunks/sec | 8.9s elapsed


Add of existing embedding ID: cik_860730_1993_section_8_1
Add of existing embedding ID: cik_792014_1993_section_1_2
Add of existing embedding ID: cik_792014_1993_section_7_3
Add of existing embedding ID: cik_792014_1993_section_8_4
Add of existing embedding ID: cik_109261_1993_section_1_5
Add of existing embedding ID: cik_109261_1993_section_7_6
Add of existing embedding ID: cik_109261_1993_section_8_7
Add of existing embedding ID: cik_701221_1993_section_1_8
Insert of existing embedding ID: cik_860730_1993_section_8_1
Insert of existing embedding ID: cik_792014_1993_section_1_2
Insert of existing embedding ID: cik_792014_1993_section_7_3
Insert of existing embedding ID: cik_792014_1993_section_8_4
Insert of existing embedding ID: cik_109261_1993_section_1_5
Insert of existing embedding ID: cik_109261_1993_section_7_6
Insert of existing embedding ID: cik_109261_1993_section_8_7
Insert of existing embedding ID: cik_701221_1993_section_1_8
Add of existing embedding ID: cik_701221_1993_se

   ‚úì 1400 chunks | 135.4 chunks/sec | 10.3s elapsed
   ‚úì 1600 chunks | 136.1 chunks/sec | 11.8s elapsed
   ‚úì 1800 chunks | 135.9 chunks/sec | 13.2s elapsed
   ‚úì 2000 chunks | 135.7 chunks/sec | 14.7s elapsed
   ‚úì 2200 chunks | 136.1 chunks/sec | 16.2s elapsed
   ‚úì 2400 chunks | 136.6 chunks/sec | 17.6s elapsed
   ‚úì 2600 chunks | 136.7 chunks/sec | 19.0s elapsed
   ‚úì 2800 chunks | 136.7 chunks/sec | 20.5s elapsed
   ‚úì 3000 chunks | 136.8 chunks/sec | 21.9s elapsed
   ‚úì 3200 chunks | 137.1 chunks/sec | 23.3s elapsed
   ‚úì 3400 chunks | 136.7 chunks/sec | 24.9s elapsed
   ‚úì 3600 chunks | 136.2 chunks/sec | 26.4s elapsed
   ‚úì 3800 chunks | 136.4 chunks/sec | 27.9s elapsed
   ‚úì 4000 chunks | 136.5 chunks/sec | 29.3s elapsed
   ‚úì 4200 chunks | 136.5 chunks/sec | 30.8s elapsed
   ‚úì 4400 chunks | 136.6 chunks/sec | 32.2s elapsed
   ‚úì 4600 chunks | 136.7 chunks/sec | 33.6s elapsed
   ‚úì 4800 chunks | 136.9 chunks/sec | 35.1s elapsed
   ‚úì 5000 chunks | 136.8 c

Failed to send telemetry event CollectionAddEvent: capture() takes 1 positional argument but 3 were given


   ‚úì 6800 chunks | 138.0 chunks/sec | 49.3s elapsed
   ‚úì 7000 chunks | 137.3 chunks/sec | 51.0s elapsed
   ‚úì 7200 chunks | 137.4 chunks/sec | 52.4s elapsed
   ‚úì 7400 chunks | 137.5 chunks/sec | 53.8s elapsed
   ‚úì 7600 chunks | 137.7 chunks/sec | 55.2s elapsed
   ‚úì 7800 chunks | 137.9 chunks/sec | 56.6s elapsed
   ‚úì 8000 chunks | 138.0 chunks/sec | 58.0s elapsed
   ‚úì 8200 chunks | 138.0 chunks/sec | 59.4s elapsed
   ‚úì 8400 chunks | 138.0 chunks/sec | 60.9s elapsed
   ‚úì 8600 chunks | 138.1 chunks/sec | 62.3s elapsed
   ‚úì 8800 chunks | 138.1 chunks/sec | 63.7s elapsed
   ‚úì 9000 chunks | 138.2 chunks/sec | 65.1s elapsed
   ‚úì 9200 chunks | 138.3 chunks/sec | 66.5s elapsed
   ‚úì 9400 chunks | 138.4 chunks/sec | 67.9s elapsed
   ‚úì 9600 chunks | 138.4 chunks/sec | 69.4s elapsed
   ‚úì 9800 chunks | 138.2 chunks/sec | 70.9s elapsed
   ‚úì 10000 chunks | 138.3 chunks/sec | 72.3s elapsed
   ‚úì 10200 chunks | 138.3 chunks/sec | 73.7s elapsed
   ‚úì 10400 chunks | 138.

Failed to send telemetry event CollectionAddEvent: capture() takes 1 positional argument but 3 were given


   ‚úì 14800 chunks | 137.5 chunks/sec | 107.6s elapsed
   ‚úì 15000 chunks | 137.5 chunks/sec | 109.1s elapsed
   ‚úì 15200 chunks | 137.6 chunks/sec | 110.5s elapsed
   ‚úì 15400 chunks | 137.6 chunks/sec | 111.9s elapsed
   ‚úì 15600 chunks | 137.6 chunks/sec | 113.4s elapsed
   ‚úì 15800 chunks | 137.4 chunks/sec | 115.0s elapsed
   ‚úì 16000 chunks | 137.4 chunks/sec | 116.4s elapsed
   ‚úì 16200 chunks | 137.5 chunks/sec | 117.8s elapsed
   ‚úì 16400 chunks | 137.5 chunks/sec | 119.3s elapsed
   ‚úì 16600 chunks | 137.5 chunks/sec | 120.7s elapsed
   ‚úì 16800 chunks | 137.6 chunks/sec | 122.1s elapsed
   ‚úì 17000 chunks | 137.6 chunks/sec | 123.5s elapsed
   ‚úì 17200 chunks | 137.6 chunks/sec | 125.0s elapsed
   ‚úì 17400 chunks | 137.7 chunks/sec | 126.4s elapsed
   ‚úì 17600 chunks | 137.7 chunks/sec | 127.8s elapsed
   ‚úì 17800 chunks | 137.8 chunks/sec | 129.2s elapsed
   ‚úì 18000 chunks | 137.7 chunks/sec | 130.7s elapsed
   ‚úì 18200 chunks | 137.7 chunks/sec | 132.2s 

Failed to send telemetry event CollectionAddEvent: capture() takes 1 positional argument but 3 were given


   ‚úì 22800 chunks | 137.7 chunks/sec | 165.6s elapsed
   ‚úì 23000 chunks | 137.7 chunks/sec | 167.1s elapsed
   ‚úì 23200 chunks | 137.7 chunks/sec | 168.5s elapsed
   ‚úì 23400 chunks | 137.7 chunks/sec | 169.9s elapsed
   ‚úì 23600 chunks | 137.7 chunks/sec | 171.4s elapsed
   ‚úì 23800 chunks | 137.7 chunks/sec | 172.9s elapsed
   ‚úì 24000 chunks | 137.7 chunks/sec | 174.3s elapsed
   ‚úì 24200 chunks | 137.7 chunks/sec | 175.7s elapsed
   ‚úì 24400 chunks | 137.7 chunks/sec | 177.2s elapsed
   ‚úì 24600 chunks | 137.7 chunks/sec | 178.6s elapsed
   ‚úì 24800 chunks | 137.8 chunks/sec | 180.0s elapsed
   ‚úì 25000 chunks | 137.8 chunks/sec | 181.5s elapsed
   ‚úì 25200 chunks | 137.7 chunks/sec | 183.1s elapsed
   ‚úì 25400 chunks | 137.6 chunks/sec | 184.6s elapsed
   ‚úì 25600 chunks | 137.4 chunks/sec | 186.3s elapsed
   ‚úì 25800 chunks | 137.2 chunks/sec | 188.0s elapsed
   ‚úì 26000 chunks | 137.1 chunks/sec | 189.6s elapsed
   ‚úì 26200 chunks | 137.0 chunks/sec | 191.2s 

## üß™ Test Basic RAG

Test the basic RAG system with simple queries.

In [15]:
# Cell 6: Test basic RAG with sample questions

# Test with a simple question
rag.ask("What are the main business activities of the companies?", top_k=5)

‚ùì Question: What are the main business activities of the companies?

  üîç Searching ChromaDB for relevant information...
  ü§î Generating answer with GPT-3.5-turbo...

üìä ANSWER
1. **First Financial Corporation (CIK 714562):**
   - Main Business Activities: First Financial Corporation is a multi-bank holding company primarily engaged in providing banking services through its subsidiaries. The company's main activities include offering various bank services, managing affiliations, supervising the bank's operations, and competing in the financial services industry. (Source: CIK 714562 | Business Description)

2. **Real Estate Investment Trust (CIK 828957):**
   - Main Business Activities: The company is a self-liquidating, finite-life real estate investment trust with a principal asset portfolio of industrial and commercial properties. Its main activities involve owning, managing, and potentially disposing of these properties. The company's financial condition is primarily tied to

"1. **First Financial Corporation (CIK 714562):**\n   - Main Business Activities: First Financial Corporation is a multi-bank holding company primarily engaged in providing banking services through its subsidiaries. The company's main activities include offering various bank services, managing affiliations, supervising the bank's operations, and competing in the financial services industry. (Source: CIK 714562 | Business Description)\n\n2. **Real Estate Investment Trust (CIK 828957):**\n   - Main Business Activities: The company is a self-liquidating, finite-life real estate investment trust with a principal asset portfolio of industrial and commercial properties. Its main activities involve owning, managing, and potentially disposing of these properties. The company's financial condition is primarily tied to the performance and valuation of its real estate assets. (Source: CIK 828957 | MD&A)\n\n3. **Income-Producing Real Property Partnership (CIK 722886):**\n   - Main Business Activit

## üîç Hybrid Search Implementation

Combines semantic search (ChromaDB) with keyword search for better accuracy.

**Benefits:**
- Catches exact keyword matches
- Better handling of specific terms (company names, metrics)
- 10-15% accuracy improvement

In [84]:
# Cell 7: Hybrid Search Implementation - ChromaDB Version

from collections import Counter
import re
import numpy as np

class HybridSearch:
    """
    Combine vector search (semantic) with keyword search (exact matches)
    This improves accuracy by 10-15%
    Now works with ChromaDB backend
    """

    def __init__(self, rag):
        self.rag = rag

    def keyword_search(self, query: str, top_k: int = 10):
        """
        Simple keyword search using TF-IDF-like scoring

        Args:
            query: Search query
            top_k: Number of results

        Returns:
            List of (chunk_index, score) tuples
        """
        # Extract keywords from query
        query_terms = set(re.findall(r'\b\w+\b', query.lower()))

        # Remove common words
        stopwords = {'the', 'a', 'an', 'in', 'on', 'at', 'for', 'to', 'of', 'and', 'or'}
        query_terms = query_terms - stopwords

        # Score each chunk
        scores = []
        for idx, chunk in enumerate(self.rag.chunks):
            chunk_terms = set(re.findall(r'\b\w+\b', chunk.lower()))

            # Count matching terms
            matches = query_terms & chunk_terms

            if matches:
                # Simple scoring: number of matching terms
                score = len(matches)

                # Boost for exact phrase matches
                if query.lower() in chunk.lower():
                    score *= 2

                scores.append((idx, score))

        # Sort by score
        scores.sort(key=lambda x: x[1], reverse=True)

        return scores[:top_k]

    def hybrid_search(self, query: str, top_k: int = 5, alpha: float = 0.7):
        """
        Combine vector search and keyword search

        Args:
            query: Search query
            top_k: Number of results to return
            alpha: Weight for vector search (1-alpha for keyword search)

        Returns:
            List of chunk indices
        """
        # Vector search using ChromaDB
        q_embedding = self.rag.embedder.encode([query], device='cuda')

        # Query ChromaDB for more candidates
        results = self.rag.collection.query(
            query_embeddings=q_embedding.tolist(),
            n_results=min(top_k * 2, len(self.rag.chunks))  # Get more candidates
        )

        # Get the IDs and convert to indices
        vector_ids = results['ids'][0]
        distances = results['distances'][0]

        # Map IDs back to indices in self.rag.chunks
        # ChromaDB returns IDs, we need to find corresponding indices
        id_to_index = {}

        if self.rag.collection.count() > 0:
            all_results = self.rag.collection.get()
            all_ids = all_results['ids']

            for idx, doc_id in enumerate(all_ids):
                id_to_index[doc_id] = idx

        vector_indices = [id_to_index[doc_id] for doc_id in vector_ids if doc_id in id_to_index]

        # Keyword search
        keyword_results = self.keyword_search(query, top_k * 2)

        # Combine scores
        combined_scores = {}

        # Add vector search scores (convert distance to similarity)
        for i, idx in enumerate(vector_indices):
            # Lower distance = better match
            score = 1.0 / (1.0 + distances[i])
            combined_scores[idx] = alpha * score

        # Add keyword search scores (normalized)
        if keyword_results:
            max_keyword_score = max(score for _, score in keyword_results)
            for idx, score in keyword_results:
                normalized_score = score / max_keyword_score
                if idx in combined_scores:
                    combined_scores[idx] += (1 - alpha) * normalized_score
                else:
                    combined_scores[idx] = (1 - alpha) * normalized_score

        # Sort by combined score
        sorted_indices = sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)

        # Return top-k
        return [idx for idx, _ in sorted_indices[:top_k]]

    def ask_hybrid(self, question: str, top_k: int = 5):
        """
        Ask question using hybrid search

        Args:
            question: Question to ask
            top_k: Number of chunks to retrieve

        Returns:
            Generated answer
        """
        if self.rag.collection.count() == 0:
            print("‚ùå Please load documents first")
            return None

        print(f"‚ùì Question: {question}\n")
        print("  üîç Using HYBRID search (vector + keyword)...")

        # Get relevant chunks using hybrid search
        indices = self.hybrid_search(question, top_k)

        # Build context
        context_parts = []
        sources_used = []

        for i, idx in enumerate(indices):
            chunk = self.rag.chunks[idx]
            meta = self.rag.chunk_metadata[idx]

            source_info = f"{meta['company']} | {meta['section']}"
            sources_used.append(source_info)

            context_parts.append(f"[Source {i+1}: {source_info}]\n{chunk}")

        context = "\n\n---\n\n".join(context_parts)

        # Generate answer (same as before)
        prompt = f"""You are an expert financial analyst with deep knowledge of SEC filings and financial statements.

Context from financial documents:
{context}

Question: {question}

Instructions:
1. Answer ONLY using information from the context above
2. Think step-by-step if calculations are needed
3. Cite which source (company and section) you're using
4. Show your work for any calculations
5. Be precise with numbers and include units
6. If information is not in the context, say "Information not available in provided documents"

Your analysis:"""

        print("  ü§î Generating answer with GPT-3.5-turbo...")

        try:
            response = self.rag.client.chat.completions.create(
                model=FINETUNED_MODEL_ID,
                messages=[
                    {"role": "system", "content": "You are an expert financial analyst."},
                    {"role": "user", "content": prompt}
                ],
                temperature=0.3,
                max_tokens=800
            )

            answer = response.choices[0].message.content

            print("\n" + "="*70)
            print("üìä ANSWER (using Hybrid Search)")
            print("="*70)
            print(answer)
            print("="*70)

            print("\nüìö Sources Used:")
            for i, source in enumerate(set(sources_used), 1):
                print(f"  {i}. {source}")
            print()

            return answer

        except Exception as e:
            print(f"‚ùå Error: {e}")
            return None

# Initialize hybrid search
hybrid = HybridSearch(rag)

print("‚úÖ Hybrid Search Implemented (ChromaDB)!")
print("üí° Usage: hybrid.ask_hybrid('your question')")

# =============================================================================
# üß™ TEST THE HYBRID SEARCH
# =============================================================================

print("\n" + "="*70)
print("   üß™ TESTING HYBRID SEARCH")
print("="*70)

# Test question that benefits from both semantic and keyword search
test_question = "What are the main risk factors related to competition and market conditions?"

print(f"\nüìù Test Question: {test_question}")
print("\nThis question tests:")
print("  ‚Ä¢ Keyword matching: 'risk factors', 'competition', 'market'")
print("  ‚Ä¢ Semantic understanding: business challenges, competitive threats")
print("\n" + "="*70 + "\n")

# Run the test
answer = hybrid.ask_hybrid(test_question, top_k=5)


‚úÖ Hybrid Search Implemented (ChromaDB)!
üí° Usage: hybrid.ask_hybrid('your question')

   üß™ TESTING HYBRID SEARCH

üìù Test Question: What are the main risk factors related to competition and market conditions?

This question tests:
  ‚Ä¢ Keyword matching: 'risk factors', 'competition', 'market'
  ‚Ä¢ Semantic understanding: business challenges, competitive threats


‚ùì Question: What are the main risk factors related to competition and market conditions?

  üîç Using HYBRID search (vector + keyword)...
  ü§î Generating answer with GPT-3.5-turbo...

üìä ANSWER (using Hybrid Search)
MAMSI: Increasing price competition, inability to expand service territory (Source 1: MD&A)

Poe & Brown: Cyclical premium pricing, high volatility (Source 2: Business Description)

Tandy: Aggressive pricing practices, rapid technological advances (Source 5: MD&A)

üìö Sources Used:
  1. CIK 1012690 | Business Description
  2. CIK 79282 | Business Description
  3. CIK 96289 | MD&A
  4. CIK 805037

In [85]:
# Test Question 1: Specific financial metrics
hybrid.ask_hybrid("What revenue streams and sources of income are mentioned?", top_k=5)

# Test Question 2: Strategic focus
hybrid.ask_hybrid("What are the main strategic priorities and business initiatives?", top_k=5)

# Test Question 3: Technology and innovation
hybrid.ask_hybrid("What technology investments or digital transformation efforts are discussed?", top_k=5)

# Test Question 4: Regulatory and compliance
hybrid.ask_hybrid("What regulatory challenges and compliance requirements are mentioned?", top_k=5)

# Test Question 5: Comparative analysis
hybrid.ask_hybrid("How do different companies approach customer acquisition and retention?", top_k=5)


‚ùì Question: What revenue streams and sources of income are mentioned?

  üîç Using HYBRID search (vector + keyword)...
  ü§î Generating answer with GPT-3.5-turbo...

üìä ANSWER (using Hybrid Search)
1. Commissions
2. Principal transactions
3. Investment banking
4. Interest income
5. Insurance commissions

Source: CIK 36781 | MD&A

üìö Sources Used:
  1. CIK 311871 | MD&A
  2. CIK 53347 | MD&A
  3. CIK 36781 | MD&A
  4. CIK 805019 | MD&A
  5. CIK 2648 | Business Description

‚ùì Question: What are the main strategic priorities and business initiatives?

  üîç Using HYBRID search (vector + keyword)...
  ü§î Generating answer with GPT-3.5-turbo...

üìä ANSWER (using Hybrid Search)
1. Reposition businesses for future success (CIK 31791 | MD&A)
2. Enhance effectiveness and efficiency, reduce costs and overhead (CIK 36090 | Business Description)
3. Improve shareholder value, reduce costs, extend leadership in health-care markets (CIK 10456 | Business Description)
4. Position for com

'North American Integrated Marketing, Inc. uses database analysis and direct-mail advertising services to help clients understand their customers and sell new products to them. They supplement client data with additional information like age and income to enhance customer profiles. (Source 1: CIK 847388 | Business Description)\n\nAPAC TeleServices, Inc. employs telephone-based marketing and customer management solutions. They use a data management system to sort customer information and employ predictive dialers for efficient outreach. This system allows for on-line monitoring and refinement of marketing campaigns. (Source 2: CIK 949297 | Business Description)'

## üìö Few-Shot Prompting

Improves answer quality by showing the model examples of good financial analyses.

In [82]:
# Cell 8: Few-Shot Prompting

class FewShotRAG:
    """
    Add few-shot examples to improve accuracy
    Shows the model examples of good answers
    """

    def __init__(self, rag, hybrid_search):
        self.rag = rag
        self.hybrid = hybrid_search

        # Define few-shot examples
        self.examples = [
            {
                "question": "What was Apple's revenue?",
                "context": "Apple Inc. reported total revenue of $394 billion for fiscal 2023, representing a 15% increase year-over-year.",
                "answer": "Based on the financial data from Apple Inc.'s fiscal 2023 filing, the company reported total revenue of $394 billion, which represents a 15% increase compared to the previous year."
            },
            {
                "question": "What are the main risk factors?",
                "context": "Risk Factors: Competition in cloud services is intense. Cybersecurity incidents could harm reputation. Economic uncertainty may reduce IT spending.",
                "answer": "The main risk factors identified are: 1) Intense competition in cloud services, 2) Potential cybersecurity incidents that could damage reputation and financial results, and 3) Economic uncertainty that may lead to reduced IT spending by customers."
            },
            {
                "question": "Compare gross margins",
                "context": "Company A gross margin: 43.5%. Company B gross margin: 42.0%. Company C gross margin: 18.2%.",
                "answer": "Comparing gross margins: Company A has the highest at 43.5%, followed by Company B at 42.0%, and Company C at 18.2%. Company A's margin is 1.5 percentage points higher than Company B and 25.3 percentage points higher than Company C."
            }
        ]

    def build_few_shot_prompt(self, question: str, context: str):
        """Build prompt with few-shot examples"""

        prompt = "You are an expert financial analyst. Here are examples of good analyses:\n\n"

        # Add examples
        for i, example in enumerate(self.examples, 1):
            prompt += f"Example {i}:\n"
            prompt += f"Context: {example['context']}\n"
            prompt += f"Question: {example['question']}\n"
            prompt += f"Answer: {example['answer']}\n\n"

        # Add actual question
        prompt += "Now answer this question in the same style:\n\n"
        prompt += f"Context from financial documents:\n{context}\n\n"
        prompt += f"Question: {question}\n\n"
        prompt += "Instructions:\n"
        prompt += "1. Answer ONLY using information from the context\n"
        prompt += "2. Be specific with numbers and cite sources\n"
        prompt += "3. Show calculations step-by-step if needed\n"
        prompt += "4. Format your answer clearly\n\n"
        prompt += "Your analysis:"

        return prompt

    def ask_with_examples(self, question: str, top_k: int = 5):
        """Ask question using few-shot prompting"""

        print(f"‚ùì Question: {question}\n")
        print("  üîç Searching with hybrid search + few-shot learning...")

        # Get context using hybrid search
        indices = self.hybrid.hybrid_search(question, top_k)

        context_parts = []
        sources_used = []

        for i, idx in enumerate(indices):
            chunk = self.rag.chunks[idx]
            meta = self.rag.chunk_metadata[idx]

            source_info = f"{meta['company']} | {meta['section']}"
            sources_used.append(source_info)

            context_parts.append(f"[Source {i+1}: {source_info}]\n{chunk}")

        context = "\n\n---\n\n".join(context_parts)

        # Build few-shot prompt
        prompt = self.build_few_shot_prompt(question, context)

        print("  ü§î Generating answer with few-shot examples...")

        try:
            response = self.rag.client.chat.completions.create(
                model=FINETUNED_MODEL_ID,
                messages=[
                    {"role": "system", "content": "You are an expert financial analyst. Follow the example format exactly."},
                    {"role": "user", "content": prompt}
                ],
                temperature=0.2,  # Lower temperature for more consistent format
                max_tokens=800
            )

            answer = response.choices[0].message.content

            print("\n" + "="*70)
            print("üìä ANSWER (with Few-Shot Learning)")
            print("="*70)
            print(answer)
            print("="*70)

            print("\nüìö Sources Used:")
            for i, source in enumerate(set(sources_used), 1):
                print(f"  {i}. {source}")
            print()

            return answer

        except Exception as e:
            print(f"‚ùå Error: {e}")
            return None

# Initialize few-shot RAG
fewshot = FewShotRAG(rag, hybrid)

print("‚úÖ Few-Shot Learning Implemented!")
print("üí° Usage: fewshot.ask_with_examples('your question')")

‚úÖ Few-Shot Learning Implemented!
üí° Usage: fewshot.ask_with_examples('your question')


In [86]:
# Test Question 1: Specific financial metrics
fewshot.ask_with_examples("What is the main business of CIK 53540?", top_k=5)

# Test Question 2: Strategic focus
fewshot.ask_with_examples("What are the main strategic priorities and business initiatives?", top_k=5)

# Test Question 3: Technology and innovation
fewshot.ask_with_examples("What technology investments or digital transformation efforts are discussed?", top_k=5)

# Test Question 4: Regulatory and compliance
fewshot.ask_with_examples("What regulatory challenges and compliance requirements are mentioned?", top_k=5)

# Test Question 5: Comparative analysis
fewshot.ask_with_examples("How do different companies approach customer acquisition and retention?", top_k=5)


‚ùì Question: What is the main business of CIK 53540?

  üîç Searching with hybrid search + few-shot learning...
  ü§î Generating answer with few-shot examples...

üìä ANSWER (with Few-Shot Learning)
The main business of CIK 53540 is not provided in the context.

üìö Sources Used:
  1. CIK 854884 | Business Description
  2. CIK 824481 | Business Description
  3. CIK 725625 | Business Description
  4. CIK 727094 | Business Description

‚ùì Question: What are the main strategic priorities and business initiatives?

  üîç Searching with hybrid search + few-shot learning...
  ü§î Generating answer with few-shot examples...

üìä ANSWER (with Few-Shot Learning)
1. Repositioning businesses for future success (Source 1)
2. Project One to enhance effectiveness and efficiency (Source 2)
3. Strategic actions to improve shareholder value and reduce costs (Source 3, 5)
4. Four-part strategic plan to enhance and expand core and non-utility businesses (Source 4)

üìö Sources Used:
  1. CIK 36

'North American Integrated Marketing, Inc. (Source 1) uses database analysis and production services to help clients understand customer behavior and sell new products. APAC TeleServices, Inc. (Source 2) employs telephone-based marketing and customer management solutions, utilizing 8,450 workstations across 62 centers.'

In [94]:
fewshot.ask_with_examples("Describe the operations of BellSouth Telecommunications", top_k=5)

‚ùì Question: Describe the operations of BellSouth Telecommunications

  üîç Searching with hybrid search + few-shot learning...
  ü§î Generating answer with few-shot examples...

üìä ANSWER (with Few-Shot Learning)
BellSouth Telecommunications provides wireline telecommunications services to two-thirds of the population and one-half of the territory within nine states (Source 2).

üìö Sources Used:
  1. CIK 92088 | MD&A
  2. CIK 732713 | MD&A
  3. CIK 732713 | Business Description



'BellSouth Telecommunications provides wireline telecommunications services to two-thirds of the population and one-half of the territory within nine states (Source 2).'

## ‚ôªÔ∏è Cross-Encoder Re-Ranking

Uses a cross-encoder to re-rank retrieved chunks for maximum relevance.

**Process:**
1. Retrieve top-20 candidates with hybrid search
2. Score each candidate with cross-encoder
3. Select top-5 highest-scored chunks
4. Generate answer with best chunks

In [80]:
# Cell 9: Cross-Encoder Re-Ranking

from sentence_transformers import CrossEncoder

class ReRanker:
    """
    Re-rank retrieved chunks using a cross-encoder
    This improves accuracy by 5-10%
    """

    def __init__(self, rag, hybrid_search):
        self.rag = rag
        self.hybrid = hybrid_search

        print("üì• Loading cross-encoder for re-ranking...")
        # Use a cross-encoder fine-tuned for semantic similarity
        self.reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
        print("‚úÖ Cross-encoder loaded!")

    def rerank(self, query: str, candidate_indices: list):
        """
        Re-rank candidates using cross-encoder

        Args:
            query: Search query
            candidate_indices: List of chunk indices to re-rank

        Returns:
            Re-ranked list of indices
        """
        # Get chunks
        candidates = [self.rag.chunks[idx] for idx in candidate_indices]

        # Score with cross-encoder
        pairs = [[query, chunk] for chunk in candidates]
        scores = self.reranker.predict(pairs)

        # Sort by score
        scored_indices = list(zip(candidate_indices, scores))
        scored_indices.sort(key=lambda x: x[1], reverse=True)

        return [idx for idx, _ in scored_indices]

    def ask_with_reranking(self, question: str, retrieve_k: int = 20, final_k: int = 5):
        """
        Ask question with retrieval + re-ranking

        Args:
            question: Question to ask
            retrieve_k: Number of chunks to retrieve initially
            final_k: Number of chunks to use after re-ranking

        Returns:
            Generated answer
        """
        print(f"‚ùì Question: {question}\n")
        print(f"  üîç Step 1: Retrieving top {retrieve_k} candidates...")

        # Step 1: Get candidates with hybrid search
        candidate_indices = self.hybrid.hybrid_search(question, retrieve_k)

        print(f"  ‚ôªÔ∏è  Step 2: Re-ranking to find best {final_k}...")

        # Step 2: Re-rank
        reranked_indices = self.rerank(question, candidate_indices)[:final_k]

        print(f"  ‚úÖ Selected {final_k} most relevant chunks\n")

        # Build context
        context_parts = []
        sources_used = []

        for i, idx in enumerate(reranked_indices):
            chunk = self.rag.chunks[idx]
            meta = self.rag.chunk_metadata[idx]

            source_info = f"{meta['company']} | {meta['section']}"
            sources_used.append(source_info)

            context_parts.append(f"[Source {i+1}: {source_info}]\n{chunk}")

        context = "\n\n---\n\n".join(context_parts)

        # Generate answer
        prompt = f"""You are an expert financial analyst with deep knowledge of SEC filings and financial statements.

Context from financial documents (re-ranked for relevance):
{context}

Question: {question}

Instructions:
1. Answer ONLY using information from the context above
2. Think step-by-step if calculations are needed
3. Cite which source (company and section) you're using
4. Show your work for any calculations
5. Be precise with numbers and include units
6. If information is not in the context, say "Information not available in provided documents"

Your analysis:"""

        print("  ü§î Generating answer...")

        try:
            response = self.rag.client.chat.completions.create(
                model=FINETUNED_MODEL_ID,
                messages=[
                    {"role": "system", "content": "You are an expert financial analyst."},
                    {"role": "user", "content": prompt}
                ],
                temperature=0.3,
                max_tokens=800
            )

            answer = response.choices[0].message.content

            print("\n" + "="*70)
            print("üìä ANSWER (with Re-Ranking)")
            print("="*70)
            print(answer)
            print("="*70)

            print("\nüìö Sources Used:")
            for i, source in enumerate(set(sources_used), 1):
                print(f"  {i}. {source}")
            print()

            return answer

        except Exception as e:
            print(f"‚ùå Error: {e}")
            return None

# Initialize re-ranker
reranker = ReRanker(rag, hybrid)

print("‚úÖ Re-Ranking Implemented!")
print("üí° Usage: reranker.ask_with_reranking('your question')")

üì• Loading cross-encoder for re-ranking...
‚úÖ Cross-encoder loaded!
‚úÖ Re-Ranking Implemented!
üí° Usage: reranker.ask_with_reranking('your question')


In [81]:
# Test Question 1: Specific financial metrics
reranker.ask_with_reranking("What revenue streams and sources of income are mentioned?")

# Test Question 2: Strategic focus
reranker.ask_with_reranking("What are the main strategic priorities and business initiatives?")

# Test Question 3: Technology and innovation
reranker.ask_with_reranking("What technology investments or digital transformation efforts are discussed?")

# Test Question 4: Regulatory and compliance
reranker.ask_with_reranking("What regulatory challenges and compliance requirements are mentioned?")

# Test Question 5: Comparative analysis
reranker.ask_with_reranking("How do different companies approach customer acquisition and retention?")


‚ùì Question: What revenue streams and sources of income are mentioned?

  üîç Step 1: Retrieving top 20 candidates...
  ‚ôªÔ∏è  Step 2: Re-ranking to find best 5...
  ‚úÖ Selected 5 most relevant chunks

  ü§î Generating answer...

üìä ANSWER (with Re-Ranking)
Commissions, principal transactions, investment banking, interest income, insurance commissions.

üìö Sources Used:
  1. CIK 53347 | MD&A
  2. CIK 19722 | MD&A
  3. CIK 36781 | MD&A

‚ùì Question: What are the main strategic priorities and business initiatives?

  üîç Step 1: Retrieving top 20 candidates...
  ‚ôªÔ∏è  Step 2: Re-ranking to find best 5...
  ‚úÖ Selected 5 most relevant chunks

  ü§î Generating answer...

üìä ANSWER (with Re-Ranking)
1. Enhance the Corporation's core electric utility business.
2. Expand the Corporation's core electric utility business.
3. Expand the Corporation's non-utility business.
4. Pursue financial initiatives.

(Source: CIK 18540 | MD&A)

üìö Sources Used:
  1. CIK 75641 | MD&A
  2. 

'APAC TeleServices uses telephone-based sales and customer management solutions, employing database analysis, targeted marketing plans, and computerized call management systems for customer acquisition and retention (Source 1: APAC TeleServices, Business Description).\n\nThe diversified financial services holding company evaluates retention and disposition of existing operations and investigates acquisitions to maximize shareholder value (Source 2: Diversified Financial Services Holding Company, Business Description).\n\nCambrex Corporation focuses on niche products requiring high technical experience and reviews product lines to eliminate those not meeting profit goals, aiming to expand through internal growth and strategic acquisitions (Source 4: Cambrex Corporation, Business Description).\n\nNorth American Integrated Marketing provides database and direct-mail advertising services, using database analysis to supplement client data and offer insights for selling new products to exist

## üìä Compare All Methods

Test all implemented methods side-by-side with the same question.

In [8]:
# Cell 10: Compare all methods

import time

def compare_methods(question: str):
    """Compare all RAG methods with the same question"""

    print("="*70)
    print(f"   COMPARING ALL METHODS")
    print("="*70)
    print(f"\nQuestion: {question}\n")
    print("="*70)

    methods = [
        ("Basic RAG", lambda: rag.ask(question, top_k=5)),
        ("Hybrid Search", lambda: hybrid.ask_hybrid(question, top_k=5)),
        ("Few-Shot Learning", lambda: fewshot.ask_with_examples(question, top_k=5)),
        ("Re-Ranking", lambda: reranker.ask_with_reranking(question, retrieve_k=20, final_k=5))
    ]

    results = {}

    for name, method in methods:
        print(f"\n{'='*70}")
        print(f"   METHOD: {name}")
        print(f"{'='*70}\n")

        start = time.time()
        answer = method()
        elapsed = time.time() - start

        results[name] = {
            'answer': answer,
            'time': elapsed
        }

        print(f"\n‚è±Ô∏è  Time taken: {elapsed:.2f} seconds")

    # Print summary
    print("\n" + "="*70)
    print("   PERFORMANCE SUMMARY")
    print("="*70)

    for name, data in results.items():
        print(f"{name:25s} - {data['time']:.2f}s")

    print("="*70)

    return results

# Example usage:
# results = compare_methods("What are the main business activities described in these filings?")

In [21]:
compare_methods("What are the main business activities described in these filings?")

   COMPARING ALL METHODS

Question: What are the main business activities described in these filings?


   METHOD: Basic RAG

‚ùì Question: What are the main business activities described in these filings?

  üîç Searching ChromaDB for relevant information...
  ü§î Generating answer with GPT-3.5-turbo...

üìä ANSWER
Based on the information provided in the context from the financial documents, we can identify the main business activities described in the filings for each company:

1. CIK 1010612 (Source 1):
- Information not available.

2. CIK 3292 (Source 2, Source 3, Source 5):
- The main business activities described in the filings for CIK 3292 include the production and operations detailed in the 1994, 1995, and 1996 Annual Reports to Shareholders. These reports likely provide insights into the company's financial performance, strategic initiatives, and operational highlights for the respective years.

3. CIK 740155 (Source 4):
- The main business activities described in the fil

{'Basic RAG': {'answer': "Based on the information provided in the context from the financial documents, we can identify the main business activities described in the filings for each company:\n\n1. CIK 1010612 (Source 1):\n- Information not available.\n\n2. CIK 3292 (Source 2, Source 3, Source 5):\n- The main business activities described in the filings for CIK 3292 include the production and operations detailed in the 1994, 1995, and 1996 Annual Reports to Shareholders. These reports likely provide insights into the company's financial performance, strategic initiatives, and operational highlights for the respective years.\n\n3. CIK 740155 (Source 4):\n- The main business activities described in the filings for CIK 740155 are detailed in the 1995 Annual Report to Shareholders. This report likely outlines the company's financial results, key business activities, and performance metrics for that year.\n\nIn summary, CIK 3292 and CIK 740155 have their main business activities described 

## üß™ Test Queries

Run various test queries to evaluate the system.

In [9]:
# Cell 11: Test various queries

# Test questions you can try:
test_questions = [
    "What are the main business activities of the companies?",
    "What are the key risk factors mentioned?",
    "What financial metrics are discussed?",
    "Compare the business strategies of different companies",
    "What are the main revenue sources?"
]

print("üìù Suggested test questions:")
print()
for i, q in enumerate(test_questions, 1):
    print(f"{i}. {q}")
print()
print("üí° Use: rag.ask('your question')")
print("üí° Or: hybrid.ask_hybrid('your question')")
print("üí° Or: fewshot.ask_with_examples('your question')")
print("üí° Or: reranker.ask_with_reranking('your question')")
print("üí° Or: compare_methods('your question') to test all methods")

üìù Suggested test questions:

1. What are the main business activities of the companies?
2. What are the key risk factors mentioned?
3. What financial metrics are discussed?
4. Compare the business strategies of different companies
5. What are the main revenue sources?

üí° Use: rag.ask('your question')
üí° Or: hybrid.ask_hybrid('your question')
üí° Or: fewshot.ask_with_examples('your question')
üí° Or: reranker.ask_with_reranking('your question')
üí° Or: compare_methods('your question') to test all methods


## üìä ChromaDB Statistics

View statistics and information about the ChromaDB collection.

In [36]:
# Cell 12: View ChromaDB statistics

def show_chromadb_stats():
    """Display detailed ChromaDB statistics"""

    print("="*70)
    print("   CHROMADB STATISTICS")
    print("="*70)

    print(f"\nTotal chunks in database: {rag.collection.count()}")
    print(f"Embedding dimension: {rag.embedding_dim}")
    print(f"Collection name: {rag.collection.name}")

    # Get unique companies
    if rag.collection.count() > 0:
        results = rag.collection.get()
        companies = set(meta['company'] for meta in results['metadatas'])

        print(f"\nNumber of companies: {len(companies)}")
        print("\nCompanies in database:")

        for company in sorted(companies):
            # Count chunks per company
            company_chunks = sum(1 for m in results['metadatas'] if m['company'] == company)
            print(f"  ‚Ä¢ {company}: {company_chunks} chunks")

    print("\n" + "="*70)

show_chromadb_stats()

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  ‚Ä¢ CIK 32908: 3 chunks
  ‚Ä¢ CIK 3292: 9 chunks
  ‚Ä¢ CIK 33002: 3 chunks
  ‚Ä¢ CIK 33015: 6 chunks
  ‚Ä¢ CIK 33073: 6 chunks
  ‚Ä¢ CIK 33185: 6 chunks
  ‚Ä¢ CIK 33213: 9 chunks
  ‚Ä¢ CIK 3327: 3 chunks
  ‚Ä¢ CIK 33325: 3 chunks
  ‚Ä¢ CIK 3333: 9 chunks
  ‚Ä¢ CIK 33416: 6 chunks
  ‚Ä¢ CIK 33488: 3 chunks
  ‚Ä¢ CIK 33533: 5 chunks
  ‚Ä¢ CIK 33565: 12 chunks
  ‚Ä¢ CIK 33619: 6 chunks
  ‚Ä¢ CIK 33656: 6 chunks
  ‚Ä¢ CIK 3370: 6 chunks
  ‚Ä¢ CIK 33769: 6 chunks
  ‚Ä¢ CIK 33798: 6 chunks
  ‚Ä¢ CIK 33837: 9 chunks
  ‚Ä¢ CIK 33939: 2 chunks
  ‚Ä¢ CIK 33992: 3 chunks
  ‚Ä¢ CIK 34088: 6 chunks
  ‚Ä¢ CIK 34115: 3 chunks
  ‚Ä¢ CIK 34125: 6 chunks
  ‚Ä¢ CIK 34136: 6 chunks
  ‚Ä¢ CIK 34151: 3 chunks
  ‚Ä¢ CIK 34169: 3 chunks
  ‚Ä¢ CIK 34236: 3 chunks
  ‚Ä¢ CIK 34257: 3 chunks
  ‚Ä¢ CIK 34285: 9 chunks
  ‚Ä¢ CIK 34339: 3 chunks
  ‚Ä¢ CIK 34371: 5 chunks
  ‚Ä¢ CIK 34471: 3 chunks
  ‚Ä¢ CIK 34489: 3 chunks
  ‚Ä¢ CIK 3449: 9 chunks
  ‚

In [34]:
"""
=============================================================================
CHROMADB INSPECTION CODE - Run this in a Jupyter Notebook Cell
=============================================================================
This code checks both ChromaDB locations and shows you exactly what's in each.
Copy and paste this entire cell into your notebook.
=============================================================================
"""

import chromadb
import os
from collections import Counter

print("üîç CHROMADB DATABASE INSPECTION")
print("="*80)
print()

# Define both locations to check
locations = {
    "Location 1 (RAG Class Default)": os.path.expanduser("~/FinancialAI/chromadb"),
    "Location 2 (Actual Data)": "./chroma_db"
}

results_summary = []

for location_name, path in locations.items():
    print(f"\nüìÇ {location_name}")
    print(f"   Path: {path}")
    print("-"*80)

    # Check if path exists
    if not os.path.exists(path):
        print(f"   ‚ùå Path does NOT exist")
        results_summary.append({
            'location': location_name,
            'path': path,
            'exists': False,
            'collections': []
        })
        continue

    print(f"   ‚úÖ Path exists")

    try:
        # Connect to ChromaDB
        client = chromadb.PersistentClient(path=path)
        collections = client.list_collections()

        if len(collections) == 0:
            print(f"   ‚ö†Ô∏è  No collections found in this database")
            results_summary.append({
                'location': location_name,
                'path': path,
                'exists': True,
                'collections': []
            })
            continue

        print(f"   Collections found: {len(collections)}")
        print()

        location_data = {
            'location': location_name,
            'path': path,
            'exists': True,
            'collections': []
        }

        # Inspect each collection
        for collection in collections:
            print(f"   üì¶ Collection: '{collection.name}'")

            # Get document count
            count = collection.count()
            print(f"      ‚îî‚îÄ Total Documents: {count:,}")

            collection_info = {
                'name': collection.name,
                'count': count,
                'metadata_keys': [],
                'sample_docs': [],
                'companies': []
            }

            if count == 0:
                print(f"      ‚îî‚îÄ ‚ö†Ô∏è  EMPTY - No documents in this collection")
                location_data['collections'].append(collection_info)
                print()
                continue

            # Get sample documents (first 10)
            try:
                sample_size = min(10, count)
                results = collection.get(
                    limit=sample_size,
                    include=['documents', 'metadatas']
                )

                # Analyze metadata
                if results['metadatas'] and len(results['metadatas']) > 0:
                    # Get all unique metadata keys
                    all_keys = set()
                    companies = []

                    for meta in results['metadatas']:
                        if meta:
                            all_keys.update(meta.keys())
                            if 'company' in meta:
                                companies.append(meta['company'])

                    collection_info['metadata_keys'] = sorted(list(all_keys))
                    collection_info['companies'] = list(set(companies))

                    print(f"      ‚îî‚îÄ Metadata fields: {', '.join(sorted(all_keys))}")

                    # Count unique companies
                    if companies:
                        unique_companies = len(set(companies))
                        print(f"      ‚îî‚îÄ Unique companies (in sample): {unique_companies}")
                        print(f"      ‚îî‚îÄ Sample companies: {', '.join(list(set(companies))[:5])}")

                # Show sample document snippets
                if results['documents'] and len(results['documents']) > 0:
                    print(f"      ‚îî‚îÄ Sample document preview:")
                    first_doc = results['documents'][0]
                    preview = first_doc[:200].replace('\n', ' ')
                    print(f"         '{preview}...'")
                    collection_info['sample_docs'] = results['documents'][:3]

                # If collection is large, get all metadata to count unique companies
                if count > sample_size and 'company' in all_keys:
                    print(f"      ‚îî‚îÄ üîÑ Scanning all documents for company count...")
                    all_results = collection.get(include=['metadatas'])
                    all_companies = [meta.get('company') for meta in all_results['metadatas'] if meta and 'company' in meta]
                    unique_companies_total = len(set(all_companies))
                    print(f"      ‚îî‚îÄ ‚úÖ Total unique companies: {unique_companies_total}")
                    collection_info['total_companies'] = unique_companies_total

            except Exception as e:
                print(f"      ‚îî‚îÄ ‚ö†Ô∏è  Error reading documents: {e}")

            location_data['collections'].append(collection_info)
            print()

        results_summary.append(location_data)

    except Exception as e:
        print(f"   ‚ùå Error connecting to database: {e}")
        results_summary.append({
            'location': location_name,
            'path': path,
            'exists': True,
            'collections': [],
            'error': str(e)
        })

# Print summary comparison
print("\n" + "="*80)
print("üìä SUMMARY COMPARISON")
print("="*80)

# Create comparison table
print(f"\n{'Location':<35} {'Path Exists':<15} {'Collections':<15} {'Documents':<15}")
print("-"*80)

for result in results_summary:
    location = result['location'][:33]
    exists = "‚úÖ Yes" if result['exists'] else "‚ùå No"

    if not result['exists']:
        collections = "N/A"
        documents = "N/A"
    elif len(result['collections']) == 0:
        collections = "0"
        documents = "0"
    else:
        collections = str(len(result['collections']))
        total_docs = sum(c['count'] for c in result['collections'])
        documents = f"{total_docs:,}"

    print(f"{location:<35} {exists:<15} {collections:<15} {documents:<15}")

# Print detailed findings
print("\n" + "="*80)
print("üéØ DETAILED FINDINGS")
print("="*80)

for result in results_summary:
    print(f"\nüìç {result['location']}")

    if not result['exists']:
        print("   ‚ùå Database path does not exist")
        print("   üí° This database was never created")
        continue

    if 'error' in result:
        print(f"   ‚ùå Error: {result['error']}")
        continue

    if len(result['collections']) == 0:
        print("   ‚ö†Ô∏è  Database exists but contains NO collections")
        print("   üí° Database was created but never populated with data")
        continue

    for collection in result['collections']:
        print(f"\n   Collection: '{collection['name']}'")
        print(f"   ‚îî‚îÄ Documents: {collection['count']:,}")

        if collection['count'] == 0:
            print(f"   ‚îî‚îÄ Status: EMPTY ‚ùå")
            print(f"   ‚îî‚îÄ Issue: Collection exists but has no documents")
        else:
            print(f"   ‚îî‚îÄ Status: HAS DATA ‚úÖ")
            if collection['metadata_keys']:
                print(f"   ‚îî‚îÄ Metadata: {', '.join(collection['metadata_keys'])}")
            if collection['companies']:
                print(f"   ‚îî‚îÄ Companies: {len(collection['companies'])} found in sample")
            if 'total_companies' in collection:
                print(f"   ‚îî‚îÄ Total Companies: {collection['total_companies']}")

# Print recommendations
print("\n" + "="*80)
print("üí° RECOMMENDATIONS")
print("="*80)

# Find which location has data
has_data = None
empty_location = None

for result in results_summary:
    if result['exists'] and len(result['collections']) > 0:
        for collection in result['collections']:
            if collection['count'] > 0:
                has_data = {
                    'location': result['location'],
                    'path': result['path'],
                    'collection_name': collection['name'],
                    'doc_count': collection['count']
                }
                break

    if result['exists'] and (len(result['collections']) == 0 or
                             all(c['count'] == 0 for c in result['collections'])):
        empty_location = {
            'location': result['location'],
            'path': result['path']
        }

if has_data:
    print(f"\n‚úÖ FOUND DATA:")
    print(f"   Location: {has_data['location']}")
    print(f"   Path: {has_data['path']}")
    print(f"   Collection: '{has_data['collection_name']}'")
    print(f"   Documents: {has_data['doc_count']:,}")
    print()
    print("üîß ACTION REQUIRED:")
    print("   Your RAG system should point to:")
    print(f"   ‚Ä¢ Path: {has_data['path']}")
    print(f"   ‚Ä¢ Collection: '{has_data['collection_name']}'")
    print()
    print("   Update your FinBERTFinancialRAG class:")
    print(f"   ‚Ä¢ persist_directory = \"{has_data['path']}\"")
    print(f"   ‚Ä¢ collection_name = \"{has_data['collection_name']}\"")
else:
    print("\n‚ö†Ô∏è  NO DATA FOUND in either location!")
    print("   You need to populate ChromaDB with your EDGAR documents first.")

print("\n" + "="*80)
print("‚úÖ INSPECTION COMPLETE")
print("="*80)

Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given


üîç CHROMADB DATABASE INSPECTION


üìÇ Location 1 (RAG Class Default)
   Path: /root/FinancialAI/chromadb
--------------------------------------------------------------------------------
   ‚úÖ Path exists
   Collections found: 1

   üì¶ Collection: 'financial_filings'
      ‚îî‚îÄ Total Documents: 27,813


Failed to send telemetry event CollectionGetEvent: capture() takes 1 positional argument but 3 were given


      ‚îî‚îÄ Metadata fields: cik, company, filing_date, filing_type, index, section, year
      ‚îî‚îÄ Unique companies (in sample): 4
      ‚îî‚îÄ Sample companies: CIK 100240, CIK 58696, CIK 103730, CIK 92116
      ‚îî‚îÄ Sample document preview:
         'Item 1. Business General Southern California Water Company (the "Registrant") is a public utility company engaged principally in the purchase, production, distribution and sale of water. The Registran...'
      ‚îî‚îÄ üîÑ Scanning all documents for company count...


Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given


      ‚îî‚îÄ ‚úÖ Total unique companies: 6232


üìÇ Location 2 (Actual Data)
   Path: ./chroma_db
--------------------------------------------------------------------------------
   ‚úÖ Path exists
   Collections found: 1

   üì¶ Collection: 'edgar_finbert'
      ‚îî‚îÄ Total Documents: 0
      ‚îî‚îÄ ‚ö†Ô∏è  EMPTY - No documents in this collection


üìä SUMMARY COMPARISON

Location                            Path Exists     Collections     Documents      
--------------------------------------------------------------------------------
Location 1 (RAG Class Default)      ‚úÖ Yes           1               27,813         
Location 2 (Actual Data)            ‚úÖ Yes           1               0              

üéØ DETAILED FINDINGS

üìç Location 1 (RAG Class Default)

   Collection: 'financial_filings'
   ‚îî‚îÄ Documents: 27,813
   ‚îî‚îÄ Status: HAS DATA ‚úÖ
   ‚îî‚îÄ Metadata: cik, company, filing_date, filing_type, index, section, year
   ‚îî‚îÄ Companies: 4 found in sample
   ‚îî‚

In [93]:
"""
üì¶ DOWNLOAD YOUR CHROMADB
=========================
27,813 documents | 6,232 companies
"""

import shutil
import os
from datetime import datetime

print("üì¶ Creating ZIP of your ChromaDB...")

# Your ChromaDB path (the one with 27,813 docs!)
chroma_path = "/root/FinancialAI/chromadb"

# Create backup name with timestamp
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_name = f"chromadb_backup_{timestamp}"

print(f"üîÑ Zipping {chroma_path}...")

# Create ZIP
shutil.make_archive(output_name, 'zip', chroma_path)

# Check size
zip_path = f"{output_name}.zip"
zip_size = os.path.getsize(zip_path) / (1024 * 1024)

print(f"‚úÖ Created: {output_name}.zip ({zip_size:.1f} MB)")

# Download (if in Colab)
try:
    from google.colab import files
    print("üì• Downloading...")
    files.download(zip_path)
    print("‚úÖ Download started!")
except:
    print(f"‚úÖ File ready: {os.path.abspath(zip_path)}")
    print("   (Right-click ‚Üí Download in Jupyter)")

print()
print("="*60)
print("üì¶ What's inside:")
print("   ‚Ä¢ 27,813 documents")
print("   ‚Ä¢ 6,232 companies")
print("   ‚Ä¢ Collection: financial_filings")
print("="*60)

üì¶ Creating ZIP of your ChromaDB...
üîÑ Zipping /root/FinancialAI/chromadb...
‚úÖ Created: chromadb_backup_20251202_205904.zip (327.1 MB)
‚úÖ File ready: /app/chromadb_backup_20251202_205904.zip
   (Right-click ‚Üí Download in Jupyter)

üì¶ What's inside:
   ‚Ä¢ 27,813 documents
   ‚Ä¢ 6,232 companies
   ‚Ä¢ Collection: financial_filings


In [37]:
# Directly paste your fine-tuned model ID
FINETUNED_MODEL_ID = "ft:gpt-4o-2024-08-06:personal:finqa-financial:Chr7KFPi"

print(f"‚úÖ Model ID: {FINETUNED_MODEL_ID}")

‚úÖ Model ID: ft:gpt-4o-2024-08-06:personal:finqa-financial:Chr7KFPi
