**Part 1: MultiModal Retrieval-Augmented Generation (Kaggle
Competition) (30 + 10 Marks)**

In [None]:
pip install PyMuPDF pdfplumber pandas sentence_transformers chromadb openai

Collecting PyMuPDF
  Downloading pymupdf-1.25.5-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Collecting pdfplumber
  Downloading pdfplumber-0.11.6-py3-none-any.whl.metadata (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
Collecting chromadb
  Downloading chromadb-1.0.7-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.9 kB)
Collecting pdfminer.six==20250327 (from pdfplumber)
  Downloading pdfminer_six-20250327-py3-none-any.whl.metadata (4.1 kB)
Collecting pypdfium2>=4.18.0 (from pdfplumber)
  Downloading pypdfium2-4.30.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (48 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.2/48.2 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata

In [None]:
# Install required packages (run in your environment if needed)
# !pip install -q sentence-transformers faiss-cpu PyMuPDF pandas pdfplumber pillow transformers torch torchvision chromadb openai

import fitz  # PyMuPDF
import pdfplumber
import os
from PIL import Image
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
import chromadb
from transformers import CLIPProcessor, CLIPModel
from openai import OpenAI


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import os
import sys
import pandas as pd
import numpy as np
import re
import io
import json
import logging
import subprocess
import importlib
from PIL import Image
from tqdm import tqdm
import matplotlib.pyplot as plt
from typing import List, Dict, Tuple, Optional, Any
from dataclasses import dataclass
import torch

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# Function to check and install a package if needed
def install_package(package_name):
    try:
        logger.info(f"Checking for {package_name}...")
        importlib.import_module(package_name.replace('-', '_'))
        logger.info(f"{package_name} is already installed.")
    except ImportError:
        logger.info(f"Installing {package_name}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", package_name])
        logger.info(f"{package_name} has been installed.")

# Import potentially missing packages with fallbacks
try:
    import torch
except ImportError:
    install_package("torch")
    import torch

try:
    import fitz  # PyMuPDF
except ImportError:
    install_package("PyMuPDF")
    import fitz

try:
    import pdfplumber
except ImportError:
    install_package("pdfplumber")
    import pdfplumber

try:
    from sentence_transformers import SentenceTransformer, util
except ImportError:
    install_package("sentence-transformers")
    from sentence_transformers import SentenceTransformer, util

try:
    import spacy
except ImportError:
    install_package("spacy")
    import spacy

try:
    import faiss
except ImportError:
    install_package("faiss-cpu")
    import faiss

try:
    from bert_score import BERTScorer
except ImportError:
    install_package("bert-score")
    from bert_score import BERTScorer

try:
    from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
except ImportError:
    install_package("transformers")
    from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

# Make sure we have the latest OpenAI package installed
try:
    from openai import OpenAI
except ImportError:
    logger.info("OpenAI package not found or outdated. Installing the latest version...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "--upgrade", "openai"])
    from openai import OpenAI

# Data classes for structured handling
@dataclass
class TextItem:
    id: str
    content: str
    page_num: int
    content_refs: List[int] = None
    topics: List[str] = None
    key_phrases: List[str] = None
    embedding: Optional[np.ndarray] = None

@dataclass
class ImageItem:
    id: int
    label: str
    type: str
    page: int
    path: str
    rel_bbox: List[float]
    abs_bbox: List[float]
    embedding: Optional[np.ndarray] = None

class VectorDatabase:
    """
    A vector database implementation using FAISS for efficient similarity search.
    Supports storing and retrieving both text and image embeddings.
    """
    def __init__(self, dimension=384, index_type="Flat"):
        """
        Initialize vector database

        Args:
            dimension: Embedding dimension
            index_type: FAISS index type (Flat, IVF, etc.)
        """
        self.dimension = dimension
        self.index_type = index_type

        # Initialize empty indices
        self.text_index = faiss.IndexFlatIP(dimension)  # Inner product (cosine) for text
        self.image_index = faiss.IndexFlatIP(dimension)  # Inner product (cosine) for images

        # Storage for items
        self.text_items = {}
        self.image_items = {}

        # Track indices
        self.text_ids_to_idx = {}
        self.image_ids_to_idx = {}
        self.text_idx_to_ids = {}
        self.image_idx_to_ids = {}

        # Counters for indices
        self.text_counter = 0
        self.image_counter = 0

        logger.info(f"Initialized vector database with dimension {dimension}")

    def add_text(self, text_id, text_item):
        """
        Add a text item and its embedding to the database

        Args:
            text_id: Unique identifier for the text
            text_item: TextItem object with embedding
        """
        if text_item.embedding is None:
            raise ValueError("Text embedding is required")

        # Normalize embedding for cosine similarity
        embedding = text_item.embedding.reshape(1, -1).astype(np.float32)
        embedding = embedding / np.linalg.norm(embedding, axis=1, keepdims=True)

        # Add to index
        self.text_index.add(embedding)

        # Store mapping
        idx = self.text_counter
        self.text_ids_to_idx[text_id] = idx
        self.text_idx_to_ids[idx] = text_id

        # Store text item
        self.text_items[text_id] = text_item

        # Update counter
        self.text_counter += 1

        return text_id

    def add_image(self, image_id, image_item):
        """
        Add an image item and its embedding to the database

        Args:
            image_id: Unique identifier for the image
            image_item: ImageItem object with embedding
        """
        if image_item.embedding is None:
            raise ValueError("Image embedding is required")

        # Normalize embedding for cosine similarity
        embedding = image_item.embedding.reshape(1, -1).astype(np.float32)
        embedding = embedding / np.linalg.norm(embedding, axis=1, keepdims=True)

        # Add to index
        self.image_index.add(embedding)

        # Store mapping
        idx = self.image_counter
        self.image_ids_to_idx[image_id] = idx
        self.image_idx_to_ids[idx] = image_id

        # Store image item
        self.image_items[image_id] = image_item

        # Update counter
        self.image_counter += 1

        return image_id

    def search_texts(self, query_embedding, k=5):
        """
        Search for similar text embeddings

        Args:
            query_embedding: Query embedding vector
            k: Number of results to return

        Returns:
            List of (text_id, score) pairs
        """
        if self.text_counter == 0:
            return []

        # Normalize query embedding
        query_embedding = query_embedding.reshape(1, -1).astype(np.float32)
        query_embedding = query_embedding / np.linalg.norm(query_embedding, axis=1, keepdims=True)

        # Search
        k = min(k, self.text_counter)
        scores, indices = self.text_index.search(query_embedding, k)

        # Convert indices to text IDs
        results = []
        for i in range(len(indices[0])):
            idx = indices[0][i]
            score = scores[0][i]
            text_id = self.text_idx_to_ids[idx]
            results.append((text_id, score))

        return results

    def search_images(self, query_embedding, k=5):
        """
        Search for similar image embeddings

        Args:
            query_embedding: Query embedding vector
            k: Number of results to return

        Returns:
            List of (image_id, score) pairs
        """
        if self.image_counter == 0:
            return []

        # Normalize query embedding
        query_embedding = query_embedding.reshape(1, -1).astype(np.float32)
        query_embedding = query_embedding / np.linalg.norm(query_embedding, axis=1, keepdims=True)

        # Search
        k = min(k, self.image_counter)
        scores, indices = self.image_index.search(query_embedding, k)

        # Convert indices to image IDs
        results = []
        for i in range(len(indices[0])):
            idx = indices[0][i]
            score = scores[0][i]
            image_id = self.image_idx_to_ids[idx]
            results.append((image_id, score))

        return results

    def get_text(self, text_id):
        """Get text item by ID"""
        return self.text_items.get(text_id)

    def get_image(self, image_id):
        """Get image item by ID"""
        return self.image_items.get(image_id)

    def save(self, path):
        """
        Save the vector database to disk

        Args:
            path: Directory path to save to
        """
        os.makedirs(path, exist_ok=True)

        # Save indices
        faiss.write_index(self.text_index, os.path.join(path, 'text_index.faiss'))
        faiss.write_index(self.image_index, os.path.join(path, 'image_index.faiss'))

        # Save items and mappings
        with open(os.path.join(path, 'text_items.json'), 'w') as f:
            # Convert TextItems to dict for JSON serialization
            serializable_text_items = {}
            for text_id, text_item in self.text_items.items():
                item_dict = {
                    'id': text_item.id,
                    'content': text_item.content,
                    'page_num': text_item.page_num,
                    'content_refs': text_item.content_refs,
                    'topics': text_item.topics,
                    'key_phrases': text_item.key_phrases
                }
                serializable_text_items[text_id] = item_dict
            json.dump(serializable_text_items, f)

        with open(os.path.join(path, 'image_items.json'), 'w') as f:
            # Convert ImageItems to dict for JSON serialization
            serializable_image_items = {}
            for image_id, image_item in self.image_items.items():
                item_dict = {
                    'id': image_item.id,
                    'label': image_item.label,
                    'type': image_item.type,
                    'page': image_item.page,
                    'path': image_item.path,
                    'rel_bbox': image_item.rel_bbox,
                    'abs_bbox': image_item.abs_bbox
                }
                serializable_image_items[str(image_id)] = item_dict
            json.dump(serializable_image_items, f)

        # Save mappings
        with open(os.path.join(path, 'text_mappings.json'), 'w') as f:
            json.dump({
                'text_ids_to_idx': {str(k): v for k, v in self.text_ids_to_idx.items()},
                'text_idx_to_ids': {str(k): v for k, v in self.text_idx_to_ids.items()},
                'text_counter': self.text_counter
            }, f)

        with open(os.path.join(path, 'image_mappings.json'), 'w') as f:
            json.dump({
                'image_ids_to_idx': {str(k): int(v) for k, v in self.image_ids_to_idx.items()},
                'image_idx_to_ids': {str(k): int(v) for k, v in self.image_idx_to_ids.items()},
                'image_counter': self.image_counter
            }, f)

        # Save embeddings separately (FAISS indices don't store the original vectors)
        text_embeddings = np.zeros((self.text_counter, self.dimension), dtype=np.float32)
        for text_id, text_item in self.text_items.items():
            if text_item.embedding is not None:
                idx = self.text_ids_to_idx[text_id]
                text_embeddings[idx] = text_item.embedding

        image_embeddings = np.zeros((self.image_counter, self.dimension), dtype=np.float32)
        for image_id, image_item in self.image_items.items():
            if image_item.embedding is not None:
                idx = self.image_ids_to_idx[image_id]
                image_embeddings[idx] = image_item.embedding

        np.save(os.path.join(path, 'text_embeddings.npy'), text_embeddings)
        np.save(os.path.join(path, 'image_embeddings.npy'), image_embeddings)

        logger.info(f"Saved vector database to {path}")

    def load(self, path):
        """
        Load vector database from disk

        Args:
            path: Directory path to load from
        """
        # Load indices
        self.text_index = faiss.read_index(os.path.join(path, 'text_index.faiss'))
        self.image_index = faiss.read_index(os.path.join(path, 'image_index.faiss'))

        # Load mappings
        with open(os.path.join(path, 'text_mappings.json'), 'r') as f:
            text_mappings = json.load(f)
            self.text_ids_to_idx = {k: int(v) for k, v in text_mappings['text_ids_to_idx'].items()}
            self.text_idx_to_ids = {int(k): v for k, v in text_mappings['text_idx_to_ids'].items()}
            self.text_counter = text_mappings['text_counter']

        with open(os.path.join(path, 'image_mappings.json'), 'r') as f:
            image_mappings = json.load(f)
            self.image_ids_to_idx = {int(k): int(v) for k, v in image_mappings['image_ids_to_idx'].items()}
            self.image_idx_to_ids = {int(k): int(v) for k, v in image_mappings['image_idx_to_ids'].items()}
            self.image_counter = image_mappings['image_counter']

        # Load items
        with open(os.path.join(path, 'text_items.json'), 'r') as f:
            serialized_text_items = json.load(f)
            for text_id, item_dict in serialized_text_items.items():
                self.text_items[text_id] = TextItem(
                    id=item_dict['id'],
                    content=item_dict['content'],
                    page_num=item_dict['page_num'],
                    content_refs=item_dict['content_refs'],
                    topics=item_dict['topics'],
                    key_phrases=item_dict['key_phrases']
                )

        with open(os.path.join(path, 'image_items.json'), 'r') as f:
            serialized_image_items = json.load(f)
            for image_id, item_dict in serialized_image_items.items():
                self.image_items[int(image_id)] = ImageItem(
                    id=item_dict['id'],
                    label=item_dict['label'],
                    type=item_dict['type'],
                    page=item_dict['page'],
                    path=item_dict['path'],
                    rel_bbox=item_dict['rel_bbox'],
                    abs_bbox=item_dict['abs_bbox']
                )

        # Load embeddings
        text_embeddings = np.load(os.path.join(path, 'text_embeddings.npy'))
        for text_id, idx in self.text_ids_to_idx.items():
            self.text_items[text_id].embedding = text_embeddings[idx]

        image_embeddings = np.load(os.path.join(path, 'image_embeddings.npy'))
        for image_id, idx in self.image_ids_to_idx.items():
            self.image_items[image_id].embedding = image_embeddings[idx]

        logger.info(f"Loaded vector database from {path} with {self.text_counter} texts and {self.image_counter} images")

class EnhancedMultiModalRAGSystem:
    """
    An enhanced MultiModal Retrieval-Augmented Generation system that:
    1. Ingests and stores data (figures, tables, and textual descriptions) in a vector database
    2. Retrieves the most relevant text and images for a given query with efficient indexing
    3. Generates an answer using a generative model based on retrieved results
    4. Evaluates retrieval and response quality using BERTScore
    """

    def __init__(self, pdf_path=None, output_dir="output_data", vector_db_path=None, openai_api_key=None):
        """
        Initialize the MultiModal RAG system

        Args:
            pdf_path: Path to the PDF document to process
            output_dir: Directory to store extracted content and results
            vector_db_path: Path to store the vector database (if None, in-memory)
            openai_api_key: OpenAI API key for using GPT models (optional)
        """
        self.pdf_path = pdf_path
        self.output_dir = output_dir
        self.content_dir = os.path.join(output_dir, "content")
        self.vector_db_path = vector_db_path

        # Data storage
        self.text_chunks = []  # Will store text chunks
        self.content_registry = {}  # Will store content metadata

        # Set OpenAI API key if provided
        if openai_api_key:
            os.environ["OPENAI_API_KEY"] = openai_api_key
            logger.info("OpenAI API key has been set")

        # Initialize vector database
        self.vector_db = VectorDatabase(dimension=384)  # Using dimension from sentence-transformers

        # Ensure directories exist
        os.makedirs(self.output_dir, exist_ok=True)
        os.makedirs(self.content_dir, exist_ok=True)

        # Check and install required packages
        self._check_dependencies()

        # Initialize embedder models
        try:
            self.text_embedder = SentenceTransformer('paraphrase-MiniLM-L6-v2')
            logger.info("Text embedder initialized")
        except Exception as e:
            logger.warning(f"Failed to initialize SentenceTransformer: {str(e)}")
            logger.info("Installing sentence-transformers...")
            os.system("pip install -q sentence-transformers")
            self.text_embedder = SentenceTransformer('paraphrase-MiniLM-L6-v2')

        # Initialize NLP for text analysis
        try:
            self.nlp = spacy.load("en_core_web_sm")
            logger.info("NLP model initialized")
        except:
            logger.info("Downloading spaCy model...")
            os.system("python -m spacy download en_core_web_sm")
            self.nlp = spacy.load("en_core_web_sm")

        # Initialize BERTScore for evaluation
        try:
            self.bert_scorer = BERTScorer(lang="en", rescale_with_baseline=True)
            logger.info("BERTScore initialized")
        except Exception as e:
            logger.warning(f"Failed to initialize BERTScorer: {str(e)}")
            logger.info("Installing bert-score...")
            os.system("pip install -q bert-score")
            self.bert_scorer = BERTScorer(lang="en", rescale_with_baseline=True)

        # Initialize LLM response generator
        try:
            # Check if OpenAI API key is set
            api_key = os.environ.get("OPENAI_API_KEY")

            if api_key:
                # Initialize OpenAI client
                self.openai_client = OpenAI(api_key=api_key)
                self.llm_type = "openai"
                logger.info("OpenAI LLM initialized")
            elif torch.cuda.is_available():
                # Initialize local model if GPU available and no OpenAI API key
                self.llm_tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")
                self.llm_model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base")
                self.llm_pipeline = pipeline(
                    "text2text-generation",
                    model=self.llm_model,
                    tokenizer=self.llm_tokenizer,
                    max_length=100
                )
                self.llm_type = "local"
                logger.info("Local LLM initialized (flan-t5-base)")
            else:
                # Fallback to a simpler local model for CPU
                self.llm_tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-small")
                self.llm_model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-small")
                self.llm_pipeline = pipeline(
                    "text2text-generation",
                    model=self.llm_model,
                    tokenizer=self.llm_tokenizer,
                    max_length=100
                )
                self.llm_type = "local"
                logger.info("Local LLM initialized (flan-t5-small)")
        except Exception as e:
            logger.warning(f"Failed to initialize LLM: {str(e)}")
            logger.info("Using rule-based response generation as fallback")
            self.llm_type = "rule-based"

        # Content definitions with coordinates
        # Format: [content_id, page_num, content_type, content_label, [x0, y0, x1, y1]]
        self.content_definitions = [
            # Tables
            [1, 1, "table", "Table 1-1", [0.2, 0.6, 0.95, 0.85]],   # World Output Growth
            [5, 5, "table", "Table 1-2", [0.2, 0.7, 0.95, 0.95]],   # Growth, Unemployment, Inflation
            [7, 7, "table", "Table 1-3", [0.2, 0.75, 0.95, 0.95]],  # Growth, Unemployment, Euro Area
            [10, 11, "table", "Table 1-4", [0.2, 0.6, 0.95, 0.9]],  # Growth and Inflation in China

            # Figures
            [2, 2, "figure", "Figure 1-1", [0.1, 0.1, 0.9, 0.75]],  # Stock prices
            [3, 3, "figure", "Figure 1-2", [0.1, 0.1, 0.9, 0.71]],  # Unemployment rates
            [4, 4, "figure", "Figure 1-3", [0.1, 0.1, 0.9, 0.8]],   # United States map
            [6, 5, "figure", "Figure 1-4", [0.1, 0.3, 0.9, 0.6]],   # US Federal Budget
            [8, 8, "figure", "Figure 1-5", [0.1, 0.1, 0.9, 0.8]],   # Euro area map
            [9, 10, "figure", "Figure 1-6", [0.1, 0.4, 0.9, 0.8]],  # China map
            [11, 20, "figure", "Figure 2-1", [0.1, 0.5, 0.9, 0.9]], # Nominal & real GDP
            [12, 21, "figure", "Figure 2-2", [0.1, 0.1, 0.9, 0.6]], # Growth rate of US GDP
            [13, 24, "figure", "Figure 2-3", [0.1, 0.1, 0.9, 0.6]], # US unemployment rate
            [14, 27, "figure", "Figure 2-4", [0.1, 0.1, 0.9, 0.6]], # Inflation rate
            [15, 29, "figure", "Figure 2-5", [0.1, 0.1, 0.9, 0.6]], # Changes in unemployment
            [16, 30, "figure", "Figure 2-6", [0.1, 0.1, 0.9, 0.6]], # Changes in inflation rate
            [17, 32, "figure", "Figure 2-7", [0.1, 0.2, 0.9, 0.6]]  # Organization of the book
        ]

    def extract_content(self):
        """
        Extract content (figures, tables, and text) from the PDF
        """
        if not self.pdf_path:
            raise ValueError("PDF path not provided")

        logger.info(f"Processing PDF: {self.pdf_path}")

        # Open the PDF
        doc = fitz.open(self.pdf_path)
        pdf = pdfplumber.open(self.pdf_path)

        # Process each defined content item (keeping this hardcoded as requested)
        for content_id, page_num, content_type, content_label, rel_bbox in self.content_definitions:
            try:
                logger.info(f"Processing {content_type} {content_label} on page {page_num}")
                output_path = os.path.join(self.content_dir, f"{content_id}.png")

                # Check if page exists
                if page_num > len(doc):
                    logger.warning(f"Page {page_num} out of range. Document has {len(doc)} pages.")
                    continue

                # Get page
                page = doc[page_num-1]  # 0-based indexing
                width, height = page.rect.width, page.rect.height

                # Convert relative coordinates to absolute
                abs_bbox = (
                    rel_bbox[0] * width,
                    rel_bbox[1] * height,
                    rel_bbox[2] * width,
                    rel_bbox[3] * height
                )

                # Create bounding box
                rect = fitz.Rect(abs_bbox)

                # Extract with high resolution
                zoom = 3.0  # High resolution
                mat = fitz.Matrix(zoom, zoom)
                pix = page.get_pixmap(matrix=mat, clip=rect)

                # Convert to PIL
                img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)

                # Save image
                img.save(output_path, quality=95)

                # Store metadata
                self.content_registry[content_id] = {
                    "id": content_id,
                    "label": content_label,
                    "type": content_type,
                    "page": page_num,
                    "path": output_path,
                    "rel_bbox": rel_bbox,
                    "abs_bbox": abs_bbox,
                    "title": self._get_figure_table_title(doc, page_num, content_label, content_type)
                }

                logger.info(f"Extracted {content_type} {content_label} as item #{content_id}")

            except Exception as e:
                logger.error(f"Error extracting {content_label}: {str(e)}")

        # Extract text chunks with dynamic approach
        self._extract_text_chunks(doc)

        # Create extended text chunks for figures and tables
        self._create_extended_chunks_for_content()

        # Close documents
        doc.close()
        pdf.close()

        logger.info(f"Extraction complete: {len(self.content_registry)} items extracted")

        return self.content_registry

    def _get_figure_table_title(self, doc, page_num, content_label, content_type):
        """
        Extract the title/caption of a figure or table

        Args:
            doc: PyMuPDF document
            page_num: Page number
            content_label: Label of the content (e.g., "Figure 1-1")
            content_type: Type of content ("figure" or "table")

        Returns:
            title: Title/caption of the figure or table
        """
        try:
            # Get page text
            page = doc[page_num-1]  # 0-based indexing
            text = page.get_text()

            # Look for the label followed by a title
            pattern = f"{content_label}[\.:]?(.+?)(?:\n\n|\n[A-Z])"
            match = re.search(pattern, text, re.DOTALL)

            if match:
                title = match.group(1).strip()
                return title

            # If not found, try to find any mention of the label
            mentions = []
            for p_idx in range(max(0, page_num-2), min(len(doc), page_num+2)):
                p_text = doc[p_idx].get_text()
                if content_label in p_text:
                    idx = p_text.find(content_label)
                    # Get text after the label up to the next paragraph
                    after_text = p_text[idx:idx+200].split('\n\n')[0]
                    mentions.append(after_text)

            if mentions:
                return mentions[0].replace(content_label, '').strip()

            return ""
        except Exception as e:
            logger.warning(f"Error extracting title for {content_label}: {str(e)}")
            return ""

    def _extract_text_chunks(self, doc):
        """
        Extract text chunks from the document for retrieval
        This uses a more dynamic approach than the original code

        Args:
            doc: The PyMuPDF document
        """
        logger.info("Extracting text chunks for retrieval...")

        # First extract full paragraphs by page
        page_paragraphs = []

        for page_idx in range(len(doc)):
            page = doc[page_idx]
            page_num = page_idx + 1

            # Get text and split into paragraphs
            text = page.get_text()
            paragraphs = [p.strip() for p in text.split('\n\n') if p.strip()]

            for para in paragraphs:
                page_paragraphs.append({
                    'page': page_num,
                    'content': para
                })

        # Now process each paragraph
        for para_idx, para_info in enumerate(page_paragraphs):
            para = para_info['content']
            page_num = para_info['page']

            if len(para) < 30:  # Skip very short paragraphs
                continue

            # Find references to figures and tables
            # Find references to figures and tables
            figure_refs = re.findall(r'(Figure|Table)\s+(\d+[-\.]\d+|\d+)', para, re.IGNORECASE)
            content_refs = []

            for ref_type, ref_num in figure_refs:
                # Try to map the reference to content registry
                ref_label = f"{ref_type} {ref_num}"  # e.g., "Figure 1-1"
                ref_label_alt = f"{ref_type.title()} {ref_num}"  # Ensure consistent capitalization

                for content_num, content_info in self.content_registry.items():
                    if content_info["label"] == ref_label or content_info["label"] == ref_label_alt:
                        content_refs.append(content_num)

            # Extract economic topics using NLP
            doc_nlp = self.nlp(para)
            topics = []

            # Use named entity recognition and keyword extraction
            for ent in doc_nlp.ents:
                if ent.label_ in ['ORG', 'GPE', 'MONEY', 'PERCENT']:
                    topics.append(ent.text)

            # Extract key phrases
            key_phrases = []
            for chunk in doc_nlp.noun_chunks:
                if len(chunk.text.split()) > 1:  # Multi-word phrases
                    key_phrases.append(chunk.text)

            # Check for chapter, section headings that might provide context
            is_heading = False
            section_match = re.match(r'(\d+(?:[.-]\d+)*)\s+(.+)', para)
            chapter_match = re.match(r'Chapter\s+(\d+)\s+(.+)', para, re.IGNORECASE)

            if section_match or chapter_match or any(heading in para.lower() for heading in ["introduction", "chapter", "section", "conclusion"]):
                is_heading = True

            # Store text chunk with metadata
            self.text_chunks.append({
                'id': f"p{page_num}_{para_idx}",
                'content': para,
                'page': page_num,
                'content_refs': content_refs,
                'topics': topics,
                'key_phrases': key_phrases,
                'is_heading': is_heading
            })

        logger.info(f"Extracted {len(self.text_chunks)} text chunks")

    def _create_extended_chunks_for_content(self):
        """
        Create extended text chunks with more context for figures and tables
        This helps with generating reference-style responses
        """
        logger.info("Creating extended context chunks for figures and tables...")

        # For each content item (figure/table), collect surrounding text
        for content_id, content_info in self.content_registry.items():
            page_num = content_info["page"]
            content_label = content_info["label"]

            # Find text chunks that reference this content
            ref_chunks = [chunk for chunk in self.text_chunks if content_id in chunk.get('content_refs', [])]

            # Find text chunks on the same page
            page_chunks = [chunk for chunk in self.text_chunks if chunk['page'] == page_num]

            # Find text chunks on surrounding pages
            near_chunks = [
                chunk for chunk in self.text_chunks
                if abs(chunk['page'] - page_num) <= 2 and content_label in chunk['content']
            ]

            # Combine all chunks that might be relevant
            all_relevant_chunks = []
            if ref_chunks:
                all_relevant_chunks.extend(ref_chunks)
            if page_chunks:
                all_relevant_chunks.extend(page_chunks)
            if near_chunks:
                all_relevant_chunks.extend(near_chunks)

            # Remove duplicates
            all_relevant_chunks = list({chunk['id']: chunk for chunk in all_relevant_chunks}.values())

            # Sort chunks by relevance
            all_relevant_chunks.sort(key=lambda c: (
                (content_id in c.get('content_refs', []) if c.get('content_refs') else False) * 3 +  # Highest priority: direct references
                (c['page'] == page_num) * 2 +                  # Second priority: same page
                (content_label in c['content']) * 1            # Third priority: mentions the content
            ), reverse=True)

            # Take the top chunks
            top_chunks = all_relevant_chunks[:10] if all_relevant_chunks else []  # Take up to 10 most relevant chunks

            # Join their text for an extended chunk
            if top_chunks:
                extended_text = f"Context for {content_label}: "
                extended_text += " ".join([c['content'] for c in top_chunks])

                # Create a special extended text chunk for this content
                self.text_chunks.append({
                    'id': f"extended_{content_id}",
                    'content': extended_text,
                    'page': page_num,
                    'content_refs': [content_id],
                    'topics': sum([c.get('topics', []) or [] for c in top_chunks], []),
                    'key_phrases': sum([c.get('key_phrases', []) or [] for c in top_chunks], []),
                    'is_extended': True
                })

        logger.info(f"Created extended context chunks for {len(self.content_registry)} content items")

    def create_embeddings_and_index(self):
        """
        Create embeddings for text chunks and images and index them in vector DB
        """
        logger.info("Creating embeddings and indexing into vector database...")

        if not self.text_chunks:
            logger.warning("No text chunks found. Extract content first.")
            return

        # Create text embeddings
        logger.info(f"Creating embeddings for {len(self.text_chunks)} text chunks")

        for i, chunk in tqdm(enumerate(self.text_chunks), total=len(self.text_chunks)):
            # Create embedding
            embedding = self.text_embedder.encode(chunk['content'], convert_to_numpy=True)

            # Create TextItem
            text_item = TextItem(
                id=chunk['id'],
                content=chunk['content'],
                page_num=chunk['page'],
                content_refs=chunk['content_refs'],
                topics=chunk['topics'],
                key_phrases=chunk['key_phrases'],
                embedding=embedding
            )

            # Add to vector database
            self.vector_db.add_text(chunk['id'], text_item)

        logger.info(f"Created and indexed text embeddings")

        # Create image embeddings
        logger.info(f"Creating embeddings for {len(self.content_registry)} images/tables")

        # Generate image embeddings using a synthetic approach that combines:
        # 1. Any text found in the image region
        # 2. The label/caption
        # 3. The image position metadata
        for content_id, content_info in tqdm(self.content_registry.items(), total=len(self.content_registry)):
            # Load the image to analyze
            img_path = content_info['path']

            # Extract caption and content type
            caption = content_info['label']
            content_type = content_info['type']
            title = content_info.get('title', '')

            # Get page text near the image to provide context
            page_num = content_info['page']
            page_texts = [chunk['content'] for chunk in self.text_chunks if chunk['page'] == page_num]
            page_text = " ".join(page_texts)

            # Create a descriptive text for embedding
            description = f"{caption} - {content_type} on page {page_num}. {title} "

            # Add referenced text if available
            referencing_chunks = [chunk for chunk in self.text_chunks if content_id in chunk.get('content_refs', [])]
            if referencing_chunks:
                # Add text that refers to this image/table
                ref_texts = [chunk['content'][:200] for chunk in referencing_chunks[:3]]  # First 200 chars of up to 3 chunks
                description += " Referenced in: " + " ".join(ref_texts)

            # Create embedding from description
            embedding = self.text_embedder.encode(description, convert_to_numpy=True)

            # Create ImageItem
            image_item = ImageItem(
                id=content_id,
                label=content_info['label'],
                type=content_info['type'],
                page=content_info['page'],
                path=content_info['path'],
                rel_bbox=content_info['rel_bbox'],
                abs_bbox=content_info['abs_bbox'],
                embedding=embedding
            )

            # Add to vector database
            self.vector_db.add_image(content_id, image_item)

        logger.info(f"Created and indexed image embeddings")

        # Save vector database if path provided
        if self.vector_db_path:
            self.vector_db.save(self.vector_db_path)

        return True

    def retrieve(self, query, top_k_texts=8, top_k_images=5):
        """
        Retrieve the most relevant text chunks and images for a query
        using vector similarity search and enhanced content-specific matching

        Args:
            query: User query string
            top_k_texts: Number of text chunks to retrieve
            top_k_images: Number of images to retrieve

        Returns:
            relevant_texts: List of relevant text chunks
            relevant_image_id: ID of most relevant image
        """
        logger.info(f"Retrieving content for query: {query}")

        # Process query with NLP
        query_doc = self.nlp(query)
        query_terms = [token.text.lower() for token in query_doc if not token.is_stop]

        # Create query embedding
        query_embedding = self.text_embedder.encode(query, convert_to_numpy=True)

        # Look for direct mentions of figures/tables in the query
        figure_table_mentions = re.findall(r'(figure|table)\s+(\d+(?:[-\.]\d+)?)', query.lower())

        # If a specific figure/table is mentioned, focus on that
        if figure_table_mentions:
            logger.info(f"Found direct mention of figure/table in query: {figure_table_mentions}")

            for mention_type, mention_num in figure_table_mentions:
                # Find the corresponding content
                for content_id, content_info in self.content_registry.items():
                    label_lower = content_info["label"].lower()
                    if mention_type in label_lower and mention_num in label_lower:
                        # Found the mentioned figure/table
                        mentioned_image_id = content_id

                        # Get extended chunks for this content
                        extended_id = f"extended_{content_id}"
                        extended_chunks = [
                            self.vector_db.get_text(extended_id)
                            for chunk_id in self.vector_db.text_items
                            if chunk_id == extended_id
                        ]

                        # Also get chunks directly referencing this content
                        ref_text_ids = [
                            text_id for text_id, text_item in self.vector_db.text_items.items()
                            if hasattr(text_item, 'content_refs') and text_item.content_refs and content_id in text_item.content_refs
                        ]
                        ref_chunks = [self.vector_db.get_text(text_id) for text_id in ref_text_ids]

                        # Combine extended and reference chunks
                        relevant_texts = list(filter(None, extended_chunks + ref_chunks))

                        # If we don't have enough, add some from vector search
                        if len(relevant_texts) < 3:
                            # Search vector database for texts
                            text_results = self.vector_db.search_texts(query_embedding, k=top_k_texts)
                            vector_texts = [self.vector_db.get_text(text_id) for text_id, _ in text_results]

                            # Add the vector search results
                            for text in vector_texts:
                                if text not in relevant_texts:
                                    relevant_texts.append(text)
                                if len(relevant_texts) >= top_k_texts:
                                    break

                        logger.info(f"Retrieved {len(relevant_texts)} text chunks for directly mentioned image #{mentioned_image_id}")
                        return relevant_texts, mentioned_image_id

        # If no specific figure/table is mentioned, use vector search
        # First search for images since we need them for response generation
        image_results = self.vector_db.search_images(query_embedding, k=top_k_images)

        if not image_results:
            logger.warning("No image results found, using default image")
            if self.content_registry:
                relevant_image_id = list(self.content_registry.keys())[0]
            else:
                relevant_image_id = 0
        else:
            # Get the top-scoring image
            relevant_image_id = image_results[0][0]

        # Now get text chunks, prioritizing those that reference the chosen image
        # First get any extended chunk for this image
        extended_id = f"extended_{relevant_image_id}"
        extended_chunks = [
            self.vector_db.get_text(extended_id)
            for chunk_id in self.vector_db.text_items
            if chunk_id == extended_id
        ]

        # Get chunks that directly reference this image
        ref_text_ids = [
            text_id for text_id, text_item in self.vector_db.text_items.items()
            if hasattr(text_item, 'content_refs') and text_item.content_refs and relevant_image_id in text_item.content_refs
        ]
        ref_chunks = [self.vector_db.get_text(text_id) for text_id in ref_text_ids]

        # Start with extended and reference chunks
        relevant_texts = list(filter(None, extended_chunks + ref_chunks))

        # If we don't have enough, add general vector search results
        if len(relevant_texts) < top_k_texts:
            # Regular vector search for texts
            text_results = self.vector_db.search_texts(query_embedding, k=top_k_texts*2)  # Get more to filter
            vector_texts = [self.vector_db.get_text(text_id) for text_id, _ in text_results]

            # Add the vector search results
            for text in vector_texts:
                if text and text not in relevant_texts:
                    relevant_texts.append(text)
                if len(relevant_texts) >= top_k_texts:
                    break

        logger.info(f"Retrieved {len(relevant_texts)} text chunks and image #{relevant_image_id}")

        return relevant_texts, relevant_image_id

    def generate_response(self, query, relevant_texts, image_id):
        """
        Generate a response using an LLM based on retrieved text and images
        that follows the reference output format

        Args:
            query: User query string
            relevant_texts: List of relevant text chunks
            image_id: ID of most relevant image

        Returns:
            response: Generated textual response
            image_id: ID of most relevant image
        """
        # Get image information
        image_info = self.content_registry.get(image_id, None)
        if not image_info:
            logger.warning(f"No image info found for image ID {image_id}")
            # Default to the first image if not found
            if self.content_registry:
                image_id = list(self.content_registry.keys())[0]
                image_info = self.content_registry.get(image_id)

        # Prepare context from relevant texts
        context = ""
        for i, text in enumerate(relevant_texts[:5]):  # Use up to 5 text chunks for better coverage
            if i > 0:
                context += "\n\n"
            context += text.content

        # Always start with "Based on [Figure/Table] X-X and the surrounding text,"
        if image_info:
            response_start = f"Based on {image_info['label']} and the surrounding text, "
        else:
            response_start = "Based on the economic data, "

        # Prepare the prompt for LLM
        prompt = f"""
        You are answering questions about an economics textbook. Your answer should directly quote text from the book.

        Question: {query}

        The most relevant image is {image_info['label'] if image_info else 'unknown'}.

        The surrounding text for this image includes:
        {context}

        Your response MUST:
        1. Start with exactly: "{response_start}"
        2. Directly include substantial quotes from the text surrounding the image/table
        3. Be detailed and specific (2-5 sentences minimum)
        4. Focus only on information directly from the provided text
        5. Not say things like "according to the text" or "the text states" - just include the actual quotes
        6. Similar to these examples:
           - "Based on Figure 1-1 and the surrounding text, This financial crisis quickly turned into a major economic crisis. The Crisis Table 1-1 gives you output growth rates for the world economy, for advanced economies and for other countries separately, since 2000."
           - "Based on Figure 2-5 and the surrounding text, People are classified as unemployed if they indicate that they are not working but are seeking work. The underground economy in Spain—defined as the number of people working without declaring it to the social security administration—accounted for between 10 and 15% of employment."

        Write your complete response:
        """

        # Try to use OpenAI for better response generation if API key available
        try:
            # Check if API key is set in environment
            api_key = os.environ.get("OPENAI_API_KEY")

            if api_key and hasattr(self, "openai_client"):
                # Call OpenAI API with specific instructions for the reference format
                response = self.openai_client.chat.completions.create(
                    model="gpt-3.5-turbo",
                    messages=[
                        {"role": "system", "content": "You are an economic expert that generates responses by directly quoting from text surrounding figures and tables."},
                        {"role": "user", "content": prompt}
                    ],
                    max_tokens=500,  # Allow for longer responses matching reference format
                    temperature=0.2  # Lower temperature for more deterministic responses
                )

                # Get response text
                response_text = response.choices[0].message.content.strip()
                logger.info("Generated response using OpenAI API")

                # Ensure response starts correctly
                if not response_text.startswith(response_start):
                    response_text = response_start + response_text

                # Remove any added quotation marks around entire response
                response_text = response_text.strip('"')

                return response_text, image_id

        except Exception as e:
            logger.warning(f"Could not use OpenAI for response generation: {str(e)}")
            logger.info("Falling back to local model or rule-based generation")

        # Try using a local Hugging Face model if available
        if self.llm_type == "local" and hasattr(self, "llm_pipeline"):
            try:
                # Using local Hugging Face model
                response = self.llm_pipeline(prompt, max_length=250, do_sample=False)
                response_text = response[0]["generated_text"].strip()
                logger.info("Generated response using local Hugging Face model")

                # Ensure response starts correctly
                if not response_text.startswith(response_start):
                    response_text = response_start + response_text

                return response_text, image_id

            except Exception as e:
                logger.warning(f"Local LLM error: {str(e)}")
                logger.info("Falling back to rule-based generation")

        # Fallback to rule-based generation with reference-style format
        logger.info("Using rule-based response generation with reference format")

        # Create a response in the reference format by directly quoting from text chunks
        if image_info:
            response_text = f"Based on {image_info['label']} and the surrounding text, "
        else:
            response_text = "Based on the economic data, "

        # Extract sentences that mention the figure/table or key terms
        relevant_sentences = []

        # Extract key terms from query
        query_doc = self.nlp(query)
        query_terms = [token.lemma_.lower() for token in query_doc
                      if not token.is_stop and token.pos_ in ['NOUN', 'PROPN', 'ADJ', 'VERB']]

        # Process each text chunk to extract sentences
        for text in relevant_texts:
            # Skip if text is None or has no content
            if not text or not hasattr(text, 'content') or not text.content:
                continue

            # Process the text with spaCy
            doc = self.nlp(text.content)

            # Extract sentences
            for sent in doc.sents:
                sent_text = sent.text.strip()
                if len(sent_text) < 10:  # Skip very short sentences
                    continue

                # Score this sentence for relevance
                score = 0

                # Mention of the figure/table
                if image_info and image_info['label'].lower() in sent_text.lower():
                    score += 10

                # Overlapping terms with query
                for term in query_terms:
                    if term in sent_text.lower():
                        score += 2

                # Position in the text (earlier sentences often provide context)
                if sent.start == 0:
                    score += 3

                # Length of sentence (prefer medium-length sentences)
                if 50 < len(sent_text) < 200:
                    score += 2

                if score > 0:
                    relevant_sentences.append((sent_text, score))

        # Sort by relevance and take the top sentences
        relevant_sentences.sort(key=lambda x: x[1], reverse=True)
        top_sentences = [s[0] for s in relevant_sentences[:5]]  # Take up to 5 sentences

        # Construct the response
        if top_sentences:
            response_text += " ".join(top_sentences)
        else:
            # Fallback if no good sentences found
            response_text += "The document provides information on economic indicators and their relationships in the context of growth, inflation, and unemployment rates."

        return response_text, image_id

    def evaluate_with_bertscore(self, generated_responses, reference_responses):
        """
        Evaluate the quality of generated responses using BERTScore

        Args:
            generated_responses: List of generated responses
            reference_responses: List of reference responses

        Returns:
            metrics: Dictionary of evaluation metrics
        """
        if not hasattr(self, "bert_scorer"):
            logger.warning("BERTScorer not initialized. Installing and initializing...")
            install_package("bert-score")
            from bert_score import BERTScorer
            self.bert_scorer = BERTScorer(lang="en", rescale_with_baseline=True)

        # Calculate BERTScore
        precision, recall, f1 = self.bert_scorer.score(generated_responses, reference_responses)

        # Convert to numpy for averaging
        p_mean = precision.mean().item()
        r_mean = recall.mean().item()
        f1_mean = f1.mean().item()

        logger.info(f"BERTScore Evaluation: P={p_mean:.4f}, R={r_mean:.4f}, F1={f1_mean:.4f}")

        return {
            "bertscore_precision": p_mean,
            "bertscore_recall": r_mean,
            "bertscore_f1": f1_mean
        }

    def process_query(self, query):
        """
        Process a user query end-to-end: retrieve, generate, and return response

        Args:
            query: User query string

        Returns:
            response: Generated textual response
            image_id: ID of most relevant image
        """
        if not self.vector_db.text_counter:
            logger.warning("Vector database is empty. Loading or creating content...")

            # Check if vector DB exists and can be loaded
            if self.vector_db_path and os.path.exists(os.path.join(self.vector_db_path, "text_index.faiss")):
                self.vector_db.load(self.vector_db_path)
            else:
                # Extract and index content
                self.extract_content()
                self.create_embeddings_and_index()

        # Retrieve relevant content with improved retrieval
        relevant_texts, image_id = self.retrieve(query, top_k_texts=8)

        # Generate response with reference format
        response, image_id = self.generate_response(query, relevant_texts, image_id)

        return response, image_id

    def evaluate(self, questions_csv, ground_truth=None):
        """
        Evaluate the system on a set of questions

        Args:
            questions_csv: Path to CSV with questions
            ground_truth: Optional path to ground truth answers

        Returns:
            metrics: Dictionary of evaluation metrics
            results_df: DataFrame with results
        """
        # Load questions
        questions_df = pd.read_csv(questions_csv)

        results = []
        generated_responses = []

        # Process each question - check column names and adapt accordingly
        # Check if 'Text' column exists, if not look for alternate column names
        text_column = 'Text'
        if 'Text' not in questions_df.columns:
            # Try common alternative column names
            if 'Question' in questions_df.columns:
                text_column = 'Question'
            elif 'Query' in questions_df.columns:
                text_column = 'Query'
            elif 'text' in questions_df.columns:
                text_column = 'text'
            else:
                # If can't find appropriate column, print column names for debugging
                logger.error(f"Column 'Text' not found. Available columns: {questions_df.columns.tolist()}")
                raise KeyError("Column 'Text' not found in questions CSV")

        # Process each question
        for _, row in tqdm(questions_df.iterrows(), total=len(questions_df), desc="Processing queries"):
            question_id = row['ID'] if 'ID' in row else row.index
            question = row[text_column]

            # Process the query
            response, image_id = self.process_query(question)
            generated_responses.append(response)

            # Store result
            results.append({
                'ID': question_id,
                'Text': response,
                'Image': image_id
            })

        # Create results DataFrame
        results_df = pd.DataFrame(results)

        # Calculate metrics if ground truth is available
        metrics = {}
        if ground_truth:
            gt_df = pd.read_csv(ground_truth)

            # Check column names in ground truth
            reference_column = None
            for col in ['Text', 'Response', 'Answer', 'text', 'response', 'answer']:
                if col in gt_df.columns:
                    reference_column = col
                    break

            if reference_column:
                # Evaluate image retrieval accuracy
                if 'Image' in gt_df.columns:
                    correct_images = (results_df['Image'] == gt_df['Image']).mean()
                    metrics['image_accuracy'] = correct_images

                # Evaluate text generation with BERTScore
                reference_responses = gt_df[reference_column].tolist()
                bertscore_metrics = self.evaluate_with_bertscore(generated_responses, reference_responses)
                metrics.update(bertscore_metrics)

        return metrics, results_df

    def generate_submission_csv(self, questions_csv, output_file="submission.csv"):
        """
        Generate submission CSV for Kaggle leaderboard

        Args:
            questions_csv: Path to CSV with questions
            output_file: Output CSV filename

        Returns:
            csv_path: Path to the generated CSV
        """
        # Evaluate and get results
        _, results_df = self.evaluate(questions_csv)

        # Save CSV
        csv_path = os.path.join(self.output_dir, output_file)
        results_df.to_csv(csv_path, index=False)

        logger.info(f"Generated submission CSV at {csv_path}")

        return csv_path

    def run_full_pipeline(self, questions_csv):
        """
        Run the complete RAG pipeline

        Args:
            questions_csv: Path to CSV with questions

        Returns:
            Dictionary with pipeline results
        """
        try:
            # Extract content if not already done
            if not self.content_registry:
                logger.info("Starting content extraction...")
                self.extract_content()

            # Create embeddings and index if not already done
            if not self.vector_db.text_counter:
                logger.info("Creating embeddings and indexing content...")
                self.create_embeddings_and_index()

            # Generate submission
            logger.info("Generating submission...")
            csv_path = self.generate_submission_csv(questions_csv)

            logger.info(f"Pipeline complete! Submission saved to {csv_path}")

            return {
                "content": self.content_registry,
                "submission": csv_path
            }
        except Exception as e:
            logger.error(f"Pipeline failed with error: {str(e)}")
            # Print full traceback for debugging
            import traceback
            logger.error(traceback.format_exc())
            raise

    def _check_dependencies(self):
        """
        Check and install required dependencies if missing
        """
        required_packages = [
            "pandas",
            "numpy",
            "torch",
            "PyMuPDF",
            "pdfplumber",
            "sentence-transformers",
            "spacy",
            "scikit-learn",
            "faiss-cpu",
            "bert-score",
            "transformers",
            "openai"
        ]

        for package in required_packages:
            try:
                if package == "PyMuPDF":
                    # PyMuPDF is imported as fitz
                    import fitz
                elif package == "openai":
                    # Try to import using the new API format
                    from openai import OpenAI
                else:
                    # Replace hyphens with underscores for import
                    importlib.import_module(package.replace('-', '_'))
            except ImportError:
                logger.warning(f"Package {package} not found. Installing...")
                if package == "openai":
                    # Make sure to get the latest version
                    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "--upgrade", package])
                else:
                    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", package])
                logger.info(f"Package {package} installed successfully.")


def main():
    """
    Main function to run the improved MultiModal RAG system
    """
    # ========== CONFIGURE YOUR PATHS HERE ==========
    # Simply change these variables to set all paths in one place
    PDF_PATH = "document.pdf"          # Path to your PDF document
    QUESTIONS_CSV = "Lab_2_Part_1_Questions.csv"    # Path to your questions CSV file
    OUTPUT_DIR = "reference_output"    # Directory to store output files
    VECTOR_DB_PATH = os.path.join(OUTPUT_DIR, "vector_db")  # Path to store/load vector database

    # ===== CONFIGURE YOUR OPENAI API KEY HERE =====
    # Replace with your actual OpenAI API key
    OPENAI_API_KEY = "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"
    # ==============================================

    # Determine if GPU is available
    if torch.cuda.is_available():
        logger.info("GPU is available! Using GPU for faster processing.")
    else:
        logger.info("GPU not available. Using CPU for processing.")

    # Create and configure the Enhanced MultiModal RAG system with the updated code
    print(f"Initializing Enhanced MultiModal RAG system with:")
    print(f"  - PDF: {PDF_PATH}")
    print(f"  - Questions: {QUESTIONS_CSV}")
    print(f"  - Output: {OUTPUT_DIR}")
    print(f"  - Vector DB: {VECTOR_DB_PATH}")
    if OPENAI_API_KEY:
        print("  - Using OpenAI API for reference-style response generation")

    # Create the system with the implementation
    rag_system = EnhancedMultiModalRAGSystem(
        pdf_path=PDF_PATH,
        output_dir=OUTPUT_DIR,
        vector_db_path=VECTOR_DB_PATH,
        openai_api_key=OPENAI_API_KEY
    )

    # Check if files exist
    if not os.path.exists(PDF_PATH):
        print(f"Error: PDF file not found at {PDF_PATH}")
        print("Please update the PDF_PATH variable.")
        sys.exit(1)

    if not os.path.exists(QUESTIONS_CSV):
        print(f"Error: Questions CSV file not found at {QUESTIONS_CSV}")
        print("Please update the QUESTIONS_CSV variable.")
        sys.exit(1)

    # Print CSV headers for debugging
    try:
        df = pd.read_csv(QUESTIONS_CSV)
        print(f"Questions CSV loaded successfully. Columns: {df.columns.tolist()}")
        print(f"First few rows: \n{df.head()}")
    except Exception as e:
        print(f"Error reading questions CSV: {str(e)}")
        print("Make sure the CSV is properly formatted.")

    # Check if vector database already exists and can be loaded
    if os.path.exists(os.path.join(VECTOR_DB_PATH, "text_index.faiss")):
        print("Found existing vector database. Loading...")
        rag_system.vector_db.load(VECTOR_DB_PATH)
        print(f"Loaded vector database with {rag_system.vector_db.text_counter} text items and {rag_system.vector_db.image_counter} images.")

        # If content registry is empty but vector DB is loaded, extract content without re-indexing
        if not rag_system.content_registry:
            print("Loading content registry...")
            rag_system.extract_content()
    else:
        # Run full pipeline to extract, index, and process
        print("Running full pipeline to extract and index content...")
        result = rag_system.run_full_pipeline(QUESTIONS_CSV)

        print(f"\nExtracted {len(rag_system.content_registry)} content items")
        print(f"Submission CSV: {result['submission']}")

    # Test a sample query with the reference-style response format
    query = "What is the relationship between GDP growth and unemployment?"
    print(f"\nTesting query: {query}")
    response, image_id = rag_system.process_query(query)

    print("\nExample Query Processing with Reference-Style Format:")
    print(f"Query: {query}")
    print(f"Response: {response}")
    print(f"Image ID: {image_id}")

    # Allow interactive testing of the system
    print("\n\nInteractive mode: Type your economic queries (or 'exit' to quit)")
    while True:
        user_query = input("\nEnter your economic query: ")
        if user_query.lower() in ['exit', 'quit', 'q']:
            break

        try:
            response, image_id = rag_system.process_query(user_query)
            print(f"\nResponse: {response}")
            print(f"Image ID: {image_id}")

            if image_id in rag_system.content_registry:
                print(f"Image: {rag_system.content_registry[image_id]['label']} ({rag_system.content_registry[image_id]['path']})")

        except Exception as e:
            print(f"Error processing query: {str(e)}")


if __name__ == "__main__":
    try:
        main()
    except KeyboardInterrupt:
        print("\nProcess interrupted by user.")
    except Exception as e:
        print(f"\nAn error occurred: {str(e)}")
        print("\nStacktrace:")
        import traceback
        traceback.print_exc()



Initializing Enhanced MultiModal RAG system with:
  - PDF: document.pdf
  - Questions: Lab_2_Part_1_Questions.csv
  - Output: reference_output
  - Vector DB: reference_output/vector_db
  - Using OpenAI API for reference-style response generation


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.51k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Questions CSV loaded successfully. Columns: ['ID', 'Question']
First few rows: 
   ID                                           Question
0   1  What sparked the global economic crisis around...
1   2  Why should we worry about unemployment rates g...
2   3  How do economists measure economic growth with...
3   4  How bad did the world economy get hit during t...
4   5  What happened to U.S. unemployment after the 2...
Running full pipeline to extract and index content...


100%|██████████| 54/54 [00:02<00:00, 20.04it/s]
100%|██████████| 17/17 [00:00<00:00, 22.63it/s]
Processing queries: 100%|██████████| 11/11 [00:20<00:00,  1.86s/it]



Extracted 17 content items
Submission CSV: reference_output/submission.csv

Testing query: What is the relationship between GDP growth and unemployment?

Example Query Processing with Reference-Style Format:
Query: What is the relationship between GDP growth and unemployment?
Response: Based on Figure 1-2 and the surrounding text, it is evident that there is a direct relationship between GDP growth and unemployment rates. The text states, "What is behind this persistently high unemployment is low output growth," highlighting the impact of economic growth on employment levels. Additionally, the text mentions, "Higher output growth leads to a decrease in unemployment," emphasizing the inverse correlation between these two economic indicators. Furthermore, the text explains, "The key to decreasing unemployment is a high enough rate of growth," underscoring the importance of robust economic growth in reducing unemployment rates.
Image ID: 3


Interactive mode: Type your economic queries (