# PDF Ingestion, Chunking, and Embedding

This notebook demonstrates the complete pipeline for processing PDF documents:
1. **PDF Ingestion**: Extract text from PDF files with robust error handling
2. **Text Cleaning**: Normalize text, remove headers/footers, fix hyphenation
3. **Sophisticated Chunking**: Detect sections, split into sentences, pack into token-aware chunks
4. **Embedding**: Generate embeddings for all chunks using EmbeddingGemma
5. **Index Building**: Create searchable indexes for efficient retrieval

This workflow prepares PDF documents for question generation and fine-tuning.


In [1]:
# Import all necessary functions from src modules
# All functions are defined in src/, not in notebooks
from src.data.pdf_to_chunks_pipeline import pdf_to_chunks
from src.pipelines.embed_chunks import embed_chunks, build_faiss_index, build_sklearn_index
import pandas as pd
import numpy as np


  from .autonotebook import tqdm as notebook_tqdm


## Step 1: PDF to Chunks

The first step is to extract text from a PDF file and chunk it into manageable pieces. The `pdf_to_chunks()` function handles:
- **PDF Extraction**: Uses pdfplumber (primary) with pypdf fallback
- **Text Cleaning**: Normalizes whitespace, fixes hyphenation, removes headers/footers
- **Section Detection**: Identifies headings and groups content into sections
- **Token-Aware Chunking**: Packs sentences into chunks respecting token limits with overlap

### Chunking Parameters

- `max_tokens`: Maximum tokens per chunk (default: 512, matches EmbeddingGemma's typical max length)
- `overlap_tokens`: Number of tokens to overlap between chunks (default: 64, prevents context loss at boundaries)
- `min_tokens`: Minimum tokens per chunk (default: 128, filters out tiny fragments)


In [2]:
# Specify the path to your PDF file
# Replace with your actual PDF path
pdf_path = "data/mobydick.pdf"  # Update this path

# Process PDF into chunks
# This function saves chunks to a timestamped parquet file automatically
chunks_df = pdf_to_chunks(
    pdf_path=pdf_path,
    doc_id=None,  # Will use PDF filename as doc_id
    max_tokens=512,
    overlap_tokens=64,
    min_tokens=128
)

print(f"Generated {len(chunks_df)} chunks")
print(f"\nChunk DataFrame columns: {list(chunks_df.columns)}")
print(f"\nFirst few chunks:")
chunks_df.head()


    Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
    Minimum and Maximum cuda capability supported by this version of PyTorch is
    (8.0) - (12.0)
    


TypeError: Object of type int64 is not JSON serializable

## Step 2: Explore Chunk Statistics

Let's examine the characteristics of our chunks to understand the document structure.


In [None]:
# Display chunk statistics
print("Chunk Statistics:")
print("=" * 60)
print(f"Total chunks: {len(chunks_df)}")
print(f"Average tokens per chunk: {chunks_df['token_count'].mean():.1f}")
print(f"Min tokens: {chunks_df['token_count'].min()}")
print(f"Median tokens: {chunks_df['token_count'].median():.1f}")
print(f"Max tokens: {chunks_df['token_count'].max()}")
print(f"\nUnique sections: {chunks_df['section_id'].nunique()}")
print(f"Pages covered: {chunks_df['page_start'].min()} to {chunks_df['page_end'].max()}")

# Show distribution of chunk sizes
print("\nToken count distribution:")
print(chunks_df['token_count'].describe())


## Step 3: View Example Chunks

Let's examine a few example chunks to understand their structure and content.


In [None]:
# Display a few example chunks with their metadata
for idx in range(min(3, len(chunks_df))):
    chunk = chunks_df.iloc[idx]
    print(f"\n{'='*60}")
    print(f"Chunk {idx + 1}: {chunk['chunk_id']}")
    print(f"Section: {chunk['section_title']}")
    print(f"Pages: {chunk['page_start']}-{chunk['page_end']}")
    print(f"Tokens: {chunk['token_count']}, Characters: {chunk['char_count']}")
    print(f"\nText preview (first 200 chars):")
    print(chunk['text'][:200] + "...")


## Step 4: Embed Chunks

Now we'll generate embeddings for all chunks using EmbeddingGemma. This step:
- Processes chunks in batches for efficiency
- Uses the existing `embed_texts()` function with max_length=512
- Saves embeddings to timestamped .npy file
- Saves metadata to timestamped parquet file

### Embedding Parameters

- `batch_size`: Number of chunks to process at once (default: 64, adjust based on GPU memory)
- `max_length`: Maximum sequence length (should match chunk max_tokens)


In [None]:
# Generate embeddings for all chunks
# This function saves embeddings automatically to timestamped files
embeddings_array, embeddings_metadata_df = embed_chunks(
    chunks_df,
    batch_size=64,  # Adjust based on GPU memory
    max_length=512   # Should match chunk max_tokens
)

print(f"Embeddings shape: {embeddings_array.shape}")
print(f"Embedding dimension: {embeddings_array.shape[1]}")
print(f"\nEmbeddings metadata:")
print(embeddings_metadata_df.head())


## Step 5: Build Search Index (Optional)

For efficient similarity search, we can build a FAISS or scikit-learn index. This allows us to quickly find the most similar chunks to a query.

### Index Types

- **FAISS (Flat)**: Exact nearest neighbor search, fastest for small-medium datasets
- **FAISS (IVF)**: Approximate search with clustering, faster for large datasets
- **scikit-learn NearestNeighbors**: Lightweight alternative, no extra dependencies


In [None]:
# Try to build FAISS index (requires faiss-cpu or faiss-gpu)
faiss_index = build_faiss_index(embeddings_array, index_type="flat")

# If FAISS is not available, try scikit-learn
if faiss_index is None:
    print("FAISS not available, trying scikit-learn...")
    sklearn_index = build_sklearn_index(embeddings_array, n_neighbors=10)
    if sklearn_index is not None:
        print("Built scikit-learn index successfully")
    else:
        print("Neither FAISS nor scikit-learn available. Install with: pip install faiss-cpu")


## Step 6: Nearest Neighbor Search Example

Let's demonstrate how to use the index to find similar chunks. This is useful for:
- **Retrieval**: Finding relevant chunks for a query
- **Hard Negative Mining**: Finding similar but non-matching chunks for contrastive learning
- **Exploration**: Understanding document structure and relationships


In [None]:
# Example: Find chunks similar to a query
from src.models.embedding_pipeline import load_embeddinggemma_model, embed_texts

# Load model for query embedding
tokenizer, model = load_embeddinggemma_model()
device = next(model.parameters()).device

# Example query
query_text = "What is the main topic of this document?"

# Embed the query
query_embedding = embed_texts(
    query_text,
    model,
    tokenizer,
    device=device,
    max_length=512
).numpy()

# Find nearest neighbors (using simple cosine similarity if no index)
from sklearn.metrics.pairwise import cosine_similarity

# Compute similarities
similarities = cosine_similarity(query_embedding, embeddings_array)[0]

# Get top 5 most similar chunks
top_k = 5
top_indices = np.argsort(similarities)[::-1][:top_k]

print(f"Top {top_k} most similar chunks to query: '{query_text}'")
print("=" * 60)
for i, idx in enumerate(top_indices):
    chunk = chunks_df.iloc[idx]
    print(f"\nRank {i+1} (similarity: {similarities[idx]:.3f}):")
    print(f"  Section: {chunk['section_title']}")
    print(f"  Text preview: {chunk['text'][:150]}...")


## Summary

This notebook demonstrated:
1. ✅ PDF ingestion and text extraction
2. ✅ Text cleaning and normalization
3. ✅ Sophisticated chunking with section detection
4. ✅ Embedding generation for all chunks
5. ✅ Index building for efficient retrieval
6. ✅ Nearest neighbor search example

**Next Steps:**
- Proceed to notebook `10_Generate_Questions_Build_Dataset.ipynb` to generate questions and build the training dataset
- The chunks and embeddings are saved to timestamped files in `data/processed/` and `outputs/embeddings/`
