# Notebook 01: Document Ingestion and Embedding

This notebook demonstrates:
1. Loading a text document (supports both .txt and .pdf files)
2. Chunking the document using configurable strategies
3. Embedding all chunks using Ollama with Gemma models
4. Tracking tokens and timing for all operations
5. Saving results for later analysis

## Setup

First, we'll import the necessary modules and set up our configuration.


In [1]:
# Import standard library modules
import sys
from pathlib import Path

# Add the src directory to Python path so we can import our modules
# This allows us to import from the src/ directory
project_root = Path().resolve().parent  # Go up one level from notebooks/ to project root
sys.path.insert(0, str(project_root))

# Import our custom modules
from src.config import Config
from src.pipeline import load_document, chunk_text, embed_chunks, save_chunks, save_metrics
from src.timing_metrics import MetricsStore
from src.token_accounting import count_tokens

print("Modules imported successfully!")


Modules imported successfully!


## Configuration

Create a configuration object with our settings. You can modify these values to experiment with different models, chunk sizes, or strategies.


In [2]:
# Create configuration object
# This holds all our settings: model names, chunk sizes, file paths, etc.
config = Config(
    embedding_model="embeddinggemma",      # Gemma model for embeddings
    generation_model="gemma3:1b",          # Gemma model for text generation (not used in this notebook)
    chunk_size_tokens=512,                  # Target chunk size in tokens
    chunk_overlap_tokens=50,                # Overlap between chunks (helps maintain context)
    chunking_strategy="fixed_token_window", # Strategy: 'fixed_token_window', 'paragraph_based', or 'both'
    ollama_endpoint="http://localhost:11434" # Where Ollama is running
)

print(f"Configuration created:")
print(f"  Embedding model: {config.embedding_model}")
print(f"  Chunk size: {config.chunk_size_tokens} tokens")
print(f"  Chunk overlap: {config.chunk_overlap_tokens} tokens")
print(f"  Chunking strategy: {config.chunking_strategy}")
print(f"  Data directory: {config.data_dir}")
print(f"  Results directory: {config.results_dir}")


Configuration created:
  Embedding model: embeddinggemma
  Chunk size: 512 tokens
  Chunk overlap: 50 tokens
  Chunking strategy: fixed_token_window
  Data directory: data
  Results directory: results


## Load Document

Load the text document. This function supports both .txt and .pdf files. We'll use MobyDick.txt as an example.


In [3]:
# Path to the document file
# MobyDick.txt is in the project root directory
document_path = project_root / "MobyDick.txt"

# Load the document
# This function automatically detects .txt or .pdf and extracts text accordingly
print(f"Loading document from: {document_path}")
text = load_document(document_path)

# Display some statistics about the loaded document
total_chars = len(text)
total_tokens = count_tokens(text)
total_lines = text.count('\n')

print(f"\nDocument loaded successfully!")
print(f"  Total characters: {total_chars:,}")
print(f"  Total tokens: {total_tokens:,}")
print(f"  Total lines: {total_lines:,}")
print(f"\nFirst 500 characters:")
print(text[:500] + "...")


Loading document from: /home/goble54/spark-dev-workspace/GenerativeAI-Cost-Estimator/MobyDick.txt

Document loaded successfully!
  Total characters: 1,238,244
  Total tokens: 311,716
  Total lines: 22,310

First 500 characters:
﻿The Project Gutenberg eBook of Moby Dick; Or, The Whale
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eB...


## Chunk the Document

Split the document into smaller chunks. The chunking strategy is determined by the configuration. This makes the document manageable for embedding and later retrieval.


In [4]:
# Chunk the text using the configured strategy
# This returns a list of dictionaries, each containing a chunk of text
print("Chunking document...")
chunks = chunk_text(text, config)

# Display statistics about the chunks
num_chunks = len(chunks)
total_chunk_tokens = sum(chunk['token_count'] for chunk in chunks)
avg_tokens_per_chunk = total_chunk_tokens / num_chunks if num_chunks > 0 else 0
min_tokens = min(chunk['token_count'] for chunk in chunks) if chunks else 0
max_tokens = max(chunk['token_count'] for chunk in chunks) if chunks else 0

print(f"\nChunking complete!")
print(f"  Total chunks: {num_chunks}")
print(f"  Total tokens in chunks: {total_chunk_tokens:,}")
print(f"  Average tokens per chunk: {avg_tokens_per_chunk:.1f}")
print(f"  Min tokens per chunk: {min_tokens}")
print(f"  Max tokens per chunk: {max_tokens}")

# Show an example chunk
if chunks:
    print(f"\nExample chunk (first chunk):")
    print(f"  Chunk ID: {chunks[0]['chunk_id']}")
    print(f"  Token count: {chunks[0]['token_count']}")
    print(f"  Text preview: {chunks[0]['text'][:200]}...")


Chunking document...

Chunking complete!
  Total chunks: 772
  Total tokens in chunks: 372,730
  Average tokens per chunk: 482.8
  Min tokens per chunk: 137
  Max tokens per chunk: 646

Example chunk (first chunk):
  Chunk ID: chunk_0
  Token count: 511
  Text preview: ﻿The Project Gutenberg eBook of Moby Dick; Or, The Whale
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrict...


## Embed All Chunks

Now we'll embed each chunk using Ollama. This process:
- Calls Ollama's embedding API for each chunk
- Tracks how long each call takes (timing)
- Counts tokens for each chunk (token accounting)
- Stores all metrics for later analysis

**Note:** This may take a while depending on the number of chunks and your system's performance.


In [5]:
# Create a metrics store to track all timing and token metrics
# This will collect data about each embedding call
metrics_store = MetricsStore()

# Embed all chunks
# This function calls Ollama for each chunk and records metrics
print(f"Embedding {len(chunks)} chunks...")
print("This may take a while. Progress will be shown below.\n")

embedded_chunks = embed_chunks(chunks, config, metrics_store)

print(f"\nEmbedding complete!")
print(f"  Total chunks embedded: {len(embedded_chunks)}")

# Display some statistics from the metrics
embedding_metrics = metrics_store.get_metrics_by_type('embedding')
if embedding_metrics:
    total_embedding_time = sum(m['duration_seconds'] for m in embedding_metrics)
    total_embedding_tokens = sum(m['token_counts'].get('input_tokens', 0) for m in embedding_metrics)
    avg_embedding_time = total_embedding_time / len(embedding_metrics) if embedding_metrics else 0
    
    print(f"  Total embedding time: {total_embedding_time:.2f} seconds")
    print(f"  Average time per chunk: {avg_embedding_time:.2f} seconds")
    print(f"  Total tokens processed: {total_embedding_tokens:,}")
    print(f"  Throughput: {total_embedding_tokens / total_embedding_time:.2f} tokens/second" if total_embedding_time > 0 else "  Throughput: N/A")


Embedding 772 chunks...
This may take a while. Progress will be shown below.



ConnectionError: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/embeddings (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0xe2573fff9c70>: Failed to establish a new connection: [Errno 111] Connection refused'))

## Save Results

Save the embedded chunks and metrics to disk so we can use them in the next notebook.


In [None]:
# Save chunks (including embeddings) to JSON file
# Note: This file may be large because embeddings are vectors of numbers
chunks_path = config.get_chunks_path()
print(f"Saving chunks to: {chunks_path}")
save_chunks(embedded_chunks, chunks_path)
print("Chunks saved!")

# Save metrics to JSON file
# This contains timing and token information for all embedding calls
metrics_path = config.get_metrics_path()
print(f"\nSaving metrics to: {metrics_path}")
save_metrics(metrics_store, metrics_path)
print("Metrics saved!")

print("\n✅ All results saved successfully!")
print("You can now proceed to notebook 02 for inference and question generation.")
