### 📄 Step 1: Parse PDF into a Document object

This step loads a `.pdf` file and converts it into a `Document`, which includes both the full text and metadata (like page numbers).  
- The `Document` also splits the text into internal chunks, typically one per page or section.


In [25]:
from archivum.parser import parse_file
file_path = "input/canadian_federal_budget_2024.pdf"
document = parse_file(file_path)
print(document)

<Document with 422 chunks>


### 🔢 Step 2: Tokenize Chunks (Basic)

This loop tokenizes each chunk of text using the configured tokenizer.  
It prints:
- `Tokens`: number of tokens in the chunk
- `Mask Sum`: number of "active" tokens after padding (should match `Tokens` if no padding)

This is helpful to check chunk size relative to your model's token limit (512 for e5).

In [15]:
from archivum.tokenizer_utils import load_tokenizer, tokenize, tokenize_raw
tokenizer = load_tokenizer()

for chunk in document.get_chunks():
    input_ids, attention_mask = tokenize(chunk.text, tokenizer)
    print(f"Tokens: {len(input_ids)}, Mask Sum: {sum(attention_mask)}")

Tokens: 512, Mask Sum: 427


### 🔍 Step 2b: Inspect Raw Tokenization

This shows raw token details:
- `tokens`: list of decoded token strings
- `ids`: actual token IDs
- `offsets`: character ranges of each token in the original string

This is for debugging and analyzing the differnece between original text and token chunks.

In [16]:
for chunk in document.get_chunks():
    enc = tokenize_raw(chunk.text, tokenizer)
    print(f"First 100 tokens: {enc.tokens[:100]}")
    print(f"Token count: {len(enc.ids)}")
    print(f"Offsets: {enc.offsets[:10]}")




First 100 tokens: ['<s>', '▁*', 'Archiv', 'um', '▁Men', 'tis', '*', '▁directly', '▁trans', 'late', 's', '▁to', '▁"', 'Archiv', 'e', '▁of', '▁the', '▁Mind', '".', '▁**', '📜', '▁E', 'tym', 'ology', '▁of', '▁_', 'Archiv', 'um', '_', '**', '▁>', '▁The', '▁English', '▁word', '▁_', 'archive', '_', '▁is', '▁deri', 'ved', '▁from', '▁the', '▁French', '▁_', 'archive', 's', '_', '▁(', 'pl', 'ural', '),', '▁and', '▁in', '▁turn', '▁from', '▁Latin', '▁_', 'arch', 'ī', 'um', '_', '▁or', '▁_', 'arch', 'īvu', 'm', '_', ',', '▁the', '▁roman', 'ized', '▁form', '▁of', '▁the', '▁Greek', '▁', 'ἀ', 'ρ', 'χε', 'ῖ', 'ον', '▁(', '_', 'ar', 'khe', 'ion', '_', ').', '▁The', '▁Greek', '▁term', '▁original', 'ly', '▁refer', 'red', '▁to', '▁the', '▁home', '▁or', '▁d']
Token count: 512
Offsets: [(0, 0), (0, 1), (1, 7), (7, 9), (10, 13), (13, 16), (16, 17), (18, 26), (27, 32), (32, 36)]


### ✂️ Step 3: Sentence-Aware Chunking

This uses full sentence boundaries to chunk text under a token limit (e.g., 512 tokens).  
Sentences are added one-by-one until the chunk hits the limit. This ensures:
- No mid-sentence cuts
- Clean semantic groupings

You can inspect how many chunks were made per document section.

In [17]:
from archivum.chunker import chunk_text, debug_token_chunk
print("\n=== Running Sentence-aware Chunking ===\n")
for i, chunk in enumerate(document.get_chunks()):
    sentence_chunks = chunk_text(chunk.text, tokenizer, strategy="sentence", log=True)
    print(f"Document Chunk {i+1} → {len(sentence_chunks)} sentence chunks\n")
    for j, c in enumerate(sentence_chunks[:2]):
        print(f"  [{j+1}] {c[:200]}...\n")
    if i == 0:
        break


=== Running Sentence-aware Chunking ===

[Chunking] Strategy: sentence | Tokens: 512 | Chunks: 10 | Avg tokens/chunk: 51.2 | Time: 0.0138s
Document Chunk 1 → 10 sentence chunks

  [1] *Archivum Mentis* directly translates to "Archive of the Mind"....

  [2] **📜 Etymology of _Archivum_**

> The English word _archive_ is derived from the French _archives_ (plural), and in turn from Latin _archīum_ or _archīvum_, the romanized form of the Greek ἀρχεῖον (_ar...



[nltk_data] Downloading package punkt_tab to /Users/soho/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


### 🪟 Step 3b: Sliding Window Chunking

This cuts text into fixed-length token windows with some overlap (`stride`).  
Useful for models that benefit from continuous context.

- `window=150`: size of each chunk in tokens
- `stride=30`: overlap between chunks

In [18]:
print("\n=== Running Sliding Window Chunking ===\n")
for i, chunk in enumerate(document.get_chunks()):
    sliding_chunks = chunk_text(chunk.text, tokenizer, strategy="sliding", window=150, stride=30, log=True)
    print(f"Document Chunk {i+1} → {len(sliding_chunks)} sliding chunks\n")
    for j, c in enumerate(sliding_chunks[:2]):
        print(f"  [{j+1}] {c[:200]}...\n")
    if i == 0:
        break


=== Running Sliding Window Chunking ===

[Chunking] Strategy: sliding | Tokens: 512 | Chunks: 17 | Avg tokens/chunk: 30.1 | Time: 0.0029s
Document Chunk 1 → 17 sliding chunks

  [1] *Archivum Mentis* directly translates to "Archive of the Mind". **📜 Etymology of _Archivum_** > The English word _archive_ is derived from the French _archives_ (plural), and in turn from Latin _archī...

  [2] > The English word _archive_ is derived from the French _archives_ (plural), and in turn from Latin _archīum_ or _archīvum_, the romanized form of the Greek ἀρχεῖον (_arkheion_). The Greek term origin...



### 🧪 Step 3c: Token Slice Debugging

Visual debug tool to print the tokenized form of a slice of text.  
Here it prints tokens 0–20 from the first chunk, showing what actual tokens get passed to the model.

In [19]:
print("\n=== Debug: Manual Token Slice ===\n")
sample_text = document.get_chunks()[0].text
debug_token_chunk(sample_text, tokenizer, 0, 20)


=== Debug: Manual Token Slice ===

Tokens 0:20 → [0, 661, 219548, 316, 1111, 1814, 1639, 105237, 3900, 19309, 7, 47, 44, 219548, 13, 111, 70, 29616, 740, 16459]
Decoded Text → *Archivum Mentis* directly translates to "Archive of the Mind". **


### 🔗 Step 4: Generate Embeddings

We collect all the sentence-based chunks across the whole document.These will be passed into the embedding model.Each chunk is converted into a vector using `e5-large-instruct`.  
The result is a tensor of shape `(num_chunks, dim)`, where each row is a semantic fingerprint of the chunk.

Printed outputs confirm:
- Number of chunks embedded
- First chunk preview
- First 5 dimensions of its vector


In [20]:
from archivum.embedder import load_embedder, embed_texts, get_detailed_instruct
import torch
all_chunks = []
for chunk in document.get_chunks():
    sentence_chunks = chunk_text(chunk.text, tokenizer, strategy="sentence")
    all_chunks.extend(sentence_chunks)

model = load_embedder()
chunk_embeddings = embed_texts(all_chunks, model)

print(f"Embedded {len(all_chunks)} chunks → shape: {chunk_embeddings.shape}")
print(f"Example chunk: {all_chunks[0][:100]}...")
print(f"Embedding (first 5 dims): {chunk_embeddings[0][:5]}")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Embedded 10 chunks → shape: torch.Size([10, 1024])
Example chunk: *Archivum Mentis* directly translates to "Archive of the Mind"....
Embedding (first 5 dims): tensor([ 0.0320, -0.0064, -0.0170, -0.0274,  0.0263], device='mps:0')


### ❓ Step 5: Encode Query

User provides a natural-language question.  
It's wrapped with an instruction (e.g., `"Retrieve relevant document information"`) to guide the embedding model to treat it as a search query.

Then it’s embedded into a vector for comparison.


In [23]:
query = input("\n🔍 Your query: ").strip()
detailed_query = get_detailed_instruct("Retrieve relevant document information", query)
query_embedding = embed_texts([detailed_query], model)[0]

### 🔍 Step 6: Vector Search + Top Matches

We compare the query vector with each chunk using dot product (cosine similarity if vectors are normalized).  
Top-k most similar chunks are retrieved based on similarity score.
The top results are printed with their similarity score and the first ~200 characters.  
This simulates the “retrieval” step in RAG — you're seeing which document pieces best match your query.



In [24]:
scores = (query_embedding @ chunk_embeddings.T) * 100  # shape: (num_chunks,)
top_k = 3
top_indices = torch.topk(scores, k=top_k).indices.tolist()

print(f"\n🧠 Top {top_k} Results:\n")
for idx in top_indices:
    print(f"[{scores[idx]:.2f}] {all_chunks[idx][:200]}...\n")


🧠 Top 3 Results:

[92.23] **📜 Etymology of _Archivum_**

> The English word _archive_ is derived from the French _archives_ (plural), and in turn from Latin _archīum_ or _archīvum_, the romanized form of the Greek ἀρχεῖον (_ar...

[90.93] > — [_Wikipedia: Archive_](https://en.wikipedia.org/wiki/Archive)

I first chose the word Archivum because I like latin terminology, however upon further digging and investigating the origin of this w...

[86.63] *Archivum Mentis* directly translates to "Archive of the Mind"....

