# Athenaeum Playground

This notebook walks through the core features of **Athenaeum** — a Python library for building searchable knowledge bases from documents.

## Installation

In [1]:
# Use one of the following:

#!pip install -q athenaeum-kb[mistral]
!uv pip install '../dist/athenaeum_kb-0.2.1-py3-none-any.whl[mistral]' -q

[2mAudited [1m1 package[0m [2min 11ms[0m[0m


## Setup

In [3]:
import getpass

OPENAI_API_KEY = getpass.getpass('OPENAI_API_KEY: ')
MISTRAL_API_KEY = getpass.getpass('MISTRAL_API_KEY: ')

OPENAI_API_KEY:  ········
MISTRAL_API_KEY:  ········


In [4]:
from pathlib import Path

DOCS_DIR = Path("knowledge-base")
STORAGE_DIR = Path(".athenaeum")

In [9]:
from langchain_text_splitters import RecursiveCharacterTextSplitter, Language
from athenaeum import Athenaeum, AthenaeumConfig, get_ocr_provider
from langchain_openai import OpenAIEmbeddings
import tiktoken

config = AthenaeumConfig(
    storage_dir=STORAGE_DIR,
    rrf_k=70,                   # RRF constant for hybrid search
    default_strategy="hybrid",  # Default search strategy
)

# Custom token-based splitter
enc = tiktoken.get_encoding("cl100k_base")
token_splitter = RecursiveCharacterTextSplitter.from_language(
    Language.MARKDOWN,
    chunk_size=512,
    chunk_overlap=64,
    length_function=lambda text: len(enc.encode(text)),
)

# Embedding model
embeddings = OpenAIEmbeddings(model="text-embedding-3-small", api_key=OPENAI_API_KEY)

# OCR model
ocr = get_ocr_provider("mistral", api_key=MISTRAL_API_KEY)

kb = Athenaeum(
    config=config,
    ocr_provider=ocr,
    embeddings=embeddings,
    text_splitter=token_splitter
)

## Load documents

In [10]:
# Load the PDF papers from `DOCS_DIR`, assigning tags by research area so we can filter later

papers = {
    "Attention Is All You Need.pdf": {"nlp", "transformers", "architecture"},
    "BERT.pdf":                      {"nlp", "transformers", "pretraining"},
    "XLNet.pdf":                     {"nlp", "transformers", "pretraining"},
    "Language Models are Few-Shot Learners.pdf": {"nlp", "transformers", "pretraining", "generative"},
    "LORA.pdf":                      {"nlp", "transformers", "fine-tuning"},
    "RAG.pdf":                       {"nlp", "transformers", "retrieval"},
    "ViT.pdf":                       {"vision", "transformers", "architecture"},
    "GANs.pdf":                      {"vision", "generative"},
    "VAE.pdf":                       {"vision", "generative"},
    "DDPM.pdf":                      {"vision", "generative", "diffusion"},
    "High-Resolution Image Synthesis with Latent Diffusion Models.pdf": {"vision", "generative", "diffusion"},
}

for filename, tags in papers.items():
    path = DOCS_DIR / filename
    if not path.exists():
        print(f"Skipping (not found): {filename}")
        continue
    doc_id = kb.load_doc(str(path), tags=tags)
    print(f"[INFO] Loaded {filename} with ID {doc_id}")

[INFO] Loaded Attention Is All You Need.pdf with ID a37ca72e7343
[INFO] Loaded BERT.pdf with ID a93623c887f3
[INFO] Loaded XLNet.pdf with ID 398c7fcbc1bb
[INFO] Loaded Language Models are Few-Shot Learners.pdf with ID 558dfc0f6033
[INFO] Loaded LORA.pdf with ID 56397b5e8269
[INFO] Loaded RAG.pdf with ID 30ea5715ee9d
[INFO] Loaded ViT.pdf with ID f83a251d3a8d
[INFO] Loaded GANs.pdf with ID 8053b5f17066
[INFO] Loaded VAE.pdf with ID 77f686645c39
[INFO] Loaded DDPM.pdf with ID 7e2a0c030dc2
[INFO] Loaded High-Resolution Image Synthesis with Latent Diffusion Models.pdf with ID f3ea54df7f8d


## List documents

In [11]:
# List all documents in kb

docs = kb.list_docs()
for doc in docs:
    print(f"{doc.id}  {doc.name:<60}")

a37ca72e7343  Attention Is All You Need.pdf                               
a93623c887f3  BERT.pdf                                                    
398c7fcbc1bb  XLNet.pdf                                                   
558dfc0f6033  Language Models are Few-Shot Learners.pdf                   
56397b5e8269  LORA.pdf                                                    
30ea5715ee9d  RAG.pdf                                                     
f83a251d3a8d  ViT.pdf                                                     
8053b5f17066  GANs.pdf                                                    
77f686645c39  VAE.pdf                                                     
7e2a0c030dc2  DDPM.pdf                                                    
f3ea54df7f8d  High-Resolution Image Synthesis with Latent Diffusion Models.pdf


In [None]:
# Filter by tags
# Tags use OR semantics — any document matching at least one of the given tags is returned.

print("Diffusion papers:")
for doc in kb.list_docs(tags={"diffusion"}):
    print(f"  - {doc.name}")

print()
print(" NLP papers:")
for doc in kb.list_docs(tags={"nlp"}):
    print(f"  - {doc.name}")

print()
print("Vision + fine-tuning:")
for doc in kb.list_docs(tags={"vision", "fine-tuning"}):
    print(f"  - {doc.name}  tags={kb.get_tags(doc.id)}")

In [13]:
# List all tags in the kb

tags = kb.list_tags()

print("Tags:")
for tag in sorted(tags):
    print(f" - {tag}")

Tags:
 - architecture
 - diffusion
 - fine-tuning
 - generative
 - nlp
 - pretraining
 - retrieval
 - transformers
 - vision


## Search across documents

In [None]:
# BM25 (keyword) search
query = "self-attention mechanism"
results = kb.search_kb(query, top_k=3, strategy="bm25")

for hit in results:
    print(f"[{hit.score:.3f}] {hit.name}")
    if hit.snippet:
        print(f"{hit.snippet[:120]}...")
        print()

In [None]:
# Vector (semantic) search
query = "how to generate realistic images?"
results = kb.search_kb(query, top_k=3, strategy="vector")

for hit in results:
    print(f"[{hit.score:.3f}] {hit.name}")
    if hit.snippet:
        print(f"{hit.snippet[:120]}...")
        print()

In [None]:
# Hybrid search (default)
# Combines BM25 and vector search with Reciprocal Rank Fusion.
query = "low-rank adaptation for large language models"
results = kb.search_kb(query, top_k=3)

for hit in results:
    print(f"[{hit.score:.4f}] {hit.name}")
    if hit.snippet:
        print(f"{hit.snippet[:120]}...")
        print()

In [None]:
# Similarity threshold — filter out low-confidence vector results
# Set similarity_threshold in AthenaeumConfig to drop chunks below a cosine score cutoff.
# Scores are in [0, 1]; a threshold of 0.35 keeps only reasonably similar chunks.
#
# Note: similarity_threshold only affects "vector" and "hybrid" strategies (not "bm25").
# To use this feature, create a new Athenaeum instance with the desired config:
#
#   from athenaeum import Athenaeum, AthenaeumConfig
#   config = AthenaeumConfig(storage_dir=STORAGE_DIR, similarity_threshold=0.35)
#   kb_filtered = Athenaeum(embeddings=embeddings, config=config)
#
# For demonstration, we show the raw scores returned by vector search so you can
# choose an appropriate threshold for your use case.

query = "how to generate realistic images?"
results = kb.search_kb(query, top_k=5, strategy="vector")

print(f"Vector search results for: '{query}'")
print(f"(use similarity_threshold in AthenaeumConfig to filter below a chosen score)\n")
for hit in results:
    print(f"  [{hit.score:.3f}] {hit.name}")

In [None]:
# Chunk-level results with aggregate=False
# By default search_kb returns one SearchHit per document (aggregate=True).
# Pass aggregate=False to get raw ContentSearchHit objects with exact line ranges,
# which is useful when you need to pinpoint exactly where a match appears.

query = "low-rank adaptation for large language models"
chunks = kb.search_kb(query, top_k=5, aggregate=False)

print(f"Chunk-level results for: '{query}'\n")
for hit in chunks:
    print(f"  [{hit.score:.4f}] {hit.name}  lines {hit.line_range[0]}-{hit.line_range[1]}")
    print(f"           {hit.text[:100].strip()}...")
    print()

In [None]:
# Search with tag filtering
# Restrict search to documents matching specific tags.
query = "generative models"
results = kb.search_kb(query, top_k=3, tags={"nlp"})

for hit in results:
    print(f"[{hit.score:.4f}] {hit.name}")
    if hit.snippet:
        print(f"{hit.snippet[:120]}...")
        print()

In [None]:
# Search by name
query = "BERT"
results = kb.search_kb(query, scope="names", strategy="bm25")

for hit in results:
    print(f"[{hit.score:.4f}] {hit.name}")
    if hit.snippet:
        print(f"{hit.snippet[:120]}...")
        print()

## Search within a document

In [None]:
# Pick a specific paper and search for content inside it

doc_ids = {doc.name: doc.id for doc in docs}
attention_id = doc_ids["BERT.pdf"]
query = "What is SQuAD?"

results = kb.search_doc(attention_id, query, top_k=3)
for hit in results:
    print(f"[{hit.score:.3f}] {hit.line_range[0]}-{hit.line_range[1]}")
    print(f"{hit.text[:100]}...")
    print()

In [None]:
# Pick a specific paper and search for content inside it

doc_ids = {doc.name: doc.id for doc in docs}
attention_id = doc_ids["BERT.pdf"]
query = "Describe BERT's model architecture"

results = kb.search_doc(attention_id, query, top_k=3)
for hit in results:
    print(f"[{hit.score:.3f}] {hit.line_range[0]}-{hit.line_range[1]}")
    print(f"{hit.text[:100]}...")
    print()

## Read specific excerpts

In [23]:
# Read the first 30 lines of a paper

gpt3_id = doc_ids["Language Models are Few-Shot Learners.pdf"]
excerpt = kb.read_doc(gpt3_id, start_line=1, end_line=20)

print(f"Lines {excerpt.line_range[0]}-{excerpt.line_range[1]} of {excerpt.total_lines}\n")
print(excerpt.text)

Lines 1-20 of 1791

# Language Models are Few-Shot Learners

|  Tom B. Brown* |   | Benjamin Mann* |   | Nick Ryder* |   | Melanie Subbiah*  |   |
| --- | --- | --- | --- | --- | --- | --- | --- |
|  Jared Kaplan† | Prafulla Dhariwal | Arvind Neelakantan | Pranav Shyam | Girish Sastry |  |  |   |
|  Amanda Askell | Sandhini Agarwal | Ariel Herbert-Voss | Gretchen Krueger | Tom Henighan |  |  |   |
|  Rewon Child | Aditya Ramesh | Daniel M. Ziegler | Jeffrey Wu | Clemens Winter |  |  |   |
|  Christopher Hesse | Mark Chen | Eric Sigler | Mateusz Litwin | Scott Gray |  |  |   |
|  Benjamin Chess |   | Jack Clark |   | Christopher Berner |   |  |   |
|  Sam McCandlish |   | Alec Radford | Ilya Sutskever | Dario Amodei |   |  |   |

OpenAI

# Abstract

Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires t

## Use table of contents to navigate

In [None]:
lora_id = doc_ids["LORA.pdf"]
print(kb.get_toc(lora_id))

In [None]:
vit_id = doc_ids["ViT.pdf"]
print(kb.get_toc(vit_id))

## Manage tags

Tags can be added or removed after loading.

In [26]:
# Add a tag
kb.tag_doc(attention_id, {"seminal", "google"})

# Remove a tag
kb.untag_doc(attention_id, {"google"})

## Cleanup

The knowledge base persists in `STORAGE_DIR`. Delete it to start fresh:

```python
import shutil
shutil.rmtree(STORAGE_DIR)
```

---