# Athenaeum Playground

This notebook walks through the core features of **Athenaeum** — a Python library for building searchable knowledge bases from documents.

We will:
1. Install dependencies
2. Initialize a knowledge base
3. Load PDF papers with tags
4. List and filter documents
5. Search across documents (BM25, vector, hybrid)
6. Search within a single document
7. Read specific excerpts
8. Manage tags

## 1. Installation

In [None]:
%pip install -q athenaeum-kb langchain-openai

## 2. Setup

Make sure your `OPENAI_API_KEY` environment variable is set before running this cell.

In [None]:
from pathlib import Path

from langchain_openai import OpenAIEmbeddings

from athenaeum import Athenaeum, AthenaeumConfig

DOCS_DIR = Path("knowledge-base")

config = AthenaeumConfig(
    storage_dir=Path("athenaeum-data"),
    chunk_size=40,
    chunk_overlap=10,
)

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
kb = Athenaeum(embeddings=embeddings, config=config)

print("Knowledge base ready.")

## 3. Load documents

We load the PDF papers from `knowledge-base/`, assigning tags by research area so we can filter later.

In [None]:
papers = {
    "Attention Is All You Need.pdf": {"nlp", "transformers", "architecture"},
    "BERT.pdf":                      {"nlp", "transformers", "pretraining"},
    "XLNet.pdf":                     {"nlp", "transformers", "pretraining"},
    "Language Models are Few-Shot Learners.pdf": {"nlp", "transformers", "pretraining", "generative"},
    "LORA.pdf":                      {"nlp", "transformers", "fine-tuning"},
    "RAG.pdf":                       {"nlp", "transformers", "retrieval"},
    "ViT.pdf":                       {"vision", "transformers", "architecture"},
    "GANs.pdf":                      {"vision", "generative"},
    "VAE.pdf":                       {"vision", "generative"},
    "DDPM.pdf":                      {"vision", "generative", "diffusion"},
    "High-Resolution Image Synthesis with Latent Diffusion Models.pdf": {"vision", "generative", "diffusion"},
}

doc_ids = {}
for filename, tags in papers.items():
    path = DOCS_DIR / filename
    if not path.exists():
        print(f"Skipping (not found): {filename}")
        continue
    doc_id = kb.load_doc(str(path), tags=tags)
    doc_ids[filename] = doc_id
    print(f"Loaded: {filename} -> {doc_id}  (tags: {tags})")

print(f"\nTotal documents: {len(doc_ids)}")

## 4. List documents

### All documents

In [None]:
for doc in kb.list_docs():
    print(f"{doc.id}  {doc.name:<60}  tags={doc.tags}")

### Filter by tags

Tags use **OR semantics** — any document matching at least one of the given tags is returned.

In [None]:
print("=== Diffusion papers ===")
for doc in kb.list_docs(tags={"diffusion"}):
    print(f"  {doc.name}")

print("\n=== NLP papers ===")
for doc in kb.list_docs(tags={"nlp"}):
    print(f"  {doc.name}")

print("\n=== Vision OR fine-tuning ===")
for doc in kb.list_docs(tags={"vision", "fine-tuning"}):
    print(f"  {doc.name}  tags={doc.tags}")

### List all tags in the knowledge base

In [None]:
print("All tags:", sorted(kb.list_tags()))

## 5. Search across documents

### BM25 (keyword) search

In [None]:
results = kb.search_docs("self-attention mechanism", top_k=5, strategy="bm25")

print("Query: 'self-attention mechanism' (BM25)\n")
for hit in results:
    print(f"  [{hit.score:.3f}] {hit.name}")
    if hit.snippet:
        print(f"          {hit.snippet[:120]}...")

### Vector (semantic) search

In [None]:
results = kb.search_docs("how to generate realistic images", top_k=5, strategy="vector")

print("Query: 'how to generate realistic images' (vector)\n")
for hit in results:
    print(f"  [{hit.score:.3f}] {hit.name}")
    if hit.snippet:
        print(f"          {hit.snippet[:120]}...")

### Hybrid search (default)

Combines BM25 and vector search with Reciprocal Rank Fusion.

In [None]:
results = kb.search_docs("low-rank adaptation for large language models", top_k=5)

print("Query: 'low-rank adaptation for large language models' (hybrid)\n")
for hit in results:
    print(f"  [{hit.score:.4f}] {hit.name}")
    if hit.snippet:
        print(f"           {hit.snippet[:120]}...")

### Search with tag filtering

Restrict search to documents matching specific tags.

In [None]:
print("=== 'generative models' scoped to vision papers ===")
for hit in kb.search_docs("generative models", top_k=5, tags={"vision"}):
    print(f"  [{hit.score:.4f}] {hit.name}")

print("\n=== 'generative models' scoped to NLP papers ===")
for hit in kb.search_docs("generative models", top_k=5, tags={"nlp"}):
    print(f"  [{hit.score:.4f}] {hit.name}")

print("\n=== 'generative models' with no tag filter (all docs) ===")
for hit in kb.search_docs("generative models", top_k=5):
    print(f"  [{hit.score:.4f}] {hit.name}")

### Search by name

In [None]:
results = kb.search_docs("BERT", scope="names")

print("Name search: 'BERT'\n")
for hit in results:
    print(f"  {hit.name}  (score={hit.score})")

## 6. Search within a document

Pick a specific paper and search for content inside it.

In [None]:
attention_id = doc_ids["Attention Is All You Need.pdf"]

results = kb.search_doc_contents(attention_id, "positional encoding", top_k=3)

print(f"Searching inside 'Attention Is All You Need' for 'positional encoding'\n")
for hit in results:
    print(f"  Lines {hit.line_range[0]}-{hit.line_range[1]}  (score={hit.score:.3f})")
    print(f"  {hit.text[:200]}...")
    print()

## 7. Read specific excerpts

Read exact line ranges from a document — useful for presenting context to an LLM.

In [None]:
# Read the first 30 lines of a paper
gpt3_id = doc_ids["Language Models are Few-Shot Learners.pdf"]
excerpt = kb.read_doc(gpt3_id, start_line=1, end_line=30)

print(f"Document: Language Models are Few-Shot Learners")
print(f"Lines {excerpt.line_range[0]}-{excerpt.line_range[1]} of {excerpt.total_lines}\n")
print(excerpt.text)

### Use table of contents to navigate

In [None]:
# View the TOC of a paper
lora_id = doc_ids["LORA.pdf"]
lora_docs = [d for d in kb.list_docs() if d.id == lora_id]

print("Table of Contents — LORA\n")
print(lora_docs[0].table_of_contents)

## 8. Manage tags

Tags can be added or removed after loading.

In [None]:
# Add a tag
kb.tag_doc(attention_id, {"seminal", "google"})

# Check
doc = [d for d in kb.list_docs() if d.id == attention_id][0]
print(f"{doc.name} tags after adding: {doc.tags}")

# Remove a tag
kb.untag_doc(attention_id, {"google"})

doc = [d for d in kb.list_docs() if d.id == attention_id][0]
print(f"{doc.name} tags after removing 'google': {doc.tags}")

# List all tags across the knowledge base
print(f"\nAll tags: {sorted(kb.list_tags())}")

## 9. Cleanup

The knowledge base persists in `athenaeum-data/`. Delete it to start fresh:

```python
import shutil
shutil.rmtree("athenaeum-data")
```