# Athenaeum Playground

This notebook walks through the core features of **Athenaeum** — a Python library for building searchable knowledge bases from documents.

## Installation

```python
pip install -q athenaeum-kb[mistral]
```

In [16]:
!uv pip install '../dist/athenaeum_kb-0.2.0-py3-none-any.whl[mistral]' -q

## Setup

In [17]:
import os
import getpass

OPENAI_API_KEY = getpass.getpass('OPENAI_API_KEY: ')
MISTRAL_API_KEY = getpass.getpass('MISTRAL_API_KEY: ')

os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY
os.environ['MISTRAL_API_KEY'] = MISTRAL_API_KEY

In [18]:
from pathlib import Path

DOCS_DIR = Path("knowledge-base")
STORAGE_DIR = Path(".athenaeum")

In [19]:
from athenaeum import Athenaeum, AthenaeumConfig, get_ocr_provider
from langchain_openai import OpenAIEmbeddings

config = AthenaeumConfig(
    storage_dir=STORAGE_DIR,
    chunk_size=750,
    chunk_overlap=75,
)

embeddings = OpenAIEmbeddings(model="text-embedding-3-small", api_key=OPENAI_API_KEY)
ocr = get_ocr_provider("mistral")

kb = Athenaeum(
    embeddings=embeddings,
    ocr_provider=ocr,
    config=config
) # Athenaeum knowledge-base is ready!

## Load documents

We load the PDF papers from `DOCS_DIR`, assigning tags by research area so we can filter later.

In [20]:
papers = {
    "Attention Is All You Need.pdf": {"nlp", "transformers", "architecture"},
    "BERT.pdf":                      {"nlp", "transformers", "pretraining"},
    "XLNet.pdf":                     {"nlp", "transformers", "pretraining"},
    "Language Models are Few-Shot Learners.pdf": {"nlp", "transformers", "pretraining", "generative"},
    "LORA.pdf":                      {"nlp", "transformers", "fine-tuning"},
    "RAG.pdf":                       {"nlp", "transformers", "retrieval"},
    "ViT.pdf":                       {"vision", "transformers", "architecture"},
    "GANs.pdf":                      {"vision", "generative"},
    "VAE.pdf":                       {"vision", "generative"},
    "DDPM.pdf":                      {"vision", "generative", "diffusion"},
    "High-Resolution Image Synthesis with Latent Diffusion Models.pdf": {"vision", "generative", "diffusion"},
}

for filename, tags in papers.items():
    path = DOCS_DIR / filename
    if not path.exists():
        print(f"Skipping (not found): {filename}")
        continue
    doc_id = kb.load_doc(str(path), tags=tags)
    print(f"[INFO] Loaded {filename} with ID {doc_id}")

Loaded: Attention Is All You Need.pdf -> ac66f9901c89  (tags: {'transformers', 'architecture', 'nlp'})
Loaded: BERT.pdf -> 4af474727685  (tags: {'transformers', 'pretraining', 'nlp'})
Loaded: XLNet.pdf -> 3c2293657e6c  (tags: {'transformers', 'pretraining', 'nlp'})
Loaded: Language Models are Few-Shot Learners.pdf -> 6525a2245e91  (tags: {'transformers', 'pretraining', 'generative', 'nlp'})
Loaded: LORA.pdf -> e6f9a937828a  (tags: {'transformers', 'fine-tuning', 'nlp'})
Loaded: RAG.pdf -> d610388cdccd  (tags: {'transformers', 'retrieval', 'nlp'})
Loaded: ViT.pdf -> c7a13d650516  (tags: {'vision', 'architecture', 'transformers'})
Loaded: GANs.pdf -> 4f67de6b50af  (tags: {'vision', 'generative'})
Loaded: VAE.pdf -> e502d1a1cb3d  (tags: {'vision', 'generative'})
Loaded: DDPM.pdf -> c60d566af74b  (tags: {'vision', 'generative', 'diffusion'})
Loaded: High-Resolution Image Synthesis with Latent Diffusion Models.pdf -> 09d7abb38688  (tags: {'vision', 'generative', 'diffusion'})

Total documen

## List documents

In [51]:
# All documents
docs = kb.list_docs()
for doc in docs:
    print(f"{doc.id}  {doc.name:<60}")

ac66f9901c89  Attention Is All You Need.pdf                               
4af474727685  BERT.pdf                                                    
3c2293657e6c  XLNet.pdf                                                   
6525a2245e91  Language Models are Few-Shot Learners.pdf                   
e6f9a937828a  LORA.pdf                                                    
d610388cdccd  RAG.pdf                                                     
c7a13d650516  ViT.pdf                                                     
4f67de6b50af  GANs.pdf                                                    
e502d1a1cb3d  VAE.pdf                                                     
c60d566af74b  DDPM.pdf                                                    
09d7abb38688  High-Resolution Image Synthesis with Latent Diffusion Models.pdf


In [36]:
# Filter by tags
# Tags use OR semantics — any document matching at least one of the given tags is returned.

print("Diffusion papers:")
for doc in kb.list_docs(tags={"diffusion"}):
    print(f"  - {doc.name}")

print()
print(" NLP papers:")
for doc in kb.list_docs(tags={"nlp"}):
    print(f"  - {doc.name}")

print()
print("Vision + fine-tuning:")
for doc in kb.list_docs(tags={"vision", "fine-tuning"}):
    print(f"  - {doc.name}  tags={doc.tags}")

Diffusion papers:
  - DDPM.pdf
  - High-Resolution Image Synthesis with Latent Diffusion Models.pdf

 NLP papers:
  - Attention Is All You Need.pdf
  - BERT.pdf
  - XLNet.pdf
  - Language Models are Few-Shot Learners.pdf
  - LORA.pdf
  - RAG.pdf

Vision + fine-tuning:
  - LORA.pdf  tags={'transformers', 'fine-tuning', 'nlp'}
  - ViT.pdf  tags={'vision', 'architecture', 'transformers'}
  - GANs.pdf  tags={'vision', 'generative'}
  - VAE.pdf  tags={'vision', 'generative'}
  - DDPM.pdf  tags={'vision', 'generative', 'diffusion'}
  - High-Resolution Image Synthesis with Latent Diffusion Models.pdf  tags={'vision', 'generative', 'diffusion'}


In [39]:
# List all tags in the knowledge base

tags = kb.list_tags()

print("Tags:")
for tag in sorted(tags):
    print(f" - {tag}")

Tags:
 - architecture
 - diffusion
 - fine-tuning
 - generative
 - nlp
 - pretraining
 - retrieval
 - seminal
 - transformers
 - vision


## Search across documents

In [44]:
# BM25 (keyword) search
query = "self-attention mechanism"
results = kb.search_docs(query, top_k=3, strategy="bm25")

for hit in results:
    print(f"[{hit.score:.3f}] {hit.name}")
    if hit.snippet:
        print(f"{hit.snippet[:120]}...")
        print()

[1.972] Attention Is All You Need.pdf
# Attention Is All You Need

Ashish Vaswani*

Google Brain

avaswani@google.com

Noam Shazeer*

Google Brain

noam@googl...

[1.761] XLNet.pdf
# XLNet: Generalized Autoregressive Pretraining for Language Understanding

Zhilin Yang^{∗1}, Zihang Dai^{∗12}, Yiming Y...

[1.583] BERT.pdf
# BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin Ming-Wei Chang Kenton L...



In [45]:
# Vector (semantic) search
query = "how to generate realistic images?"
results = kb.search_docs(query, top_k=3, strategy="vector")

for hit in results:
    print(f"[{hit.score:.3f}] {hit.name}")
    if hit.snippet:
        print(f"{hit.snippet[:120]}...")

[0.103] High-Resolution Image Synthesis with Latent Diffusion Models.pdf
# High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach $^{1*}$  Andreas Blattmann $^{1*}$  Domini...
[0.045] GANs.pdf
# Generative Adversarial Nets

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,
Sherjil ...
[-0.000] DDPM.pdf
# Denoising Diffusion Probabilistic Models

Jonathan Ho

UC Berkeley

jonathanho@berkeley.edu

Ajay Jain

UC Berkeley

a...


  results = self._store.similarity_search_with_relevance_scores(query, **kwargs)  # type: ignore[arg-type]


In [46]:
# Hybrid search (default)
# Combines BM25 and vector search with Reciprocal Rank Fusion.
query = "low-rank adaptation for large language models"
results = kb.search_docs(query, top_k=5)

for hit in results:
    print(f"[{hit.score:.4f}] {hit.name}")
    if hit.snippet:
        print(f"{hit.snippet[:120]}...")

[0.0328] LORA.pdf
# LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS

Edward Hu* Yelong Shen* Phillip Wallis Zeyuan Allen-Zhu

Yuanzhi L...
[0.0320] Language Models are Few-Shot Learners.pdf
# Language Models are Few-Shot Learners

|  Tom B. Brown* |   | Benjamin Mann* |   | Nick Ryder* |   | Melanie Subbiah* ...
[0.0315] BERT.pdf
# BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin Ming-Wei Chang Kenton L...
[0.0304] XLNet.pdf
# XLNet: Generalized Autoregressive Pretraining for Language Understanding

Zhilin Yang^{∗1}, Zihang Dai^{∗12}, Yiming Y...
[0.0299] Attention Is All You Need.pdf
# Attention Is All You Need

Ashish Vaswani*

Google Brain

avaswani@google.com

Noam Shazeer*

Google Brain

noam@googl...


In [47]:
# Search with tag filtering
# Restrict search to documents matching specific tags.

query = "generative models"
for hit in kb.search_docs(query, top_k=3, tags={"nlp"}):
    print(f"[{hit.score:.4f}] {hit.name}")

[0.0328] RAG.pdf
[0.0320] Attention Is All You Need.pdf
[0.0315] Language Models are Few-Shot Learners.pdf


  results = self._store.similarity_search_with_relevance_scores(query, **kwargs)  # type: ignore[arg-type]


In [50]:
# Search by name
query = "BERT"
results = kb.search_docs(query, scope="names", strategy="bm25")

for hit in results:
    print(f"[{hit.score:.4f}] {hit.name}")

[0.5000] BERT.pdf


## Search within a document

In [64]:
# Pick a specific paper and search for content inside it.

doc_ids = {doc.name: doc.id for doc in docs}
attention_id = doc_ids["Attention Is All You Need.pdf"]

results = kb.search_doc_contents(attention_id, "positional encoding", top_k=3)
for hit in results:
    print(f"[{hit.score:.3f}] {hit.line_range[0]}-{hit.line_range[1]}")
    print(f"{hit.text[:100]}...")
    print()

[0.033] 1-385
# Attention Is All You Need

Ashish Vaswani*

Google Brain

avaswani@google.com

Noam Shazeer*

Goog...



  results = self._store.similarity_search_with_relevance_scores(query, **kwargs)  # type: ignore[arg-type]


## Read specific excerpts

Read exact line ranges from a document — useful for presenting context to an LLM.

In [65]:
# Read the first 30 lines of a paper
gpt3_id = doc_ids["Language Models are Few-Shot Learners.pdf"]
excerpt = kb.read_doc(gpt3_id, start_line=1, end_line=30)

print(f"Lines {excerpt.line_range[0]}-{excerpt.line_range[1]} of {excerpt.total_lines}\n")
print(excerpt.text)

Lines 1-30 of 1739

# Language Models are Few-Shot Learners

|  Tom B. Brown* |   | Benjamin Mann* |   | Nick Ryder* |   | Melanie Subbiah*  |   |
| --- | --- | --- | --- | --- | --- | --- | --- |
|  Jared Kaplan† | Prafulla Dhariwal | Arvind Neelakantan | Pranav Shyam | Girish Sastry |  |  |   |
|  Amanda Askell | Sandhini Agarwal | Ariel Herbert-Voss | Gretchen Krueger | Tom Henighan |  |  |   |
|  Rewon Child | Aditya Ramesh | Daniel M. Ziegler | Jeffrey Wu | Clemens Winter |  |  |   |
|  Christopher Hesse | Mark Chen | Eric Sigler | Mateusz Litwin | Scott Gray |  |  |   |
|  Benjamin Chess |   | Jack Clark |   | Christopher Berner |   |  |   |
|  Sam McCandlish |   | Alec Radford | Ilya Sutskever | Dario Amodei |   |  |   |

OpenAI

# Abstract

Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires t

### Use table of contents to navigate

In [31]:
lora_id = doc_ids["LORA.pdf"]
lora_docs = [d for d in kb.list_docs() if d.id == lora_id]

print(lora_docs[0].table_of_contents)

Table of Contents — LORA

- LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS [lines 1-16]
- ABSTRACT [lines 17-20]
- 1 INTRODUCTION [lines 21-86]
      - Terminologies and Conventions [lines 41-44]
  - 2 Problem Statement [lines 45-62]
  - 3 Aren’t Existing Solutions Good Enough? [lines 63-86]
      - Adapter Layers Introduce Inference Latency [lines 67-72]
      - Directly Optimizing the Prompt is Hard [lines 73-86]
- 4 OUR METHOD [lines 87-90]
- 4.1 LOW-RANK-PARAMETRIZED UPDATE MATRICES [lines 91-173]
    - 4.2 Applying LoRA to Transformer [lines 107-116]
      - Practical Benefits and Limitations. [lines 111-116]
  - 5 Empirical Experiments [lines 117-173]
    - 5.1 Baselines [lines 121-173]
- 5.2 ROBERTA BASE/LARGE [lines 174-177]
- 5.3 DEBERTA XXL [lines 178-181]
- 5.4 GPT-2 MEDIUM/LARGE [lines 182-199]
- 5.5 SCALING UP TO GPT-3 175B [lines 200-210]
- 6 RELATED WORKS [lines 211-238]
      - Prompt Engineering and Fine-Tuning. [lines 217-220]
      - Parameter-Efficient Adaptatio

In [66]:
lora_id = doc_ids["ViT.pdf"]
lora_docs = [d for d in kb.list_docs() if d.id == lora_id]

print(lora_docs[0].table_of_contents)

Table of Contents — LORA

- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [lines 1-48]
          - Abstract [lines 10-13]
  - 1 Introduction [lines 14-27]
  - 2 Related Work [lines 28-48]
- 3 METHOD [lines 49-52]
- 3.1 VISION TRANSFORMER (VIT) [lines 53-109]
      - Inductive bias. [lines 70-73]
      - Hybrid Architecture. [lines 74-77]
    - 3.2 Fine-tuning and Higher Resolution [lines 78-81]
  - 4 Experiments [lines 82-109]
    - 4.1 Setup [lines 86-109]
- 4.2 COMPARISON TO STATE OF THE ART [lines 110-137]
- 4.3 PRE-TRAINING DATA REQUIREMENTS [lines 138-164]
- 4.4 SCALING STUDY [lines 165-170]
- 4.5 INSPECTING VISION TRANSFORMER [lines 171-183]
- 4.6 SELF-SUPERVISION [lines 184-199]
- 5 CONCLUSION [lines 200-205]
- ACKNOWLEDGEMENTS [lines 206-209]
- REFERENCES [lines 210-328]
- APPENDIX [lines 329-330]
- A MULTIHEAD SELF-ATTENTION [lines 331-352]
- B EXPERIMENT DETAILS [lines 353-354]
- B.1 TRAINING [lines 355-358]
- B.1.1 FINE-TUNING [lines 359-381]
- B

## Manage tags

Tags can be added or removed after loading.

In [67]:
# Add a tag
kb.tag_doc(attention_id, {"seminal", "google"})

# Remove a tag
kb.untag_doc(attention_id, {"google"})

## Cleanup

The knowledge base persists in `STORAGE_DIR`. Delete it to start fresh:

```python
import shutil
shutil.rmtree(STORAGE_DIR)
```

---