# Athenaeum Playground

This notebook walks through the core features of **Athenaeum** — a Python library for building searchable knowledge bases from documents.

## Installation

In [1]:
# Use one of the following:

#!pip install -q athenaeum-kb[mistral]
!uv pip install '../dist/athenaeum_kb-0.2.1-py3-none-any.whl[mistral]' -q

[2mAudited [1m1 package[0m [2min 11ms[0m[0m


## Setup

In [3]:
import getpass

OPENAI_API_KEY = getpass.getpass('OPENAI_API_KEY: ')
MISTRAL_API_KEY = getpass.getpass('MISTRAL_API_KEY: ')

OPENAI_API_KEY:  ········
MISTRAL_API_KEY:  ········


In [4]:
from pathlib import Path

DOCS_DIR = Path("knowledge-base")
STORAGE_DIR = Path(".athenaeum")

In [9]:
from langchain_text_splitters import RecursiveCharacterTextSplitter, Language
from athenaeum import Athenaeum, AthenaeumConfig, get_ocr_provider
from langchain_openai import OpenAIEmbeddings
import tiktoken

config = AthenaeumConfig(
    storage_dir=STORAGE_DIR,
    rrf_k=70,                   # RRF constant for hybrid search
    default_strategy="hybrid",  # Default search strategy
)

# Custom token-based splitter
enc = tiktoken.get_encoding("cl100k_base")
token_splitter = RecursiveCharacterTextSplitter.from_language(
    Language.MARKDOWN,
    chunk_size=512,
    chunk_overlap=64,
    length_function=lambda text: len(enc.encode(text)),
)

# Embedding model
embeddings = OpenAIEmbeddings(model="text-embedding-3-small", api_key=OPENAI_API_KEY)

# OCR model
ocr = get_ocr_provider("mistral", api_key=MISTRAL_API_KEY)

kb = Athenaeum(
    config=config,
    ocr_provider=ocr,
    embeddings=embeddings,
    text_splitter=token_splitter
)

## Load documents

In [10]:
# Load the PDF papers from `DOCS_DIR`, assigning tags by research area so we can filter later

papers = {
    "Attention Is All You Need.pdf": {"nlp", "transformers", "architecture"},
    "BERT.pdf":                      {"nlp", "transformers", "pretraining"},
    "XLNet.pdf":                     {"nlp", "transformers", "pretraining"},
    "Language Models are Few-Shot Learners.pdf": {"nlp", "transformers", "pretraining", "generative"},
    "LORA.pdf":                      {"nlp", "transformers", "fine-tuning"},
    "RAG.pdf":                       {"nlp", "transformers", "retrieval"},
    "ViT.pdf":                       {"vision", "transformers", "architecture"},
    "GANs.pdf":                      {"vision", "generative"},
    "VAE.pdf":                       {"vision", "generative"},
    "DDPM.pdf":                      {"vision", "generative", "diffusion"},
    "High-Resolution Image Synthesis with Latent Diffusion Models.pdf": {"vision", "generative", "diffusion"},
}

for filename, tags in papers.items():
    path = DOCS_DIR / filename
    if not path.exists():
        print(f"Skipping (not found): {filename}")
        continue
    doc_id = kb.load_doc(str(path), tags=tags)
    print(f"[INFO] Loaded {filename} with ID {doc_id}")

[INFO] Loaded Attention Is All You Need.pdf with ID a37ca72e7343
[INFO] Loaded BERT.pdf with ID a93623c887f3
[INFO] Loaded XLNet.pdf with ID 398c7fcbc1bb
[INFO] Loaded Language Models are Few-Shot Learners.pdf with ID 558dfc0f6033
[INFO] Loaded LORA.pdf with ID 56397b5e8269
[INFO] Loaded RAG.pdf with ID 30ea5715ee9d
[INFO] Loaded ViT.pdf with ID f83a251d3a8d
[INFO] Loaded GANs.pdf with ID 8053b5f17066
[INFO] Loaded VAE.pdf with ID 77f686645c39
[INFO] Loaded DDPM.pdf with ID 7e2a0c030dc2
[INFO] Loaded High-Resolution Image Synthesis with Latent Diffusion Models.pdf with ID f3ea54df7f8d


## List documents

In [11]:
# List all documents in kb

docs = kb.list_docs()
for doc in docs:
    print(f"{doc.id}  {doc.name:<60}")

a37ca72e7343  Attention Is All You Need.pdf                               
a93623c887f3  BERT.pdf                                                    
398c7fcbc1bb  XLNet.pdf                                                   
558dfc0f6033  Language Models are Few-Shot Learners.pdf                   
56397b5e8269  LORA.pdf                                                    
30ea5715ee9d  RAG.pdf                                                     
f83a251d3a8d  ViT.pdf                                                     
8053b5f17066  GANs.pdf                                                    
77f686645c39  VAE.pdf                                                     
7e2a0c030dc2  DDPM.pdf                                                    
f3ea54df7f8d  High-Resolution Image Synthesis with Latent Diffusion Models.pdf


In [12]:
# Filter by tags
# Tags use OR semantics — any document matching at least one of the given tags is returned.

print("Diffusion papers:")
for doc in kb.list_docs(tags={"diffusion"}):
    print(f"  - {doc.name}")

print()
print(" NLP papers:")
for doc in kb.list_docs(tags={"nlp"}):
    print(f"  - {doc.name}")

print()
print("Vision + fine-tuning:")
for doc in kb.list_docs(tags={"vision", "fine-tuning"}):
    print(f"  - {doc.name}  tags={doc.tags}")

Diffusion papers:
  - DDPM.pdf
  - High-Resolution Image Synthesis with Latent Diffusion Models.pdf

 NLP papers:
  - Attention Is All You Need.pdf
  - BERT.pdf
  - XLNet.pdf
  - Language Models are Few-Shot Learners.pdf
  - LORA.pdf
  - RAG.pdf

Vision + fine-tuning:
  - LORA.pdf  tags={'transformers', 'fine-tuning', 'nlp'}
  - ViT.pdf  tags={'architecture', 'transformers', 'vision'}
  - GANs.pdf  tags={'generative', 'vision'}
  - VAE.pdf  tags={'generative', 'vision'}
  - DDPM.pdf  tags={'generative', 'vision', 'diffusion'}
  - High-Resolution Image Synthesis with Latent Diffusion Models.pdf  tags={'generative', 'vision', 'diffusion'}


In [13]:
# List all tags in the kb

tags = kb.list_tags()

print("Tags:")
for tag in sorted(tags):
    print(f" - {tag}")

Tags:
 - architecture
 - diffusion
 - fine-tuning
 - generative
 - nlp
 - pretraining
 - retrieval
 - transformers
 - vision


## Search across documents

In [14]:
# BM25 (keyword) search
query = "self-attention mechanism"
results = kb.search_docs(query, top_k=3, strategy="bm25")

for hit in results:
    print(f"[{hit.score:.3f}] {hit.name}")
    if hit.snippet:
        print(f"{hit.snippet[:120]}...")
        print()

[8.708] Attention Is All You Need.pdf
## 2 Background

The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU *[16]*...

[8.615] BERT.pdf
# 3.2 Fine-tuning BERT

Fine-tuning is straightforward since the self-attention mechanism in the Transformer allows BERT...

[5.722] High-Resolution Image Synthesis with Latent Diffusion Models.pdf
Table 15. Hyperparameters for the conditional LDMs from Sec. 4. All models trained on a single NVIDIA A100 except for th...



In [15]:
# Vector (semantic) search
query = "how to generate realistic images?"
results = kb.search_docs(query, top_k=3, strategy="vector")

for hit in results:
    print(f"[{hit.score:.3f}] {hit.name}")
    if hit.snippet:
        print(f"{hit.snippet[:120]}...")
        print()

[0.367] High-Resolution Image Synthesis with Latent Diffusion Models.pdf
# 1. Introduction

Image synthesis is one of the computer vision fields with the most spectacular recent development, bu...

[0.239] DDPM.pdf
# 4.4 Interpolation

We can interpolate source images  $\mathbf{x}_0, \mathbf{x}_0' \sim q(\mathbf{x}_0)$  in latent spa...



In [16]:
# Hybrid search (default)
# Combines BM25 and vector search with Reciprocal Rank Fusion.
query = "low-rank adaptation for large language models"
results = kb.search_docs(query, top_k=3)

for hit in results:
    print(f"[{hit.score:.4f}] {hit.name}")
    if hit.snippet:
        print(f"{hit.snippet[:120]}...")
        print()

[0.0280] LORA.pdf
# LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS

Edward Hu* Yelong Shen* Phillip Wallis Zeyuan Allen-Zhu

Yuanzhi L...



In [17]:
# Search with tag filtering
# Restrict search to documents matching specific tags.
query = "generative models"
results = kb.search_docs(query, top_k=3, tags={"nlp"})

for hit in results:
    print(f"[{hit.score:.4f}] {hit.name}")
    if hit.snippet:
        print(f"{hit.snippet[:120]}...")
        print()

[0.0280] Language Models are Few-Shot Learners.pdf
# 3.9.4 News Article Generation

Previous work on generative language models qualitatively tested their ability to gener...

[0.0267] RAG.pdf
- [47] Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller....

[0.0137] LORA.pdf
- Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving Language Understanding by Generative Pre-Trai...



In [18]:
# Search by name
query = "BERT"
results = kb.search_docs(query, scope="names", strategy="bm25")

for hit in results:
    print(f"[{hit.score:.4f}] {hit.name}")
    if hit.snippet:
        print(f"{hit.snippet[:120]}...")
        print()

[0.5000] BERT.pdf


## Search within a document

In [19]:
# Pick a specific paper and search for content inside it

doc_ids = {doc.name: doc.id for doc in docs}
attention_id = doc_ids["BERT.pdf"]
query = "What is SQuAD?"

results = kb.search_doc_contents(attention_id, query, top_k=3)
for hit in results:
    print(f"[{hit.score:.3f}] {hit.line_range[0]}-{hit.line_range[1]}")
    print(f"{hit.text[:100]}...")
    print()

[0.014] 408-429
#### Masked LM and the Masking Procedure

Assuming the unlabeled sentence is my dog is hairy, and du...

[0.014] 139-147
# 4.2 SQuAD v1.1

The Stanford Question Answering Dataset (SQuAD v1.1) is a collection of 100k crowd...

[0.014] 475-491
### A.5 Illustrations of Fine-tuning on Different Tasks

The illustration of fine-tuning BERT on dif...



In [20]:
# Pick a specific paper and search for content inside it

doc_ids = {doc.name: doc.id for doc in docs}
attention_id = doc_ids["BERT.pdf"]
query = "Describe BERT’s model architecture"

results = kb.search_doc_contents(attention_id, query, top_k=3)
for hit in results:
    print(f"[{hit.score:.3f}] {hit.line_range[0]}-{hit.line_range[1]}")
    print(f"{hit.text[:100]}...")
    print()

[0.028] 58-70
# 3 BERT

We introduce BERT and its detailed implementation in this section. There are two steps in ...

[0.028] 1-11
# BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin Min...

[0.014] 103-115
# 3.2 Fine-tuning BERT

Fine-tuning is straightforward since the self-attention mechanism in the Tra...



## Read specific excerpts

In [23]:
# Read the first 30 lines of a paper

gpt3_id = doc_ids["Language Models are Few-Shot Learners.pdf"]
excerpt = kb.read_doc(gpt3_id, start_line=1, end_line=20)

print(f"Lines {excerpt.line_range[0]}-{excerpt.line_range[1]} of {excerpt.total_lines}\n")
print(excerpt.text)

Lines 1-20 of 1791

# Language Models are Few-Shot Learners

|  Tom B. Brown* |   | Benjamin Mann* |   | Nick Ryder* |   | Melanie Subbiah*  |   |
| --- | --- | --- | --- | --- | --- | --- | --- |
|  Jared Kaplan† | Prafulla Dhariwal | Arvind Neelakantan | Pranav Shyam | Girish Sastry |  |  |   |
|  Amanda Askell | Sandhini Agarwal | Ariel Herbert-Voss | Gretchen Krueger | Tom Henighan |  |  |   |
|  Rewon Child | Aditya Ramesh | Daniel M. Ziegler | Jeffrey Wu | Clemens Winter |  |  |   |
|  Christopher Hesse | Mark Chen | Eric Sigler | Mateusz Litwin | Scott Gray |  |  |   |
|  Benjamin Chess |   | Jack Clark |   | Christopher Berner |   |  |   |
|  Sam McCandlish |   | Alec Radford | Ilya Sutskever | Dario Amodei |   |  |   |

OpenAI

# Abstract

Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires t

## Use table of contents to navigate

In [24]:
lora_id = doc_ids["LORA.pdf"]
lora_docs = [d for d in kb.list_docs() if d.id == lora_id]

print(lora_docs[0].table_of_contents)

- LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS [lines 1-16]
- ABSTRACT [lines 17-20]
- 1 INTRODUCTION [lines 21-86]
        - Terminologies and Conventions [lines 41-44]
  - 2 Problem Statement [lines 45-62]
  - 3 Aren’t Existing Solutions Good Enough? [lines 63-86]
      - Adapter Layers Introduce Inference Latency [lines 67-72]
      - Directly Optimizing the Prompt is Hard [lines 73-86]
- 4 OUR METHOD [lines 87-90]
- 4.1 LOW-RANK-PARAMETRIZED UPDATE MATRICES [lines 91-173]
    - 4.2 Applying LoRA to Transformer [lines 107-116]
      - Practical Benefits and Limitations. [lines 111-116]
  - 5 Empirical Experiments [lines 117-173]
    - 5.1 Baselines [lines 121-173]
- 5.2 ROBERTA BASE/LARGE [lines 174-177]
- 5.3 DEBERTA XXL [lines 178-181]
- 5.4 GPT-2 MEDIUM/LARGE [lines 182-199]
- 5.5 SCALING UP TO GPT-3 175B [lines 200-210]
- 6 RELATED WORKS [lines 211-238]
      - Prompt Engineering and Fine-Tuning. [lines 217-220]
      - Parameter-Efficient Adaptation. [lines 221-224]
     

In [25]:
lora_id = doc_ids["ViT.pdf"]
lora_docs = [d for d in kb.list_docs() if d.id == lora_id]

print(lora_docs[0].table_of_contents)

- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [lines 1-48]
          - Abstract [lines 10-13]
  - 1 Introduction [lines 14-27]
  - 2 Related Work [lines 28-48]
- 3 METHOD [lines 49-52]
- 3.1 VISION TRANSFORMER (VIT) [lines 53-109]
      - Inductive bias. [lines 70-73]
      - Hybrid Architecture. [lines 74-77]
    - 3.2 Fine-tuning and Higher Resolution [lines 78-81]
  - 4 Experiments [lines 82-109]
    - 4.1 Setup [lines 86-109]
- 4.2 COMPARISON TO STATE OF THE ART [lines 110-137]
- 4.3 PRE-TRAINING DATA REQUIREMENTS [lines 138-164]
- 4.4 SCALING STUDY [lines 165-170]
- 4.5 INSPECTING VISION TRANSFORMER [lines 171-183]
- 4.6 SELF-SUPERVISION [lines 184-199]
- 5 CONCLUSION [lines 200-205]
- ACKNOWLEDGEMENTS [lines 206-209]
- REFERENCES [lines 210-328]
- APPENDIX [lines 329-330]
- A MULTIHEAD SELF-ATTENTION [lines 331-352]
- B EXPERIMENT DETAILS [lines 353-354]
- B.1 TRAINING [lines 355-358]
- B.1.1 FINE-TUNING [lines 359-381]
- B.1.2 SELF-SUPERVISION [lin

## Manage tags

Tags can be added or removed after loading.

In [26]:
# Add a tag
kb.tag_doc(attention_id, {"seminal", "google"})

# Remove a tag
kb.untag_doc(attention_id, {"google"})

## Cleanup

The knowledge base persists in `STORAGE_DIR`. Delete it to start fresh:

```python
import shutil
shutil.rmtree(STORAGE_DIR)
```

---