Engineer Onboarding

Engineer Onboarding Guide

Goal: zero-to-contributing in one week. Read Assyriology 101 first for the domain primer; this file covers the codebase.

Day 1: Orient

What Glintstone Is

A platform that unifies fragmented cuneiform research data (CDLI, ORACC, eBL, ePSD2, OGSL) so scholars can search, browse, and analyze ~353k ancient artifacts in one place. Every artifact is tracked through five stages: Captured > Recognized > Transcribed > Lemmatized > Translated.

This is a production academic tool. Data integrity and source attribution are not optional — they're structural requirements. Scholars' careers depend on proper credit.

Architecture in 60 Seconds

Browser
  |
nginx (.test domains, port 80)
  |
  +---> Web App (port 8002, FastAPI + Jinja2)
  |         |
  |         +---> API (port 8001, FastAPI + JSON)
  |                   |
  |                   +---> PostgreSQL 17
  |
  +---> Marketing site (static)

Two-tier FastAPI: the web layer renders HTML and calls the API layer via httpx. The web layer never touches the database directly. This separation exists so the API can serve future consumers (other tools, ML pipelines, public access).

Directory Map

api/                 REST API: routes, repositories, services
app/                 Web app: routes, Jinja2 templates, CSS, JS
core/                Shared: config, database pool, base repository
data-model/
  glintstone-schema.yaml   Authoritative schema spec
  migrations/        SQL migrations (NNN_*.sql)
  migrate.py         Migration runner
ingestion/           ETL framework: typed connectors + CLI
docs/                Reference docs (this file, data-model/, getting-started/)
ops/local/           setup.sh, start.sh, stop.sh, nginx.conf
ops/deploy/          deploy.sh, rollback.sh, provision.sh, nginx/

Get Running

# Prerequisites: Homebrew, PostgreSQL 17, Python 3.9+

# One-time setup (creates DB, venv, /etc/hosts entries, nginx config)
./ops/local/setup.sh

# Start everything (PostgreSQL, API on 8001, web on 8002, nginx)
./ops/local/start.sh

# Verify
open http://app.glintstone.test       # Web UI
open http://api.glintstone.test/docs  # API docs (Swagger)

# Stop
./ops/local/stop.sh

Both servers run with --reload — edit a Python file and changes are live immediately.

Day 2: Read the Code

Start Here (in order)

File	Why
`Assyriology 101.md`	Domain primer — wedge-shaped marks to database tables
`core/config.py`	How settings work (pydantic-settings, .env)
`core/database.py`	Connection pool (psycopg3, dict_row, get_db dependency)
`core/repository.py`	BaseRepository: fetch_all, fetch_one, build_in_clause
`api/routes/artifacts.py`	Trace a request end-to-end
`api/repositories/artifact_repo.py`	Where SQL lives — the largest repo file
`app/routes/tablets.py`	How the web layer calls the API and renders templates
`app/templates/tablets/list.html`	Jinja2 template with macros and filters
`data-model/glintstone-schema.yaml`	Full schema spec

Key Code Patterns

Repository pattern — all SQL in api/repositories/*_repo.py, injected via FastAPI's Depends(get_db):

@router.get("/{p_number}")
def get_artifact(p_number: str, conn=Depends(get_db)):
    repo = ArtifactRepository(conn)
    return repo.find_by_p_number(p_number)

Two-tier call chain — web route calls API, never DB:

# app/routes/tablets.py
async def tablet_detail(request, p_number):
    async with httpx.AsyncClient() as client:
        resp = await client.get(f"{API_URL}/artifacts/{p_number}")
    return templates.TemplateResponse("tablets/detail.html", {"artifact": resp.json()})

Cross-filter counts — the filter sidebar shows how many results each option would produce, dynamically excluding the current dimension:

def _cross_filter_where(self, active: dict, exclude: str):
    # Builds WHERE for all active filters EXCEPT the excluded dimension
    # So "Period" counts reflect your Genre/Language/Provenience selections

Source attribution — every record links to an annotation_run identifying who produced it, when, and from what source. This is non-negotiable.

Day 3: Understand the Data

The P-Number is Everything

P227657 is a tablet. The P-number (from CDLI) is the universal join key across every table, every source, every system in the field. Think of it as the primary key for all of Assyriology.

Canon Tables

Raw source data is messy. Canon tables normalize it:

Table	Purpose	Example
`period_canon`	Normalize time periods	"Ur III" maps to group "Third Millennium"
`provenience_canon`	Normalize places	"Nippur (mod. Nuffar)" becomes "Nippur"
`language_map`	Normalize language codes	"sux-x-emesal" maps to "Sumerian"
`genre_canon`	Normalize genres	Various genre strings to canonical forms

These are seeded by 01_seed_lookup_tables.py and used by the filter system.

Database Access

# As the table owner (for migrations, schema changes)
$(brew --prefix postgresql@17)/bin/psql -h 127.0.0.1 -U wittkensis -d glintstone

# As the app user (for read-only exploration)
$(brew --prefix postgresql@17)/bin/psql -h 127.0.0.1 -U glintstone -d glintstone

Ownership matters: tables are owned by wittkensis, not glintstone. Run migrations as wittkensis. After creating new tables or sequences, GRANT permissions:

GRANT ALL ON new_table TO glintstone;
GRANT USAGE, SELECT ON SEQUENCE new_table_id_seq TO glintstone;

Current Scale

Table	Rows	Notes
artifacts	353,283	Every known tablet
text_lines	1,728,706	ATF decomposed
tokens	4,905,814	Individual words
lemmatizations	416,911	Token-level linguistic annotations
lexical_lemmas	346,480	Dictionary entries across 14 languages
lexical_norms	834,235	Normalized forms per lemma
glossary_entries	220,349	ORACC glossary entries (103 projects)
lexical_senses	860,940	Meaning distinctions
scholars	20,490	Researcher registry
publications	16,725	Academic bibliography

Day 4: The Ingestion Framework

How data gets in

The project is on a typed ingestion framework (ingestion/) where every source has a typed connector. The old numbered scripts that lived under source-data/import-tools/ were retired and removed (see _archive/ notes in CLAUDE.md). Detailed connector anatomy lives in the gs-expert-integrations skill (see .claude/skills/gs-expert-integrations/).

The framework:

ingestion/
  base.py          — SourceConnector, ModelConnector, RunContext, LoadStats
  runner.py        — run_connector() — wraps ETL + import_runs bookkeeping
  loader.py        — upsert_batch() — INSERT … ON CONFLICT
  registry.py      — discover_connectors() — auto-discovers ingestion.connectors.*
  dead_letters.py  — DeadLetterSink, DeadLetterCategory
  cli.py           — python -m ingestion.cli list|run|status|dead-letters
  connectors/      — one file per source

Key connectors:

Connector	What
`lookup_tables.py`	Canon tables for normalization (runs first)
`annotation_runs.py`	Register data sources
`ogsl_signs.py`	OGSL sign list (3,367 signs)
`cdli_catalog.py`	CDLI artifact catalog (~353k records)
`atf_parser.py`	ATF → text_lines + tokens
`oracc_lemmatizations.py`	ORACC linguistic annotations
`oracc_glossaries.py`	ORACC glossary entries
`epsd2.py`	ePSD2 Sumerian dictionary
`oracc_lexical_glossaries.py`	ORACC glossaries (~115 projects, 14 langs)
`translation_line_matcher.py`	Match translations to text_lines
`baby_lemmatizer.py`	ML lemmatization (BabyLemmatizer)

Running a connector

python -m ingestion.cli list                # show registered connectors
python -m ingestion.cli run cdli-catalog    # run one
python -m ingestion.cli status              # recent runs across all connectors
python -m ingestion.cli dead-letters cdli-catalog   # triage failures

A web admin dashboard is at /admin/ingestion.

Conventions

class MyConnector(SourceConnector):
    id = "my-connector"           # stable slug
    kind = "catalog"              # catalog | corpus | lexicon | annotation | lookup | derived
    runs_after = ["lookup-tables"]
    license = "CC-BY-4.0"

    def discover(self, ctx) -> SourceManifest: ...   # checksum for skip-if-unchanged
    def extract(self, ctx) -> Iterator[dict]: ...    # yield raw rows
    def transform(self, ctx, record) -> Iterator[dict]: ...  # validate, dead-letter, reshape
    def load(self, ctx, rows) -> LoadStats:
        return upsert_batch(ctx.db, table="my_table", rows=rows,
                            unique_key=["id"], policy=ConflictPolicy.UPDATE)
    def verify(self, ctx) -> None: ...               # sanity-check, raise on issue

upsert_batch() covers idempotency via UPDATE / SKIP / REPLACE policies. Bad rows go to dead-letters via ctx.dead_letter(...); connectors never raise on per-record problems.

Source data lives outside the repo (each connector documents its own discovery / download path — see the connector source files for specifics).

Day 5: Common Tasks

Add an API Endpoint

Add route function in api/routes/<domain>.py
Add repository method in api/repositories/<domain>_repo.py
If new router, register it in api/main.py
Test at http://api.glintstone.test/docs

Add a Web Page

Add route in app/routes/<domain>.py
Create template in app/templates/<domain>/
If new router, register in app/main.py
Add CSS in app/static/css/pages/
Add JS in app/static/js/ if needed

Add a Database Migration

Create data-model/migrations/NNN_description.sql (next number after the latest)

Apply via the migration runner (preferred — records applied state):

DATABASE_URL=postgresql://wittkensis@127.0.0.1:5432/glintstone python data-model/migrate.py up

Or ad-hoc with psql:

$(brew --prefix postgresql@17)/bin/psql -h 127.0.0.1 -U wittkensis -d glintstone -f data-model/migrations/NNN_description.sql

GRANT permissions to glintstone user for any new tables/sequences
Update the schema YAML if it affects the data model

Add a New Import Script

Create NN_description.py following existing patterns
Register an annotation_run before importing data
Use ON CONFLICT for idempotency
Use ImportCheckpoint for large imports
Support --dry-run, --limit, --reset flags

Gotchas & Traps

Python / psycopg

Rollback trap: conn.rollback() undoes ALL uncommitted changes in the transaction, not just the failed statement. Never use try/except for UniqueViolation — use NOT EXISTS subqueries or ON CONFLICT instead.
macOS SSL: Python's urllib/requests can fail with SSL errors on macOS. Use subprocess.run(["curl", ...]) for external HTTP calls in import scripts.

Jinja2 Templates

Dict method collision: If a dict key is named items, values, keys, get, or update, dot notation (dict.items) calls the Python dict method instead of the key. Use bracket notation (dict['items']) or rename the key.
No in test: selectattr('val', 'in', list) doesn't work — 'in' is not a valid Jinja2 test. Use a loop with namespace to check membership.

Data

Subscript semantics differ by language: In Sumerian, du3 is a different word than du. In Akkadian, subscripts are just sign disambiguation — du3 normalizes to du. The tokenizer must handle both modes based on the ATF language header.
Competing interpretations are intentional: Two scholars may lemmatize the same word differently. Both are stored. This is a feature.
Coverage is sparse: Only ~2% of tablets are lemmatized. 35% have transliteration. 12% have translations. Expect NULLs everywhere.

Infrastructure

Table ownership: wittkensis owns tables. glintstone is the app user. Migrations run as wittkensis. New tables need explicit GRANTs.
Servers must be running: The web app calls the API over HTTP. If the API is down, the web app shows errors, not empty pages.
OpenAlex API: Uses "topics" not the deprecated "concepts". Topic T12307 = "Ancient Near East History".

Key Vocabulary (Beyond Assyriology 101)

Term	Meaning in this codebase
annotation_run	A batch of imported data with full provenance (source, version, timestamp)
canon table	Lookup table that normalizes messy source values to canonical forms
cross-filter count	Dynamic count showing results per filter option, excluding that dimension
P-number	CDLI artifact identifier, universal join key (e.g., P227657)
Q-number	CDLI composite text identifier — the abstract "work" vs. physical tablet
stage	Pipeline position: Captured (1) through Translated (5)
unified lexical	The three-table architecture (signs, lemmas, senses) from migration 008

Who to Ask

The project is early-stage. There are no separate frontend/backend teams. You'll touch Python, SQL, Jinja2, CSS, and vanilla JS in the same week. Domain questions will come up constantly — Assyriology 101.md and the PLANNING/Domain Research/ folder are your references. For linguistic questions that go deeper, the data model docs at data-model/ have detailed commentary.

First Week Checklist

Open Issues (as of 2026-02-23)

Status key: DONE = closed, IN PROGRESS = actively being worked, IMPROVED = recent meaningful commits

Foundation & Infrastructure

#	Issue	Labels	Status
1	API Design	`help wanted`	Open
2	Caching	`enhancement` `help wanted`	Open
3	Schema Design & Data Quality	`help wanted`	Open — schema is solid through v2/migration 017, but multilingual edge cases remain
4	Pressure Testing, Unit Testing & QA	`help wanted`	Open — no test framework exists yet
5	Automated Data Ingestion	`help wanted`	Open — scripts run manually today
7	Server Configuration & Security	`help wanted`	Open — Hostinger VPS, Alpine, bot protection
14	Deployment & Versioning System	`help wanted`	Open

Data Migration (v2)

#	Issue	Status
17	Lemmatization Matches	Open — 38% match rate, 99k "no line match"
18	CompVis Unmatched Tablets	Open
19	Surface Type Precision	Open
20	Citations	Open — many CDLI 302 redirects, CSV fallback planned
21	Remaining Import Steps	DONE — eBL annotations, collections, ORACC credits all landed
22	Citation Phases Remaining	Open — 8 of 12 citation scripts still pending (03-10)
24	Rewire Global Search	`bug`

App UI & UX

#	Issue	Labels	Status
9	App UI v2	`enhancement`	IN PROGRESS — ATF viewer done, composites done, dictionary Phase 1 done, Phase 2 (knowledge graph) schema deployed but API/UI pending
6	UI Systems Componentization	`enhancement`	Open — headers, dictionary surfaces inconsistent
13	Remove Tailwind Completely	`enhancement`	Open
15	Gut Check on Frontend Framework	`help wanted`	Open — stay Jinja2 or move to something else?
33	UX: Dictionary experience	`ux`	IN PROGRESS (assigned) — Phase 1 shipped, Phase 2 in progress
34	UX: Composite experience	`ux`	IMPROVED — composites tab implemented in #9
35	UX: Useful app homepage	`ux`	Open
36	UX: ATF viewer revamp	`ux`	IMPROVED — new viewer with glassmorphism legend, interlinear glossing landed in #9
37	UX: Knowledge sidebar	`ux`	Open
38	UX: Global semantic search	`ux`	Open
39	UX: Source overview	`ux`	Open
40	UX: Pipeline design system	`ux`	Open
41	UX: Citation/provenance/trust layer	`ux`	Open
43	UX: Dynamic Collections	`ux`	Open

Features & Enhancements

#	Issue	Labels	Status
8	Database Transparency	`enhancement`	Open
10	Marketing Page Updates	`enhancement`	Open (assigned)
11	Agentic Summaries (Anthropic API)	`help wanted`	Open
12	ML Integration	`enhancement`	Open
16	Tablet Thumbnail Generation	`help wanted`	Open
25	Semantic Global Search	`enhancement` `help wanted`	Open
27	Integrate new dictionary sources	(assigned)	Open
32	Account creation & management	`enhancement`	Open
42	Jupyter notebook templates	`enhancement`	Open
46	Multilingual UI (EN/DE/AR/FA)	`enhancement`	Open

Research & Decisions

#	Issue	Labels	Status
23	Consider GraphQL Integration	`enhancement` `help wanted`	Open
26	Docs: Assyriology Basics in 5 mins	`documentation` (assigned)	IMPROVED — `Assyriology 101.md` exists at repo root
28	Docs: Language Litmus Tests	`documentation` (assigned)	Open
29	App UI: Pull from API classes or HTTP?	`question`	Open — architectural decision on two-tier boundary
30	Teach Eric Proper GIT etiquette	`help wanted` `question`	Open
31	Usage telemetry	`question`	Open
44	Mathematica integration?	`question`	Open
45	Socialization: Identify 5 early participants	`research` (assigned)	Open

Good First Issues for New Engineers

If you're looking for where to start, these are approachable and high-value:

#4 — Testing framework: Set up pytest, write first tests for repository layer and API endpoints. Huge impact, well-scoped.
#16 — Tablet thumbnails: Image processing task, self-contained. Pillow is already a dependency.
#13 — Remove Tailwind: CSS cleanup, good way to learn the template/styling structure.
#2 — Caching: Profile slow queries (get_filter_options() hits 353k rows), add Redis or in-memory cache. Performance win.
#6 — UI componentization: Audit and unify Jinja2 macros, CSS patterns. Good way to learn the full frontend.

Source: github.com/wittkensis/glintstone · Issues · Edit this wiki

Home

Start here

Getting Started

Overview

Data Model

Reference — Data Model

Reference — API

Reference — MCP

Opportunities

Personas

Project

Research

Engineer Onboarding

Engineer Onboarding Guide

Day 1: Orient

What Glintstone Is

Architecture in 60 Seconds

Directory Map

Get Running

Day 2: Read the Code

Start Here (in order)

Key Code Patterns

Day 3: Understand the Data

The P-Number is Everything

Canon Tables

Database Access

Current Scale

Day 4: The Ingestion Framework

How data gets in

Running a connector

Conventions

Day 5: Common Tasks

Add an API Endpoint

Add a Web Page

Add a Database Migration

Add a New Import Script

Gotchas & Traps

Python / psycopg

Jinja2 Templates

Data

Infrastructure

Key Vocabulary (Beyond Assyriology 101)

Who to Ask

First Week Checklist

Open Issues (as of 2026-02-23)

Foundation & Infrastructure

Data Migration (v2)

App UI & UX

Features & Enhancements

Research & Decisions

Good First Issues for New Engineers

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!