Ideas

Glintstone — promising ideas, gaps, and bugs

This is the project's idea backlog. Not actionable yet (that would be a GitHub issue), but worth keeping in one place so they don't get lost. Each entry: what, why, where it came from, and what would unblock it.

When an idea becomes ready to act on, file a GitHub issue and link it from the relevant section.

How it's organized

Paused product ideas (icebox PRDs) — features that have been thought through but deprioritized
UX explorations — design directions that have prototypes or analysis but no implementation slot yet
Schema & data-quality concerns — ongoing governance questions
Data gaps to close — sources/datasets identified but not integrated
Lexical & dictionary work — Phase 6–9 of the lexical roadmap
Quick wins — small polish that hasn't been picked up
Bugs & data integrity issues — confirmed defects without an open issue

1. Paused product ideas (icebox PRDs)

1.1 Agentic summary layer (was #11)

Use the Claude API to generate historical/contextual summaries on demand for tablets, composites, collections. Cache aggressively to control cost. Outputs are versioned (generated_at + model_version) and stored in the DB. System prompts are versioned and visible to administrators.

Specific features it would unlock:

Knowledge Bar — Tablet History (agentic historical context summary)
Knowledge Bar — Dictionary (AI-assisted gloss explanation)
Translation Proposals (suggested translations for unlematized tokens)
Global Semantic Search (vector embeddings)

Pick up when: core PRDs (Translation Builder, Knowledge Bar, Browse, Tablet Viewer) ship. Anthropic API costs are predictable enough to commit. Decision on whether to host inference on the VPS via Ollama or call Anthropic directly.

Companion skill: gs-scout-integrations for evaluating which model size to use.

1.2 API design & external versioning (was #1)

The current API is good enough for the web app but isn't yet ready for external scholar consumers. Needs:

Versioning strategy (/v1/, /v2/ or header-based)
Annotated OpenAPI / Swagger
Terminology aligned with CDLI / ORACC / ePSD2 conventions
Auth + rate limiting for external use
Changelog + deprecation policy

Pick up post-v1 launch once the surface is stable.

1.3 Automated data ingestion (was #5)

Import scripts run manually. Automate via cron / scheduled runs on the VPS:

Change detection (only re-import what's changed since last run)
Notification on failures
Idempotency is already there (ON CONFLICT); automation is the missing layer

Pick up after v1, when corpus freshness needs to be < quarterly. CDLI bulk export has been frozen since Aug 2022, which lowers the urgency.

1.4 Multilingual UI (was #46)

Scholars in Iraq, Syria, Iran, and Germany are underrepresented. Phase plan:

Phase 1: German (translations already in DB)
Phase 2: Arabic, Farsi (requires RTL layout)
Phase 3: French, Spanish

Challenges: LTR/RTL swap, i18n string extraction (currently no string-table), Jinja2 i18n (babel or equivalent).

Pick up after v1 when the UI is stable enough that strings won't churn.

1.5 Remaining data sources to integrate

Several ORACC subprojects failed initial download (TOPLEVEL_RETRY):

Project	Status	Blocker
cams	Download failed	Retry needed
etcsl	Download failed	Retry needed
rime	Download failed	Retry needed
ctij	Download failed	Retry needed
ribo	Subproject portal; download incomplete	Investigate portal format
amgg	Empty directory	Retry needed
hbtin	Not attempted (priority 3)	—
dccmt	Not attempted (priority 3)	—

Import connectors (lemmatizations, glossaries, norms, credits) already include these projects — no code changes needed, just successful downloads.

Other source candidates listed in gs-scout-integrations/worked-examples.md.

1.6 Schema governance & language coverage (was #3)

Ongoing concerns, not a discrete feature:

Mixed-language tablets — tokens that switch between Sumerian, Akkadian, Hittite within a line
Citation integrity — every data point traceable to annotation run + scholar + publication
Schema versioning as new sources land
Hittite: language_map entries exist but lemmatization is sparse
Elamite, Ugaritic, Hurrian: in artifacts but not in lexical layer

Treat as governance; review before each major import.

1.7 Tech stack review (was #15)

Options to weigh, post-v1:

Stay with FastAPI + Jinja2 + vanilla JS (no JS build, fast server render, complex interactions need significant vanilla JS)
NiceGUI — Python-native, less HTML/CSS control
Reflex — React-based in Python, more interactive but heavier
Partial React — islands pattern for complex components only (Translation Builder, KB)

Decision criteria: do we hit a wall with vanilla JS in the Translation Builder?

2. UX explorations

2.1 Search v2

V1 search has 8 known issues (no blank state, no Browse vs Detail context, silent scope change, no scoped no-results state, no a11y, low-affordance "128 results" link, scope pill only when open). V2 designs address each. Source: PLAN/UX Explores/Search & Navigation/search-v2-improvements.md plus prototypes search-v2-prototype.html and search-v3-prototype.html.

When this gets picked up, the prototype HTML files should be ported into app/templates/components/ or treated as design references. Move the prototypes into docs/prototypes/ or attach to a GitHub issue when filed.

2.2 Dictionary UI options (Signs / Lemmas / Glosses)

Detail-page design ideation for signs, lemmas, and glosses, grounded in actual data shapes. Terminology alignment between the Lexicography Terminology report and current schema:

L0 Sign → lexical_signs (missing: sign_type, period attestation range)
L1 Reading → split across 3 tables (gap; readings should be first-class)
L2 Lemma → lexical_lemmas (missing: certainty flag, stem/derivation, root)
L3 Gloss → lexical_lemmas.guide_word + lexical_senses.definition_parts[] (rename "Meanings" → "Glosses")
L4 Attestation → lemmatizations + tokens + text_lines

Key principle: "Gloss for finding, attestation for understanding." Senses are irrecoverable for dead languages; lexical_senses actually stores glosses.

The biggest structural gap is Reading, which is scattered across lexical_signs.values[], sign_lemma_associations.reading_type, and token_readings. A future migration could pull these into a dedicated lexical_readings table.

2.3 ATF viewer continued work

Knowledge Bar variants, ATF viewer overlay options, tablet citation display options. Markdown mockups live in (will move from PLAN to): the mockups attached here:

knowledge-bar-options.md
atf-viewer-options.md
detail-section-structures.md
tablet-citation-options.md
adaptive-knowledge-bar.md

When picked up, attach to the relevant tablet-viewer or knowledge-bar GitHub issue.

3. Schema & data-quality concerns

3.1 Readings as first-class entities

Currently distributed across 3 tables. A lexical_readings table would simplify joins and make reading-level provenance trackable. Migration would be additive; old code paths can continue reading from current locations during transition.

3.2 Mixed-language token handling

ORACC corpus JSON has %a, %sux, etc. language-shift markers within a single line. Currently parsed but not surfaced consistently in the UI. Token-level language exists post migration 011 — verify it's populated for all bilingual tablets.

3.3 Period / region / dialect on lemmas

lexical_lemmas.period, region, dialect are empty. Filling them enables filtered dictionary browsing. Data exists in upstream ORACC glossaries (per-project, per-language) — connector update needed.

3.4 Sign_type and period attestation on signs

lexical_signs is missing sign_type (simple/modified/compound) and period attestation range. Data is in OGSL — connector update needed.

4. Data gaps to close

(Mirror of gs-scout-integrations/gap-analysis.md — see there for prioritization.)

Highlights:

Lemmatization coverage: ~2% of artifacts. Highest-impact gap. FactGrid lemma model is a candidate.
Sign detection coverage: 81 tablets (CompVis). DETR scaled, or a HuggingFace candidate.
MZL ↔ OGSL concordance: ~90% done; 200–400 signs need manual curation.
Sumerian outside ORACC projects
Akkadian dialects disambiguated
Translations beyond English (#46 ties in)

5. Lexical & dictionary roadmap (Phase 6–9)

Migrated from PLAN/PRIVATE-TODO.md. Some items may already be partially done; verify before picking up.

Phase 6 — activate lexical resources

Populate lexical_tablet_occurrences from token data (background job; join through token_readings and lemmatizations)
Build REST endpoints for signs/lemmas/senses (planned list in gs-expert-data-model/lexical-api.md)
Token viewer lexical popup (click → show all possible lemmas, senses, signs)

Phase 7 — Dictionary & Sign Explorer UI

Search bar w/ autocomplete across 61k lemmas
Filters: language, POS, source, period, dialect (needs Phase 3.3 above)
Multi-source comparison view (ePSD2 vs ORACC)
Source attribution badges
Sign detail pages with Unicode + Borger + all values + lemmas + composite components + tablet occurrences
Lemma detail pages with senses, signs that write it, cognates, attestation counts

Phase 8 — Advanced lexical

Semantic field classification (12 hierarchical categories already seeded)
Confidence scoring for token → lemma matches
Manual curation interface (scholarly verification, flag uncertain, track curator, export to ORACC)
Cross-language navigation (Sumerogram explorer, cognate chains, etymology)

Phase 9 — Data quality & expansion

Tablet occurrence analytics (genre/period/region/co-occurrence)
Import remaining ORACC projects (Phase 5 dependencies)
Hittite resources, Elamite dictionaries, CAD integration
Duplicate detection across sources; conflict resolution interface

6. Quick wins

Migrated from PRIVATE-TODO. Low effort, high cleanup value:

Attribution facets — filter /tablets by atf_source / translation_source
Filter UX polish — dropdowns, active-state chips, clear-all behavior
Verify eBL translation attribution still renders alongside CDLI/ORACC display
"Back to Tablets" button on Tablet viewer
Breadcrumb navigation
Loading states for API calls
Error boundaries and fallbacks
Cache or index filter options — get_filter_options() runs 4 DISTINCT on 353k rows per page load (also in gs-expert-data-model/query-patterns.md)
Redis for frequent lookups
EXPLAIN ANALYZE pass on slow queries
DB connection pooling tuning
Pagination for large result sets
OpenAPI/Swagger annotation pass (overlaps with section 1.2)
ER diagrams for the data model
Import pipeline troubleshooting guide (now mostly handled by gs-expert-integrations/sources.md)

7. Bugs & data integrity (not yet filed as issues)

Translation pipeline backlog

Confirmed but unverified-fixed:

Migration 011 (011_lemmatization_language.sql) — adds per-word language to lemmatizations. Need to confirm applied to production and that 11_import_lemmatizations.py (or its v2 equivalent oracc_lemmatizations.py) backfills per-word language from ORACC CDL.
Translation panel must appear on tablets with translations (e.g., P001282, P108651) — verify.
Interlinear glosses — lemma indexing fix was applied; verify render is correct.
Bilingual language indicators appear on ORACC-lemmatized bilingual tablets — verify.

Data integrity (issue #49 covers some of this)

5 confirmed bugs affect data accuracy and will surface in Translation Builder / Knowledge Bar (per PRD-006a). Pinpoint, add regression fixtures to gs-curator-artifacts/catalog.yaml.

Citation phases remaining (#22)

8 of 12 citation scripts (03–10) still pending in connector form.
Many CDLI 302 redirects to handle; CSV fallback planned.

Maintenance

When an idea graduates to actionable work: file a GitHub issue, link the issue here, and either delete the section or mark it [FILED: #NNN]. When a bug is fixed: remove from section 7 and add a regression fixture (gs-curator-artifacts/catalog.yaml).

gs-curator-docs doesn't enforce freshness on this file specifically — it's a backlog, not an authority — but bump modified: when sections change meaningfully.

Source: github.com/wittkensis/glintstone · Issues · Edit this wiki

Home

Start here

Getting Started

Overview

Data Model

Reference — Data Model

Reference — API

Reference — MCP

Opportunities

Personas

Project

Research

Ideas

Glintstone — promising ideas, gaps, and bugs

How it's organized

1. Paused product ideas (icebox PRDs)

1.1 Agentic summary layer (was #11)

1.2 API design & external versioning (was #1)

1.3 Automated data ingestion (was #5)

1.4 Multilingual UI (was #46)

1.5 Remaining data sources to integrate

1.6 Schema governance & language coverage (was #3)

1.7 Tech stack review (was #15)

2. UX explorations

2.1 Search v2

2.2 Dictionary UI options (Signs / Lemmas / Glosses)

2.3 ATF viewer continued work

3. Schema & data-quality concerns

3.1 Readings as first-class entities

3.2 Mixed-language token handling

3.3 Period / region / dialect on lemmas

3.4 Sign_type and period attestation on signs

4. Data gaps to close

5. Lexical & dictionary roadmap (Phase 6–9)

Phase 6 — activate lexical resources

Phase 7 — Dictionary & Sign Explorer UI

Phase 8 — Advanced lexical

Phase 9 — Data quality & expansion

6. Quick wins

7. Bugs & data integrity (not yet filed as issues)

Translation pipeline backlog

Data integrity (issue #49 covers some of this)

Citation phases remaining (#22)

Maintenance

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!