Skip to content

Skills: Domain Modeling (archived)

Gully Burns edited this page Apr 16, 2026 · 1 revision

Skills: Domain Modeling

Domain modeling is the process of designing a new skill from scratch — defining the TypeDB schema namespace, writing the skill files, and building the Python scripts that give an agent a coherent new domain to work in.

This is a meta-skill: rather than operating on a domain (job hunting, rare diseases), it helps you create one.

When You Need Domain Modeling

You need domain modeling when you want the agent to:

  • Track a new category of things over time (papers, experiments, competitors, grants, proteins…)
  • Build structured understanding from unstructured sources in a new area
  • Answer recurring questions that require accumulated, queryable knowledge

If a conversation-scoped answer is enough, you don't need a new skill. Domain modeling is justified when the knowledge should persist across sessions and accumulate into something more than the sum of individual notes.

The Central Design Question

Before writing a single line of schema, answer this:

What questions do you want to ask six months from now, once the knowledge graph has been accumulating data?

The answer determines what entities you need, what relationships matter, and what level of granularity is worth capturing. Work backwards from the queries you want to run.

Example (jobhunt):

  • "What skill gaps appear most often across my high-priority positions?" → need position, requirement, your-skill
  • "Which companies have I interviewed at?" → need company, position, status tracking
  • "What learning resources address my top gap?" → need learning-resource linked to requirement

Example (rare-disease):

  • "What genes cause this disease?" → need gene, disease, causal relation
  • "Which diseases are phenotypically similar?" → need phenotype, similarity score
  • "What drugs target those genes?" → need drug, gene, targeting relation

Placing Concepts in the 3-Branch Hierarchy

Every type you add to the schema belongs in one of three branches. Getting this right is the most important design decision.

Branch Base Type The Question to Ask
Domain objects domain-thing Is this a real-world thing I reason about?
Collections collection Is this a typed set of things with shared context?
Content information-content-entity Is this content I capture, extract, or annotate?

The content branch has three concrete subtypes:

ICE Subtype The Question to Ask
artifact Is this raw captured content with a source URL or provenance?
fragment Is this an extracted piece of an artifact?
note Is this the agent's interpretation, analysis, or annotation?

The key rule: domain objects are not information content. A gene, a job posting, a company, a disease — these are domain things. The HTML page describing the job posting is an artifact. The extracted skill requirement is a fragment. The fit analysis is a note.

Conflating these creates ontological confusion that makes queries brittle. A common mistake is making a "paper" both a domain thing (the intellectual object) and an artifact (the captured PDF). Keep them separate.

Worked Classification Exercise

New domain: tracking grant funding opportunities.

Concept Branch Reasoning
Funding agency (NIH, NSF, Wellcome) domain-thing Real-world organisation you reason about
Grant opportunity (RFA-CA-25-001) domain-thing Real-world object with a persistent identity
Your research group domain-thing Actor in the domain
Funding portfolio collection A typed set of grant opportunities being tracked
Grant opportunity announcement (HTML) artifact Raw captured content with a source URL
Extracted eligibility criterion fragment Extracted piece of the announcement artifact
Fit assessment note Agent's analysis of group fit against the opportunity

Schema Design

Namespace Prefix

All type names must use a consistent <domain>- prefix to avoid clashes with other namespaces. Choose a short, unambiguous prefix:

jobhunt-*    rd-*    techrecon-*    grant-*

Schema Structure

A minimal schema has four sections:

define

# 1. Attributes
grant-url sub attribute, value string;
grant-deadline sub attribute, value datetime;
grant-amount sub attribute, value long;
grant-fit-score sub attribute, value double;

# 2. Domain Things
grant-opportunity sub domain-thing,
    owns grant-url,
    owns grant-deadline,
    plays opportunity-at-agency:opportunity;

grant-agency sub domain-thing,
    plays opportunity-at-agency:agency;

# 3. Collections
grant-portfolio sub collection;

# 4. ICE subtypes
grant-announcement sub artifact;

grant-eligibility-criterion sub fragment,
    owns criterion-text,
    owns criterion-type;           # required / preferred / excluded

grant-fit-note sub note,
    owns grant-fit-score,
    owns fit-summary;

# 5. Relations
opportunity-at-agency sub relation,
    relates opportunity,
    relates agency;

Design Principles

Prefix everything. url is fine for a single-domain system; for a multi-skill knowledge graph it will collide. Use grant-url.

Attributes are reusable. If name already exists in the core schema, extend it rather than defining grant-name. Check the core schema first.

Keep domain-things clean. Don't put content attributes (summaries, extracted text) on domain things. Those belong on ICEs.

Relations are explicit. A job posting "requires" a skill. A gene "causes" a disease. A drug "targets" a gene. Make the relationship a named type, not just a pointer. This enables pattern-matching queries across the graph.

Collections are typed per domain. Don't use a generic collection — define grant-portfolio sub collection so queries can target it specifically.

The Sensemaking Chain

Every skill follows the same provenance chain. Design it explicitly:

Source URL / API
    ↓
Artifact (raw content + provenance)
    ↓
Fragments (extracted structured pieces)
    ↓
Notes (agent analysis, scoring, synthesis)

For each domain object ask:

  • What artifact captures the raw evidence about it?
  • What fragments are worth extracting from that artifact?
  • What notes will the agent write after reading?

Example (grant skill):

grants.nih.gov/grants/guide/rfa-files/RFA-CA-25-001.html
    ↓
grant-announcement artifact (HTML + URL + fetch timestamp)
    ↓
grant-eligibility-criterion fragments (one per eligibility rule)
    ↓
grant-fit-note (fit score + narrative + identified gaps)

This chain means you can always trace a fit score back to the exact text it was derived from.

Writing SKILL.md

SKILL.md is loaded at startup on every session. Keep it under one page. It contains only:

  1. Frontmatter (name, description)
  2. Trigger phrases — what the user might say that should invoke this skill
  3. Prerequisites (TypeDB running, env vars, any auth)
  4. Quick-start command examples
  5. A pointer to USAGE.md for the full workflow
---
name: grant
description: Track grant opportunities, analyse eligibility, identify fit gaps
---

# Grant Tracking Skill

Use this skill to manage grant opportunities as a knowledge graph.

**When to use:** "add grant", "new funding opportunity", "analyze this RFA",
"show my grant pipeline", "funding gaps", "submission deadlines"

## Prerequisites
- TypeDB running: `make db-start`
- `uv sync --all-extras`

## Quick Start

```bash
uv run python .claude/skills/grant/grant.py ingest-opportunity \
    --url "https://grants.nih.gov/grants/guide/rfa-files/RFA-CA-25-001.html"

uv run python .claude/skills/grant/grant.py list-pipeline

Before executing any commands, read USAGE.md for the complete workflow, command reference, and sensemaking steps.


## Writing USAGE.md

USAGE.md is **loaded on demand** when executing the skill. It can be as long as needed. Structure it around the five curation stages:

```markdown
# Grant Tracking Usage

## 5-Phase Workflow

### Phase 1: Foraging
[how to discover grant opportunities — search, feeds, agency pages]

### Phase 2: Ingestion
[ingest-opportunity command, what it fetches and stores]

### Phase 3: Sensemaking
[step-by-step: read artifact → extract eligibility criteria as fragments
 → write fit-note with score]

### Phase 4: Analysis
[query commands: show-gaps, show-pipeline, show-deadlines]

### Phase 5: Reporting
[report commands, dashboard views]

## Sensemaking Workflow (Agent reasoning steps)

1. Run `show-artifact --id <id>` to get announcement text
2. Read the full announcement, identify eligibility criteria
3. For each criterion: run `add-criterion` to store as fragment
4. Assess group fit against each criterion
5. Run `add-note --type fit-analysis --fit-score 0.75 ...`
6. Report: score breakdown, key gaps, recommended actions

## Command Reference
[full argparse command table]

## TypeQL Examples
[representative queries]

Python Script Conventions

The script is the skill's I/O layer. The agent handles reasoning; the script handles data.

Required Structure

#!/usr/bin/env python3
"""Grant tracking CLI."""

import argparse, json, os, sys
from typedb.driver import TypeDB, SessionType, TransactionType

TYPEDB_HOST = os.getenv("TYPEDB_HOST", "localhost")
TYPEDB_PORT = int(os.getenv("TYPEDB_PORT", "1729"))
TYPEDB_DATABASE = os.getenv("TYPEDB_DATABASE", "alhazen_notebook")

def cmd_ingest(args):
    # ... fetch + store as artifact ...
    print(json.dumps({"success": True, "id": artifact_id}))  # stdout = structured result

def cmd_list(args):
    # ... query TypeDB ...
    print(json.dumps({"success": True, "items": results}))

def main():
    parser = argparse.ArgumentParser(description="Grant tracking CLI")
    sub = parser.add_subparsers(dest="command", required=True)
    # ... register subcommands ...
    args = parser.parse_args()
    commands[args.command](args)

if __name__ == "__main__":
    main()

Key Rules

  • JSON to stdout only. The agent parses stdout.
  • Progress and errors to stderr. print("Fetching...", file=sys.stderr) never pollutes the JSON.
  • One subcommand per operation. ingest-opportunity, list-pipeline, add-note — not a single run command with flags controlling everything.
  • No reasoning in the script. The script fetches, stores, queries. It does not summarise, score, or interpret. That is the agent's job.
  • Fail loudly. If TypeDB is unavailable or a required ID is missing, raise and let stderr carry the message. Don't silently return empty results.

Standard Command Categories

Most skills need commands in these four categories:

Category Examples
Ingestion ingest-*, init-* — fetch from external source, store as artifact
Sensemaking support show-artifact, list-artifacts — feed content to the agent for reading
Annotation add-note, add-fragment, tag — store what the agent extracts
Query / Report list-*, show-*, report-* — retrieve structured results

Common Pitfalls

Putting everything in notes. Notes are analysis; fragments are extracted structure. If you're storing a list of 50 eligibility criteria as a single note blob, they should be fragments — one per criterion, queryable individually.

Skipping the artifact. It's tempting to skip storing the raw HTML and go straight to extracted entities. Don't. The artifact preserves provenance and lets you re-read the source if your extraction was incomplete.

Generic collection types. collection sub collection is useless. grant-portfolio sub collection lets you write $p isa grant-portfolio in queries. Always define a named subtype.

Forgetting namespace isolation. Two skills both defining url will conflict at schema load time. Always prefix.

Making the script do sensemaking. If your ingest command is computing scores, summarising text, or making classification decisions — stop. Return the raw data and let the agent do that work.

Reference Implementations

Both demonstration skills are complete working examples:

Skill What to study
Skills: Jobhunt Full vertical: schema + SKILL.md + USAGE.md + Python script + Next.js dashboard. The forager pattern (automated discovery → candidates → promote).
Skills: Rare Disease Multi-source ingestion (Monarch, ClinicalTrials, ChEMBL). Structured mechanism model as a schema first-class concept. Cross-namespace queries.

Read their schema.tql files alongside Schema Reference to see how the 3-branch hierarchy is applied in practice.

Related

Clone this wiki locally