# Wikidata Barrel Recipe from EntitySchema

A Barrel Recipe is an organized rendition of how one of the target knowledge systems in the Global Knowledge Commons wants content about a particular thing to be structured. The gkc code uses the concept to smooth over all the different ways that systems like Wikidata and OpenStreetMap do things allowing us to package data that makes it through our distillation process and ship it to those target systems.

This notebook works through a typical workflow where you know you want to get your source data into a particular Wikidata entity type. You need to put together what that entity type really looks like in a common technical documentation structure that can be used as a part of the process between your source data and Wikidata items with labels and properties.

## 1. Locate and Load ShEx Schema
We are trying to promote the good use of Wikidata's Entity Schema architecture, which uses the Shape Expression (ShEx) specification to encode the properties a given entity type should use as statements/claims and specific parameters for those. If you don't already have an entity schema created, gkc can help with that as well.

This code block takes any valid Entity Schema ID and reads it in for processing. If there is anything invalid in the shape expressions, those will be reported.

In [None]:
EID = "E502"

from gkc.cooperage import fetch_entity_schema_metadata

# Fetch metadata about the EntitySchema itself (label, description, aliases)
schema_metadata = fetch_entity_schema_metadata(eid=EID, user_agent="gkc-notebook/0.1 (example)")

print("EntitySchema Metadata:")
print(f"  Label: {schema_metadata.get('label', '<no label')}")
print(f"  Description: {schema_metadata.get('description', '<no description>')}")
print(f"  Aliases: {schema_metadata.get('aliases', [])}")
print(f"  Source: {schema_metadata.get('source', '')}")

In [None]:
from gkc import SpiritSafeValidator

validator = SpiritSafeValidator(eid=EID, user_agent="gkc-notebook/0.1 (example)")
validator.load_specification()

schema_text = validator._schema or ""
print(f"Loaded schema {EID}. Characters: {len(schema_text)}")
print("First line:", schema_text.splitlines()[0] if schema_text else "<empty>")

In [None]:
from pyshex.utils.schema_loader import SchemaLoader

loader = SchemaLoader()
shex_schema = loader.loads(schema_text)

shape_count = len(shex_schema.shapes or [])
print(f"Parsed ShExJ schema. Shapes: {shape_count}")

In [None]:
shape_ids = [str(shape.id) for shape in (shex_schema.shapes or []) if getattr(shape, "id", None)]

print("Shape IDs (first 10):")
print(shape_ids[:10])

## 2. Extract Classification Constraints
We parse the ShExC into a ShExJ AST (via PyShEx) and traverse the schema start shape (and its subshapes) to identify P31 (instance of) and P279 (subclass of) constraints that apply to the overall entity type.

In [None]:
from gkc.recipe import SpecificationExtractor

extractor = SpecificationExtractor(schema_text)

instance_of = extractor.get_instance_of_constraints()
subclass_of = extractor.get_subclass_of_constraints()

print("Classification constraints extracted from schema:")
print(f"  P31 (instance of): {instance_of}")
print(f"  P279 (subclass of): {subclass_of}")

In [None]:
import importlib
import gkc.recipe
importlib.reload(gkc.recipe)

# Recreate the builder with EID so metadata is fetched
from gkc.recipe import RecipeBuilder
builder = RecipeBuilder(eid=EID, user_agent="gkc-notebook/0.1 (example)")

In [None]:
import re

def _predicate_pid(predicate):
    text = str(predicate)
    if "P31" in text:
        return "P31"
    if "P279" in text:
        return "P279"
    return None

shape_map = {
    str(shape.id): shape
    for shape in (shex_schema.shapes or [])
    if getattr(shape, "id", None)
}

def _resolve_shape_ref(ref):
    ref_text = str(ref)
    if ref_text in shape_map:
        return shape_map[ref_text]
    local_match = re.search(r"<?(\w+)>?", ref_text)
    if local_match:
        local = local_match.group(1)
        for key, shape in shape_map.items():
            if key.endswith(local) or key.endswith(f"/{local}") or key.endswith(f"#{local}"):
                return shape
    return None

def _collect_qids(expr):
    qids = []
    if expr is None:
        return qids
    predicate = getattr(expr, "predicate", None)
    if predicate is not None:
        pid = _predicate_pid(predicate)
        if pid:
            value_expr = getattr(expr, "valueExpr", None)
            qids.extend(re.findall(r"Q\d+", str(value_expr)))
            if not qids:
                ref = getattr(value_expr, "reference", None)
                if ref is None and isinstance(value_expr, str):
                    ref = value_expr
                if ref is not None:
                    ref_shape = _resolve_shape_ref(ref)
                    if ref_shape is not None:
                        qids.extend(_collect_qids(getattr(ref_shape, "expression", None)))
        return qids
    expressions = getattr(expr, "expressions", None)
    if expressions is not None:
        for child in expressions:
            qids.extend(_collect_qids(child))
        return qids
    inner = getattr(expr, "expression", None)
    if inner is not None:
        qids.extend(_collect_qids(inner))
    return qids

def _walk_expr(expr, shape_id, hits):
    if expr is None:
        return
    predicate = getattr(expr, "predicate", None)
    if predicate is not None:
        pid = _predicate_pid(predicate)
        if pid:
            value_expr = getattr(expr, "valueExpr", None)
            qids = re.findall(r"Q\d+", str(value_expr))
            qids_from_ref = []
            if not qids:
                ref = getattr(value_expr, "reference", None)
                if ref is None and isinstance(value_expr, str):
                    ref = value_expr
                if ref is not None:
                    ref_shape = _resolve_shape_ref(ref)
                    if ref_shape is not None:
                        qids_from_ref = _collect_qids(getattr(ref_shape, "expression", None))
            hits.append({
                "shape": shape_id,
                "predicate": pid,
                "qids": qids,
                "qids_from_ref": qids_from_ref,
                "value_expr": str(value_expr),
            })
        return
    expressions = getattr(expr, "expressions", None)
    if expressions is not None:
        for child in expressions:
            _walk_expr(child, shape_id, hits)
        return
    inner = getattr(expr, "expression", None)
    if inner is not None:
        _walk_expr(inner, shape_id, hits)
        return

hits = []
start_shape = _resolve_shape_ref(getattr(shex_schema, "start", None))
if start_shape is not None:
    start_id = str(getattr(start_shape, "id", "<start>"))
    print(f"Start shape: {start_id}")
    _walk_expr(getattr(start_shape, "expression", None), start_id, hits)
else:
    print("Start shape not resolved.")

print("P31/P279 constraints from start shape (first 10):")
for item in hits[:10]:
    print(item)

In [None]:
from pathlib import Path

from gkc import RecipeBuilder

# Entity type is now optional - can be auto-detected from schema constraints
EXPLICIT_ENTITY_TYPE = "Q7840353"  # Optional: override schema constraint
OUTPUT_PATH = Path("temp/generated/wikidata_barrel_recipe_E502.json")

builder = RecipeBuilder(schema_text=schema_text, user_agent="gkc-notebook/0.1 (example)")

## 3. Generate Barrel Recipe from ShEx
Convert the schema into a GKC Wikidata Barrel Recipe structure.

In [None]:
# Build recipe - entity_type is optional, uses schema constraints if not provided
# Option 1: Let it use constraints from schema
recipe_auto = builder.finalize_recipe()

# Option 2: Override with explicit entity type
recipe_explicit = builder.finalize_recipe(entity_type=EXPLICIT_ENTITY_TYPE)

# Use the explicit version for the rest of the notebook
recipe = recipe_explicit

print("Recipe metadata:")
for key, value in recipe["metadata"].items():
    if isinstance(value, (str, int)):
        print(f"  {key}: {value}")
    elif isinstance(value, list):
        if key in ['target_entity_types', 'schema_instance_of', 'schema_subclass_of']:
            print(f"  {key}: {value}")

print(f"\nTotal claims in recipe: {len(recipe['mappings']['claims'])}")

In [None]:
recipe

In [None]:
print("generated_date:", recipe["metadata"]["generated_date"])
type(recipe["metadata"]["generated_date"])

In [None]:
print("builder.validator.eid:", builder.validator.eid)

## 4. Inspect and Serialize Barrel Recipe
Preview a few mappings, then write the full recipe to JSON for inspection.

In [None]:
import json

preview = recipe["mappings"]["claims"][:5]
print(json.dumps(preview, indent=2))

OUTPUT_PATH.parent.mkdir(parents=True, exist_ok=True)
OUTPUT_PATH.write_text(json.dumps(recipe, indent=2))
print("Wrote:", OUTPUT_PATH.resolve())
print("Bytes:", OUTPUT_PATH.stat().st_size)

## 5. Classification Extraction and Recipe Building

The RecipeBuilder now parses ShExC into ShExJ and walks the schema start shape (plus its subshapes) to extract P31 (instance of) and P279 (subclass of) constraints. These flow into the recipe metadata as:

- `target_entity_type` or `target_entity_types` (if explicit entity_type provided or extracted from schema)
- `schema_instance_of` (QIDs extracted from P31 constraints)
- `schema_subclass_of` (QIDs extracted from P279 constraints)

The `entity_type` parameter is now optionalâ€”if not provided, the builder will use classifications extracted from the schema start shape.

Note: `source_field` values are auto-generated (p###_value). Customize them to match the Unified Still Schema as it evolves, which is a starting point for issue #9 work. This recipe is the foundation for building the mapping from your source data (Still Schema) to Wikidata items.