# Advanced Usage: Validating Data with LinkML Schemas

This tutorial demonstrates how to use `linkml-reference-validator validate data` to validate supporting text in structured data files against their cited references.

## What is `validate data`?

While `validate text` checks a single quote, `validate data` validates entire data files:
- Reads YAML/JSON data files
- Uses LinkML schemas to identify fields containing supporting text
- Validates all supporting text claims in batch
- Integrates with linkml-validate for complete data validation

## Part 1: Create a LinkML Schema

First, let's create a schema that defines our data model. We'll use special slot URIs to mark which fields contain supporting text:
- `linkml:excerpt` - The field contains quoted text
- `linkml:authoritative_reference` - The field contains the reference ID

In [None]:
%%bash
cat > schema.yaml << 'EOF'
id: https://example.org/gene-functions
name: gene-functions
title: Gene Function Annotations
description: Schema for gene function claims with supporting evidence

prefixes:
  linkml: https://w3id.org/linkml/
  example: https://example.org/

default_prefix: example
default_range: string

classes:
  GeneFunction:
    description: A gene function annotation with supporting evidence
    attributes:
      gene_symbol:
        description: Gene symbol (e.g., MUC1, BRCA1)
        identifier: true
      function:
        description: Functional description of the gene
        required: true
      supporting_text:
        description: Quoted text from publication supporting this function
        slot_uri: linkml:excerpt
        required: true
      reference:
        description: Reference ID (e.g., PMID:12345678)
        slot_uri: linkml:authoritative_reference
        required: true
EOF

echo "✅ Created schema.yaml"

## Part 2: Create Data with Real Citations

Now let's create data with real supporting text from actual papers:

In [None]:
%%bash
cat > gene_functions.yaml << 'EOF'
# Real gene function annotations with real citations

- gene_symbol: MUC1
  function: oncoprotein that blocks nuclear targeting of c-Abl
  supporting_text: MUC1 oncoprotein blocks nuclear targeting of c-Abl
  reference: PMID:16888623

- gene_symbol: MUC1  
  function: involved in apoptotic response to DNA damage
  supporting_text: blocks nuclear targeting of c-Abl in the apoptotic response to DNA damage
  reference: PMID:16888623

- gene_symbol: MUC1
  function: interacts with c-Abl tyrosine kinase
  supporting_text: MUC1 oncoprotein blocks nuclear targeting of c-Abl
  reference: PMID:16888623
EOF

echo "✅ Created gene_functions.yaml with 3 annotations"

## Part 3: Validate the Data (Success Case)

All these quotes come from the same paper (PMID:16888623). The tool will:
1. Fetch the reference from PubMed (or use cached copy)
2. Validate each supporting text quote
3. Report any mismatches

In [None]:
%%bash
linkml-reference-validator validate data \
  gene_functions.yaml \
  --schema schema.yaml \
  --target-class GeneFunction

echo "✅ All validations passed!"

## Part 4: Create Data with Errors

Let's create data where some supporting text doesn't match the references:

In [None]:
%%bash
cat > bad_annotations.yaml << 'EOF'
- gene_symbol: MUC1
  function: activates JAK-STAT signaling
  supporting_text: MUC1 activates the JAK-STAT pathway
  reference: PMID:16888623
  # This text does NOT appear in PMID:16888623

- gene_symbol: MUC1
  function: suppresses immune response  
  supporting_text: MUC1 inhibits T cell activation
  reference: PMID:16888623
  # This text also does NOT appear in the paper
EOF

echo "✅ Created bad_annotations.yaml with intentional errors"

## Part 5: Validate Invalid Data (Failure Cases)

In [None]:
%%bash
linkml-reference-validator validate data \
  bad_annotations.yaml \
  --schema schema.yaml \
  --target-class GeneFunction \
  || echo "❌ Validation failed as expected - supporting text not found"

## Part 6: Using Editorial Notes and Ellipsis in Data

The same `[...]` and `...` syntax works in data files:

In [None]:
%%bash
cat > annotations_with_edits.yaml << 'EOF'
- gene_symbol: MUC1
  function: oncoprotein blocking c-Abl nuclear targeting
  supporting_text: MUC1 [mucin 1] oncoprotein blocks nuclear targeting of c-Abl
  reference: PMID:16888623
  # Editorial note [mucin 1] is ignored during validation

- gene_symbol: MUC1
  function: involved in apoptosis and DNA damage response
  supporting_text: MUC1 oncoprotein ... apoptotic response to DNA damage
  reference: PMID:16888623
  # Ellipsis allows omitting middle text

- gene_symbol: MUC1
  function: blocks c-Abl function
  supporting_text: MUC1 [an oncoprotein] blocks nuclear targeting of c-Abl [a tyrosine kinase]
  reference: PMID:16888623
  # Multiple editorial notes work too
EOF

echo "✅ Created annotations_with_edits.yaml"

In [None]:
%%bash
linkml-reference-validator validate data \
  annotations_with_edits.yaml \
  --schema schema.yaml \
  --target-class GeneFunction

echo "✅ Editorial notes and ellipsis handled correctly!"

## Part 7: Verbose Output

Use `--verbose` to see detailed information about each validation:

In [None]:
%%bash
linkml-reference-validator validate data \
  gene_functions.yaml \
  --schema schema.yaml \
  --target-class GeneFunction \
  --verbose 2>&1 | head -40

## Part 9: Integration with LinkML Schema Validation

The reference validator is a LinkML plugin, so it works alongside other validation features.

Let's create a schema with additional constraints:

In [None]:
%%bash
cat > strict_schema.yaml << 'EOF'
id: https://example.org/strict-gene-functions
name: strict-gene-functions

prefixes:
  linkml: https://w3id.org/linkml/
  example: https://example.org/

default_prefix: example
default_range: string

classes:
  GeneFunction:
    attributes:
      gene_symbol:
        identifier: true
        pattern: "^[A-Z0-9]+$"  # Must be uppercase alphanumeric
      function:
        required: true
        minimum_value: 10  # At least 10 characters
      supporting_text:
        slot_uri: linkml:excerpt
        required: true
      reference:
        slot_uri: linkml:authoritative_reference
        required: true
        pattern: "^PMID:[0-9]+$"  # Must match PMID format
      confidence:
        range: float
        minimum_value: 0.0
        maximum_value: 1.0
EOF

echo "✅ Created strict_schema.yaml with validation constraints"

In [None]:
%%bash
cat > strict_data.yaml << 'EOF'
- gene_symbol: MUC1
  function: blocks nuclear targeting of c-Abl
  supporting_text: MUC1 oncoprotein blocks nuclear targeting of c-Abl
  reference: PMID:16888623
  confidence: 0.95
EOF

echo "✅ Created strict_data.yaml"

In [None]:
%%bash
# Validates BOTH the supporting text AND schema constraints
linkml-reference-validator validate data \
  strict_data.yaml \
  --schema strict_schema.yaml \
  --target-class GeneFunction

echo "✅ All validations (reference text + schema) passed!"

## Part 10: Batch Validation

You can validate multiple files in a loop:

In [None]:
%%bash
# Validate multiple data files
echo "Validating all annotation files..."
for file in gene_functions.yaml annotations_with_edits.yaml; do
  echo "\nValidating $file..."
  linkml-reference-validator validate data \
    "$file" \
    --schema schema.yaml \
    --target-class GeneFunction | head -5
done

echo "\n✅ All files validated!"

## Part 11: Understanding the Cache

All fetched references are cached in `references_cache/`:

In [None]:
%%bash
# List all cached references
echo "Cached references:"
ls -lh references_cache/

In [None]:
%%bash
# Show structure of a cached reference
echo "Structure of cached reference PMID:16888623:"
head -25 references_cache/PMID_16888623.md

## CLI Help

In [None]:
%%bash
linkml-reference-validator validate data --help

## Summary

In this tutorial, we learned:

- **Schema design**: Use `linkml:excerpt` and `linkml:authoritative_reference` slot URIs
- **Batch validation**: Validate all supporting text in data files
- **Editorial notes**: `[...]` for clarifications in data
- **Ellipsis**: `...` for omitted text in quotes
- **Multiple references**: Tool handles different PMIDs automatically
- **Schema integration**: Works with LinkML validation constraints
- **Caching**: References cached automatically for reuse

## Next Steps

- **Tutorial 1**: Getting started with `validate text`
- **Tutorial 3**: Python API for programmatic usage
- [Full Documentation](https://monarch-initiative.github.io/linkml-reference-validator)

## Cleanup

In [None]:
%%bash
# Clean up example files
rm -f schema.yaml strict_schema.yaml *.yaml
echo "✅ Cleaned up example files"