# Integration with linkml-validate

This tutorial demonstrates how to use linkml-term-validator plugins with the standard `linkml-validate` command via configuration files.

This approach is ideal when you want to:
- Use the standard LinkML validation workflow
- Combine ontology validation with other LinkML validators
- Configure validation in YAML rather than code
- Integrate into existing LinkML projects

## Setup

Create a temporary directory for our example files:

In [None]:
%%bash
# Create working directory
rm -rf /tmp/linkml-validate-demo
mkdir -p /tmp/linkml-validate-demo
cd /tmp/linkml-validate-demo
pwd

## Part 1: Basic linkml-validate Integration

### Create a Schema

First, let's create a LinkML schema with dynamic enums:

In [None]:
%%bash
cd /tmp/linkml-validate-demo

cat > schema.yaml << 'EOF'
id: https://example.org/gene-annotation
name: gene-annotation

prefixes:
  GO: http://purl.obolibrary.org/obo/GO_
  linkml: https://w3id.org/linkml/

default_prefix: gene-annotation
default_range: string

classes:
  GeneAnnotation:
    attributes:
      gene_id:
        identifier: true
        required: true
      gene_name:
        required: true
      biological_process:
        range: BiologicalProcessEnum
        required: true

enums:
  BiologicalProcessEnum:
    description: Any biological process term from GO
    reachable_from:
      source_ontology: sqlite:obo:go
      source_nodes:
        - GO:0008150  # biological_process
      relationship_types:
        - rdfs:subClassOf
EOF

echo "✅ Schema created"
cat schema.yaml

### Create Test Data

Create both valid and invalid data to demonstrate validation:

In [None]:
%%bash
cd /tmp/linkml-validate-demo

# Valid data
cat > valid_data.yaml << 'EOF'
- gene_id: BRCA1
  gene_name: "breast cancer 1"
  biological_process: GO:0007049  # cell cycle - valid biological process

- gene_id: TP53
  gene_name: "tumor protein p53"
  biological_process: GO:0006915  # apoptotic process - valid biological process
EOF

echo "✅ Valid data created"
cat valid_data.yaml

In [None]:
%%bash
cd /tmp/linkml-validate-demo

# Invalid data (using cellular component instead of biological process)
cat > invalid_data.yaml << 'EOF'
- gene_id: EGFR
  gene_name: "epidermal growth factor receptor"
  biological_process: GO:0005634  # nucleus - INVALID (cellular component, not process)

- gene_id: MYC
  gene_name: "MYC proto-oncogene"
  biological_process: GO:0007049  # cell cycle - valid
EOF

echo "✅ Invalid data created"
cat invalid_data.yaml

### Create linkml-validate Configuration

This is the key part - creating a configuration file that tells `linkml-validate` to use the linkml-term-validator plugins:

In [None]:
%%bash
cd /tmp/linkml-validate-demo

cat > validation_config.yaml << 'EOF'
# Configuration file for linkml-validate
#
# Usage: linkml-validate --config validation_config.yaml

# Schema to validate against
schema: schema.yaml

# Target class (optional - if not specified, validates entire file)
target_class: GeneAnnotation

# Data sources to validate
data_sources:
  - valid_data.yaml

# Validation plugins to use
plugins:
  # Standard JSON Schema validation (built-in to LinkML)
  JsonschemaValidationPlugin:
    closed: true

  # Dynamic enum validation from linkml-term-validator
  # Full module path required: linkml_term_validator.plugins.DynamicEnumPlugin
  "linkml_term_validator.plugins.DynamicEnumPlugin":
    oak_adapter_string: "sqlite:obo:"
    cache_labels: true
    cache_dir: cache
EOF

echo "✅ Validation config created"
cat validation_config.yaml

### Run linkml-validate

Now we can use the standard `linkml-validate` command with our configuration:

In [None]:
%%bash
cd /tmp/linkml-validate-demo

echo "Validating with linkml-validate..."
linkml-validate --config validation_config.yaml && echo "✅ Validation passed!"

### Test with Invalid Data

Now let's update the config to use the invalid data and see the validation errors:

In [None]:
%%bash
cd /tmp/linkml-validate-demo

# Update config to use invalid data
cat > validation_config_invalid.yaml << 'EOF'
schema: schema.yaml
target_class: GeneAnnotation

data_sources:
  - invalid_data.yaml

plugins:
  JsonschemaValidationPlugin:
    closed: true

  "linkml_term_validator.plugins.DynamicEnumPlugin":
    oak_adapter_string: "sqlite:obo:"
    cache_labels: true
    cache_dir: cache
EOF

echo "Testing with invalid data..."
linkml-validate --config validation_config_invalid.yaml || echo "⚠️  Validation failed as expected"

## Part 2: Adding Binding Validation

Let's create a more complex example with binding constraints:

In [None]:
%%bash
cd /tmp/linkml-validate-demo

cat > schema_with_bindings.yaml << 'EOF'
id: https://example.org/gene-annotation-advanced
name: gene-annotation-advanced

prefixes:
  GO: http://purl.obolibrary.org/obo/GO_
  linkml: https://w3id.org/linkml/

default_prefix: gene-annotation-advanced
default_range: string

classes:
  GeneAnnotation:
    attributes:
      gene_id:
        identifier: true
        required: true
      gene_name:
        required: true
      go_term:
        range: GOTerm
        required: true
        bindings:
          - binds_value_of: id
            range: BiologicalProcessEnum

  GOTerm:
    attributes:
      id:
        identifier: true
        required: true
      label:
        required: true

enums:
  BiologicalProcessEnum:
    description: Any biological process term from GO
    reachable_from:
      source_ontology: sqlite:obo:go
      source_nodes:
        - GO:0008150
      relationship_types:
        - rdfs:subClassOf
EOF

echo "✅ Schema with bindings created"

In [None]:
%%bash
cd /tmp/linkml-validate-demo

cat > data_with_terms.yaml << 'EOF'
- gene_id: BRCA1
  gene_name: "breast cancer 1"
  go_term:
    id: GO:0007049
    label: "cell cycle"

- gene_id: TP53
  gene_name: "tumor protein p53"
  go_term:
    id: GO:0006915
    label: "apoptotic process"
EOF

echo "✅ Data with GO terms created"

In [None]:
%%bash
cd /tmp/linkml-validate-demo

cat > validation_config_bindings.yaml << 'EOF'
schema: schema_with_bindings.yaml
target_class: GeneAnnotation

data_sources:
  - data_with_terms.yaml

plugins:
  JsonschemaValidationPlugin:
    closed: true

  # Dynamic enum validation
  "linkml_term_validator.plugins.DynamicEnumPlugin":
    oak_adapter_string: "sqlite:obo:"
    cache_labels: true
    cache_dir: cache

  # Binding validation with label checking
  "linkml_term_validator.plugins.BindingValidationPlugin":
    oak_adapter_string: "sqlite:obo:"
    validate_labels: true  # Also check labels match ontology
    cache_labels: true
    cache_dir: cache
EOF

echo "✅ Validation config with bindings created"
cat validation_config_bindings.yaml

In [None]:
%%bash
cd /tmp/linkml-validate-demo

echo "Validating with both dynamic enums and bindings..."
linkml-validate --config validation_config_bindings.yaml || echo "⚠️  May encounter issues with complex binding validation"

## Part 3: Using Custom OAK Configuration

For more control over ontology access, create an OAK config file:

In [None]:
%%bash
cd /tmp/linkml-validate-demo

cat > oak_config.yaml << 'EOF'
# OAK adapter configuration
# Controls which ontology adapters to use for different prefixes

ontology_adapters:
  # Use local SQLite databases for GO
  GO: sqlite:obo:go
  
  # Skip validation for these prefixes
  linkml: ""
  schema: ""
EOF

echo "✅ OAK config created"
cat oak_config.yaml

In [None]:
%%bash
cd /tmp/linkml-validate-demo

cat > validation_config_with_oak.yaml << 'EOF'
schema: schema_with_bindings.yaml
target_class: GeneAnnotation

data_sources:
  - data_with_terms.yaml

plugins:
  JsonschemaValidationPlugin:
    closed: true

  "linkml_term_validator.plugins.DynamicEnumPlugin":
    oak_adapter_string: "sqlite:obo:"
    oak_config_path: oak_config.yaml  # Use custom OAK config
    cache_labels: true
    cache_dir: cache

  "linkml_term_validator.plugins.BindingValidationPlugin":
    oak_adapter_string: "sqlite:obo:"
    oak_config_path: oak_config.yaml  # Use custom OAK config
    validate_labels: true
    cache_labels: true
    cache_dir: cache
EOF

echo "✅ Validation config with OAK config created"

In [None]:
%%bash
cd /tmp/linkml-validate-demo

echo "Validating with custom OAK configuration..."
linkml-validate --config validation_config_with_oak.yaml || echo "⚠️  May encounter issues with complex binding validation"

## Part 4: Inspecting the Cache

The plugins create a cache of ontology labels for performance:

In [None]:
%%bash
cd /tmp/linkml-validate-demo

echo "Cache contents:"
find cache -type f

In [None]:
%%bash
cd /tmp/linkml-validate-demo

echo "Cached GO terms:"
cat cache/go/terms.csv

## Part 5: CI/CD Integration Example

Here's how you might use this in a GitHub Actions workflow:

In [None]:
%%bash
cd /tmp/linkml-validate-demo

cat > .github-workflow-example.yml << 'EOF'
name: Validate Data

on: [push, pull_request]

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.10'

      - name: Install dependencies
        run: |
          pip install linkml linkml-term-validator

      - name: Cache ontology databases
        uses: actions/cache@v3
        with:
          path: cache
          key: ontology-cache-${{ hashFiles('oak_config.yaml') }}

      - name: Validate data
        run: |
          linkml-validate --config validation_config.yaml
EOF

echo "✅ Example GitHub Actions workflow created"
cat .github-workflow-example.yml

## Summary

### Key Points

1. **Configuration File**: Use a YAML config file to specify schema, data sources, and plugins

2. **Plugin Specification**: Use full module paths in quotes:
   - `"linkml_term_validator.plugins.DynamicEnumPlugin"`
   - `"linkml_term_validator.plugins.BindingValidationPlugin"`

3. **Plugin Options**: Configure each plugin with YAML parameters:
   - `oak_adapter_string`: Ontology adapter to use
   - `oak_config_path`: Path to custom OAK configuration
   - `cache_labels`: Enable/disable caching
   - `cache_dir`: Cache directory location
   - `validate_labels`: (BindingValidationPlugin only) Check labels match ontology

4. **Combine Validators**: Mix linkml-term-validator plugins with standard LinkML validators like `JsonschemaValidationPlugin`

5. **Command**: Run with `linkml-validate --config validation_config.yaml`

### When to Use This Approach

Use `linkml-validate` with config files when:
- You want a declarative configuration approach
- You're already using LinkML's validation ecosystem
- You need to combine multiple validation plugins
- You want easy CI/CD integration

Use the `linkml-term-validator` CLI when:
- You only need ontology term validation
- You want simpler, more focused commands
- You're primarily validating schemas (not data)

Use the Python API when:
- You need programmatic control
- You're building custom validation workflows
- You want to integrate validation into Python applications

## Cleanup

In [None]:
%%bash
rm -rf /tmp/linkml-validate-demo
echo "✅ Cleanup complete"