# Validating TSV/CSV Data

This tutorial demonstrates how to use linkml-term-validator to validate **tabular data** in TSV/CSV format.

Many bioinformatics and data science workflows use TSV/CSV files for flat, tabular data. LinkML can validate these formats, and linkml-term-validator can ensure ontology terms in these files are valid.

This notebook shows:
- Validating TSV files with ontology term columns
- Validating CSV files with ontology term columns  
- Using linkml-validate for direct TSV/CSV validation
- Common patterns for tabular data validation

## Setup

Create a temporary directory for our example files:

In [None]:
%%bash
rm -rf /tmp/tsv-validation-demo
mkdir -p /tmp/tsv-validation-demo
cd /tmp/tsv-validation-demo
pwd

## Part 1: Schema for Tabular Data

Create a LinkML schema for flat, tabular gene annotation data.

**Key point**: For TSV/CSV data, keep your schema flat - avoid nested objects.

In [None]:
%%bash
cd /tmp/tsv-validation-demo

cat > gene_schema.yaml << 'EOF'
id: https://example.org/gene-annotation
name: gene-annotation

prefixes:
  GO: http://purl.obolibrary.org/obo/GO_
  linkml: https://w3id.org/linkml/
  gene-annotation: https://example.org/gene-annotation/

default_prefix: gene-annotation
default_range: string

classes:
  GeneAnnotation:
    tree_root: true
    attributes:
      gene_id:
        identifier: true
        required: true
      gene_name:
        required: true
      biological_process:
        range: BiologicalProcessEnum
        required: true
      evidence_code:

enums:
  BiologicalProcessEnum:
    description: Any biological process term from GO
    reachable_from:
      source_ontology: sqlite:obo:go
      source_nodes:
        - GO:0008150
      relationship_types:
        - rdfs:subClassOf
EOF

echo "✅ Schema created"

## Part 2: Creating TSV Data

Create tab-separated values files with gene annotations:

In [None]:
%%bash
cd /tmp/tsv-validation-demo

cat > genes_valid.tsv << 'EOF'
gene_id	gene_name	biological_process	evidence_code
BRCA1	breast cancer 1	GO:0007049	IDA
TP53	tumor protein p53	GO:0006915	IMP
EGFR	epidermal growth factor receptor	GO:0008283	TAS
MYC	MYC proto-oncogene	GO:0045893	IGI
EOF

echo "✅ Valid TSV created"
cat genes_valid.tsv

In [None]:
%%bash
cd /tmp/tsv-validation-demo

cat > genes_invalid.tsv << 'EOF'
gene_id	gene_name	biological_process	evidence_code
BRCA1	breast cancer 1	GO:0007049	IDA
EGFR	epidermal growth factor receptor	GO:0005634	TAS
EOF

echo "✅ Invalid TSV created (GO:0005634 is nucleus, not a process)"

## Part 3: Validating TSV with linkml-validate

Create a validation configuration for TSV data:

In [None]:
%%bash
cd /tmp/tsv-validation-demo

cat > validation_config.yaml << 'EOF'
schema: gene_schema.yaml
target_class: GeneAnnotation

data_sources:
  - genes_valid.tsv

plugins:
  JsonschemaValidationPlugin:
    closed: true

  "linkml_term_validator.plugins.DynamicEnumPlugin":
    oak_adapter_string: "sqlite:obo:"
    cache_labels: true
    cache_dir: cache
EOF

echo "✅ Validation config created"

In [None]:
%%bash
cd /tmp/tsv-validation-demo

echo "Validating valid TSV data..."
linkml-validate --config validation_config.yaml && echo "✅ Validation passed!"

### Test with Invalid TSV Data

In [None]:
%%bash
cd /tmp/tsv-validation-demo

cat > validation_config_invalid.yaml << 'EOF'
schema: gene_schema.yaml
target_class: GeneAnnotation

data_sources:
  - genes_invalid.tsv

plugins:
  JsonschemaValidationPlugin:
    closed: true

  "linkml_term_validator.plugins.DynamicEnumPlugin":
    oak_adapter_string: "sqlite:obo:"
    cache_labels: true
    cache_dir: cache
EOF

echo "Testing with invalid TSV data..."
linkml-validate --config validation_config_invalid.yaml || echo "⚠️  Validation failed as expected"

## Part 4: CSV Data

CSV (comma-separated values) works exactly the same way:

In [None]:
%%bash
cd /tmp/tsv-validation-demo

cat > genes.csv << 'EOF'
gene_id,gene_name,biological_process,evidence_code
BRCA1,breast cancer 1,GO:0007049,IDA
TP53,tumor protein p53,GO:0006915,IMP
EGFR,epidermal growth factor receptor,GO:0008283,TAS
MYC,MYC proto-oncogene,GO:0045893,IGI
EOF

echo "✅ CSV created"
cat genes.csv

In [None]:
%%bash
cd /tmp/tsv-validation-demo

cat > validation_config_csv.yaml << 'EOF'
schema: gene_schema.yaml
target_class: GeneAnnotation

data_sources:
  - genes.csv

plugins:
  JsonschemaValidationPlugin:
    closed: true

  "linkml_term_validator.plugins.DynamicEnumPlugin":
    oak_adapter_string: "sqlite:obo:"
    cache_labels: true
    cache_dir: cache
EOF

echo "Validating CSV data..."
linkml-validate --config validation_config_csv.yaml && echo "✅ CSV validation passed!"

## Part 5: Inspecting the Cache

The validation plugins cache ontology labels for performance:

In [None]:
%%bash
cd /tmp/tsv-validation-demo

echo "Cache structure:"
find cache -type f 2>/dev/null || echo "(No cache created yet)"

In [None]:
%%bash
cd /tmp/tsv-validation-demo

echo "GO terms validated:"
cat cache/go/terms.csv 2>/dev/null || echo "(GO cache not yet created)"

## Summary

### Key Points

1. **TSV/CSV Support**: LinkML natively supports tabular data formats
2. **Flat Schemas**: Design schemas with flat class structures (no nested objects)
3. **linkml-validate**: Use `linkml-validate` for direct TSV/CSV validation
4. **Same Plugins**: All validation plugins work the same way regardless of input format
5. **Header Row**: Always include a header row with column names matching schema slots

### Best Practices

1. **Schema Design**: Keep schemas flat for TSV/CSV data
2. **Column Names**: Use valid LinkML slot names (avoid spaces, special characters)
3. **Data Types**: LinkML validates types (integer, string, etc.) in addition to ontology terms

### When to Use TSV/CSV

Use TSV/CSV formats when:
- Data is naturally tabular (rows and columns)
- Integrating with spreadsheet tools (Excel, Google Sheets)
- Working with bioinformatics pipelines (common format)
- Data doesn't have nested structures

Use YAML/JSON when:
- Data has nested or hierarchical structures
- Need to represent complex object relationships
- Working with configuration files

### Workflow

```bash
# 1. Create TSV/CSV data
# 2. Create LinkML schema with ontology constraints
# 3. Create validation config
# 4. Run linkml-validate
linkml-validate --config validation_config.yaml
```

### Next Steps

- [linkml-validate Integration](04_linkml_validate_integration.ipynb) - More configuration options
- [Getting Started](01_getting_started.ipynb) - CLI basics

## Cleanup

In [None]:
%%bash
rm -rf /tmp/tsv-validation-demo
echo "✅ Cleanup complete"