# 🔮 Automated Inference with OmniGenBench AutoInfer

Welcome to this comprehensive tutorial on **automated inference** using OmniGenBench's AutoInfer functionality. This guide will walk you through making predictions on genomic sequences using pre-trained and fine-tuned models.

### 1. The Computational Challenge: Making Predictions at Scale

**AutoInfer** streamlines the process of making predictions on genomic sequences by providing:
- **Unified Interface**: Single API for all model types and tasks
- **Batch Processing**: Efficient handling of thousands of sequences
- **Flexible Input**: Support for various file formats (JSON, CSV, TXT)
- **Rich Output**: Predictions, probabilities, and confidence scores

Applications across genomic analysis:
- **Transcription Factor Binding**: Predict TF binding sites in regulatory regions
- **Translation Efficiency**: Estimate mRNA translation rates
- **Promoter Detection**: Identify promoter regions in genomic sequences
- **RNA Structure**: Predict secondary structure elements
- **Variant Effect**: Assess functional impact of genetic variants

### 2. The Data: From Sequences to Predictions

AutoInfer handles various input formats:

| Input Type | Format | Use Case |
|-----------|--------|----------|
| Single sequence | String | Quick predictions |
| Multiple sequences | Comma-separated | Small batches |
| JSON file | `{"sequences": [...]}` | Structured data |
| CSV file | `sequence,label,...` | Tabular data |
| Text file | One per line | Simple lists |

### 3. The Tool: Pre-trained and Fine-tuned Models

#### Model Types
- **Foundation Models**: General genomic understanding (e.g., OmniGenome-186M)
- **Fine-tuned Models**: Task-specific models (e.g., TFB prediction, TE prediction)
- **Custom Models**: Your own fine-tuned models

All models support:
1. **Sequence Classification**: Single or multi-label classification
2. **Token Classification**: Per-nucleotide predictions
3. **Regression**: Continuous value prediction
4. **Embedding Extraction**: Vector representations

### 4. The Workflow: 4-Step Guide to Inference

```mermaid
flowchart TD
    subgraph "4-Step Workflow for AutoInfer"
        A["📥 Step 1: Setup and Configuration<br/>Prepare environment and data"] --> B["🔧 Step 2: Load Model<br/>Initialize pre-trained or fine-tuned model"]
        B --> C["🎓 Step 3: Run Inference<br/>Make predictions on sequences"]
        C --> D["🔮 Step 4: Analysis and Export<br/>Analyze results and save outputs"]
    end

    style A fill:#e1f5fe,stroke:#333,stroke-width:2px
    style B fill:#f3e5f5,stroke:#333,stroke-width:2px
    style C fill:#e8f5e8,stroke:#333,stroke-width:2px
    style D fill:#fff3e0,stroke:#333,stroke-width:2px
```

Let's start making predictions!

## 🚀 Step 1: Setup and Configuration

First, let's set up our environment and prepare sample data for inference.

### 1.1: Environment Setup

Install required packages for genomic inference.

In [None]:
!pip install omnigenbench torch transformers -U

### 1.2: Import Required Libraries

Import essential libraries for inference and analysis.

In [None]:
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Import OmniGenBench components
from omnigenbench import (
    ModelHub,
    OmniTokenizer,
    OmniModelForSequenceClassification,
    OmniModelForTokenClassification,
)

# Set up plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("[SUCCESS] Libraries imported successfully!")

### 1.3: Prepare Sample Data

Let's create various types of sample data to demonstrate different inference scenarios.

#### Data Types:
- **Promoter sequences**: For binary classification
- **Regulatory elements**: For multi-class classification
- **Long genomic regions**: For token-level predictions

In [None]:
# Sample genomic sequences for inference
sample_sequences = {
    "promoter_candidates": [
        "ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG",
        "GCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGC",
        "TATATATATATATATATATATATATATATATATATATATATATATATAT",
        "ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC",
        "CGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGAT",
    ],
    "regulatory_elements": [
        "CGGCGCGCCATATAAGCATCGAGCGCGCACGTGCGCTGCGCGCGCGCTACGCGCGCATGTGCGCGCACGTACGCGCG",
        "GCGCGCGCACGTGCGCACGTGCGCGCACGTGCGCGCGCACGTGCGCGCACGTGCGCGCACGTGCGCGCACGTGCGCGC",
        "GCGCGCCACCAATGCGCGCGCCACCATGTGCGCGCCACCATGTGCGCGCCACCATGTGCGCGCCACCATGTGCGCGCC",
    ],
    "long_sequences": [
        "ATCGATCGATCG" * 20,  # 240 bp
        "GCTAGCTAGCTA" * 20,  # 240 bp
    ]
}

# Create JSON format data
json_data = {
    "sequences": sample_sequences["promoter_candidates"]
}

# Save to file
with open("inference_sequences.json", "w") as f:
    json.dump(json_data, f, indent=2)

# Create JSON with metadata
json_with_metadata = {
    "data": [
        {
            "sequence": seq,
            "gene_id": f"GENE_{i:03d}",
            "description": f"Sample sequence {i+1}",
            "organism": "Arabidopsis thaliana"
        }
        for i, seq in enumerate(sample_sequences["promoter_candidates"])
    ]
}

with open("inference_with_metadata.json", "w") as f:
    json.dump(json_with_metadata, f, indent=2)

# Create CSV format data
csv_data = pd.DataFrame({
    "sequence": sample_sequences["promoter_candidates"],
    "gene_id": [f"GENE_{i:03d}" for i in range(len(sample_sequences["promoter_candidates"]))],
    "organism": "Arabidopsis thaliana"
})
csv_data.to_csv("inference_sequences.csv", index=False)

# Create text file (one sequence per line)
with open("inference_sequences.txt", "w") as f:
    for seq in sample_sequences["promoter_candidates"]:
        f.write(seq + "\n")

print("[SUCCESS] Sample data files created:")
print("  - inference_sequences.json")
print("  - inference_with_metadata.json")
print("  - inference_sequences.csv")
print("  - inference_sequences.txt")
print(f"\n[INFO] Total sequences prepared: {len(sample_sequences['promoter_candidates'])}")

## 🔧 Step 2: Load Model

Now let's load a pre-trained or fine-tuned model for inference.

### Model Loading Options

We'll demonstrate three approaches:
1. **Load from HuggingFace Hub**: Use publicly available fine-tuned models
2. **Load from local path**: Use your own trained models
3. **Use ModelHub**: Simplified loading with automatic configuration

### 2.1: Load Fine-tuned Model (Recommended)

Let's use a fine-tuned model for transcription factor binding prediction.

In [None]:
# Configuration
inference_config = {
    "model_name": "yangheng/ogb_tfb_finetuned",  # Fine-tuned TFB model
    "batch_size": 16,
    "device": "cuda",  # or "cpu"
}

print("[INFO] Loading fine-tuned model for inference...")
print(f"  Model: {inference_config['model_name']}")
print(f"  Device: {inference_config['device']}")

# Load model using ModelHub (simplified approach)
model = ModelHub.load(
    inference_config["model_name"],
    device=inference_config["device"],
    trust_remote_code=True
)

# Set to evaluation mode
model = model.eval()

print("\n[SUCCESS] Model loaded successfully!")
print(f"  Model type: {type(model).__name__}")
print(f"  Ready for inference: {not model.training}")

### 2.2: Alternative - Load from Local Path

If you have a locally trained model, you can load it directly.

In [None]:
# Example: Load from local directory
# Uncomment to use your own model

# local_model_path = "./my_trained_model"
# model = OmniModelForSequenceClassification.from_pretrained(
#     local_model_path,
#     trust_remote_code=True
# )
# model = model.eval()

print("[INFO] To use a local model, uncomment the code above and specify the path.")

## 🎓 Step 3: Run Inference

Now let's make predictions on our sample sequences using various methods.

### Inference Methods:
1. **Single Sequence**: Quick predictions on individual sequences
2. **Batch Inference**: Efficient processing of multiple sequences
3. **File-based Inference**: Process sequences from JSON/CSV files

### 3.1: Single Sequence Inference

In [None]:
# Single sequence prediction
single_sequence = "ATCGATCGATCGATCGATCGATCGATCGATCG"

print("[INFO] Running inference on single sequence...")
print(f"  Sequence: {single_sequence[:40]}...")
print(f"  Length: {len(single_sequence)} bp")

# Make prediction
result = model.inference(single_sequence)

print("\n[SUCCESS] Prediction completed!")
print(f"  Prediction: {result['predictions']}")

if 'probabilities' in result:
    probs = result['probabilities']
    print(f"  Probabilities: {probs}")
    print(f"  Confidence: {max(probs):.4f}")

if 'confidence' in result:
    print(f"  Confidence score: {float(result['confidence']):.4f}")

### 3.2: Batch Inference

For multiple sequences, batch inference is more efficient.

In [None]:
# Batch inference on multiple sequences
sequences = sample_sequences["promoter_candidates"]

print(f"[INFO] Running batch inference on {len(sequences)} sequences...")

# Collect all predictions
batch_results = []

for i, seq in enumerate(sequences):
    result = model.inference(seq)
    batch_results.append({
        "sequence_id": i,
        "sequence": seq[:30] + "...",  # Truncate for display
        "prediction": result['predictions'],
        "confidence": max(result['probabilities']) if 'probabilities' in result else None
    })

# Display results
print("\n[SUCCESS] Batch inference completed!")
print("\nResults:")
print("-" * 80)

for res in batch_results:
    print(f"Sequence {res['sequence_id']}: {res['sequence']}")
    print(f"  Prediction: {res['prediction']}")
    if res['confidence']:
        print(f"  Confidence: {res['confidence']:.4f}")
    print()

### 3.3: File-based Inference

Process sequences from JSON files with metadata preservation.

In [None]:
# Load sequences from JSON file
print("[INFO] Loading sequences from JSON file...")

with open("inference_with_metadata.json", "r") as f:
    data = json.load(f)

print(f"  Loaded {len(data['data'])} sequences with metadata")

# Run inference and preserve metadata
results_with_metadata = []

for item in data['data']:
    sequence = item['sequence']
    result = model.inference(sequence)
    
    # Combine prediction with metadata
    results_with_metadata.append({
        "sequence": sequence,
        "metadata": {
            "gene_id": item['gene_id'],
            "description": item['description'],
            "organism": item['organism']
        },
        "predictions": result['predictions'],
        "probabilities": result.get('probabilities', []),
    })

print("\n[SUCCESS] File-based inference completed!")
print(f"  Processed {len(results_with_metadata)} sequences")


### 3.4: CSV File Inference

Process sequences from CSV files and create a results DataFrame.

In [None]:
# Load sequences from CSV
print("[INFO] Loading sequences from CSV file...")

df = pd.read_csv("inference_sequences.csv")
print(f"  Loaded {len(df)} sequences from CSV")

# Run inference on all sequences
predictions = []
confidences = []

for sequence in df['sequence']:
    result = model.inference(sequence)
    predictions.append(result['predictions'])
    if 'probabilities' in result:
        confidences.append(max(result['probabilities']))
    else:
        confidences.append(None)

# Add predictions to DataFrame
df['prediction'] = predictions
df['confidence'] = confidences

# Save results
df.to_csv("inference_results.csv", index=False)

print("\n[SUCCESS] CSV inference completed!")
print("\nResults DataFrame:")
print(df.head())
print(f"\n[INFO] Results saved to: inference_results.csv")

## 🎯 Advanced Usage: CLI Integration

The same inference can be performed using the command-line interface.

### CLI Examples

Here are equivalent CLI commands for the operations we performed above:

In [None]:
# Display CLI usage examples
cli_examples = """
Command-Line Interface Examples:
================================

1. Single sequence inference:
   ogb autoinfer \
     --model yangheng/ogb_tfb_finetuned \
     --sequence "ATCGATCGATCGATCGATCGATCG" \
     --output-file predictions.json

2. Multiple sequences (comma-separated):
   ogb autoinfer \
     --model yangheng/ogb_tfb_finetuned \
     --sequence "ATCGATCG,GCGCGCGC,TATATAT" \
     --output-file predictions.json

3. Batch inference from JSON:
   ogb autoinfer \
     --model yangheng/ogb_tfb_finetuned \
     --input-file inference_sequences.json \
     --batch-size 16 \
     --output-file results.json

4. Inference from CSV:
   ogb autoinfer \
     --model yangheng/ogb_tfb_finetuned \
     --input-file inference_sequences.csv \
     --output-file predictions.json \
     --device cuda:0

5. Inference from text file:
   ogb autoinfer \
     --model yangheng/ogb_tfb_finetuned \
     --sequence inference_sequences.txt \
     --output-file predictions.json
"""

print(cli_examples)

## 🎓 Summary and Best Practices

### Key Takeaways

1. **Model Loading**:
   - Use `ModelHub.load()` for simplified loading
   - Always set model to evaluation mode: `model.eval()`
   - Specify device explicitly for GPU acceleration

2. **Inference Methods**:
   - Single sequence: `model.inference(sequence)`
   - Batch processing: Loop over sequences
   - CLI: Use `ogb autoinfer` for production workflows

3. **Performance Tips**:
   - Use batch processing for multiple sequences
   - Enable GPU when available
   - Use mixed precision for faster inference
   - Process sequences in parallel when possible

4. **Output Formats**:
   - JSON: Rich metadata and nested structures
   - CSV: Tabular data for spreadsheet analysis
   - Custom: Adapt to your downstream analysis needs

### Common Use Cases

| Task | Model | Application |
|------|-------|-------------|
| TF Binding | `ogb_tfb_finetuned` | Predict transcription factor binding sites |
| Translation Efficiency | `ogb_te_finetuned` | Estimate mRNA translation rates |
| Promoter Detection | Custom fine-tuned | Identify promoter regions |
| RNA Structure | Structure-specific | Predict secondary structure elements |

### Next Steps

- **Fine-tune your own models**: See `sequence_classification_tutorial`
- **Batch processing**: Scale to thousands of sequences
- **Integration**: Incorporate into bioinformatics pipelines
- **Advanced analysis**: Combine with embedding extraction and attention visualization

### Resources

- **Documentation**: `docs/GETTING_STARTED.md`
- **CLI Reference**: `docs/cli.rst`
- **Examples**: `examples/` directory
- **Issues**: https://github.com/yangheng95/OmniGenBench/issues

## 🧹 Cleanup

Remove temporary files created during the tutorial (optional).

In [None]:
# Uncomment to remove temporary files
# import os

# files_to_remove = [
#     "inference_sequences.json",
#     "inference_with_metadata.json",
#     "inference_sequences.csv",
#     "inference_sequences.txt",
#     "inference_results.json",
#     "inference_results.csv",
#     "inference_summary.json",
#     "inference_analysis.png",
#     "complete_inference_results.json",
# ]

# for file in files_to_remove:
#     if os.path.exists(file):
#         os.remove(file)
#         print(f"Removed: {file}")

print("[INFO] To clean up temporary files, uncomment the code above.")