# TREBL Quick Start Example

This notebook demonstrates a quick start workflow for TREBL analysis with:
- **No error correction** (faster processing)
- **Simple UMI deduplication only** (for Step 1 - TREBL experiment)

This is ideal for initial data exploration or when processing time is a priority.

**Note:** For large files, the plotting steps (`step1_reads_distribution`, `trebl_experiment_reads_distribution`) can be computationally intensive and may benefit from submission as a Savio job instead of running interactively.

## Setup and Imports

In [None]:
import sys
import os
import glob

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import duckdb
from tqdm import tqdm

from trebl_tools import (
    initial_map,
    map_refiner,
    complexity,
    finder,
    preprocess,
    error_correct,
    plotting,
    umi_deduplicate,
    pipelines
)

## Initialize Pipeline

Key settings for quick start:
- `error_correction=False` - Skips error correction for faster processing
- `test_n_reads` - Optional: Set to a number (e.g., 100000) for testing with subset of data

In [None]:
# Initialize pipeline with no error correction
pipeline = pipelines.TreblPipeline(
    db_path="quick_start.db",
    design_file_path="path/to/your/design_file.txt",  # Update this path
    error_correction=False,  # No error correction for quick start
    output_path="output/quick_start"
    # test_n_reads=100000  # Uncomment to test with first 100k reads
)

## Step 1: TREBL Mapping

Define barcodes and run initial mapping to establish barcode relationships.

In [None]:
# Define barcodes to search for in reads
AD = finder.Barcode(
    name="AD",
    preceder="GGCTAGC",
    post="TGACTAG",
    length=120
)

AD_BC = finder.Barcode(
    name="AD_BC",
    preceder="CGCGCC",
    post="GGGCCC",
    length=11
)

RT_BC = finder.Barcode(
    name="RT_BC",
    preceder="CTCGAG",
    post="GGCCGC",
    length=14
)

# Combine barcodes
bc_objects = [AD, AD_BC, RT_BC]

In [None]:
# Specify sequencing file(s)
step1_seq_file = "path/to/your/step1_sequencing_file.fastq"  # Update this path
# Can be a single file (string) or multiple files (list of strings)
# Supported formats: .fastq or .fastq.gz

In [None]:
# Plot reads distribution
# NOTE: For large files (>10M reads), consider submitting this as a Savio job
# See examples/savio_jobs/quick_start_job.sh for job submission example

pipeline.step1_reads_distribution(
    seq_file=step1_seq_file,
    bc_objects=bc_objects,
    reverse_complement=True
)
# Produces histogram of reads per barcode
# Helps pick appropriate reads_threshold for filtering

In [None]:
# Run Step 1 mapping
step1_map = pipeline.run_step_1(
    seq_file=step1_seq_file,
    bc_objects=bc_objects,
    column_pairs=[("RT_BC", "AD")],  # Check for collisions between RT_BC and AD
    reads_threshold=10,  # Minimum reads to keep a barcode
    reverse_complement=False
)
# Returns DataFrame of Step 1 mapping
# Saves CSV, loss table visualization, and optional loss table CSV

## TREBL Experiment with Simple UMI Deduplication

Process the full TREBL experiment using simple UMI deduplication only.

In [None]:
# Define UMI objects
AD_UMI = finder.Barcode(
    name="UMI",
    preceder="TGATTT",
    post="",
    length=12
)

RT_UMI = finder.Barcode(
    name="UMI",
    preceder="TGTCAC",
    post="",
    length=12
)

# Separate barcode objects
AD_bc_objects = [AD, AD_BC]  # AD and AD barcodes
RT_bc_objects = [RT_BC]      # Reporter barcodes

In [None]:
# Collect sequencing files
trebl_AD_seq_files = glob.glob("path/to/AD_assembled/*")  # Update this path
trebl_RT_seq_files = glob.glob("path/to/RT_assembled/*")  # Update this path

In [None]:
# Plot reads distribution for all files
# NOTE: For large files, consider submitting this as a Savio job
# See examples/savio_jobs/quick_start_job.sh for job submission example

pipeline.trebl_experiment_reads_distribution(
    AD_seq_files=trebl_AD_seq_files,
    AD_bc_objects=AD_bc_objects,
    RT_seq_files=trebl_RT_seq_files,
    RT_bc_objects=RT_bc_objects,
    reverse_complement=True
)
# Generates histograms for all AD and RT files

In [None]:
# Run TREBL experiment with SIMPLE UMI deduplication only
trebl_results = pipeline.trebl_experiment_analysis(
    AD_seq_files=trebl_AD_seq_files,
    AD_bc_objects=AD_bc_objects,
    RT_seq_files=trebl_RT_seq_files,
    RT_bc_objects=RT_bc_objects,
    reverse_complement=True,
    step1_map_csv_path="output/quick_start/step1_AD_AD_BC_RT_BC_designed.csv",  # Update with your step1 CSV path
    AD_umi_object=AD_UMI,
    RT_umi_object=RT_UMI,
    umi_deduplication='simple'  # Use ONLY simple deduplication for quick start
)

# Access results
AD_results = trebl_results["AD_results"]
RT_results = trebl_results["RT_results"]

## Next Steps

After completing this quick start analysis:

1. **Review outputs** in the `output/quick_start` directory
2. **Check loss tables** to understand filtering at each step
3. **Validate results** by examining the CSV files
4. **For more comprehensive analysis**, see the `full_analysis_example.ipynb` notebook which includes:
   - Error correction for improved accuracy
   - Both simple and directional/complex UMI deduplication

### Cleanup

After analysis is complete, you can delete the DuckDB database:

In [None]:
# import os
# os.remove("quick_start.db")