# Nanopore Consensus Sequence Analysis: Old Guppy Data vs New Dorado Data
## Project Goal
This notebook facilitates a comparative analysis of consensus sequences generated from the same raw Nanopore sequencing data (fungal ITS amplicons) but processed using two different software pipelines; most notably: **Dorado** (the newer method) and **Guppy** (the previous standard).The objective is to quantitatively assess the differences and potential improvements offered by the new Dorado pipeline by comparing key sequence metrics and characteristics.

## Background
The raw Nanopore signal data from fungal ITS sequencing runs (`OMDL*` datasets in our case) was independently processed by both basecallers in multiplex sampple pool. Subsequent steps involved demultiplexing reads into sample-specific bins, clustering similar reads, and generating consensus sequences. The final output for comparison includes FASTA sequence files and associated metadata like "Reads in Consensus" (RiC).  There have been many software improvements implemented since the original Guppy datasets were produced, so this comparison is to investigate the difference produced using the new pipeline on the exact same raw source data. 

## Notebook Workflow
This jupyter notebook provides an interactive interface to:
1.  **Setup and Config:** Load run data, and identify runs with paired Dorado and Guppy data, then and load the corresponding sequence files.
2.  **Match Sequences:** Use some logic to pair corresponding consensus sequences generated by Dorado and Guppy for the *same* original sample.
3.  **Calculate & Compare Metrics:** For matched pairs, calculate and compare key metrics like:
    * Reads in Consensus (RiC)
    * Sequence Length
    * GC Content
    * Sequence Identity (including mismatches, insertions, deletions)
    * Homopolymer run characteristics
    * Frequency of ambiguous bases (not currently relevant)
4.  **Statistical Analysis:** Apply non-parametric tests (e.g., Wilcoxon signed-rank test) to assess the significance of observed differences.
5.  **Visualize Results:** Generate plots (scatter plots, histograms) to visualize comparisons and distributions.
6.  **Explore Alignments:** Interactively view pairwise alignments of matched sequences to examine differences at the base level.
7.  **Summarize & Export:** Generate summary tables for individual runs and across all runs, exporting results to TSV/CSV files.

# Initial Setup and Config

In [1]:
import data_functions
from viz_handler import display_run_analysis, create_sequence_alignment_viewer
import os
from natsort import natsorted
from IPython.display import display, Markdown, clear_output
import ipywidgets as widgets
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd # type: ignore
import json
# Optional: Reload data_functions if making changes during development
# import importlib
# importlib.reload(data_functions) # Use this in code cell to reload the module

# Configure pandas display
pd.set_option('display.max_columns', None)

# Configure plotting style (optional)
sns.set_theme(style="whitegrid")

# --- Configuration: Define Project Paths ---
# You can change BASE_PROJECT_DIR if your data/results aren't relative to the notebook
BASE_PROJECT_DIR = '.' # Assumes seqs, summary, results are subdirs of the notebook's dir or a linked dir

# Define specific directories relative to the base
SEQS_DIR = os.path.join(BASE_PROJECT_DIR, 'seqs')
SUMMARY_DIR = os.path.join(BASE_PROJECT_DIR, 'summary')
RESULTS_DIR = os.path.join(BASE_PROJECT_DIR, 'results')

# Create results directory if it doesn't exist
os.makedirs(RESULTS_DIR, exist_ok=True)

print(f"Using Sequences Directory: {os.path.abspath(SEQS_DIR)}")
print(f"Using Summary Directory:   {os.path.abspath(SUMMARY_DIR)}")
print(f"Using Results Directory:   {os.path.abspath(RESULTS_DIR)}")


Using Sequences Directory: c:\gitsync\nanopore-consensus-benchmark\seqs
Using Summary Directory:   c:\gitsync\nanopore-consensus-benchmark\summary
Using Results Directory:   c:\gitsync\nanopore-consensus-benchmark\results


### Load Runs

In [2]:
runs_df, runs_dict = data_functions.discover_runs(SEQS_DIR)

print("Discovered Runs:")
display(runs_df)
if runs_dict:
    print("Run dictionary Loaded.")


Discovered Runs:


Unnamed: 0_level_0,dorado,guppy,Both Available
Run ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
OMDL1,True,True,True
OMDL2,True,True,True
OMDL3,True,True,True
OMDL4,True,True,True
OMDL5,True,True,True
OMDL6,True,True,True
OMDL7,True,True,True
OMDL8,True,True,True
OMDL9,True,True,True
OMDL10,True,True,True


Run dictionary Loaded.


## Process Runs
Execute the workflow on all runs loaded.

In [3]:
def process_run_data(run_id, seqs_dir, results_dir):
    """Loads, matches, analyzes, and saves data for a single run."""
    print(f"\n--- Processing Run: {run_id} ---")
    run_results = {'run_id': run_id, 'stats': {}, 'counts': {}}

    # 1. Load Sequences
    print("  Loading sequences...")
    dorado_seqs = data_functions.load_sequences(run_id, 'dorado', seqs_dir)
    guppy_seqs = data_functions.load_sequences(run_id, 'guppy', seqs_dir)

    if dorado_seqs is None or guppy_seqs is None:
        print(f"  Skipping {run_id}: Missing Dorado or Guppy sequence file.")
        return None # Indicate failure for this run

    print(f"  Loaded {sum(len(v) for v in dorado_seqs.values())} Dorado sequences across {len(dorado_seqs)} samples.")
    print(f"  Loaded {sum(len(v) for v in guppy_seqs.values())} Guppy sequences across {len(guppy_seqs)} samples.")
    run_results['counts']['dorado_total'] = sum(len(v) for v in dorado_seqs.values()) # Store total counts
    run_results['counts']['guppy_total'] = sum(len(v) for v in guppy_seqs.values())

    # 2. Match Sequences
    print("  Matching sequences...")
    matched_pairs, dorado_only, guppy_only = data_functions.match_sequences(dorado_seqs, guppy_seqs)
    run_results['matched_pairs'] = matched_pairs # Keep matched_pairs for potential later use (e.g., alignment viewer)
    run_results['dorado_only'] = dorado_only
    run_results['guppy_only'] = guppy_only
    run_results['counts']['matched'] = len(matched_pairs)
    run_results['counts']['dorado_only'] = len(dorado_only)
    run_results['counts']['guppy_only'] = len(guppy_only)
    print(f"  Matching complete: {len(matched_pairs)} pairs, {len(dorado_only)} Dorado-only, {len(guppy_only)} Guppy-only.")

    if not matched_pairs:
        print(f"  Skipping analysis for {run_id}: No matched pairs found.")
        return run_results # Return counts even if no matches

    # 3. Consolidate Metrics into DataFrame (Step 3.4)
    print("  Generating comparison DataFrame...")
    run_comparison_df = data_functions.generate_comparison_dataframe(matched_pairs)
    run_results['comparison_df'] = run_comparison_df # Store DF for this run

    if run_comparison_df.empty:
         print(f"  Skipping further analysis for {run_id}: Comparison DataFrame is empty.")
         return run_results

    # 4. Perform Run-Specific Statistical Analysis (Step 4.2)
    print("  Calculating run statistics...")
    run_stats = data_functions.calculate_run_statistics(run_comparison_df)
    run_results['stats'] = run_stats
    print(f"  Statistics calculated: {list(run_stats.keys()) if run_stats else 'None'}")


    # 5. Save Run-Specific Output (Step 4.3)
    print("  Saving run comparison data...")
    saved_path = data_functions.save_run_comparison(run_comparison_df, run_id, results_dir, format='tsv') # Or 'csv'
    if saved_path:
         print(f"  Run comparison saved to: {saved_path}")
         # Optionally save as CSV too
         # data_functions.save_run_comparison(run_comparison_df, run_id, results_dir, format='csv')
    else:
         print("  Failed to save run comparison data.")


    print(f"--- Finished processing {run_id} ---")
    return run_results

In [4]:
# --- Process All Valid Runs ---
all_runs_analysis_results = {}
valid_run_ids = runs_df[runs_df['Both Available']].index.tolist() # Get naturally sorted IDs from the DataFrame

if not valid_run_ids:
    print("No runs found with both Dorado and Guppy data available.")
else:
    print(f"Starting processing for {len(valid_run_ids)} runs: {', '.join(valid_run_ids)}")
    for run_id in valid_run_ids:
         # Pass necessary directory paths to the function
        results = process_run_data(run_id, SEQS_DIR, RESULTS_DIR)
        if results: # Store results even if some steps failed (e.g., no matches)
             all_runs_analysis_results[run_id] = results

    print("\n--- All Run Processing Complete ---")
    print(f"Processed results stored in 'all_runs_analysis_results' dictionary for {len(all_runs_analysis_results)} runs.")

    # Optional: Display a summary of what was processed
    processed_summary = []
    for run_id, data in all_runs_analysis_results.items():
         counts = data.get('counts', {})
         processed_summary.append({
              'Run_ID': run_id,
              'Matched': counts.get('matched', 0),
              'Dorado_Only': counts.get('dorado_only', 0),
              'Guppy_Only': counts.get('guppy_only', 0),
              'Stats_Keys': list(data.get('stats', {}).keys()) if data.get('stats') else 'N/A'
         })
    summary_df = pd.DataFrame(processed_summary)
    print("\nProcessing Summary:")
    display(summary_df)

Starting processing for 12 runs: OMDL1, OMDL2, OMDL3, OMDL4, OMDL5, OMDL6, OMDL7, OMDL8, OMDL9, OMDL10, OMDL12, OMDL13

--- Processing Run: OMDL1 ---
  Loading sequences...
  Loaded 155 Dorado sequences across 152 samples.
  Loaded 1406 Guppy sequences across 430 samples.
  Matching sequences...
Processing 152 samples common to both Dorado and Guppy...
Processing 0 samples unique to Dorado...
Processing 278 samples unique to Guppy...
Matching complete. Found 85 matched pairs, 71 Dorado-only sequences, 1321 Guppy-only sequences.
  Matching complete: 85 pairs, 71 Dorado-only, 1321 Guppy-only.
  Generating comparison DataFrame...
  Calculating run statistics...
  Statistics calculated: ['RiC', 'Length', 'GC', 'Homo_Count', 'Homo_MaxLen', 'Ambig_Count', 'Ambig_Freq']
  Saving run comparison data...
Run comparison data saved to: .\results\OMDL1_comparison_data.tsv
  Run comparison saved to: .\results\OMDL1_comparison_data.tsv
--- Finished processing OMDL1 ---

--- Processing Run: OMDL2 ---


Unnamed: 0,Run_ID,Matched,Dorado_Only,Guppy_Only,Stats_Keys
0,OMDL1,85,71,1321,"[RiC, Length, GC, Homo_Count, Homo_MaxLen, Amb..."
1,OMDL2,239,217,712,"[RiC, Length, GC, Homo_Count, Homo_MaxLen, Amb..."
2,OMDL3,149,266,880,"[RiC, Length, GC, Homo_Count, Homo_MaxLen, Amb..."
3,OMDL4,223,175,1093,"[RiC, Length, GC, Homo_Count, Homo_MaxLen, Amb..."
4,OMDL5,108,231,589,"[RiC, Length, GC, Homo_Count, Homo_MaxLen, Amb..."
5,OMDL6,191,293,685,"[RiC, Length, GC, Homo_Count, Homo_MaxLen, Amb..."
6,OMDL7,218,417,619,"[RiC, Length, GC, Homo_Count, Homo_MaxLen, Amb..."
7,OMDL8,292,336,459,"[RiC, Length, GC, Homo_Count, Homo_MaxLen, Amb..."
8,OMDL9,302,324,1313,"[RiC, Length, GC, Homo_Count, Homo_MaxLen, Amb..."
9,OMDL10,160,164,554,"[RiC, Length, GC, Homo_Count, Homo_MaxLen, Amb..."


In [5]:
# --- Generate and Save Overall Summary ---
print("\n--- Generating Overall Summary ---")
if all_runs_analysis_results:
    overall_summary_path = data_functions.generate_overall_summary(
        all_runs_analysis_results,
        RESULTS_DIR,
        format='tsv' # Or 'csv'
    )
    if overall_summary_path:
        print(f"Overall summary saved to: {overall_summary_path}")
        # Optionally load and display the summary DataFrame
        try:
            overall_summary_df = pd.read_csv(overall_summary_path, sep='\t')
            print("\nOverall Summary DataFrame Head:")
            display(overall_summary_df.head())
        except Exception as e:
            print(f"Could not read back overall summary file: {e}")
    else:
        print("Failed to generate overall summary file.")
else:
    print("Skipping overall summary generation: No run results available.")


--- Generating Overall Summary ---
Overall summary data saved to: .\results\overall_comparison_summary.tsv
Overall summary saved to: .\results\overall_comparison_summary.tsv

Overall Summary DataFrame Head:


Unnamed: 0,Run_ID,Matched_Pairs,Dorado_Only_Seqs,Guppy_Only_Seqs,RiC_Median_Diff,RiC_p_value,RiC_N_Pairs,Length_Median_Diff,Length_p_value,Length_N_Pairs,GC_Median_Diff,GC_p_value,GC_N_Pairs,Homo_Count_Median_Diff,Homo_Count_p_value,Homo_Count_N_Pairs,Homo_MaxLen_Median_Diff,Homo_MaxLen_p_value,Homo_MaxLen_N_Pairs,Ambig_Count_Median_Diff,Ambig_Count_p_value,Ambig_Count_N_Pairs
0,OMDL1,85,71,1321,21.0,0.0,85,1.0,0.2471,85,0.0,0.9756,85,0.0,0.2371,85,0.0,0.0319,85,0.0,1.0,85
1,OMDL2,239,217,712,96.0,0.0,239,9.0,0.0,239,-0.0003,0.7436,239,0.0,0.0432,239,0.0,0.0408,239,0.0,1.0,239
2,OMDL3,149,266,880,26.0,0.0,149,2.0,0.0739,149,-0.0001,0.0146,149,0.0,0.7088,149,0.0,0.4042,149,0.0,1.0,149
3,OMDL4,223,175,1093,16.0,0.0,223,1.0,0.1108,223,0.0002,0.4414,223,0.0,0.5205,223,0.0,0.1507,223,0.0,1.0,223
4,OMDL5,108,231,589,74.5,0.0,108,4.0,0.5725,108,-0.001,0.001,108,0.0,0.0094,108,0.0,0.0809,108,0.0,1.0,108


# Interactive Run Selection

In [None]:
# --- Main Interactive Cell ---
import ipywidgets as widgets
from IPython.display import display, Markdown, clear_output
from natsort import natsorted
from viz_handler import display_run_analysis, create_sequence_alignment_viewer # Assuming you move viewer creation too
import data_functions # Make sure it's imported

# --- 1. Define Widgets Globally ---
# Dropdown for run selection
processed_run_ids = list(all_runs_analysis_results.keys())
sorted_run_ids = natsorted(processed_run_ids)
selected_run_id = sorted_run_ids[0] if sorted_run_ids else None
run_dropdown = widgets.Dropdown(
    options=sorted_run_ids,
    description='Select Run:',
    disabled=not sorted_run_ids,
    style={'description_width': 'initial'}
)
# Main output area for stats/plots
analysis_output_area = widgets.Output()
# Button and output area for the alignment viewer
show_viewer_button = widgets.Button(description="Show Interactive Alignment Viewer", button_style='success')
viewer_output_area = widgets.Output() # Area for the alignment viewer widgets

# --- 2. Define Event Handlers Globally ---
# Handler for run dropdown change
def on_run_select_change(change):
    global selected_run_id
    if change['type'] == 'change' and change['name'] == 'value':
        selected_run_id = change['new']
        with analysis_output_area:
            clear_output(wait=True)
            # Call function to display stats/plots for the run
            display_run_analysis(selected_run_id, all_runs_analysis_results)
        # Clear the separate alignment viewer area when the run changes
        with viewer_output_area:
            clear_output()

# Handler for the "Show Alignment Viewer" button click
def on_show_viewer_click(button):
     with viewer_output_area:
         clear_output(wait=True)
         if selected_run_id:
             # Call function to create and display the viewer widgets
             create_sequence_alignment_viewer(selected_run_id, all_runs_analysis_results)
         else:
             print("Please select a run first.")

# --- 3. Link Handlers ---
run_dropdown.observe(on_run_select_change, names='value')
show_viewer_button.on_click(on_show_viewer_click)

# --- 4. Display Layout ---
display(Markdown("### Select Run for Detailed Analysis:"))
display(run_dropdown)
display(analysis_output_area) # Area for stats/plots

display(Markdown("---")) # Separator
display(Markdown("#### Interactive Alignment Viewer"))
display(show_viewer_button) # Display button
display(viewer_output_area) # Display area for viewer widgets

# --- 5. Initial Trigger ---
if selected_run_id:
    on_run_select_change({'type': 'change', 'name': 'value', 'new': selected_run_id})
else:
    print("No initial run selected or no runs processed.")

### Select Run for Detailed Analysis:

Dropdown(description='Select Run:', options=('OMDL1', 'OMDL2', 'OMDL3', 'OMDL4', 'OMDL5', 'OMDL6', 'OMDL7', 'O…

Output()

---

#### Interactive Alignment Viewer

Button(button_style='success', description='Show Interactive Alignment Viewer', style=ButtonStyle())

Output()