# PRECISE-QC - Errors per full length read

This notebook is designed to reproduce the plot to visually analyze the overlaying sequences results from multiple runs.


The expected input files are the alignment files in the SAM format


- Each curve represents the fraction of reads with a given number of total error events.
- Error events are counted as:
  - **Mismatches** (from `MD` tags),
  - **Insertions** (`I` in CIGAR),
  - **Deletions** (`D` in CIGAR).


## 0) Setup

Required packages:
- `pysam` — to parse SAM files
- `numpy`- - to process the data
- `matplotlib`  plot the results
- `re`, `collections.Counter` (built-in)

If using **Colab**, install libraries first:
```bash

!pip install pysam, matplotlib
```

In [9]:
#@title Run this cell to install the libraries
!pip install pysam; matplotlib

/bin/bash: line 1: matplotlib: command not found


In [11]:
#@title EDIT this cell and run it to rovide the paths to the input .SAM files:

file1 = "/path/to/SAM_1" # edit the path here
file2 = "/path/to/SAM_2" # edit the path here
file3 = "/path/to/SAM_3" # edit the path here

In [10]:
#@title Run this cell to impot the libraries
import pysam
import re
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter

## 1) Define Helper Functions

We create two helper functions:
- `count_mismatches_from_md(md)`: counts mismatches in the MD tag
- `process_sam(filename)`: parses a SAM file and returns **per-read error counts**.

In [15]:
#@title Run this cell to define the functions

def count_mismatches_from_md(md):
    """Count mismatches from MD tag (letters = mismatches)."""
    md = re.sub(r"\^[A-Z]+", "", md)   # remove deletions
    return len(re.findall(r"[A-Z]", md))

def process_sam(filename):
    """Return per-read error counts from a SAM file."""
    samfile = pysam.AlignmentFile(filename, "r")
    error_counts = []

    for read in samfile.fetch(until_eof=True):
        if read.is_unmapped:
            continue

        # mismatches from MD tag
        try:
            md = read.get_tag("MD")
            mismatches = count_mismatches_from_md(md)
        except KeyError:
            mismatches = 0

        cigar = read.cigarstring or ""

        # insertion/deletion events (not bases)
        insertions = cigar.count("I")
        deletions = cigar.count("D")

        total_errors = mismatches + insertions + deletions
        error_counts.append(total_errors)

    samfile.close()
    return np.array(error_counts)
#@title Run this cell to plot the overlaying sequences


def plot_area_distributions(files, labels):
    """Plot filled line curves (like MATLAB area())."""
    plt.figure(figsize=(7,7))
    colors = ["#1f77b4", "#ff7f0e", "#2ca02c"]  # blue, orange, green

    for filename, label, color, alpha in zip(files, labels, colors, [0.3, 0.5, 0.3]):
        error_counts = process_sam(filename)
        total_reads = len(error_counts)

        # Count number of reads for each error value
        counts = Counter(error_counts)
        xs = np.arange(0, max(counts.keys())+1)   # full x-range without gaps
        ys = np.array([counts.get(x, 0) for x in xs]) / total_reads * 100  # %

        # plot
        plt.fill_between(xs, ys, alpha=alpha, color=color, label=label)
        plt.plot(xs, ys, color=color, linewidth=0)

    plt.xlabel("Number of Errors per Read")
    plt.ylabel("Reads (%)")
    plt.xlim(left=0, right=32)
    plt.ylim(bottom=0, top=17)
    plt.yticks([0, 5, 10, 15, 20])
    plt.axis("square")
    plt.legend()
    plt.tight_layout()
    plt.show()




We use `plot_area_distributions` to:
- Count number of reads with each error count,
- Normalize to **% of total reads**,
- Plot filled line curves (like MATLAB `area` plots).


Provide the names of the sequences below

In [14]:
#@title EDIT and Run this cell to assign the name of each sequence

first_seq = "EDIT_NAME_HERE"
second_seq = "EDIT_NAME_HERE"
third_seq = "EDIT_NAME_HERE"

In [None]:
#@title Finally Run this cell to plot

plot_area_distributions(
    [file1, file2, file3],
    [first_seq, second_seq, third_seq]
)
