
# PRECISE-QC — Per-Nucleotide Error Profile

This notebook reproduces the **stacked bar error plot** used to visualize
per-position mismatch, insertion, and deletion rates across an sgRNA.

Input: a txt (or TSV) file generated by:

```bash
pysamstats --type variation --fasta reference.fa full_length.bam > only_variation.txt
```


Columns should include:

* `pos`: reference position
* `ref`: reference nucleotide
* `matches`, `mismatches`, `insertions`, `deletions`
* `reads_all`: total coverage



Make sure you have the required Python packages:

- `pandas` for data handling
- `matplotlib` for plotting

If running on **Colab**, you may need to install them:
```bash
!pip install pandas matplotlib
````


In [2]:
# INstall the libraries
!pip install pandas matplotlib
import pandas as pd
import matplotlib.pyplot as plt



## 1) Load Data

We load the variation statistics file into a pandas DataFrame and extract:

- reference nucleotides (`ref`)
- positions (`pos`)
- counts of matches, mismatches, insertions, deletions
- coverage (`reads_all`)

### Provide the path to the input file below

In [None]:
#@title EDIT and Run the cell Iput file

INFILE = "path/to/input" # edit here


all_data = pd.read_csv(INFILE, sep="\t")

nucleotide = all_data["ref"].astype(str).values
position = all_data["pos"].values
matches = all_data["matches"].values
mismatches = all_data["mismatches"].values
deletions = all_data["deletions"].values
insertions = all_data["insertions"].values
coverage = all_data["reads_all"].values


## 2) Normalize Error Counts

We calculate the **fraction** of mismatches, deletions, and insertions
at each position relative to the total observations (matches + errors).
This ensures percentages are comparable across positions with different coverage.

A safeguard sets `total=1` when coverage is 0 to avoid division by zero.



In [None]:
#@title RUN this cell for Error normalization
total = matches + mismatches + deletions + insertions
total[total == 0] = 1   # avoid division by zero

n_matches = matches / total
n_mismatches = mismatches / total
n_deletions = deletions / total
n_insertions = insertions / total
total_errors = n_mismatches + n_deletions + n_insertions

## 3) Plot Stacked Bar Chart

- Each bar = a reference position.
- Bars are stacked: **Deletions (blue)**, **Insertions (gray)**, **Mismatches (red)**.
- A dashed black line marks the **5% error threshold**.
- X-axis = reference nucleotide sequence.
- Y-axis = error rate (%) up to 40%.

In [None]:
# Plotting the figure
fig, ax = plt.subplots(figsize=(12, 5))

# Horizontal error threshold (5%)
ax.plot(position, [5]*len(position), "k--", label="Error Threshold")

# Stacked bars (convert to %)
ax.bar(position, n_deletions*100, color="#85b3cb", width=0.6, label="Deletions")
ax.bar(position, n_insertions*100, bottom=n_deletions*100, color="#999999", width=0.6, label="Insertions")
ax.bar(position, n_mismatches*100, bottom=(n_deletions+n_insertions)*100,
       color="#ea7878", width=0.6, label="Mismatches")

# Axis formatting
ax.set_xticks(position)
ax.set_xticklabels(nucleotide, rotation=0)
ax.set_xlim([1, 101])
ax.set_ylim([0, 40])
ax.set_yticks([0, 10, 20, 30])
ax.set_ylabel("% error")

ax.legend()
plt.tight_layout()
plt.show()