# Testing the sample consensus calls
## April 14th, 2025

This notebook manually inspects a few calls from this pipeline's routine for calling sample consensus. This pipeline supports sequencing that is performed with multiple replicates per sample, and calls a consensus sequence for each replicate. It is desirable to associate a sample consensus that is derived from each replicate consensus. Our strategy for this is as follows (N denotes an ambiguous base due to lack of coverge):
- if each replicate consensus contains an N, call an N for the sample
- if one replicate contains a nonambiguous base while the other contains an N, call the nonambiguous base
- if each replicate contains a different nonambiguous base, call an N and alert the user

Note that the third condition has not yet been observed in practice. This procedure is implemented in the rule and function `call_sample_consensus`. The purpose of this notebook is to perform some manual inspection for various edge cases that can appear in practice and observe that they are appropriately handled. Data was analyzed with [this commit](https://github.com/moncla-lab/illumina-pipeline/tree/138c59c54f12ac434aa122ad2a074b8d5c8943ab).

First, some imports...

In [1]:
import glob
import re

import pandas as pd
from Bio import SeqIO

Next, we will extract some project wide information based on the directory structure. We'd like to see which replicates had consensus calls and grab associated metadata like sample and segment.

In [2]:
files = glob.glob('data/*/replicate-*/reremapping/segments/*/consensus.fasta')
# match sample, replicate, segment
pattern = re.compile(r'data/(.*)/replicate-(.*)/reremapping/segments/(.*)/consensus.fasta')

# example pattern match
match = pattern.match(files[0])
print(match.groups())

('be_w3', '2', 'ns')


For each replicate consensus, we'd like to calculate the percentage of N's to make sure we appropriately handle cases such as:
- both replicate consensus sequences are completely full
- one replicate consensus sequence is completely empty
- each replicate consensus sequence has ambiguities at different positions

The following function extracts this information...

In [3]:
def extract_info(fasta_path):
    record = SeqIO.read(fasta_path, 'fasta')
    Ns = sum([i == 'N' for i in record])
    total_bases = len(record)
    N_percentage = 1 if total_bases == 0 else Ns / total_bases
    sample, replicate, segment = pattern.match(fasta_path).groups()
    return (fasta_path, N_percentage, sample, replicate, segment)


info = pd.DataFrame(
    [extract_info(f) for f in files],
    columns=['file', 'N_percentage', 'sample', 'replicate', 'segment']
)
info.head(20)

Unnamed: 0,file,N_percentage,sample,replicate,segment
0,data/be_w3/replicate-2/reremapping/segments/ns...,0.0,be_w3,2,ns
1,data/be_w3/replicate-2/reremapping/segments/na...,1.0,be_w3,2,na
2,data/be_w3/replicate-2/reremapping/segments/pb...,0.0,be_w3,2,pb2
3,data/be_w3/replicate-2/reremapping/segments/pa...,0.0,be_w3,2,pa
4,data/be_w3/replicate-2/reremapping/segments/ha...,0.0,be_w3,2,ha
5,data/be_w3/replicate-2/reremapping/segments/mp...,0.717624,be_w3,2,mp
6,data/be_w3/replicate-2/reremapping/segments/np...,0.000639,be_w3,2,np
7,data/be_w3/replicate-2/reremapping/segments/pb...,1.0,be_w3,2,pb1
8,data/be_w3/replicate-1/reremapping/segments/ns...,0.0,be_w3,1,ns
9,data/be_w3/replicate-1/reremapping/segments/na...,1.0,be_w3,1,na


...and saves it to a [CSV file](https:/m/en.wikipedia.org/wiki/Comma-separated_values).

In [4]:
info.to_csv('Ns.csv')

Finally, some code to visualize alignments in the notebook...

After manually inspecting we come across the following samples of interest for testing...

Note that in each screen shot, the order is
- sample
- replicate 1
- replicate 2.

## be_w3 mp
Looks correctly called, as replicate 1 is empty.

In [5]:
def get_relevant_info(sample, segment):
    return info.loc[(info['sample']==sample) & (info['segment']==segment), :]
get_relevant_info('be_w3', 'mp')

Unnamed: 0,file,N_percentage,sample,replicate,segment
5,data/be_w3/replicate-2/reremapping/segments/mp...,0.717624,be_w3,2,mp
13,data/be_w3/replicate-1/reremapping/segments/mp...,1.0,be_w3,1,mp


<img src="images/001-bew3-mp.png" width="1140"/>

## rth_w3 mp 
An interesting one. Replicate 2 is deficient on the 5' end, while replicate 1 is deficient on the 3' end, but the sample call looks correct and fills in nicely.

<img src="images/001-rthw3-mp-5p.png" width="550"/>
<img src="images/001-rthw3-mp-3p.png" width="550"/>

In [6]:
get_relevant_info('rth_w3', 'mp')

Unnamed: 0,file,N_percentage,sample,replicate,segment
53,data/rth_w3/replicate-2/reremapping/segments/m...,0.014606,rth_w3,2,mp
61,data/rth_w3/replicate-1/reremapping/segments/m...,0.017527,rth_w3,1,mp


## kc_com5 pa
Both replicates are fully filled in as well as the sample.

<img src="images/001-kccom5-pa.png" width="1100"/>

In [7]:
get_relevant_info('kc_com5', 'pa')

Unnamed: 0,file,N_percentage,sample,replicate,segment
115,data/kc_com5/replicate-2/reremapping/segments/...,0.0,kc_com5,2,pa
123,data/kc_com5/replicate-1/reremapping/segments/...,0.0,kc_com5,1,pa


## ms_w2 pb1
Fills in nicely, mostly supplied by replicate 2 with agreement from replicate 1.

<img src="images/001-msw2-pb1.png" width="1100"/>

In [8]:
get_relevant_info('ms_w2', 'pb1')

Unnamed: 0,file,N_percentage,sample,replicate,segment
327,data/ms_w2/replicate-2/reremapping/segments/pb...,0.087997,ms_w2,2,pb1
335,data/ms_w2/replicate-1/reremapping/segments/pb...,0.529261,ms_w2,1,pb1


The following bash command was useful to check this. It concatenates the sample consensus with replicate 1 and 2 consensus, and opens it. A cursory glance at the above CSV for a few edges passed inspection. Simply adjust `sample` and `segment` variables to inspect.

```
sample=ms_w2; segment=pb1; fasta=$sample-$segment-check.fasta; seqkit grep -p $segment data/$sample/consensus.fasta > $fasta; cat data/$sample/replicate-1/reremapping/segments/$segment/consensus.fasta data/$sample/replicate-2/reremapping/segments/$segment/consensus.fasta >> $fasta && open $fasta