# Nanocompore Usage 

## Preparing data 

The 2 matching nanopore datasets have to be basecalled, aligned and resquiggled before the data can be processed by *nanocompore*.  

### Basecalling reads

If using Albacore for basecalling from raw fast5 files, use the *--disable_filtering* to disable filtering of low quality value, as the presence of modifications might bias the read quality.

Example of basecalling of a direct RNA sample with *Albacore v2.3.1*: 

In [None]:
read_fast5_basecaller.py --disable_filtering -r -f FLO-MIN106 -k SQK-RNA001 -o fastq -q 0 -i ./raw/ -s ./basecall/

### Read alignment

Basecalled reads have to be aligned to a reference. For dRNA-Seq, reads should be aligned to a reference transcriptome (not genome) in a non-spliced fashion. We recommend to filter bam files to remove any reads that would be unmapped, secondary, supplementary or aligned with a MAPQ score below 30. Nanopolish also requires reads to be sorted and indexed.

Example of dRNA-Seq alignment with *Minimap2 v2.11*:

In [None]:
minimap2 -ax map-ont ./reference/transcriptome.fa ./basecall/workspace/reads.fastq > ./alignment/reads.sam

samtools view ./alignment/reads.sam -bh -F 2308 -q 30 | samtools sort -o ./alignment/reads.bam

samtools index ./alignment/reads.bam

### Event alignment with nanopolish

Nanopolish is required to realign raw signal to the expected reference sequence. Reads have to be indexed first with *nanopolish index*, realigned with *nanopolish eventalign* and finally the data has to be collapsed per kmer and indexed by *NanopolishComp Eventalign_collapse*.

Example of dRNA-Seq alignment with Nanopolish v0.10.1 and NanopolishComp v0.4.3

In [None]:
nanopolish index -d ./raw/ ./basecall/workspace/reads.fastq

nanopolish eventalign --reads ./basecall/workspace/reads.fastq --bam ./alignment/reads.bam --genome ./reference/transcriptome.fa --samples > ./eventalign/reads.tsv

NanopolishComp Eventalign_collapse -i ./eventalign/reads.tsv -o ./eventalign/reads_collapsed.tsv

## Using Nanocompore Python API

### Generate a whitelist of transcripts intervals

This step is necessary to narrow down the list of candidate position to check based on a sufficient coverage in both samples. Basically, the method parse the index file and save the information corresponding to revelant reads

The method requires the path of the index file generated by *NanopolishComp Eventalign_collapse* for each of the 2 samples. In addition, it also requires the path of a the fasta index file corresponding to the reference used for the genome/transcriptome alignment. This can be easily generated using samtools faidx.

The following options are also available:
* **min_coverage**: minimal coverage required in both samples
* **downsample_high_coverage**: For reference with higher coverage, downsample by randomly selecting reads.
* **max_NNNNN_kmers_freq**: maximum frequency of NNNNN kmers in reads (1 to deactivate)
* **max_mismatching_kmers_freq**: maximum frequency of mismatching kmers in reads (1 to deactivate)
* **max_missing_kmers_freq**: maximum frequency of missing kmers in reads (1 to deactivate)
* **logLevel**: Set the log level. Valid values: warning, info, debug

In [4]:
from nanocompore.Whitelist import Whitelist

w  = Whitelist (
    s1_index_fn = "./data/KO/eventalign_collapsed.tsv.idx",
    s2_index_fn = "./data/WT/eventalign_collapsed.tsv.idx",
    fasta_index_fn = "./data/references/SC_R64-1-1_transcripts_clean.fa.fai",
    min_coverage=2)

Initialise and checks options
Read fasta index files
	Total references: 6713
Read eventalign index files
	References found in index: 90
Filter out references with low coverage
	References remaining after reference coverage filtering: 2
Compute coverage per reference and select intervals with high enough coverage
100%|██████████| 2/2 [00:00<00:00, 174.68 References/s]
	References remaining after position coverage filtering: 2


The resulting Whitelist object can be used as an iterator

In [5]:
for ref_id, ref_dict in w:
    print (ref_id, ref_dict["interval_list"])

YHR174W [(0, 1309)]
YGR192C [(868, 992)]


... or as a dictionary

In [6]:
w["YGR192C"]

OrderedDict([('interval_list', [(868, 992)]),
             ('S1',
              [line_tuple(ref_id='YGR192C', ref_start=868, ref_end=993, read_id='110', kmers=121, NNNNN_kmers=3, mismatching_kmers=0, missing_kmers=4, byte_offset=6598473, byte_len=6795),
               line_tuple(ref_id='YGR192C', ref_start=0, ref_end=993, read_id='109', kmers=942, NNNNN_kmers=35, mismatching_kmers=0, missing_kmers=51, byte_offset=6677643, byte_len=52338)]),
             ('S2',
              [line_tuple(ref_id='YGR192C', ref_start=0, ref_end=976, read_id='112', kmers=931, NNNNN_kmers=31, mismatching_kmers=0, missing_kmers=45, byte_offset=6094171, byte_len=51643),
               line_tuple(ref_id='YGR192C', ref_start=9, ref_end=994, read_id='113', kmers=944, NNNNN_kmers=26, mismatching_kmers=0, missing_kmers=41, byte_offset=6213824, byte_len=52390),
               line_tuple(ref_id='YGR192C', ref_start=278, ref_end=995, read_id='115', kmers=677, NNNNN_kmers=22, mismatching_kmers=0, missing_kmers=40, byte

Finally it can be written as a basic bed file

In [7]:
w.to_bed(bed_fn="./output/whitelist.bed")

### Compare samples with SampComp

SampComp is the main package method. Based on the information contained in the whitelist object, it retrieves the read data and aggregate results per position of each reference sequence. The method requires the path of the data file generated by NanopolishComp Eventalign_collapse for each of the 2 samples. In addition, it also needs a whitelist object as well as a path to right the output result database file.

The following options are also available:

* **padj_threshold**: Adjusted p-value threshold for reporting sites.
* **comparison_method**: Statistical method to compare the 2 samples signal (default kmean)
* **sequence_context**: Extend statistical analysis to contigous adjacent base is available
* **nthreads**: Number of threads (two are used for reading and writing, all the others for processing in parallel).
* **logLevel**: Set the log level. Valid values: warning, info, debug

In [8]:
from nanocompore.SampComp import SampComp

In [12]:
s = SampComp (
    s1_fn = "./data/KO/eventalign_collapsed.tsv",
    s2_fn = "./data/WT/eventalign_collapsed.tsv",
    output_db_fn = "./output/out.db",
    whitelist = w,
    nthreads = 8)

Initialise and checks options
Start data processing
100%|██████████| 2/2 [00:00<00:00,  2.60 Processed References/s]


In [10]:
s.ref_id_list

['YGR192C', 'YHR174W']

In [11]:
s ["YGR192C"]

OrderedDict([(868,
              {'S1': [(73.4564208984375, 27), (72.79857635498047, 66)],
               'S2': [(74.06575012207031, 49),
                (73.90553283691406, 85),
                (74.39916229248047, 62),
                (72.6175765991211, 20)]}),
             (869,
              {'S1': [(71.12263488769531, 23), (71.90535736083984, 88)],
               'S2': [(70.60127258300781, 8),
                (72.11139678955078, 38),
                (71.4753646850586, 31),
                (71.62437438964844, 31)]}),
             (870,
              {'S1': [(74.83845520019531, 13), (75.99736785888672, 35)],
               'S2': [(75.2939453125, 7),
                (75.69001007080078, 27),
                (75.6395492553711, 18),
                (75.23548889160156, 30)]}),
             (871,
              {'S1': [(80.9549331665039, 28), (84.62736511230469, 16)],
               'S2': [(87.08580780029297, 44),
                (86.9090347290039, 9),
                (86.66130065917969, 30