# Detecting adapters

For datasets where the adapter is tricky in situations where non-standard adapters have been used or a barcode exists adjacent to the adapter.

# 3' adapters

This section is borrowed and copied as it is from the `cutadapt` [manual](https://cutadapt.readthedocs.io/en/v1.7.1/guide.html).


A 3’ adapter is a piece of DNA ligated to the 3’
end of the DNA fragment you are interested in. 
The sequencer starts the sequencing process at the 5’
end of the fragment and sequences into the adapter
if the read is long enough. 

The read that it outputs will then have a part of
the adapter in the end. Or, if the adapter was 
short and the read length quite long, then the adapter
will be somewhere within the read (followed by other bases).

For example, assume your fragment of interest
is MYSEQUENCE and the adapter is ADAPTER. 
Depending on the read length, 
you will get reads that look like this:

```
MYSEQUEN
MYSEQUENCEADAP
MYSEQUENCEADAPTER
MYSEQUENCEADAPTERSOMETHINGELSE
```

Use cutadapt’s -a ADAPTER option to remove this type of adapter. This will be the result:

```
MYSEQUEN
MYSEQUENCE
MYSEQUENCE
MYSEQUENCE
```
As can be seen, cutadapt correctly deals with partial adapter matches,
and also with any trailing sequences after the adapter.
Cutadapt deals with 3’ adapters by removing the adapter itself and 
any sequence that may follow.
If the sequence starts with an adapter, like this:
```
ADAPTERSOMETHING
```
Then the sequence will be empty after trimming. 
Note that, by default, empty reads are not discarded and will appear in the output.


NOTE: Our adapter in the above case is not "anchored" at the end. There is a seprate flag to handle that in `cutadapt`.

Here we perform various case studies where `cutadapt`'s auto-detection alone is not useful. 

The standard adapters are:


| Protocol |  Adapter     |
|----------|--------------|
|Illumina  | AGATCGGAAGAGC|
|Small RNA |  TGGAATTCTCGG|
|Nextera   |  CTGTCTCTTATA|

In [1]:
%pylab inline
%load_ext autoreload
%autoreload 2

from riboraptor.kmer import fastq_kmer_histogram

def get_top_kmers(kmer_series):
    """Return all kmers with cumulative sum <=50,
    because we won't need mroe than that.
    """
    cumsum = kmer_series.cumsum()
    return kmer_series[cumsum<=70]
    

Populating the interactive namespace from numpy and matplotlib


In [2]:
fastqs = {'17nt': '/staging/as/skchoudh/re-ribo-analysis/hg38/SRP098789/sratofastq/SRR5227288.fastq.gz',
         '17nt_post_trimming': '/staging/as/skchoudh/re-ribo-analysis/hg38/SRP098789/preprocessed/SRR5227288_trimmed.fq.gz',
         '13nt': '/home/cmb-06/as/skchoudh/dna/Dec_12_2017_Penalva_RPS5_Riboseq/Penalva_L_12112017/RPS5_C2_S2_L001_R1_001.fastq.gz',
         '13nt_post_trimming': '/home/cmb-panasas2/skchoudh/rna/Dec_12_2017_Penalva_RPS5_RNAseq_and_Riboseq/preprocessed/RPS5_C2_S2_L001_R1_001_trimmed.fq.gz',
         'ambiguous': '/staging/as/skchoudh/re-ribo-analysis/hg38/SRP031501_human_remap_v2/sratofastq/SRR1562541.fastq.gz',
          'erx': '/staging/as/wenzhenl/re-ribo-data/ERP005378/ERX432360/ERR466125.fastq'
         }
histograms = {k:{} for k in fastqs.keys()}

In [None]:

for key, fastq in fastqs.items():
    histograms[key] = fastq_kmer_histogram(fastq)

100%|██████████| 1000000/1000000 [00:33<00:00, 29482.00it/s]
 44%|████▎     | 436734/1000000 [00:34<01:01, 9101.83it/s]

# A Dataset with 17nt adapter

The adapter for this dataset is 17nt long: CTGTAGGCACCATCAAT
We will first consider looking at the raw fastq to see if we find any enriched sequences.


In [None]:
get_top_kmers(histograms['17nt'][17])
AGATCGGAAGAGC
CACCATCAATAGATCGG

There does seem to be enrichment of some 17nt sequences (collpasing them on Levenshtein distance will lead to one sequence being >50%)

In [None]:
get_top_kmers(histograms['17nt_post_trimming'][17])

This should also highlight the need to trim twice. Let's look at the original 17nt enrichment sequence (before anytrimming). The top sequence is:

CACCATCAATAGATCGG. Looking closely, it should not be difficult to realize that there is illumina standard adapter (`AGATCGGAAGAGC`) thrown somewhere in between.

CACCATCAAT**AGATCGG**. So trimming gets rid of these partial adapters and the rest of the overrepresented sequence shows enrichment, which is not so clear
without trimming.

To make sure 17nt indeed is the adapter, we can go one nucleotide up:


In [None]:
get_top_kmers(histograms['17nt_post_trimming'][18])

# Dataset with 13nt adapter

Let's try do do the same with a fastq where we know the adpter is 13 nt long.


In [None]:
get_top_kmers(histograms['13nt'][13])


Again, the first four sequences are essentially within a Levenshtein distance of 1-2nt.


In [None]:
get_top_kmers(histograms['13nt'][14])


In [None]:
get_top_kmers(histograms['13nt'][15])


In [None]:
get_top_kmers(histograms['13nt'][17])


In [None]:
get_top_kmers(histograms['13nt'][20])


In [None]:
get_top_kmers(histograms['13nt'][21])


Again, not sure where should we stop. Let's do this on the trimmed dataset

In [None]:
get_top_kmers(histograms['13nt_post_trimming'][5])[0:5]


It look's good now. This doesn't require a second pass at trimming!