# Resistome profiling

One of the main concerns during the suspected outbreak of *E.cloacae* was the possibility of antimicrobial resistance (AMR), particularly carbapenem resistance.

For this step in the workflow, we'll be sketching with:

* MinHash
* Minimizers

***

## Taking a quick look with MinHash

To quickly tell if there are any known AMR genes in these samples, we can use [GROOT](https://github.com/will-rowe/groot). GROOT is designed for metagenomes but does the job for isolates too. It works by building variation graphs for clusters of genes, then indexing the traverslas in each graph using **MinHash** sketches.

We're going to see how you can sketch reads as you download them, this means you could just keep the samples which contain the genes that you are intested in. GROOT also has read QC and trimming built in, so bad reads will be handled.

* to begin, download a reference AMR database and index it:

In [167]:
# download the ResFinder database
!groot get -d arg-annot

# index the database
!groot index -i ./arg-annot.90 -l 150 -k 21 -s 24 -o arg-annot-index

downloading the pre-clustered arg-annot database...
unpacking...
could not save db to specified directory


> `-l` specifies the window length to sketch in the graph, which should be similar to the read length

> `-k` specifies the k-mer length and `-s` specifies the sketch size

> GROOT is quicker if we use multiple cpus to index or sketch (using groot `--processors`)

* now we stream the data from the ENA; as reads arrive we align them to the reference graphs

In [169]:
# stream the reads and align them to the graphs
!fastq-dump ERX168346 -Z --split-files | groot align -i arg-annot-index -o ERX168346-graphs > ERX168346.bam


Read 1048216 spots for ERX168346
Written 1048216 spots for ERX168346


> we use fastq-dump to stream reads from the ENA download into the groot command (via the `-Z` flag and pipe `|`)

> the align subcommand produces a BAM file containing all graph traversals for each read. Each BAM file essentially contains the ARG-derived reads.

> the gfa variation graphs which had reads align are also kept and can be viewed in Bandage etc.

* now, report what AMR genes are present in this sample:

In [175]:
!groot report -i ERX168346.bam --lowCov


argannot~~~(Bla)SHV-183~~~HG934764:1-864	87	864	3D849M12D
argannot~~~(Sul)SulI~~~AF071413:6700-7539	223	840	6D834M
argannot~~~(Tmt)DfrA1~~~JQ794607:474	377	474	5D469M
argannot~~~(Bla)SHV-12~~~FJ685654:24-860	86	861	3D846M12D
argannot~~~(Sul)SulII~~~EU360945:1617-2432	123	816	816M
argannot~~~(AGly)Sat-2A~~~X51546:518-1042	331	525	3D522M


> the `--lowCov` flag is used as we are using GROOT on isolates, not metagenomes. The flag ignores uncovered reads in the first few bases of each gene, which are usually uncovered because there are not enough reads to completely span the gene (partial gene alignments aren't counted by GROOT).

This result tells us that our isolates contain AMR genes and this warrants further inspection. Good job we have already QC'd the data and have this waiting for us.

The [paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4001082/) that describes this outbreak was particularly interested in beta-lactam resistance. They found *blaSHV-12*, *blaIMP-1* and *blaTEM-1* genes in **isolate EC1a** (ERX168346). This was the only *E.cloacae* isolate in which they found these genes, or did any phenotypic testing for.

We managed to find *blaSHV-12* in our MinHash search but not *blaIMP-1* and *blaTEM-1*. If we look at the BAM file, we do find some reads matching these two genes but we didn't get enough coverage to call the genes. We also used the raw data. Let's now try using the cleaned data and doing a full read-alignment.


## Mapping reads with MiniMap2

The quick look with GROOT told us that these samples contain AMR genes. Let's say that we decided to keep these downloaded samples on disk and then quality checked them using our previous workflow.

We will use [MiniMap2](https://github.com/lh3/minimap2) to map our quality checked reads to the reference AMR genes, which we will index with **minimizers**.

* get the fasta sequences for the AMR genes in the ARG-annot database:

In [176]:
!wget https://raw.githubusercontent.com/will-rowe/groot/master/db/full-ARG-databases/arg-annot-db/argannot-args.fna

--2019-04-09 13:57:06--  https://raw.githubusercontent.com/will-rowe/groot/master/db/full-ARG-databases/arg-annot-db/argannot-args.fna
Resolving raw.githubusercontent.com... 151.101.16.133
Connecting to raw.githubusercontent.com|151.101.16.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1651832 (1.6M) [text/plain]
Saving to: 'argannot-args.fna.2'


2019-04-09 13:57:07 (4.17 MB/s) - 'argannot-args.fna.2' saved [1651832/1651832]



* index the AMR genes and align our quality trimmed reads:

In [222]:
# create the index
!minimap2 -x sr -d argannot-args.mmi argannot-args.fna
# align the reads and output as a bam file
!minimap2 -x sr -t 1 -a argannot-args.mmi ./*cor.fq.gz | samtools view -b -o ERX168346-minimap2.bam

[M::mm_idx_gen::0.075*1.13] collected minimizers
[M::mm_idx_gen::0.093*1.48] sorted minimizers
[M::main::0.113*1.39] loaded/built the index for 1749 target sequence(s)
[M::mm_idx_stat] kmer size: 21; skip: 11; is_hpc: 0; #seq: 1749
[M::mm_idx_stat::0.115*1.39] distinct minimizers: 97399 (79.09% are singletons); average occurrences: 2.551; average spacing: 6.156
[M::main] Version: 2.16-r922
[M::main] CMD: minimap2 -x sr -d argannot-args.mmi argannot-args.fna
[M::main] Real time: 0.132 sec; CPU: 0.173 sec; Peak RSS: 0.015 GB
[M::main::0.029*1.23] loaded/built the index for 1749 target sequence(s)
[M::mm_mapopt_update::0.029*1.23] mid_occ = 1000
[M::mm_idx_stat] kmer size: 21; skip: 11; is_hpc: 0; #seq: 1749
[M::mm_idx_stat::0.031*1.22] distinct minimizers: 97399 (79.09% are singletons); average occurrences: 2.551; average spacing: 6.156
[M::worker_pipeline::3.529*0.62] mapped 251286 sequences
[M::main] Version: 2.16-r922
[M::main] CMD: minimap2 -x sr -t 1 -a argannot-args.mmi ./ERX168346

* sort, index and read in the alignment:

In [223]:
import pysam
pysam.sort("-o", "ERX168346-minimap2.bam", "ERX168346-minimap2.bam")
pysam.index("ERX168346-minimap2.bam")
# the rb arguments tell pysam that we are reading in a bam file
samfile = pysam.AlignmentFile("ERX168346-minimap2.bam", "rb")

* let's look at the all the AMR genes that were at least 95% covered:

In [224]:
# check each AMR gene in the database
for ref in samfile.header['SQ']:
    name=ref['SN']
    length=ref['LN']
    # create a pilup for the reference AMR gene
    pileup=samfile.pileup(name)
    coveredBases=0
    # see if any reads aligned to this AMR gene
    for pos,column in enumerate(pileup,1):
        coveredBases+=1

    # if >95% of the AMR gene had reads align, print the name of the gene and it's coverage
    coverage=(coveredBases/length)*100
    if (coverage > 95):
        print("{} is {}% covered by reads" .format(name, coverage))

# close the alignment file
samfile.close()

argannot~~~(AGly)Sat-2A~~~X51546:518-1042 is 100.0% covered by reads
argannot~~~(Sul)SulI~~~AF071413:6700-7539 is 99.76190476190476% covered by reads
argannot~~~(Sul)SulII~~~EU360945:1617-2432 is 98.52941176470588% covered by reads
argannot~~~(Tmt)DfrA1~~~JQ794607:474 is 100.0% covered by reads


Still no carbapenemase genes! And some of our other AMR genes which GROOT found are no longer showing up. We might have trimmed our reads too aggressively, or the gene causing the phenotypic carbapenem resistance in the EC1a isolate might not be in our database.

* let's try again but this time use the raw data:

In [226]:
!fastq-dump ERX168346 -Z --split-files |  minimap2 -x sr -t 1 -a argannot-args.mmi - | samtools view -b -o ERX168346-minimap2.bam


[M::main::0.051*0.89] loaded/built the index for 1749 target sequence(s)
[M::mm_mapopt_update::0.052*0.89] mid_occ = 1000
[M::mm_idx_stat] kmer size: 21; skip: 11; is_hpc: 0; #seq: 1749
[M::mm_idx_stat::0.054*0.90] distinct minimizers: 97399 (79.09% are singletons); average occurrences: 2.551; average spacing: 6.156
[M::worker_pipeline::16.511*0.18] mapped 333334 sequences
[M::worker_pipeline::24.148*0.25] mapped 333334 sequences
[M::worker_pipeline::31.599*0.28] mapped 333334 sequences
[M::worker_pipeline::39.132*0.31] mapped 333334 sequences
[M::worker_pipeline::46.951*0.32] mapped 333334 sequences
[M::worker_pipeline::54.963*0.33] mapped 333334 sequences
Read 1048216 spots for ERX168346
Written 1048216 spots for ERX168346
[M::worker_pipeline::57.262*0.34] mapped 96428 sequences
[M::main] Version: 2.16-r922
[M::main] CMD: minimap2 -x sr -t 1 -a argannot-args.mmi -
[M::main] Real time: 57.274 sec; CPU: 19.316 sec; Peak RSS: 0.196 GB


* now process the alignment:

In [227]:
pysam.sort("-o", "ERX168346-minimap2.bam", "ERX168346-minimap2.bam")
pysam.index("ERX168346-minimap2.bam")
samfile = pysam.AlignmentFile("ERX168346-minimap2.bam", "rb")
for ref in samfile.header['SQ']:
    name=ref['SN']
    length=ref['LN']
    # create a pilup for the reference AMR gene
    pileup=samfile.pileup(name)
    # counter for the covered bases
    coveredBases=0
    # iterate over each position in this gene and incremenet a counter if a read aligned here
    for pos,column in enumerate(pileup,1):
        # read covers this base in the reference
        coveredBases+=1
            
    # if 100% of the AMR gene had reads align, print the name of the gene
    coverage=(coveredBases/length)*100
    if (coverage > 99):
        print("{} is {}% covered by reads" .format(name, coverage))

argannot~~~(AGly)Aac6-Ib~~~M21682:380-985 is 100.0% covered by reads
argannot~~~(AGly)AacA4~~~AF416297:2738-3304 is 99.64726631393297% covered by reads
argannot~~~(AGly)Sat-2A~~~X51546:518-1042 is 100.0% covered by reads
argannot~~~(Bla)IMP-39~~~D50438:1195-1935 is 100.0% covered by reads
argannot~~~(Bla)OXA-9~~~NC_015515:451-1290 is 100.0% covered by reads
argannot~~~(Flq)Qnr-A1~~~AY070235:303-959 is 100.0% covered by reads
argannot~~~(Phe)CatA2~~~X53796:187-903 is 100.0% covered by reads
argannot~~~(Sul)SulI~~~AF071413:6700-7539 is 100.0% covered by reads
argannot~~~(Sul)SulII~~~EU360945:1617-2432 is 100.0% covered by reads
argannot~~~(Tmt)DfrA1~~~JQ794607:474 is 100.0% covered by reads
argannot~~~(Tmt)DfrA14~~~GU726917:72-545 is 100.0% covered by reads


So we do have a carbapenemase gene (*blaIMP*) in our sample - why are we only finding this now?!

* look at the alignments for *blaIMP-39*:

In [228]:
# the fetch method creates an iterator around the region we give
iter = samfile.fetch("argannot~~~(Bla)IMP-39~~~D50438:1195-1935")
# we can loop over the reads aligned to the specified region
for read in iter:
    print(read.cigarstring)
samfile.close()

61S89M
28S122M
58S92M
82H68M
150M
150M
150M
150M
150M
150M
150M
150M
150M
10S140M
150M
150M
150M
150M
150M
150M
150M
150M
150M
101M49S
49S101M
99M51S
51S99M
150M
150M
150M
150M
150M
150M
None
150M
150M
116M34S
114M36S
67M83S
62M88S
None
56S57M37S
44M106S


That's a lot of clipping (S and H in the CIGAR strings)!

The reasons we haven't seen this carbapenemase gene so far are probably because:

* GROOT is quite stringent - the default settings we used don't allow that much clipping in the read alignment, so these reads will have been dropped after mapping (seeding)

* the trimmed reads we used with MiniMap2 will have had these clipped regions removed, or the whole read dropped, so MiniMap2 wouldn't have tried mapping them

* MiniMap2 is good at aligning reads with errors in them, so using the raw data and then checking each gene we are interested in turns out to be a good approach.

To check this theory, just look in the GROOT alignment from earlier. This BAM file actually contains all the MinHash sketch matches for the reads against the AMR genes. You can count the number of reads that mapped to the *blaIMP-39* and it's roughly the same as the MiniMap2 alignment - which is to be expected as they are both using similar sketches to seed the reads.

In terms of our workflow for resistome profiling - I'm not sure how confident we would be in calling this exact gene as being present but there is definitely some form of beta lactamase gene there!

As stated earlier, the sequencing data for these isolates isn't great. Our QC actually gets rid of the reads that result in this AMR gene being called.