# Outbreak surveillance

Let's now perform some outbreak surveillance.

We will use sketching to create a dendrogram based on the whole-genome sequencing data for the isolates.


For this workflow, we'll be sketching with:

* MinHash

***

We'll be using [mashtree](https://github.com/lskatz/mashtree). This uses [MASH](https://github.com/lskatz/mashtree) to estimate genetic distances for each isolate, based on **MinHash** estimates of Jaccard similarity. It then builds a tree....


* first, we need to interleave the read pairs:

In [None]:
%%bash
for i in ../data/reads/*_1*;
do
base=${i%%_1-trimmed.fq.gz}
interleave-reads.py ${base}* -o ${base}.fastq.gz
done

* run mashtree to get distances and create a newick tree:

In [None]:
!mashtree.pl --mindepth 5 --sketch-size 10000 --kmerlength 31 ../data/reads/*fastq.gz ../data/GCF_000025565.1_ASM2556v1_genomic.fna.gz > mashtree.dnd


* now we have a tree, let's check it out:

In [None]:
from ete3 import Tree
# Load a tree structure from a newick file.
tree = Tree("mashtree.dnd")
tree.set_outgroup("GCF_000025565.1_ASM2556v1_genomic")
print(tree)

* compare this to the tree in the [paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4001082), built using SNPs from whole genome sequence alignments:

![](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4001082/bin/emss-58061-f0001.jpg)

* and here are the isolate names -> ENA experiment again, for your reference:

|isolate name|patient ID|ENA sample|ENA experiment|
|-|-|-|-|
EC1a|EC1|ERS184249|ERX168346|
EC2a|EC2|ERS184250|ERX168347|
EC2b|EC2|ERS184251|ERX168341|
EC3a|EC3|ERS184252|ERX168340|
EC4a|EC4|ERS184245|ERX168345|
EC5a|EC5|ERS184246|ERX168339|
EC6a|EC6|ERS184247|ERX168343|
EC7a|EC7|ERS184248|ERX168344|

You can see that the outbreak cluster, which was identified in the paper as being EC1a, EC2a, EC2b and EC3a, can be identified using our sketch-based tree (the bottom clade). Here's our tree one more time:

In [None]:
print(tree)

***

Let's move on to the next workflow in our outbreak analysis: [data mining](r4.4.Data-mining.ipynb)