# Identifying contamination
It is always a good idea to check that your data is from the species you expect it to be. A very useful tool for this is [Kraken](https://www.ebi.ac.uk/research/enright/software/kraken). In this tutorial we will go through how you can use Kraken to check your samples for contamination.

__Note if using the Sanger cluster:__ Kraken is run as part of the automatic qc pipeline and you can retreive the results using the `pf qc` script. For more information, run `pf man qc`. 

## Setting up a database
To run Kraken you need to either build a database or download an existing one. The standard database is very large (33 GB), but thankfully there are some smaller, pre-built databased available. To download the smallest of them, the 4 GB MiniKraken, run:

`wget https://ccb.jhu.edu/software/kraken/dl/minikraken_20171019_4GB.tgz`

Then all you need to do is un-tar it:

`tar -zxvf minikraken_20171019_4GB.tgz`

If the pre-packaged databases are not quite what you are looking for, you can create your own customized database instead. Details about this can be found [here](http://ccb.jhu.edu/software/kraken/MANUAL.html#custom-databases).

__Note if using the Sanger cluster:__ There are several pre-built databases available centrally on the Sanger cluster. For more information, please contact the Pathogen Informatics team.

## Running Kraken
To run Kraken, you need to provide the path to the database you just created. By default, the input files are assumed to be in FASTA format, so in this case we also need to tell Kraken that our input files are in FASTQ format, gzipped, and that they are paired end reads:

`kraken --db /path/to/minikraken_20171019_4GB --output kraken_results --fastq-input --gzip-compressed --paired data/13681_1#18_1.fastq.gz data/13681_1#18_2.fastq.gz`

The kraken_results file that is produced will look something like:

```
C       HS38_13681:1:1101:1200:79237#18 1313    201     0:26 1313:1 0:18 1313:1 0:24 A:31 0:32 1301:2 0:13 1313:1 0:14 1313:1 0:7
C       HS38_13681:1:1101:1203:67902#18 1313    201     0:2 1313:1 0:7 1313:1 0:48 1313:1 0:9 1313:1 A:31 0:4 1313:1 0:7 1301:1 0:25 1313:1 0:5 1313:1 0:7 1313:1 0:15 1313:1 0:1
C       HS38_13681:1:1101:1203:95955#18 1313    201     0:27 1301:1 0:7 1301:1 0:17 1301:1 0:16 A:31 0:4 1313:1 0:28 1301:1 0:19 1301:1 0:16
C       HS38_13681:1:1101:1207:84487#18 1313    201     0:62 1313:1 0:7 A:31 0:3 1301:1 0:29 1313:1 0:6 1313:1 0:12 1313:1 0:16
C       HS38_13681:1:1101:1207:91454#18 1313    201     0:42 1301:1 0:27 A:31 0:16 1301:1 0:26 1313:1 0:26
C       HS38_13681:1:1101:1208:26781#18 1313    201     0:46 1313:1 0:23 A:31 0:15 1313:1 0:1 1313:1 0:6 1313:1 0:24 1313:1 0:20
C       HS38_13681:1:1101:1208:76534#18 1313    201     1313:1 0:4 1313:1 0:1 1313:1 0:62 A:31 0:19 1313:2 0:43 1313:1 0:5
C       HS38_13681:1:1101:1209:37708#18 1313    201     0:6 1313:1 0:18 1313:1 0:44 A:31 1301:1 0:28 1313:1 0:40
C       HS38_13681:1:1101:1211:90237#18 1313    201     0:1 1313:1 0:7 1313:1 0:4 1301:1 0:9 1301:1 0:4 1301:1 0:5 1301:1 0:34 A:31 0:15 1313:1 0:18 1301:1 0:28 1301:1 0:6
C       HS38_13681:1:1101:1213:82376#18 1313    201     0:7 1313:1 0:10 1313:1 0:10 1313:1 0:1 1313:1 0:35 1313:2 0:1 A:31 0:55 1301:1 0:14
C       HS38_13681:1:1101:1218:39426#18 1313    201     0:25 1301:1 0:44 A:31 0:8 1301:1 0:7 1301:1 0:1 1313:1 0:12 1313:1 0:5 1313:1 0:32
C       HS38_13681:1:1101:1218:62545#18 1313    201     1313:1 0:3 1313:1 0:41 1300:1 0:18 1300:1 0:3 1300:1 A:31 0:30 1301:1 0:17 1313:1 0:16 1301:1 0:4
U       HS38_13681:1:1101:1218:66374#18 0       201     0:70 A:31 0:70
C       HS38_13681:1:1101:1220:51648#18 1313    201     1301:2 0:6 1301:1 0:6 1301:1 0:10 1301:1 0:2 1301:1 0:6 1301:1 0:12 1301:1 0:20 A:31 0:31 1313:1 0:31 1313:1 0:6
C       HS38_13681:1:1101:1222:54548#18 1313    201     0:26 1313:1 0:43 A:31 1313:1 0:7 1313:1 0:8 1313:1 0:6 1301:1 0:4 1301:1 0:40
...
```
According to the [Kraken manual](http://ccb.jhu.edu/software/kraken/MANUAL.html), the five columns in this file are :

1. "C"/"U": one letter code indicating that the sequence was either classified or unclassified.
2. The sequence ID, obtained from the FASTA/FASTQ header.
3. The taxonomy ID Kraken used to label the sequence; this is 0 if the sequence is unclassified.
4. The length of the sequence in bp.
5. A space-delimited list indicating the LCA mapping of each k-mer in the sequence.

To make this a bit clearer you can create a kraken report:

`kraken-report --db /path/to/minikraken_20171019_4GB --print_header kraken_results > kraken-report`

## Looking at the results
Let's have a closer look at the kraken_report for the sample:

In [1]:
head -n 20  data/kraken-report

#Kraken version: kraken-0.10.6-a2d113dc8f
#Database: /lustre/scratch118/infgen/pathogen/pathpipe/kraken/minikraken_20140330
  1.33	17491	17491	U	0	unclassified
 98.67	1296036	14289	-	1	root
 97.58	1281687	0	-	131567	  cellular organisms
 97.58	1281687	424	D	2	    Bacteria
 97.53	1281086	342	P	1239	      Firmicutes
 97.50	1280744	287	C	91061	        Bacilli
 97.48	1280454	644	O	186826	          Lactobacillales
 97.43	1279808	127	F	1300	            Streptococcaceae
 97.42	1279681	144729	G	1301	              Streptococcus
 86.17	1131862	1076308	S	1313	                Streptococcus pneumoniae
  1.30	17075	17075	-	561276	                  Streptococcus pneumoniae ATCC 700669
  1.25	16375	16375	-	488222	                  Streptococcus pneumoniae JJA
  0.22	2838	2838	-	516950	                  Streptococcus pneumoniae CGSP14
  0.21	2818	2818	-	170187	                  Streptococcus pneumoniae TIGR4
  0.20	2602	2602	-	1130804	                  Streptococcus pneumoniae ST556
  0.16	2100	2100	-	

According to the [Kraken manual](http://ccb.jhu.edu/software/kraken/MANUAL.html), the six columns in this file are:

1. Percentage of reads covered by the clade rooted at this taxon
2. Number of reads covered by the clade rooted at this taxon
3. Number of reads assigned directly to this taxon
4. A rank code, indicating (U)nclassified, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. All other ranks are simply '-'.
5. NCBI taxonomy ID
6. Scientific name

## Exercises
__Q1: What is the most prevalent species in this sample?__

__Q2: Are there clear signs of contamination?__

__Q3: What percentage of reads could not be classified?__  

Congratulations! You have reached the end of this tutorial. You can find the answers to all the questions of the tutorial [here](contamination-answers.ipynb).  
To revisit the previous section [click here](assessment.ipynb). Alternatively you can head back to the [index page](index.ipynb)