# Identifying contamination
It is always a good idea to check that your data is from the species you expect it to be. A very useful tool for this is [Kraken](https://www.ebi.ac.uk/research/enright/software/kraken). In this tutorial we will go through how you can use Kraken to check your samples for contamination. If you have kraken installed, please feel free to follow the rest of this tutorial on the command line. Don't worry if you do not though, as it is not strictly necessary for the sake of this tutorial.

__Note if using the Sanger cluster:__ If you have access to the Sanger cluster, Kraken is already centrally installed. Kraken is also run as part of the automatic qc pipeline and you can retreive the results using the `pf qc` script. For more information, run `pf man qc`. 

## Setting up a database
To run Kraken you need to either build a database or download an existing one. The standard database is fairly huge (33 GB) and possibly a bit overkill for running basic QC checks. Thankfully, there are some pre-built databased available. Here we are going to use the smallest of them, the 4 GB MiniKraken. To download this, run:

`wget https://ccb.jhu.edu/software/kraken/dl/minikraken_20171019_4GB.tgz`

Then all you need to do is un-tar it:

`tar -zxvf minikraken_20171019_4GB.tgz`

If the pre-packaged databases are not quite what you are looking for, you can create your own customized database instead. Details about this can be found [here](http://ccb.jhu.edu/software/kraken/MANUAL.html#custom-databases).

__Note if using the Sanger cluster:__ There are several pre-built databases available centrally on the Sanger cluster. For more information, please contact the Pathogen Informatics team.

## Running Kraken
To run Kraken, you need to provide it with the path to the database you just created. Bu default, the input files are assumed to be in FASTA format, so in this case we also need to tell Kraken that our input files are in FASTQ and that they are paired end reads:

`kraken --db /path/to/minikraken_20171019_4GB --output kraken_results --fastq-input --paired s_7_1.fastq s_7_2.fastq`

or:

`kraken --output kraken_results --fastq-input --paired s_7_1.fastq s_7_2.fastq`

The kraken_results file that is produced will look something like:

```
Not very readable file
```
The most important parts of this file are column 2 which contains the sequence ID, column 3 that holds the taxon ID  and column 5, which contains a summary of all the taxon IDs that each k-mer in the sequence matched to (taxon ID:number of k-mers).

To make this a bit clearer you can create a kraken report:

`kraken-report --db /path/to/minikraken_20171019_4GB --print_header kraken_results`

## Looking at the results
Let's have a closer look at the kraken_report for the s_7 sample:

In [None]:
head data/kraken_report

Column 1: percentage of reads in the clade/taxon in Column 6
Column 2: number of reads in the clade.
Column 3: number of reads in the clade but not further classified.
Column 4: code indicating the rank of the classification: (U)nclassified, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, (S)pecies).
Column 5: NCBI taxonomy ID.

__Q1: What is the most prevalent species in this sample?  
Q2: Are there clear signs of contamination?__  