# Identifying contamination
It is always a good idea to check that your data is from the species you expect it to be. A very useful tool for this is [Kraken](https://www.ebi.ac.uk/research/enright/software/kraken). In this tutorial we will go through how you can use Kraken to check your samples for contamination. If you have kraken installed, please feel free to follow the rest of this tutorial on the command line. Don't worry if you do not though, as it is not strictly necessary for the sake of this tutorial.

__Note if using the Sanger cluster:__ If you have access to the Sanger cluster, Kraken is already centrally installed. Kraken is also run as part of the automatic qc pipeline and you can retreive the results using the `pf qc` script. For more information, run `pf man qc`. 

## Setting up a database
To run Kraken you need to either build a database or download an existing one. The standard database is fairly huge (33 GB) and possibly a bit overkill for running basic QC checks. Thankfully, there are some pre-built databased available. To download the smallest of them, the 4 GB MiniKraken, run:

`wget https://ccb.jhu.edu/software/kraken/dl/minikraken_20171019_4GB.tgz`

Then all you need to do is un-tar it:

`tar -zxvf minikraken_20171019_4GB.tgz`

If the pre-packaged databases are not quite what you are looking for, you can create your own customized database instead. Details about this can be found [here](http://ccb.jhu.edu/software/kraken/MANUAL.html#custom-databases).

__Note if using the Sanger cluster:__ There are several pre-built databases available centrally on the Sanger cluster. For more information, please contact the Pathogen Informatics team.

## Running Kraken
To run Kraken, you need to provide it with the path to the database you just created. Bu default, the input files are assumed to be in FASTA format, so in this case we also need to tell Kraken that our input files are in FASTQ and that they are paired end reads:

`kraken --db /path/to/minikraken_20171019_4GB --output kraken_results --fastq-input --paired s_7_1.fastq s_7_2.fastq`

The kraken_results file that is produced will look something like:

```
C       1       1313    201     0:48 1313:1 0:21 A:31 0:37 1313:1 0:12 1313:1 0:19
C       3       1313    201     0:1 1313:1 0:25 1313:1 0:26 1313:1 0:15 A:31 0:37 1313:1 0:12 1313:1 0:19
U       5       0       201     0:70 A:31 0:70
C       7       1313    201     0:43 1313:1 0:26 A:31 0:58 1313:1 0:11
C       9       1313    201     0:24 1313:1 0:15 1313:1 0:8 1313:1 0:20 A:31 0:25 1313:1 0:23 1313:1 0:20
U       11      0       201     0:70 A:31 0:70
C       13      561276  201     0:8 1313:1 0:30 1313:1 0:30 A:31 0:4 1313:2 0:9 1313:1 0:14 1313:1 0:35 561276:1 0:3
U       15      0       201     0:70 A:31 0:70
U       17      0       201     0:70 A:31 0:70
U       19      0       201     0:70 A:31 0:70
...
```
According to the [Kraken manual](http://ccb.jhu.edu/software/kraken/MANUAL.html), the five columns in this file are :

1. "C"/"U": one letter code indicating that the sequence was either classified or unclassified.
2. The sequence ID, obtained from the FASTA/FASTQ header.
3. The taxonomy ID Kraken used to label the sequence; this is 0 if the sequence is unclassified.
4. The length of the sequence in bp.
5. A space-delimited list indicating the LCA mapping of each k-mer in the sequence.

To make this a bit clearer you can create a kraken report:

`kraken-report --db /path/to/minikraken_20171019_4GB --print_header kraken_results > kraken-report`

## Looking at the results
Let's have a closer look at the kraken_report for the s_7 sample:

In [None]:
less data/kraken-report

#Kraken version: kraken-0.10.6-a2d113dc8f
#Database: /lustre/scratch118/infgen/pathogen/pathpipe/kraken/minikraken_2014033 0
 89.97    178811    178811    U    0    unclassified
 10.03    19942    515    -    1    root
  9.76    19393    21    -    131567      cellular organisms
  9.66    19193    72    D    2        Bacteria
  6.63    13179    4    P    1239          Firmicutes
  6.62    13158    8    C    91061            Bacilli
  6.57    13050    2    O    1385              Bacillales
  6.56    13037    21    F    90964                Staphylococcaceae
  6.55    13014    372    G    1279                  Staphylococcus
  6.13    12181    0    S    1283                    Staphylococcus haemolyticus
  6.13    12181    12181    -    279808                      Staphylococcus haem olyticus JCSC1435
  0.17    340    0    S    1292                    Staphylococcus warneri
  0.17    340    340    -    1194526                      Staphylococcus warneri  SG1
  0.04    79    5    S    

According to the [Kraken manual](http://ccb.jhu.edu/software/kraken/MANUAL.html), the six columns in this file are:

1. Percentage of reads covered by the clade rooted at this taxon
2. Number of reads covered by the clade rooted at this taxon
3. Number of reads assigned directly to this taxon
4. A rank code, indicating (U)nclassified, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. All other ranks are simply '-'.
5. NCBI taxonomy ID
6. Scientific name

__Q1: What is the most prevalent species in this sample?  
Q2: Are there clear signs of contamination?  
Q3: What percentage of reads could not be classified?__  

Congratulations! You have reached the end of this tutorial. You can find the answers to all the questions of the tutorial [here](answers.ipynb). To revisit the previous section, [click here](assessment.ipynb), alternatively you can head back to the [index page](index.ipynb)