# Performing QC on your data
The results you can get from any analysis will only ever be as good as the data you put into it. To avoid spending countless hours performing analysis without receiving any satisfactory results, or worse yet erronious or misleading results, it is important to QC your data before starting. There are a number of checks you can make to ensure your dealing with high quality data, and we will walk you through some of them here. 

## Contamination
In order to get meaningful results from Roary, the samples should be closely related. If you have lots of contamination in your data, for instance if one of your samples is from a different species, you will get very few genes in your core genome, if any at all.  

It is always a good idea to check that your samples are the species you expect them to be. You can use tools such as [Kraken](https://www.ebi.ac.uk/research/enright/software/kraken) for this. Roary comes with a qc option that will run Kraken for you and generate a report listing the top species of all the samples. For this to work you need to have Kraken installed and a Kraken database available. You won't be needing it for the sake of this tutorial but it is highly reccommended if you plan to do any real analysis.

The following command can be used to generate a qc report with Kraken (substituting the path to the database to wherever you downloaded it):  

    roary -qc -k /path/to/kraken/db *.gff

The report will look something like this:

![QC report](img/qc_report.png)

As we expected, these three samples are all of the same species. Let's assume that we initially had a forth sample that we wanted to use in this analysis. We thought that this sample was also from *S. pneumoniae*, but once we run roary with the qc option, we get the following output:

![QC report with contamination](img/qc_contamination.png)

This tells us that the most prevalent species in sample 4 is in fact *Escherichia coli* so we will exclude this sample from our analysis before we carry on.

For Sanger users Kraken is already installed and is run as part of the automated QC pipeline. To create a symlink to the Kraken report, you can do:

    pf qc -t lane -i 13681_1#18 -l .

The size of the assemblies can also provide a useful hint. If one of the assemblies is much smaller or bigger than the others there is a chance that this is not of the same species as the rest.

## Coverage
To get decent assemblies out of your raw data, you need a genome coverage of  at least 30x. For a quick estimate of your coverage, you can divide the number of bases in your raw data with the number of bases in the reference genome of the species. For the samples used in this tutorial, the coverage is listed below. The genome of  _S. pneumoniae_ is approximately 2,200,000 bases.

|Sample |Nr of Bases|Coverage|
|------ |-----------|--------|
|sample1|262705400  |120x    |
|sample2|218026200  |99x     |
|sample3|173524000  |79x     |

## Fragmented assemblies
If the assemblies are very fragmented (thousands of contigs), the genes may be too fragmented to be of much use.  
   
    
These are just some of the most basic things that you can do to make sure your data looks alright. There is much more that can be done but we won't go into any further detail in this tutorial. 

## Check your understanding
**Q5: Why is it important to QC your data?**  
  
**Q6: You're not getting any core genes when you run Roary. What could be the reason?**  

The answers to these questions can be found [here](answers.ipynb).  

Now you should be ready to run Roary to generate a pangenome, so head to the next section, [Running Roary](run_roary.ipynb).  
You can also revisit the [previous section](prepare_data.ipynb) or return to the [index page](index.ipynb).