# Sanger pathogen users

**This section is for Sanger users who are wanting to prepare their input data from the pathogens RNA-Seq Expression Pipeline.**

In this section, we'll go through how you could generate the files we use in the [DEAGO tutorial](https://github.com/vaofford/pathogen-informatics-training/blob/master/Notebooks/DEAGO/index.ipynb) using the output from the Sanger pathogen pipelines.

_You will need to be logged in to either `pcs5` or the `farm` to run the commands in this section._

## Requesting the RNA-Seq Expression Pipeline for your data

For an overview of the RNA-Seq Expression Pipeline and for details on requesting this pipeline for your data, please see the [Pathogen Informatics wiki](http://mediawiki.internal.sanger.ac.uk/index.php/Pathogen_Informatics_RNA-Seq_Expression_Pipeline). If you need help with this or have questions, please email [path-help@sanger.ac.uk](path-help@sanger.ac.uk).

***

## Pipeline status 

Once you have requested the RNA-Seq Expression Pipeline, check your samples have finished going through the pipeline using **`pf status`**.  The command to check the pipeline status of our [tutorial](https://github.com/vaofford/pathogen-informatics-training/blob/master/Notebooks/DEAGO/index.ipynb) pipeline data would be:

In [None]:
    pf status -t study -i 2319

This should give you the status of the 32 lanes in this study within all of the pathogens pipelines.

| Name     | Import | QC    | Mapping | Archive | Improve | SNP call | RNASeq | Assemble | Annotate |
| :------- | :----: | :---: | :-----: | :-----: | :-----: | :------: | :----: | :------: | :------: | 
| 8380_3#1 | Done   | Done  | Done    | Done    | -       | Done     | Done   | Done     | -        |
| 8380_3#2 | Done   | Done  | Done    | Done    | -       | Done     | Done   | Done     | -        |
| 8380_3#4 | Done   | Done  | Done    | -       | -       | Done     | Done   | Done     | -        |
| 8380_3#5 | Done   | Done  | Done    | Done    | -       | Done     | Done   | Done     | -        |
| ...      | ...    | ...   | ...     | ...     | ...     | ...      | ...    | ...      | ...      |

***

## Count data

We can use **`pf rnaseq`** to find the count files which are the output from the RNASeq Expression Pipeline.

In [None]:
    pf rnaseq -t study -i 2319

This will give you the location of the 32 expression count files.

    /lustre/scratch118/infgen/pathogen/pathpipe/prokaryotes/seq-pipelines/Mus/musculus/TRACKING/2319/WT2xCtrl_1
    /SLX/WT2xCtrl_1_5733492/8380_3#1/390176.pe.markdup.bam.expression.csv
    
    /lustre/scratch118/infgen/pathogen/pathpipe/prokaryotes/seq-pipelines/Mus/musculus/TRACKING/2319/WT2xCtrl_2
    /SLX/WT2xCtrl_2_5733493/8380_3#2/390269.pe.markdup.bam.expression.csv
    
    /lustre/scratch118/infgen/pathogen/pathpipe/prokaryotes/seq-pipelines/Mus/musculus/TRACKING/2319/WT2xIL22_2
    /SLX/WT2xIL22_2_5733495/8380_3#4/389017.pe.markdup.bam.expression.csv
    ...

Expression counts files are available for all organisms. For human and mouse, there are also **featurecounts** files which have been generated using [featureCounts](http://bioinf.wehi.edu.au/featureCounts/). To access these instead we can use the **`-f`** option to find a particular filetype.

In [None]:
    pf rnaseq -t study -i 2319 -f featurecounts

For DEAGO, you will need to have your count files in a single directory. It isn't efficient to copy the data from the pipelines to your working directory. Instead you should create a shortcut or reference to these files called a symlink.

In [None]:
   mkdir counts
   pf rnaseq -t study -i 2319 -f featurecounts -l ./counts

Take a look in the counts directory:

In [None]:
    ls counts

And you should see:

    8380_3#1.390176.pe.markdup.bam.featurecounts.csv   8380_5#12.389308.pe.markdup.bam.featurecounts.csv 
    8380_8#11.390155.pe.markdup.bam.featurecounts.csv  8380_3#2.390269.pe.markdup.bam.featurecounts.csv  
    8380_6#1.390254.pe.markdup.bam.featurecounts.csv   8380_8#12.390242.pe.markdup.bam.featurecounts.csv
    ...

While it might look like your files are in the counts directory, what the `-l` option has done is create a series of symlinks which point the locations of the counts files within the pipelines. You can use **`ls -al`** to see what we mean.

    drwxr-xr-x  2 vo1 pathdev 4096 Mar 22 15:52 .
    drwxr-xr-x 15 vo1 pathdev 4096 Mar 22 15:52 ..
    lrwxrwxrwx  1 vo1 pathdev  179 Mar 22 15:52 8380_3#1.390176.pe.markdup.bam.featurecounts.csv -> /lustre/scratch118
    /infgen/pathogen/pathpipe/prokaryotes/seq-pipelines/Mus/musculus/TRACKING/2319/WT2xCtrl_1/SLX/WT2xCtrl_1_5733492
    /8380_3#1/390176.pe.markdup.bam.featurecounts.csv
    lrwxrwxrwx  1 vo1 pathdev  179 Mar 22 15:52 8380_3#2.390269.pe.markdup.bam.featurecounts.csv -> /lustre/scratch118
    /infgen/pathogen/pathpipe/prokaryotes/seq-pipelines/Mus/musculus/TRACKING/2319/WT2xCtrl_2/SLX/WT2xCtrl_2_5733493
    /8380_3#2/390269.pe.markdup.bam.featurecounts.csv
    lrwxrwxrwx  1 vo1 pathdev  179 Mar 22 15:52 8380_3#4.389017.pe.markdup.bam.featurecounts.csv -> /lustre/scratch118
    /infgen/pathogen/pathpipe/prokaryotes/seq-pipelines/Mus/musculus/TRACKING/2319/WT2xIL22_2/SLX/WT2xIL22_2_5733495
    /8380_3#4/389017.pe.markdup.bam.featurecounts.csv 
    ...

***

## Sample/condition mapping

DEAGO also needs a targets file which maps the sample files to the experimental conditions that were applied.

DEAGO expects to see these three columns in this file:

* **`filename`** - name of the sample count file in the counts directory

* **`condition`** - experimental condition that was applied

* **`replicate`** - number or phrase representing a replicate group

You can get your filenames by using **`ls`** to list the files in your counts directory.  

You may be able to use **`pf info`** to get information on the experimental conditions and replicate numbers.


In [None]:
    pf info -t study -i 2319

Here we can see that the **Sample** column might be able to help us:

| Lane      | Sample     | Supplier Name | Public Name | Strain  |
| :-------- | :--------: | :-----------: | :---------: | :-----: |
| 8380_3#1  | WT2xCtrl_1 | NA            | WT2xCtrl_1  | C57BL/6 |
| 8380_3#2  | WT2xCtrl_2 | NA            | WT2xCtrl_2  | C57BL/6 |
| ...       | ...        | ...           | ...         | ...     |
| 8380_6#3  | KO1xIL22_1 | NA            | KO1xIL22_1  | C57BL/6 |
| 8380_6#4  | KO1xIL22_2 | NA            | KO1xIL22_2  | C57BL/6 |
| ...       | ...        | ...           | ...         | ...     |
| 8380_7#9  | KO3xCtrl_1 | NA            | KO3xCtrl_1  | C57BL/6 |
| ...       | ...        | ...           | ...         | ...     |
| 8380_8#15 | KO4xIL22_1 | NA            | KO4xIL22_1  | C57BL/6 |

From this we can see that there are:

* 2 cell types: WT and KO
* 2 treatments: Ctrl and IL22
* 4 biological replicates (e.g. KO4)
* 2 technical replicates (e.g. _1 and _2)

So, for lanes 8380_3#1 and 8380_3#2 the targets file would need:

    condition	cell_type	treatment	replicate	filename
    WT_Ctrl	WT	Ctrl	2.1	8380_3#1.390176.pe.markdup.bam.featurecounts.csv
    WT_Ctrl	WT	Ctrl	2.2	8380_3#2.390269.pe.markdup.bam.featurecounts.csv

DEAGO expects a **condition** column. We have two conditions here, _cell type_ and _treatment_, but DEAGO can only perform single-factor analysese. So we must join these together for the condition e.g. WT_Ctrl. We also only have one replicate column so we join the biological and technical replicates e.g. 2.1.  

DEAGO will ignore extra columns in your targets file so you can have other descriptive columns like cell_type and treatment.

***

## Annotations

To prepare an annotation file, it's often useful to know what reference was used to map your reads.  The **`--details`** option in **`pf rnaseq`** will give you this information.

In [None]:
    pf rnaseq -t lane -i 8380_3#1 --details

Here we can see that Mus_musculus_mm10 was used as the reference when mapping this lane with BWA.

    /lustre/scratch118/infgen/pathogen/pathpipe/prokaryotes/seq-pipelines/Mus/musculus/TRACKING/2319/WT2xCtrl_1/SLX/WT2xCtrl_1_5733492/8380_3#1/390176.pe.markdup.bam.expression.csv  Mus_musculus_mm10	bwa	2012-12-05T10:36:36

For organisms where gene symbol and GO terms are available in Ensembl, you can use BioMart to gather your annotation. For other organisms, you may need to mine other databases or sources. The annotation will need to be formatted for use with DEAGO. See [Preparing an annotation file](Preparing-an-annotation-file.ipynb) for more information.
  
[Return to the index](index.ipynb)