# Reference data standardization - Workflow

## Aims:
Through this protocol, try to understand the first step of the detection and analysis of molecular QTLs (xQTLs). This protocol mainly focus on data downloading, indexing, and processing, prepare for futher use. 
## To do:
To understand as much as possible of this protocol, followings will be done as running through this protocal:
* Read the Chapter 8 of the book listed in the introduction md. 
* Run all codes listed below to get a sense of what is going on.
* Write up errors and issues happened during the run.

## Overview
This module provides reference data download, indexing and preprocessing (if necessary), in preparation for use throughout the pipeline.

Following output reference files will be used for RNA-seq expression and splicing quantification:

1. `GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.{dict,fasta,fasta.fai}`
2. `Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf` for stranded protocol, and `Homo_sapiens.GRCh38.103.chr.reformatted.gene.ERCC.gtf` for unstranded protocol.
3. Everything under `STAR_Index` folder
4. Everything under `RSEM_Index` folder
5. Optionally, for quality control, `gtf_ref.flat`

## Workflows

Workflows implemented include:
* Convert transcript feature file gff3 to gtf
    - Input: an uncompressed gff3 file.(i.e. can be view via cat)

* Collapse transcript features into genes
    - Input: a gtf file.

* Generate STAR index based on gtf and reference fasta
    - Input: a gtf file and an acompanying fasta file.

* Generate RSEM index based on gtf and reference fasta
    - Input: a gtf file and an acompanying fasta file.

## Details
### To download reference data:

In [None]:
sos run pipeline/reference_data.ipynb download_hg_reference --cwd reference_data
sos run pipeline/reference_data.ipynb download_gene_annotation --cwd reference_data
sos run pipeline/reference_data.ipynb download_ercc_reference --cwd reference_data
sos run pipeline/reference_data.ipynb download_dbsnp --cwd reference_data

This chunk only contains part of the path of files. It will generate error, fail to locate. To solve this, need to contain relative or absolute path of the files to lei it run. Instead, I will use codes below.

In [None]:
sos run xqtl-pipeline/pipeline/reference_data.ipynb download_hg_reference --cwd reference_data
sos run xqtl-pipeline/pipeline/reference_data.ipynb download_gene_annotation --cwd reference_data
sos run xqtl-pipeline/pipeline/reference_data.ipynb download_ercc_reference --cwd reference_data
sos run xqtl-pipeline/pipeline/reference_data.ipynb download_dbsnp --cwd reference_data

This will generate a folder called reference_data in current dictionary. If want to store the data in different place, change `--cwd reference_data` to `--cwd /path-to-store/reference_data`. When getting error `like no enough space`, try to use external hard drive to store data. Remember to aovid space in the path!

### To format reference data: 
List codes for docker and sigularity

In [None]:
sos run reference_data.ipynb hg_reference \
    --cwd reference_data \
    --ercc-reference reference_data/ERCC92.fa \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.fa \
    --container container/rna_quantification.sif

This code using sigularity to format the data. Using docker is also a way to generate same result. To use docker, do the following:

In [None]:
sos run xqtl-pipeline/reference_data.ipynb hg_reference \
    --cwd reference_data \
    --ercc-reference reference_data/ERCC92.fa \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.fa \
    --container gaow/rna_quantification

If reference data is in a different dictionary, remember to add path, `path-to-store/reference_data`. 

In [None]:
sos run pipeline/reference_data.ipynb hg_gtf \
    --cwd reference_data \
    --hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.gtf \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta \
    --container containers/rna_quantification.sif --stranded

In [None]:
sos run xqtl-pipeline/pipeline/reference_data.ipynb hg_gtf \
    --cwd reference_data \
    --hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.gtf \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta \
    --container gaow/rna_quantification --stranded

Do the same changes as above.

Here might result a error like `script killed by docker, probably because of RAM`. This is the memory issue of computer. It doesn't have enough memory to run, mostly happened for computers only have 8GB memory. To solve this, creating a virtual machine is a good way. Detailes refer to `Guidlines for setting virtual machine and connecting to personal computer`. 

### To format gene feature data:

In [None]:
sos run pipeline/reference_data.ipynb gene_annotation \
    --cwd reference_data \
    --ercc-gtf reference_data/ERCC92.gtf \
    --hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.gtf \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta \
    --container containers/rna_quantification.sif --stranded # gaow/rna_quantification --stranded if docker

**Notice that for un-stranded RNA-seq protocol please use switch `--no-stranded` to the command above instead of `--stranded`. More details can be found later in the document.**

### Generating STAR index without the GTF annotation file allow customize read lenght lateron in STAR alignment. 

In [None]:
sos run pipeline/reference_data.ipynb STAR_index \
    --cwd reference_data \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --container containers/rna_quantification.sif \ # gaow/rna_quantification --stranded if docker
    --mem 40G

This will generate a STAR folder in reference_data folder.

### To generate RSEM index with the gtf file **prior** to the gene collapsing step ( **without** the gene tag in its file name.)

In [None]:
sos run pipeline/reference_data.ipynb RSEM_index \
    --cwd reference_data \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --container containers/rna_quantification.sif  & # gaow/rna_quantification --stranded if docker

This will generate a RSEM folder in reference_data folder.

### To generate RefFlat annotation for Picard QC,

In [None]:
sos run pipeline/reference_data.ipynb RefFlat_generation \
    --hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf 
    --container containers/rna_quantification.sif & # --gaow/rna_quantification --stranded if docker

To generate the SUPPA annotation for psichomics to detect RNA alternative splicing events,

In [None]:
sos run pipeline/reference_data.ipynb SUPPA_annotation \
    --hg_gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --container containers/psichomics.sif # --gaow/psichomics --stranded if docker

Prepare psichomics database, 

In [None]:
sos run pipeline/reference_data.ipynb psi_hg38_annotation \
    --hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformated.gtf \
    --hgrc-db reference_data/hgnc_database.txt \
    --container containers/psichomics.sif # --gaow/psichomics --stranded if docker

## Output - Screenshots are shown in Output folder
After this protocol, a few files has been created preparing for further analysis. 
Steps include:
* Convert transcript feature file gff3 to gtf
* Collapse transcript features into genes
* Generate STAR index based on gtf and reference fasta
* Generate RSEM index based on gtf and reference fasta

Outputs include
- A gtf file.
- A gtf file with collapesed gene model: will be used in the next protocol
- A folder of STAR index: will be used in the next protocol
- A folder of RSEM index: will be used in the next protocol