# Documentation on running the TRIMBA and RNASEQ pipelines

## Test your Ubuntu installation

To install the pipeline on your local Ubuntu machine, ensure you have `git`, `docker` and `nextflow` installed.

To check if you have `git` execute the following commands on the Ubuntu command line.

```bash
git --version
```

That should say something like:
```
git version 2.34.1
```

If not, **you need to install git**.

```bash
sudo apt-get install git
```

### Test if you have docker access

The pipeline is using `docker` and to test this is working on your computer, try to following command:

```bash
docker run hello-world
```

That should give something like:

```

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (amd64)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/get-started/
 ```
 
 If docker is not installed, check the following on how to install docker on Windows machine and configure it for use on your Ubuntu WSL2 (Windows Subsystem for Linux, version 2) system.

https://docs.docker.com/desktop/wsl/

#### Permission denied error

When you get the following when trying to run `docker run hello-world`:

```
"Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Post "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/containers/create": dial unix /var/run/docker.sock: connect: permission denied.
```

it means that the uesr you run at, is not yet a member of to `docker` group. Either run as `sudo` or better, add the user to the `docker` group with the following command:

```bash
sudo usermod -aG docker $(whoami)
```

Now you should be able to run:

```bash
docker run hello-world
```

and get something like this:
```
Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (amd64)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/get-started/
```

### Test if you have nextflow

The pipelines are executed with [`nextflow`](https://www.nextflow.io/) and you can check if `nextflow` is installed with this command:

```
nextflow -v
```

Which should give something like:

```
nextflow version 23.10.1.5891
```
If it does not, you need to install it. `nextflow` has one more requirement, which is `java`, a programming language that is used by `nextflow` so ensure you have `java` by running this command:

```bash
java -version
```

that should give something like:
```
openjdk version "11.0.21" 2023-10-17
OpenJDK Runtime Environment (build 11.0.21+9-post-Ubuntu-0ubuntu122.04)
OpenJDK 64-Bit Server VM (build 11.0.21+9-post-Ubuntu-0ubuntu122.04, mixed mode, sharing)
```

**Note** It is important that it is version 11 (11.0.21 in my case). If this is smaller than 11, you need to install JAVA 11. More information on installing `java 11` you can find here: https://www.nextflow.io/docs/latest/getstarted.html#requirements

If `java 11` is installed, you can use the following command to install nextflow:

```bash
curl -s https://get.nextflow.io | bash
```

This will install everything you need and place it likely in your home directory in `.local/bin/nextflow`.

If, after installation the command `nextflow -v` does still not work, type this command:

```bash
${HOME}/.local/bin/nextflow -v
```

If that works, it means that `~/.local/bin` is not on your `PATH` and you can add that to your path in your `.bashrc` file by adding the following as the last line in that file

```
export PATH=${HOME}/.local/bin:$PATH
```

That completes the requirements check. Next is to install the `trimba` pipeline.

## Install the pipeline locally

To get the `trimba` pipeline, you need to install it locally on the `ubuntu` system. First create or navigate to the folder you want to install it in and then execute the following command:

```
git clone https://github.com/thondeboer/nf-core-trimba.git
```

that will install folder that is called `nf-core-trimba` in the folder and that will be the value you use in the `nextflow` command.

## Do test run of the RNAseq pipeline

The RNASEQ pipeline does not need to be installed with `git` but will automatically be downloaded when you run it for the first time. It is a good practice to test your setup and it will download all the data it needs as well, so is very simple to run

```bash
nextflow run nf-core/rnaseq -r 3.14.0 -profile test,docker --outdir test_rnaseq
```

This will create a directory test_rnaseq and run the pipeline. Takes a few minutes (~4m on my computer) and should complete successfully.

# Configure and run the `trimba` pipeline

The `trimba` and `rnaseq` pipeline require some configurations, revolving mostly around providing the pipeline with the files to analyze.

The script `run_trima_and_rnaseq.sh` contains the complete set of commands to run and you should edit this file for each run.

The next section breaks down each of the sections in the pipeline script.

### Main configuration

Most of the confguration is done through using shell script variables and hopefully only a few need to be changed to reflect your situation and would only need to be changed ONCE.

```bash
#!/bin/bash

#####
#
# MAIN CONFIGURATION
#

# Location of your FASTQ data
DATADIR='/mnt/data/yasar'
FASTQ_DIR="${DATA_DIR}/FASTQ"

# The locatiom of your reference files (Genome FASTA and indices etc.)
REF="/mnt/data/REFERENCES"

# The location of the PIPELINES
PIPELINE_DIR="${HOME}/CODE/workflows"
TRIMBA="${PIPELINE_DIR}/nf-core-trimba"

# The location of the nextflow exectuable
NEXTFLOW="${HOME}/.local/bin/nextflow"
```

### Reference files configuration

The next section contains the configuration of all the various reference files that the pipeline needs. This is also a section that only need to be changed ONCE and then used for all pipeline runs.

```bash
#####
#
# REFERENCE FILES CONFIGURATION
#

# Mouse data - REQUIRED - Only these THREE files are required
# See the documentation on how to download these files
GRCM38_FASTA="${REF}/nf-core/GRCm38/fasta/default/Mus_musculus.GRCm38.dna.primary_assembly.fa"
GRCM38_GTF="${REF}/nf-core/GRCm38/ensembl_gtf/default/Mus_musculus.GRCm38.102.gtf.gz"
MEME_DB="${REF}/memesuite/motif_databases"

# Mouse data - OPTIONAL (If not provided, will be created at run time but with save_reference can be saved for next time)
GRCM38_GTF_DB="${REF}/nf-core/GRCm38/ensembl_gtf/default/gtf.db"
GRCM38_BOWTIE_IDX="${REF}/nf-core/GRCm38/fasta/default/bowtie_index"
GRCM38_RNAS_BOWTIE_IDX="${REF}/nf-core/GRCm38/fasta/default/Mus_musculus.GRCm38.dna.primary_assembly.RNAtranscripts_bowtie"
GRCM38_TRANSCRIPTS="${REF}/nf-core/GRCm38/fasta/default/Mus_musculus.GRCm38.dna.primary_assembly.ALLtranscripts.fa"
GRCM38_TRANSCRIPTS_BOWTIE_IDX="${REF}/nf-core/GRCm38/fasta/default/Mus_musculus.GRCm38.dna.primary_assembly.ALLtranscripts_bowtie"
# For RNASEQ
GRCM38_STAR_INDEX="${REF}/nf-core/GRCm38/fasta/default/star"
GRCM38_SALMON_INDEX="${REF}/nf-core/GRCm38/fasta/default/salmon"

#*****************************
#* END OF MAIN CONFIGURATION *
#*****************************
```

### Pipline configuration

The next sction is the pipeline configuration and this is likely an area that needs to be configured for **each run of the pipelines**.

There are six main configuration settings to review

1. OUTDIR_TRIMBA      : The results directory for the `trimba` pipeline
1. OUTDIR_RNASEQ      : The results directory for the `nf-core/rnaseq` pipeline.
1. SAMPLESHEET_TRIMBA : The file containing the input files for the `trimba` pipeline.
1. SAMPLESHEET_RNASEQ : The file containing the input files for the `nf-core/rnaseq` pipeline.
1. GENES_OF_INTEREST  : The genes of interest for the `trimba` pipeline.
1. MOTFIS_FILE        : The motifs of interest for the `trimba` pipeline.

```bash
#####
#
# OUTDIR CONFIGURATION - Edit this
#
# The OUTDIR determines where all the inputs and outputs are stored.
OUTDIR_TRIMBA="${DATA_DIR}/WF/trimba_demo"
OUTDIR_RNASEQ="${DATA_DIR}/WF/rnaseq_demo"

#####
#
# FILE SETUP - Do not edit
#
# This section just defines some more variables and should not be edited
mkdir -p $OUTDIR_TRIMBA && cd $OUTDIR_TRIMBA
SAMPLESHEET_TRIMBA="${OUTDIR_TRIMBA}/sample_sheet.csv"
SAMPLESHEET_RNASEQ="${OUTDIR_RNASEQ}/sample_sheet.csv"
GENES_OF_INTEREST="${OUTDIR_TRIMBA}/genes_of_interest.txt"
MOTIFS_FILE="${OUTDIR_TRIMBA}/motif_file.txt"
PARAMS_TRIMBA="${OUTDIR_TRIMBA}/nf-params.yml"
CONFIG_TRIMBA="${OUTDIR_TRIMBA}/nextflow.config"
PARAMS_RNASEQ="${OUTDIR_RNASEQ}/nf-params.yml"
CONFIGRNASEQ="${OUTDIR_RNASEQ}/nextflow.config"

#####
#
# TRIMBA_SAMPLESHEET - Edit this
#
# This is the main input file for the `trimba` pipeline
# If no riboseq data is available, only provide the header line.
cat > "${SAMPLESHEET_TRIMBA}" <<EOFSST
sample,fastq_1,fastq_2,strandedness
Riboseq_Control,"${FASTQ_DIR}/RIBO_DEMO/Ribo-Seq_of_MEF_Cell_Control.fastq.gz",,unstranded
Riboseq_Starvation,"${FASTQ_DIR}/RIBO_DEMO/Ribo-Seq_of_MEF_Cell_Amino_Acid_Starvation.fastq.gz",,unstranded
QTIseq_Control,"${FASTQ_DIR}/RIBO_DEMO/QTI-Seq_of_MEF_Cell_Control.fastq.gz",,unstranded
QTIseq_Starvation,"${FASTQ_DIR}/RIBO_DEMO/QTI-Seq_of_MEF_Cell_Amino_Acid_Starvation.fastq.gz",,unstranded
EOFSST

#####
#
# RNASEQ_SAMPLESHEET - Edit this
#
# This contains the the input files for the `nf-core/rnaseq` pipeline.
# You should provide any orignal RNASEQ data and should also use the data in the `bowtie_rna` directory
# The latter will be created by the `trimba` pipeline, but you can anticipate what the name will be
cat > "${SAMPLESHEET_RNASEQ}" <<EOFSSR
sample,fastq_1,fastq_2,strandedness
RNASeq_Control,"${FASTQ_DIR}/RIBO_DEMO/RNA-Seq_of_MEF_Cell_Control.fastq.gz",,auto
RNASeq_Starvation,"${FASTQ_DIR}/RIBO_DEMO/RNA-Seq_of_MEF_Cell_Amino_Acid_Starvation.fastq.gz",,auto
Riboseq_Control,"${OUTDIR_TRIMBA}/bowtie_rna/Riboseq_Control.unmapped.fastq.gz",,auto
Riboseq_Starvation,"${OUTDIR_TRIMBA}/bowtie_rna/Riboseq_Starvation.unmapped.fastq.gz",,auto
QTIseq_Control,"${OUTDIR_TRIMBA}/bowtie_rna/QTIseq_Control.unmapped.fastq.gz",,auto
QTIseq_Starvation,"${OUTDIR_TRIMBA}/bowtie_rna/QTIseq_Starvation.unmapped.fastq.gz",,auto
EOFSSR

#####
#
# GENES_OF_INTEREST - Edit this
#
# This contains the genes of interest for this `trimba` run
cat > "${GENES_OF_INTEREST}" <<EOFGOI
Brca1
Zfand2a
Tcea1
Gm7341
EOFGOI

#####
#
# MOTIFS_FILE - Edit this
#
# This contains the list of MOTIF files to be used in the known-motif part of the `trima` pipeline
# Note that you cannot mix DNA and RNA motif files, but for all RNA files there are DNA "equivalents"
cat > "${MOTIFS_FILE}" <<EOFMTFS
${MEME_DB}/CISBP-RNA/Mus_musculus.dna_encoded.meme
${MEME_DB}/CISBP-RNA/Homo_sapiens.dna_encoded.meme
${MEME_DB}/RNA/Ray2013_rbp_Mus_musculus.dna_encoded.meme
${MEME_DB}/RNA/Ray2013_rbp_Homo_sapiens.dna_encoded.meme
${MEME_DB}/MOUSE/HOCOMOCOv11_core_MOUSE_mono_meme_format.meme
${MEME_DB}/MOUSE/HOCOMOCOv11_full_MOUSE_mono_meme_format.meme
${MEME_DB}/MOUSE/HOCOMOCOv10_MOUSE_mono_meme_format.meme
${MEME_DB}/MOUSE/chen2008.meme
${MEME_DB}/HUMAN/HOCOMOCOv11_core_HUMAN_mono_meme_format.meme
${MEME_DB}/HUMAN/HOCOMOCOv11_full_HUMAN_mono_meme_format.meme
${MEME_DB}/HUMAN/HOCOMOCOv10_HUMAN_mono_meme_format.meme
${MEME_DB}/HUMAN/HOCOMOCOv9.meme
${MEME_DB}/MIRBASE/22/Homo_sapiens_hsa.dna_encoded.meme
${MEME_DB}/MIRBASE/22/Mus_musculus_mmu.dna_encoded.meme
EOFMTFS
```

### Pipeline parameters - Review and edit this (Occasionally)

The next set of configurations change the behaviour of the tools in the pipeline. Review the settings and determine if they are still appropriate for the pipeline run, but it is unlikely they need changing between pipeline runs.

### `trimba` Pipeline parameter definitions

|Parameter|Description|Notes|
|-|-|-|
|**input**|Path to the SAMPLESHEET|**Required**|
|**outdir**|Path to the folder where the results will be stored|**Required**|
|**fasta**|Path to the Genome FASTA file|**Required**|
|**gtf**|Path to the Genome GTF file|**Required**|
|**gtfdb**|Path the the database file for the GTF file|Optional. If `save_reference` is used, can be found in **\[WFDIR\]/genome/\[GENOME\].gtf.db**|
|**rna_index**|Path the the `bowtie` index for the RNA sequences|Optional. If `save_reference` is used, can be found in **\[WFDIR\]/genome/RNA/bowtie/**|
|**bowtie_index**|Path to the `bowtie` index for the Genome sequences|Optional. If `save_reference` is used, can be found in **\[WFDIR\]/genome/GENOME/bowtie**|
|**transcripts_index**|Path to the `bowtie` index for the Transcriptome sequences.|Optional. If `save_reference` is used, can be found in **\[WFDIR\]/genome/TRANSCRIPTOME/bowtie**|
|||
|**genes_file**|Path to the file of genes of interest|Optional. Use Gensymbol, one per line|
|**motifs_file**|Path to the file of MEME database files|Optional. If not provided, no _known_ motif search will be done, but still will find _novel_ motifs|
|||
|**skip_trimming**|Boolean, indicating if trimming should be skipped|Note: Booleans should have the value `true` or `false`, case sensitive|
|**skip_riboseq**|Boolean, indicating if Riboseq data is provided| Use this, if only providing list of genes. Note: Booleans are case sensitive, `true` or `false`|
|**trimmer**|Set this to the trimming program to use|One of `cutadapt`, `trimgalore` or `fastp`. Default is `cutadapt`|
|**extra_cutadapt_args**|Extra parameters to give to `cutadapt`||
|**rnas_to_filter**|List of GTF feature types to use to filter the reads on|Default "scRNA,3prime_overlapping_ncRNA,miRNA,snRNA,macro_lncRNA, sRNA,lincRNA,Mt_rRNA,scaRNA,snoRNA, rRNA,Mt_tRNA,bidirectional_promoter_lncRNA,misc_RNA"|
|**nmotifs**|Number of _novel_ motifs to find in the 5'-UTR|Default = 1|
|**extra_align_rna_args**|Extra arguments to provide to `bowtie` when aligning reads to _RNA_ for removal||
|**extra_align_genome_args**|Extra arguments to provide to `bowtie` when aligning reads to the _genome_|Usually should be '-m 1'|
|**extra_align_transcriptome_args**|Extra arguments to provide to `bowtie` when aligning reads to the _transcriptome_|Usually should be '-m 1'|
|**extra_ribotish_qual_args**|Extra arguments to provide to `ribotish quality` tool||
|**extra_ribotish_args**|Extra arguments to provide to `ribotish quality` tool||


### `nf-core/rnaseq` Pipeline parameter definitions

These are _some_ of the parameters for the nf-core/rnaseq pipeline that are important. For a full list, see: https://nf-co.re/rnaseq/3.14.0/parameters
|Parameter|Description|Notes|
|-|-|-|
|**input**|Path to the SAMPLESHEET|**Required**|
|**outdir**|Path to the folder where the results will be stored|**Required**|
|**fasta**|Path to the Genome FASTA file|**Required**|
|**gtf**|Path to the Genome GTF file|**Required**|
|**salmon_index**|Path the the `salmon` index for the transcript sequences|Optional. If `save_reference` is used, can be found in **\[WFDIR\]/genome/index/salmon/**|
|**pseudo_aligner**|String indicating which pseudo alginer to use|Default is "salmon"|


```bash
#####
#
# TRIMBA Pipeline parameters - Review and edit this (occasionally)
#
# These are the settings for the tools in the pipeline
# Review these, but you unlikley need to change these for different pipeline runs
cat > "${PARAMS_TRIMBA}" <<EOFPARAMST
input:                          "$SAMPLESHEET_TRIMBA"
outdir:                         "$OUTDIR_TRIMBA"
fasta:                          "$GRCM38_FASTA"
gtf:                            "$GRCM38_GTF"
gtfdb:                          "$GRCM38_GTF_DB"
rna_index:                      "$GRCM38_RNAS_BOWTIE_IDX"
bowtie_index:                   "$GRCM38_BOWTIE_IDX"
transcripts_index:              "$GRCM38_TRANSCRIPTS_BOWTIE_IDX"

genes_file:                     "$GENES_OF_INTEREST"
motifs_file:                    "$MOTIFS_FILE"

skip_trimming:                  true
skip_riboseq:                   false
trimmer:                        "cutadapt"
extra_cutadapt_args:            '--trim-n --match-read-wildcards -u 16 -n 4 -a AGATCGGAAGAGCACACGTCTG -a AAAAAAAA -a GAACTCCAGTCAC -e 0.2 --nextseq-trim 20 -m 17 -M 34'
rnas_to_filter:                 "scRNA,3prime_overlapping_ncRNA,miRNA,snRNA,macro_lncRNA,sRNA,lincRNA,Mt_rRNA,scaRNA,snoRNA,rRNA,Mt_tRNA,bidirectional_promoter_lncRNA,misc_RNA"
nmotifs:                        1
extra_align_rna_args:           ''
extra_align_genome_args:        '-m 1'
extra_align_transcriptome_args: '-m 1'
extra_ribotish_qual_args:       ''
extra_ribotish_args:            ''

save_reference:                 false
EOFPARAMST

#####
#
# RNASEQ Pipeline parameters - Review and edit this (occasionally)
#
# These are the settings for the tools in the pipeline
# Review these, but you unlikley need to change these for different pipeline runs
cat > "${PARAMS_RNASEQ}" <<EOFPARAMSR
input:                 "$SAMPLESHEET_RNASEQ"
outdir:                "$OUTDIR_RNASEQ"
fasta:                 "${GRCM38_FASTA}"
gtf:                   "${GRCM38_GTF}"
salmon_index:          "${GRCM38_SALMON_INDEX}"
pseudo_aligner:        "salmon"
EOFPARAMSR

#####
#
# RNASEQ Extra options - Review and edit this (occasionally)
#
# The RNASEQ pipeline can be configured with MANY MORE options, and see the main nf-core/rnaseq website for options
# https://nf-co.re/rnaseq/3.14.0
# To run the most mimimal (and fastest) RNASEQ pipeline, you can skip a lot of the steps by providing these SKIP options
FAST="--skip_gtf_filter --skip_gtf_transcript_filter --skip_bbsplit --skip_umi_extract --skip_trimming --skip_alignment \
      --skip_markduplicates --skip_bigwig --skip_stringtie --skip_fastqc --skip_preseq --skip_dupradar \
      --skip_qualimap --skip_rseqc --skip_biotype_qc --skip_deseq2_qc --skip_multiqc --skip_qc"
      
#************************************************************************************************************************
#* END OF CONFIGURATION                                                                                                 *
#************************************************************************************************************************

#####
#
# RUN TRIMBA pipeline
#
cd $OUTDIR_TRIMBA
$NEXTFLOW run $TRIMBA -profile docker -resume \
  -params-file "${PARAMS_TRIMBA}"
  
#####
#
# RUN RNASEQ pipeline
#
cd $OUTDIR_RNASEQ
$NEXTFLOW run nf-core/rnaseq -revision 3.14.0 -profile docker -resume \
  -params-file "${PARAMS_RNASEQ}" \
  $FAST
```

# Run the `trimba` and `nf-core/rnaseq` pipelines

Once the script has been edited and reviewed, you can run the script and it will exectute the two pipelines (one after the other, since `nf-core/rnaseq` requires the output of `trimba` to be available).

You can run it in the foreground if you want, but running it in the background would be safer, since it would continue to run, even if you close down the terminal (although not usre how this works on WSL2 ubuntu on Windows)

#### Run in the background
```bash
nohup run_trimba_and_rnaseq.sh > pipe.out 2> pipe.err &
```

#### Run in the foreground

```bash
./run_trimba_and_rnaseq.sh
```

All the log files are in the output directory (they are hidden, so you need to use `ls -al` to see them)

# Runnning `trimba` and `nf-core/rnaseq` for the first time

The first time you run the pipelines, you can use it to have it create the indices for the RNA and DNA database and various other indices it requires.

I created a seperate script called `first_run_trimba_and_rnaseq.sh`.

You still need to review the top part to ensure the various directories are appropriate for you but then you can simply run and it will create the directories `trimba_firstrun` and `rnaseq_firstrun` and you can extract the various indices and place them in the appropriate place on your ubuntu server.

**See the table above with the location of the various indices**

It will run `trimba` without any Riboseq data and will run the demo for `nf-core/rnaseq`.

```bash
./run_trimba_and_rnaseq.sh
```