# What to expect

In this notebook, we will run some of the initial quality filtering and mapping with downsampled data from our example *Schistosoma* dataset. If you need to revisit the presentation where we introduced RNAseq and the example dataset, you can find it on Learn, in the "Workshop 1" folder. You will also find the original paper where this dataset was published, and some review articles about RNA-Seq analysis methods. These are not compulsory reading, but may be of interest and are worth a look.

In places we will provide the code to run an analysis step first, and then describe what it is doing. Some of these steps may take a few minutes, and this time can be spent reading and understanding the process as it runs.

## The command line

Whilst the first and second year courses focused on teaching coding in python, another key skill in biology is to run specialized existing software. Some of these can be installed as python modules, but many real-world tools are run "on the command line". This means that they run like an application or program, but the user types commands in a "shell" or "terminal" instead of clicking/swiping in an interactive window. Within the notebook environment, these commands can be run in 3 ways:

* adding `%%bash` to the top of a cell
* adding a `!` to the start of the command in a cell
* (sometimes) if the command can only be interpreted in bash, jupyter sometimes doesn't need to be told. There is a really excellent (short) primer to the command line which can be found [here](https://cyverse-leptin-rna-seq-lesson-dev.readthedocs-hosted.com/en/latest/section-3.html) and gives a description of the most common commands.

During this (and subsequent) workshops, we will combine code that is written on the command line (for which the full command will be provided) and questions which will require you to use the python coding you have already learnt.

# The raw data

The data is stored in `data/Schistosoma_mansoni`. Here we find four elements:

1. `README` file - contains basic information about the data in this folder.
2. `list_ids` file - contains the ids for the reads in this dataset
3. `reference` folder - contains the reference genome and transcriptome
4. `subsampled` folder - contains the raw data file for the sample of sequences we have taken for this workshop

Let's see what we have in this examples dataset. 

In [None]:
# Print the list of ids for this example dataset
! cat data/Schistosoma_mansoni/list_ids.txt

<div class="alert alert-block alert-info">
In the code above:

`cat` - display the contents of a file


<div class="alert alert-block alert-warning">

Questions:
1. The raw data files look like this `<accession>_<1|2>.fastq.gz`. What does the ".gz" in the file name mean?
2. Pick one of the files and open it. What does it look like? Define the [FASTQ format](https://en.wikipedia.org/wiki/FASTQ_format)?

<details>
<summary><i>Hint</i></summary>

1. ".gz" is what is known as a file extension.
2. In a code cell, run
   ```gunzip --keep data/Schistosoma_mansoni/subsampled/ERR022872_1.fastq.gz```
   to get an uncompressed copy of `ERR022872_1.fastq`. You should then be able to double click on this file in the file browser on the left.

</details>


In [None]:
! gunzip --keep data/Schistosoma_mansoni/subsampled/ERR022872_1.fastq.gz

# Quality Control

As you will have seen in you investigations to answer question 2, sequencing DNA is not error-free - each base of the sequence is read with a degree of uncertainty and error. Each sequencing machine/method results in different error profiles. To improve our analysis we first want to filter the lowest quality reads. One tool commonly used to profile the amounts of error is [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). 

In [None]:
#Let's start by installing the program
! conda install --yes --quiet bioconda::fastqc

<div class="alert alert-block alert-info">

In the code above:

`conda` - Package manager. We use it to install software

`--yes` - confirm that we want to install all the dependencies

`--quiet` - do not show extra output

In [None]:
# Create the output directory
! mkdir -p analysis/Schistosoma_mansoni/qc/

<div class="alert alert-block alert-info">

In the code above:

`mkdir` - command to create a new folder

`-p` - flag to create nested folders

Now let's do the FASTQC for each file

In [None]:
%%bash
for accession in $(cat data/Schistosoma_mansoni/list_ids.txt)
do 
    fastqc data/Schistosoma_mansoni/subsampled/$accession*.fastq.gz --noextract -o analysis/Schistosoma_mansoni/qc
done


<div class="alert alert-block alert-info">

In the code above:

`for X in Y; do ... done` - this the structure for a loop in bash; we are asking that for each X (in this case accession n) in Y (in this case the list of ids), the program does something (in this case fastqc)

`--noextract` - Do not uncompress the output file after creating it

`-o` - create all output files in the output directory specified next

<div class="alert alert-block alert-warning">

Open the output directory. You will see that two new files have been created for each read. Open one of the html files and have a look. [Here](https://hbctraining.github.io/Training-modules/planning_successful_rnaseq/lessons/QC_raw_data.html) you can find guidance on what each graph means. 

Questions:

3. Is there a pattern to where the errors occur in these reads?
4. Are there any overrepresented sequences? What are they?

In your investigations on the FASTQC reports you will have seen that scores are used to indicate how likely it is that a base reported in a sequencing read is in error. This is the [Phred score](https://en.wikipedia.org/wiki/Phred_quality_score). 

Let's have a look at the Phred scores for one of our reads. To do this, we will use [biopython](https://biopython.org/). This is a python library which supports lots of different file formats used for biological data, providing easy access to the information.

In [None]:
# Let's start by installing biopython
! pip install biopython

For each read in a file, biopython creates a [record](https://biopython.org/wiki/SeqRecord) object. Take a look at what a record looks like when you print it using `print(record)`. Explore what information you can access from the read using `.`

<div class="alert alert-block alert-warning">

5. Choose one of the fastq files and, using the example python code below as starting point, find out what are the highest and lowest scores in your chosen file.

<details>
<summary><i>Hint</i></summary>
    
Create variables to represent the maximum and minimum value, and for each read, update these variables using `record.letter_annotations["phred_quality"]`.
You don't need to keep the print statements. These are for exploring the data.

</details>


In [None]:
# the code below uses biopython to load in a fastq file and print the first read
from Bio import SeqIO

for record in SeqIO.parse("data/Schistosoma_mansoni/subsampled/ERR022872_1.fastq", "fastq"):
    print(record)
    print(f"ID:{record.id}, Sequence:{record.seq}")
    print(record.letter_annotations["phred_quality"])
    break

<div class="alert alert-block alert-info">

In the code above:

`SeqIO.parse("file name","file format")` - is a function that reads the file and gives the desired output

`break` - is needed to stop the loop, otherwise it would do it for every single read in the file, which is not what we want just now


# Adaptor trimming

When we generate millions of reads in a sequencing experiment, we are able to average multiple observations at the same location. However if there are portions of the reads which are low quality these may still affect the average and so we want to remove these regions. As you will have seen in the previous section, the adaptor sequences used to generate the DNA library may also be over-represented in the reads and could cause contamination.

The next step in an RNA-Seq analysis is therefore to trim poor quality regions. [TrimGalore](https://github.com/FelixKrueger/TrimGalore/blob/master/Docs/Trim_Galore_User_Guide.md) is a popular software for trimming sequencing reads. 

In [None]:
# Install the software
! conda install --yes --quiet bioconda::trim-galore conda-forge::pigz

In [None]:
%%bash
for accession in $(cat data/Schistosoma_mansoni/list_ids.txt)
do
    trim_galore \
      data/Schistosoma_mansoni/subsampled/$accession*.fastq.gz \
      --paired \
      --output_dir analysis/Schistosoma_mansoni/qc/ \
      --basename $accession \
      --fastqc
done

<div class="alert alert-block alert-warning">

Question:

6. What do the following elements of the code above mean (Use the TrimGalore documentation to find the information)?

`--paired`

`--fastqc`

`--basename`

7. Chose a read file and compare the FastQC report before trimming to the FastQC report after adaptor trimming. What are the improvements to the data quality after trimming? Are there any remaining warnings?

# Mapping to the reference

Now that we are happy with the quality of the reads and have removed adapters, we can map our sequences to the genome. This allows us to identify which genes each read came from, and which genes were <i>expressed</i> in our samples in the form of transcripts. 

<figure>
    <img src="https://www.annualreviews.org/docserver/ahah/fulltext/biodatasci/2/1/bd020139.f4_thmb.gif">
</figure>

In some organisms including Plasmodium, expressed transcripts may be generated by splicing together non-contiguous exons from the genome (others such as Trypanosoma do not). To handle this, we can either use a splice-aware mapper to align reads across splice junctions, or we can map directly against panels of known transcripts. In this example we are going to use [STAR](https://academic.oup.com/bioinformatics/article/29/1/15/272537) to perform splice-aware alignment to the reference genome fasta. The manual can be found [here](https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf). 

In [None]:
! mamba install --yes -c bioconda star

In [None]:
%%bash
mkdir -p analysis/Schistosoma_mansoni/star/ref
gunzip data/Schistosoma_mansoni/reference/schistosoma_mansoni.PRJEA36577.WBPS19.genomic.fa.gz
gunzip data/Schistosoma_mansoni/reference/schistosoma_mansoni.PRJEA36577.WBPS19.annotations.gff3.gz

# first we need to index the reference
STAR --runThreadN 4 \
  --runMode genomeGenerate \
  --genomeDir analysis/Schistosoma_mansoni/star/ref \
  --genomeFastaFiles data/Schistosoma_mansoni/reference/schistosoma_mansoni.PRJEA36577.WBPS19.genomic.fa \
  --sjdbGTFfile data/Schistosoma_mansoni/reference/schistosoma_mansoni.PRJEA36577.WBPS19.annotations.gff3 \
  --sjdbOverhang 75 \
  --genomeSAindexNbases 13

<div class="alert alert-block alert-info">

In the code above:

`--runMode genomeGenerate` - directs STAR to run genome indexing

`--genomeDir /path/to/genomeDir` - specifies where to store the index

`--genomeFastaFiles /path/to/genome/fasta` - provides the reference genome

`--sjdbGTFfile /path/to/annotations.gtf` - provides the coordinates of splice junctions in the reference genome

`--sjdbOverhang ReadLength-1` - this specifies the length of sequences to include in the splice junctions database

We can now use this index to align each pair of readfiles against the reference.

In [None]:
%%bash
for accession in $(cat data/Schistosoma_mansoni/list_ids.txt)
do
    mkdir -p analysis/Schistosoma_mansoni/star/$accession
    STAR \
      --genomeDir analysis/Schistosoma_mansoni/star/ref \
      --runThreadN 4 \
      --readFilesIn <(gunzip -c analysis/Schistosoma_mansoni/qc/$accession*.fq.gz) \
      --outFileNamePrefix analysis/Schistosoma_mansoni/star/$accession/$accession \
      --outSAMtype BAM SortedByCoordinate \
      --outSAMattributes Standard \
      --quantMode TranscriptomeSAM GeneCounts
done

<div class="alert alert-block alert-info">

In the code above:

`<(gunzip -c reads.fq.gz)` - this uncompresses the read sequences to input to STAR which does not support compressed files

`--outSAMtype BAM SortedByCoordinate` - sort and compress the output

`--outSAMattributes Standard` - include some standard count information in the output file

<div class="alert alert-block alert-warning">

Question:

8. Using the STAR manual, what outputs are generated using the flags `--quantMode TranscriptomeSAM GeneCounts`?

9. Find a python library which can load a SAM file

We will be running STAR on each of the full datasets and will make the mapped read files available for the next class.

# Extension

There also exist several methods for transcript abundance quantification using `pseudo-alignment`. These methods don't fully line up reads against the reference genome or transcript sequences, but instead count the occurance of substrings of these transcripts and use this to estimate transcript abundances. 

One example of this method is [Kallisto](https://www.nature.com/articles/nbt.3519)

In [None]:
! mamba install --yes --quiet kallisto=0.48

In [None]:
%%bash

mkdir -p analysis/S_mansoni/kallisto/

kallisto index --index=analysis/S_mansoni/kallisto/smansoni data/S_mansoni/reference/schistosoma_mansoni.PRJEA36577.WBPS19.mRNA_transcripts.fa.gz

for accession in $(cat data/S_mansoni/list_ids.txt)
do
    kallisto quant --threads=2 \
      --index=analysis/S_mansoni/kallisto/smansoni \
      --output-dir=analysis/S_mansoni/kallisto \
      --gtf=data/S_mansoni/reference/schistosoma_mansoni.PRJEA36577.WBPS19.annotations.gff3.gz \
      analysis/S_mansoni/qc/"$accession"_1.trimmed.fastq.gz analysis/S_mansoni/qc/"$accession"_2.trimmed.fastq.gz
done

<div class="alert alert-block alert-warning">

Question:

10. Give 2 differences between the methods used above by STAR and Kallisto to quantify transcript abundances