# What to expect

In this notebook, we will run some of the initial quality filtering and mapping with downsampled data from the *Schistosoma* dataset. If you need to revisit the presentation where we introduced RNAseq and the example dataset, you can find it on Learn, in the "Workshop 1" folder. You will also find the original paper where this dataset was published, and some review articles about RNA-Seq analysis methods. These are not compulsory reading, but may be of interest and are worth a look.

In places we will provide the code to run an analysis step first, and then describe what it is doing. Some of these steps may take a few minutes, and this time can be spent reading and understanding the process as it runs.

## The command line

Whilst the first and second year courses focused on teaching coding in python, another key skill in biology is to run specialized existing software. Some of these can be installed as python modules, but many real-world tools are run "on the command line". This means that they run like an application or program, but the user types commands in a "shell" or "terminal" instead of clicking/swiping in an interactive window. Within the notebook environment, these commands can be run in 3 ways:

* adding `%%bash` to the top of a cell
* adding a `!` to the start of the command in a cell
* (sometimes) if the command can only be interpreted in bash, jupyter sometimes doesn't need to be told. There is a really excellent (short) primer to the command line which can be found [here](https://cyverse-leptin-rna-seq-lesson-dev.readthedocs-hosted.com/en/latest/section-3.html) and gives a description of the most common commands.

During this (and subsequent) workshops, we will combine code that is written on the command line (for which the full command will be provided) and questions which will require you to use the python coding you have already learnt.

In [6]:
# Let's start by installing dependancies we will use
! pip install biopython
! conda install --yes --quiet bioconda::fastqc
! conda install --yes --quiet bioconda::trim-galore conda-forge::pigz
! conda install --yes --quiet bioconda::star

done
Channels:
 - conda-forge
 - bioconda
 - defaults
 - anaconda
Platform: osx-64
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: /Users/rmcolq/Work/apps/miniconda3/envs/pathbio3

  added / updated specs:
    - bioconda::fastqc


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    git-lfs-3.5.1              |       hecd8cb5_0         4.5 MB  anaconda
    scikit-learn-1.5.1         |  py310h207f725_0        10.6 MB  anaconda
    star-2.7.11b               |       hd576cc4_2         2.9 MB  bioconda
    ------------------------------------------------------------
                                           Total:        17.9 MB

The following packages will be UPDATED:

  ca-certificates                       2024.6.2-h8857fd0_0 --> 2024.7.4-h8857fd0_0 
  certifi                             2024

<div class="alert alert-block alert-info">

In the code above:

`conda` - Package manager. We use it to install software

`--yes` - confirm that we want to install all the dependencies

`--quiet` - do not show extra output

# The raw data

The data is stored in `data/Schistosoma_mansoni`. Here we find four elements:

1. `README` file - contains basic information about the data in this folder.
2. `list_ids` file - contains the ids for the reads in this dataset. Each id corresponds to one sample (eg Schistosomula 3h post infection, replicate 1)
3. `reference` folder - contains the reference genome and transcriptome
4. `subsampled` folder - contains the raw data file for the subset of sequences we have taken for this workshop

Let's see what we have in this examples dataset. 

In [2]:
# Print the list of ids for this example dataset
! cat data/Schistosoma_mansoni/list_ids.txt

ERR022872
ERR022873
ERR022874
ERR022875
ERR022876
ERR022877
ERR022878
ERR022879
ERR022880
ERR022881
ERR022882
ERR022883


<div class="alert alert-block alert-info">
In the code above:

`cat` - display the contents of a file


# FASTQ structure

<div class="alert alert-block alert-warning">

Questions:
1. The raw data files look like this `<accession>_<1|2>.fastq.gz`. What does the ".gz" in the file name mean?
2. Pick one of the files and open it. What does it look like? Define the [FASTQ format](https://en.wikipedia.org/wiki/FASTQ_format)?

<details>
<summary><i>Hint</i></summary>

1. ".gz" is what is known as a file extension.
2. To view the file you first need to uncompress it. In Data Exploration (week 1, class 2) you uncompressed ".gz" files using `gunzip`. This time you want to save (`keep`) the uncompressed version of the file too. Once you have created the uncompressed version, you can open by double clicking in the file browser on the left hand side, or can download it and open it with a text editor.

</details>

In [4]:
! gunzip --keep data/Schistosoma_mansoni/subsampled/ERR022872_1.fastq.gz

data/Schistosoma_mansoni/subsampled/ERR022872_1.fastq already exists -- do you wish to overwrite (y or n)? ^C


<div class="alert alert-block alert-success">
Answers:

1. These files had been compressed using an algorithm called gzip
2. This file contains read sequences. Each read is recorded in 4 lines:
   - Field 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line).
   - Field 2 is the raw sequence letters.
   - Field 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again.
   - Field 4 encodes the quality values for the sequence in Field 2, and must contain the same number of symbols as letters in the sequence.

For standard file formats for biological data (like FASTQ), there are often tools or librarys which make it easier to interract with them. These usually check that the file is formatted as expected. They also make it much easier to find information in the file without having to look up the exact structure of the file format yourself. A commonly used python module is [SeqIO](https://biopython.org/wiki/SeqIO) from the python library [biopython](https://biopython.org/).

In your investigations on the FASTQ format you will have seen that scores are used to indicate how likely it is that a base reported in a sequencing read is in error. This is the [Phred score](https://learn.gencore.bio.nyu.edu/ngs-file-formats/quality-scores/). Now let's use the biopython library to investigate the sequence qualities in this file. 

For each read in a fastq file, SeqIO creates a [record](https://biopython.org/wiki/SeqRecord) object, with information about the read. The following code prints the record and Phred score for the first sequence in our FASTQ file.

In [8]:
from Bio import SeqIO

for index, record in enumerate(SeqIO.parse("data/Schistosoma_mansoni/subsampled/ERR022872_1.fastq", "fastq")):
    print(record)
    print(record.letter_annotations["phred_quality"])
    break

ID: ERR022872.17
Name: ERR022872.17
Description: ERR022872.17 IL9_3012:1:1:2:473/1
Number of features: 0
Per letter annotation for: phred_quality
Seq('CNTCACCACAACCCAGCAGACCTTTACACTATGTATCTTCTTTAGNTNANANNA...AAC')
[24, 4, 23, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 32, 32, 32, 28, 29, 31, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 32, 32, 32, 33, 33, 32, 32, 32, 32, 32, 33, 33, 32, 16, 4, 25, 4, 22, 4, 23, 4, 4, 24, 32, 33, 32, 32, 32, 33, 32, 32, 32, 33, 33, 32, 32, 32, 32, 34, 32, 31, 31, 31, 29, 32]


<div class="alert alert-block alert-info">

In the code above:

`SeqIO.parse("file name","file format")` - is a function that reads the file and gives the desired output

`break` - is needed to stop the loop, otherwise it would do it for every single read in the file, which is not what we want just now


<div class="alert alert-block alert-warning">

3. Choose one of the FASTQ files and use a for loop to find out what the highest and lowest Phred scores are in your chosen file.

<details>
<summary><i>Hint</i></summary>

For each record you need to:
 - find out the maximum and minimum Phred score in the record
 - update the variables

</details>

In [9]:
#First define variables. All Phred scores will be bigger than 0 and less than 1000
highest_phred=0
lowest_phred=1000

#Write the loop
for record in SeqIO.parse("data/Schistosoma_mansoni/subsampled/ERR022872_1.fastq", "fastq"):
    quality_scores = record.letter_annotations["phred_quality"]
    record_highest = max(quality_scores)
    record_lowest = min(quality_scores)
    highest_phred = max(highest_phred, record_highest)
    lowest_phred = min(lowest_phred, record_lowest)
    
#Print the output
print(f"Highest Phred score: {highest_phred}")
print(f"Lowest Phred score: {lowest_phred}")

Highest Phred score: 35
Lowest Phred score: 4


<div class="alert alert-block alert-success">
Answers:

    #First define variables. All Phred scores will be bigger than 0 and less than 1000
    highest_phred=0
    lowest_phred=1000

    #Write the loop
    for record in SeqIO.parse("data/Schistosoma_mansoni/subsampled/ERR022872_1.fastq", "fastq"):
        quality_scores = record.letter_annotations["phred_quality"]
        record_highest = max(quality_scores)
        record_lowest = min(quality_scores)
        highest_phred = max(highest_phred, record_highest)
        lowest_phred = min(lowest_phred, record_lowest)
    
    #Print the output
    print(f"Highest Phred score: {highest_phred}")
    print(f"Lowest Phred score: {lowest_phred}")


Highest Phred score: 35
Lowest Phred score: 4


# Quality Control

Hopefully in the previous exercises it has become clear that sequencing DNA is not error-free - each base of the sequence is read with a degree of uncertainty and error - and that our data definitely has some low quality reads (and parts of reads). Each sequencing machine/method results in different error profiles. To improve our analysis we first want to filter the lowest quality reads. One tool commonly used to profile the amounts of error is [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). 

In [10]:
# Create the output directory
! mkdir -p analysis/Schistosoma_mansoni/qc/

<div class="alert alert-block alert-info">

In the code above:

`mkdir` - command to create a new folder

`-p` - flag to create nested folders

Now let's do the FASTQC for each file

In [12]:
%%bash
for accession in $(cat data/Schistosoma_mansoni/list_ids.txt)
do 
    fastqc data/Schistosoma_mansoni/subsampled/$accession*.fastq.gz --noextract -o analysis/Schistosoma_mansoni/qc
done


application/gzip


Started analysis of ERR022872_1.fastq.gz


application/gzip


Approx 5% complete for ERR022872_1.fastq.gz
Approx 10% complete for ERR022872_1.fastq.gz
Approx 15% complete for ERR022872_1.fastq.gz
Approx 20% complete for ERR022872_1.fastq.gz
Approx 25% complete for ERR022872_1.fastq.gz
Approx 30% complete for ERR022872_1.fastq.gz
Approx 35% complete for ERR022872_1.fastq.gz
Approx 40% complete for ERR022872_1.fastq.gz
Approx 45% complete for ERR022872_1.fastq.gz
Approx 50% complete for ERR022872_1.fastq.gz
Approx 55% complete for ERR022872_1.fastq.gz
Approx 60% complete for ERR022872_1.fastq.gz
Approx 65% complete for ERR022872_1.fastq.gz
Approx 70% complete for ERR022872_1.fastq.gz
Approx 75% complete for ERR022872_1.fastq.gz
Approx 80% complete for ERR022872_1.fastq.gz
Approx 85% complete for ERR022872_1.fastq.gz
Approx 90% complete for ERR022872_1.fastq.gz
Approx 95% complete for ERR022872_1.fastq.gz
Approx 100% complete for ERR022872_1.fastq.gz


Analysis complete for ERR022872_1.fastq.gz


Started analysis of ERR022872_2.fastq.gz
Approx 5% complete for ERR022872_2.fastq.gz
Approx 10% complete for ERR022872_2.fastq.gz
Approx 15% complete for ERR022872_2.fastq.gz
Approx 20% complete for ERR022872_2.fastq.gz
Approx 25% complete for ERR022872_2.fastq.gz
Approx 30% complete for ERR022872_2.fastq.gz
Approx 35% complete for ERR022872_2.fastq.gz
Approx 40% complete for ERR022872_2.fastq.gz
Approx 45% complete for ERR022872_2.fastq.gz
Approx 50% complete for ERR022872_2.fastq.gz
Approx 55% complete for ERR022872_2.fastq.gz
Approx 60% complete for ERR022872_2.fastq.gz
Approx 65% complete for ERR022872_2.fastq.gz
Approx 70% complete for ERR022872_2.fastq.gz
Approx 75% complete for ERR022872_2.fastq.gz
Approx 80% complete for ERR022872_2.fastq.gz
Approx 85% complete for ERR022872_2.fastq.gz
Approx 90% complete for ERR022872_2.fastq.gz
Approx 95% complete for ERR022872_2.fastq.gz
Approx 100% complete for ERR022872_2.fastq.gz


Analysis complete for ERR022872_2.fastq.gz
application/gzip
application/gzip


Started analysis of ERR022873_1.fastq.gz
Approx 5% complete for ERR022873_1.fastq.gz
Approx 10% complete for ERR022873_1.fastq.gz
Approx 15% complete for ERR022873_1.fastq.gz
Approx 20% complete for ERR022873_1.fastq.gz
Approx 25% complete for ERR022873_1.fastq.gz
Approx 30% complete for ERR022873_1.fastq.gz
Approx 35% complete for ERR022873_1.fastq.gz
Approx 40% complete for ERR022873_1.fastq.gz
Approx 45% complete for ERR022873_1.fastq.gz
Approx 50% complete for ERR022873_1.fastq.gz
Approx 55% complete for ERR022873_1.fastq.gz
Approx 60% complete for ERR022873_1.fastq.gz
Approx 65% complete for ERR022873_1.fastq.gz
Approx 70% complete for ERR022873_1.fastq.gz
Approx 75% complete for ERR022873_1.fastq.gz
Approx 80% complete for ERR022873_1.fastq.gz
Approx 85% complete for ERR022873_1.fastq.gz
Approx 90% complete for ERR022873_1.fastq.gz
Approx 95% complete for ERR022873_1.fastq.gz
Approx 100% complete for ERR022873_1.fastq.gz


Analysis complete for ERR022873_1.fastq.gz


Started analysis of ERR022873_2.fastq.gz
Approx 5% complete for ERR022873_2.fastq.gz
Approx 10% complete for ERR022873_2.fastq.gz
Approx 15% complete for ERR022873_2.fastq.gz
Approx 20% complete for ERR022873_2.fastq.gz
Approx 25% complete for ERR022873_2.fastq.gz
Approx 30% complete for ERR022873_2.fastq.gz
Approx 35% complete for ERR022873_2.fastq.gz
Approx 40% complete for ERR022873_2.fastq.gz
Approx 45% complete for ERR022873_2.fastq.gz
Approx 50% complete for ERR022873_2.fastq.gz
Approx 55% complete for ERR022873_2.fastq.gz
Approx 60% complete for ERR022873_2.fastq.gz
Approx 65% complete for ERR022873_2.fastq.gz
Approx 70% complete for ERR022873_2.fastq.gz
Approx 75% complete for ERR022873_2.fastq.gz
Approx 80% complete for ERR022873_2.fastq.gz
Approx 85% complete for ERR022873_2.fastq.gz
Approx 90% complete for ERR022873_2.fastq.gz
Approx 95% complete for ERR022873_2.fastq.gz
Approx 100% complete for ERR022873_2.fastq.gz


Analysis complete for ERR022873_2.fastq.gz
application/gzip
application/gzip


Started analysis of ERR022874_1.fastq.gz
Approx 5% complete for ERR022874_1.fastq.gz
Approx 10% complete for ERR022874_1.fastq.gz
Approx 15% complete for ERR022874_1.fastq.gz
Approx 20% complete for ERR022874_1.fastq.gz
Approx 25% complete for ERR022874_1.fastq.gz
Approx 30% complete for ERR022874_1.fastq.gz
Approx 35% complete for ERR022874_1.fastq.gz
Approx 40% complete for ERR022874_1.fastq.gz
Approx 45% complete for ERR022874_1.fastq.gz
Approx 50% complete for ERR022874_1.fastq.gz
Approx 55% complete for ERR022874_1.fastq.gz
Approx 60% complete for ERR022874_1.fastq.gz
Approx 65% complete for ERR022874_1.fastq.gz
Approx 70% complete for ERR022874_1.fastq.gz
Approx 75% complete for ERR022874_1.fastq.gz
Approx 80% complete for ERR022874_1.fastq.gz
Approx 85% complete for ERR022874_1.fastq.gz
Approx 90% complete for ERR022874_1.fastq.gz
Approx 95% complete for ERR022874_1.fastq.gz
Approx 100% complete for ERR022874_1.fastq.gz


Analysis complete for ERR022874_1.fastq.gz


Started analysis of ERR022874_2.fastq.gz
Approx 5% complete for ERR022874_2.fastq.gz
Approx 10% complete for ERR022874_2.fastq.gz
Approx 15% complete for ERR022874_2.fastq.gz
Approx 20% complete for ERR022874_2.fastq.gz
Approx 25% complete for ERR022874_2.fastq.gz
Approx 30% complete for ERR022874_2.fastq.gz
Approx 35% complete for ERR022874_2.fastq.gz
Approx 40% complete for ERR022874_2.fastq.gz
Approx 45% complete for ERR022874_2.fastq.gz
Approx 50% complete for ERR022874_2.fastq.gz
Approx 55% complete for ERR022874_2.fastq.gz
Approx 60% complete for ERR022874_2.fastq.gz
Approx 65% complete for ERR022874_2.fastq.gz
Approx 70% complete for ERR022874_2.fastq.gz
Approx 75% complete for ERR022874_2.fastq.gz
Approx 80% complete for ERR022874_2.fastq.gz
Approx 85% complete for ERR022874_2.fastq.gz
Approx 90% complete for ERR022874_2.fastq.gz
Approx 95% complete for ERR022874_2.fastq.gz
Approx 100% complete for ERR022874_2.fastq.gz


Analysis complete for ERR022874_2.fastq.gz
application/gzip
application/gzip


Started analysis of ERR022875_1.fastq.gz
Approx 5% complete for ERR022875_1.fastq.gz
Approx 10% complete for ERR022875_1.fastq.gz
Approx 15% complete for ERR022875_1.fastq.gz
Approx 20% complete for ERR022875_1.fastq.gz
Approx 25% complete for ERR022875_1.fastq.gz
Approx 30% complete for ERR022875_1.fastq.gz
Approx 35% complete for ERR022875_1.fastq.gz
Approx 40% complete for ERR022875_1.fastq.gz
Approx 45% complete for ERR022875_1.fastq.gz
Approx 50% complete for ERR022875_1.fastq.gz
Approx 55% complete for ERR022875_1.fastq.gz
Approx 60% complete for ERR022875_1.fastq.gz
Approx 65% complete for ERR022875_1.fastq.gz
Approx 70% complete for ERR022875_1.fastq.gz
Approx 75% complete for ERR022875_1.fastq.gz
Approx 80% complete for ERR022875_1.fastq.gz
Approx 85% complete for ERR022875_1.fastq.gz
Approx 90% complete for ERR022875_1.fastq.gz
Approx 95% complete for ERR022875_1.fastq.gz
Approx 100% complete for ERR022875_1.fastq.gz


Analysis complete for ERR022875_1.fastq.gz


Started analysis of ERR022875_2.fastq.gz
Approx 5% complete for ERR022875_2.fastq.gz
Approx 10% complete for ERR022875_2.fastq.gz
Approx 15% complete for ERR022875_2.fastq.gz
Approx 20% complete for ERR022875_2.fastq.gz
Approx 25% complete for ERR022875_2.fastq.gz
Approx 30% complete for ERR022875_2.fastq.gz
Approx 35% complete for ERR022875_2.fastq.gz
Approx 40% complete for ERR022875_2.fastq.gz
Approx 45% complete for ERR022875_2.fastq.gz
Approx 50% complete for ERR022875_2.fastq.gz
Approx 55% complete for ERR022875_2.fastq.gz
Approx 60% complete for ERR022875_2.fastq.gz
Approx 65% complete for ERR022875_2.fastq.gz
Approx 70% complete for ERR022875_2.fastq.gz
Approx 75% complete for ERR022875_2.fastq.gz
Approx 80% complete for ERR022875_2.fastq.gz
Approx 85% complete for ERR022875_2.fastq.gz
Approx 90% complete for ERR022875_2.fastq.gz
Approx 95% complete for ERR022875_2.fastq.gz
Approx 100% complete for ERR022875_2.fastq.gz


Analysis complete for ERR022875_2.fastq.gz
application/gzip
application/gzip


Started analysis of ERR022876_1.fastq.gz
Approx 5% complete for ERR022876_1.fastq.gz
Approx 10% complete for ERR022876_1.fastq.gz
Approx 15% complete for ERR022876_1.fastq.gz
Approx 20% complete for ERR022876_1.fastq.gz
Approx 25% complete for ERR022876_1.fastq.gz
Approx 30% complete for ERR022876_1.fastq.gz
Approx 35% complete for ERR022876_1.fastq.gz
Approx 40% complete for ERR022876_1.fastq.gz
Approx 45% complete for ERR022876_1.fastq.gz
Approx 50% complete for ERR022876_1.fastq.gz
Approx 55% complete for ERR022876_1.fastq.gz
Approx 60% complete for ERR022876_1.fastq.gz
Approx 65% complete for ERR022876_1.fastq.gz
Approx 70% complete for ERR022876_1.fastq.gz
Approx 75% complete for ERR022876_1.fastq.gz
Approx 80% complete for ERR022876_1.fastq.gz
Approx 85% complete for ERR022876_1.fastq.gz
Approx 90% complete for ERR022876_1.fastq.gz
Approx 95% complete for ERR022876_1.fastq.gz
Approx 100% complete for ERR022876_1.fastq.gz


Analysis complete for ERR022876_1.fastq.gz


Started analysis of ERR022876_2.fastq.gz
Approx 5% complete for ERR022876_2.fastq.gz
Approx 10% complete for ERR022876_2.fastq.gz
Approx 15% complete for ERR022876_2.fastq.gz
Approx 20% complete for ERR022876_2.fastq.gz
Approx 25% complete for ERR022876_2.fastq.gz
Approx 30% complete for ERR022876_2.fastq.gz
Approx 35% complete for ERR022876_2.fastq.gz
Approx 40% complete for ERR022876_2.fastq.gz
Approx 45% complete for ERR022876_2.fastq.gz
Approx 50% complete for ERR022876_2.fastq.gz
Approx 55% complete for ERR022876_2.fastq.gz
Approx 60% complete for ERR022876_2.fastq.gz
Approx 65% complete for ERR022876_2.fastq.gz
Approx 70% complete for ERR022876_2.fastq.gz
Approx 75% complete for ERR022876_2.fastq.gz
Approx 80% complete for ERR022876_2.fastq.gz
Approx 85% complete for ERR022876_2.fastq.gz
Approx 90% complete for ERR022876_2.fastq.gz
Approx 95% complete for ERR022876_2.fastq.gz
Approx 100% complete for ERR022876_2.fastq.gz


Analysis complete for ERR022876_2.fastq.gz
application/gzip
application/gzip


Started analysis of ERR022877_1.fastq.gz
Approx 5% complete for ERR022877_1.fastq.gz
Approx 10% complete for ERR022877_1.fastq.gz
Approx 15% complete for ERR022877_1.fastq.gz
Approx 20% complete for ERR022877_1.fastq.gz
Approx 25% complete for ERR022877_1.fastq.gz
Approx 30% complete for ERR022877_1.fastq.gz
Approx 35% complete for ERR022877_1.fastq.gz
Approx 40% complete for ERR022877_1.fastq.gz
Approx 45% complete for ERR022877_1.fastq.gz
Approx 50% complete for ERR022877_1.fastq.gz
Approx 55% complete for ERR022877_1.fastq.gz
Approx 60% complete for ERR022877_1.fastq.gz
Approx 65% complete for ERR022877_1.fastq.gz
Approx 70% complete for ERR022877_1.fastq.gz
Approx 75% complete for ERR022877_1.fastq.gz
Approx 80% complete for ERR022877_1.fastq.gz
Approx 85% complete for ERR022877_1.fastq.gz
Approx 90% complete for ERR022877_1.fastq.gz
Approx 95% complete for ERR022877_1.fastq.gz
Approx 100% complete for ERR022877_1.fastq.gz


Analysis complete for ERR022877_1.fastq.gz


Started analysis of ERR022877_2.fastq.gz
Approx 5% complete for ERR022877_2.fastq.gz
Approx 10% complete for ERR022877_2.fastq.gz
Approx 15% complete for ERR022877_2.fastq.gz
Approx 20% complete for ERR022877_2.fastq.gz
Approx 25% complete for ERR022877_2.fastq.gz
Approx 30% complete for ERR022877_2.fastq.gz
Approx 35% complete for ERR022877_2.fastq.gz
Approx 40% complete for ERR022877_2.fastq.gz
Approx 45% complete for ERR022877_2.fastq.gz
Approx 50% complete for ERR022877_2.fastq.gz
Approx 55% complete for ERR022877_2.fastq.gz
Approx 60% complete for ERR022877_2.fastq.gz
Approx 65% complete for ERR022877_2.fastq.gz
Approx 70% complete for ERR022877_2.fastq.gz
Approx 75% complete for ERR022877_2.fastq.gz
Approx 80% complete for ERR022877_2.fastq.gz
Approx 85% complete for ERR022877_2.fastq.gz
Approx 90% complete for ERR022877_2.fastq.gz
Approx 95% complete for ERR022877_2.fastq.gz
Approx 100% complete for ERR022877_2.fastq.gz


Analysis complete for ERR022877_2.fastq.gz
application/gzip
application/gzip


Started analysis of ERR022878_1.fastq.gz
Approx 5% complete for ERR022878_1.fastq.gz
Approx 10% complete for ERR022878_1.fastq.gz
Approx 15% complete for ERR022878_1.fastq.gz
Approx 20% complete for ERR022878_1.fastq.gz
Approx 25% complete for ERR022878_1.fastq.gz
Approx 30% complete for ERR022878_1.fastq.gz
Approx 35% complete for ERR022878_1.fastq.gz
Approx 40% complete for ERR022878_1.fastq.gz
Approx 45% complete for ERR022878_1.fastq.gz
Approx 50% complete for ERR022878_1.fastq.gz
Approx 55% complete for ERR022878_1.fastq.gz
Approx 60% complete for ERR022878_1.fastq.gz
Approx 65% complete for ERR022878_1.fastq.gz
Approx 70% complete for ERR022878_1.fastq.gz
Approx 75% complete for ERR022878_1.fastq.gz
Approx 80% complete for ERR022878_1.fastq.gz
Approx 85% complete for ERR022878_1.fastq.gz
Approx 90% complete for ERR022878_1.fastq.gz
Approx 95% complete for ERR022878_1.fastq.gz
Approx 100% complete for ERR022878_1.fastq.gz


Analysis complete for ERR022878_1.fastq.gz


Started analysis of ERR022878_2.fastq.gz
Approx 5% complete for ERR022878_2.fastq.gz
Approx 10% complete for ERR022878_2.fastq.gz
Approx 15% complete for ERR022878_2.fastq.gz
Approx 20% complete for ERR022878_2.fastq.gz
Approx 25% complete for ERR022878_2.fastq.gz
Approx 30% complete for ERR022878_2.fastq.gz
Approx 35% complete for ERR022878_2.fastq.gz
Approx 40% complete for ERR022878_2.fastq.gz
Approx 45% complete for ERR022878_2.fastq.gz
Approx 50% complete for ERR022878_2.fastq.gz
Approx 55% complete for ERR022878_2.fastq.gz
Approx 60% complete for ERR022878_2.fastq.gz
Approx 65% complete for ERR022878_2.fastq.gz
Approx 70% complete for ERR022878_2.fastq.gz
Approx 75% complete for ERR022878_2.fastq.gz
Approx 80% complete for ERR022878_2.fastq.gz
Approx 85% complete for ERR022878_2.fastq.gz
Approx 90% complete for ERR022878_2.fastq.gz
Approx 95% complete for ERR022878_2.fastq.gz
Approx 100% complete for ERR022878_2.fastq.gz


Analysis complete for ERR022878_2.fastq.gz
application/gzip
application/gzip


Started analysis of ERR022879_1.fastq.gz
Approx 5% complete for ERR022879_1.fastq.gz
Approx 10% complete for ERR022879_1.fastq.gz
Approx 15% complete for ERR022879_1.fastq.gz
Approx 20% complete for ERR022879_1.fastq.gz
Approx 25% complete for ERR022879_1.fastq.gz
Approx 30% complete for ERR022879_1.fastq.gz
Approx 35% complete for ERR022879_1.fastq.gz
Approx 40% complete for ERR022879_1.fastq.gz
Approx 45% complete for ERR022879_1.fastq.gz
Approx 50% complete for ERR022879_1.fastq.gz
Approx 55% complete for ERR022879_1.fastq.gz
Approx 60% complete for ERR022879_1.fastq.gz
Approx 65% complete for ERR022879_1.fastq.gz
Approx 70% complete for ERR022879_1.fastq.gz
Approx 75% complete for ERR022879_1.fastq.gz
Approx 80% complete for ERR022879_1.fastq.gz
Approx 85% complete for ERR022879_1.fastq.gz
Approx 90% complete for ERR022879_1.fastq.gz
Approx 95% complete for ERR022879_1.fastq.gz
Approx 100% complete for ERR022879_1.fastq.gz


Analysis complete for ERR022879_1.fastq.gz


Started analysis of ERR022879_2.fastq.gz
Approx 5% complete for ERR022879_2.fastq.gz
Approx 10% complete for ERR022879_2.fastq.gz
Approx 15% complete for ERR022879_2.fastq.gz
Approx 20% complete for ERR022879_2.fastq.gz
Approx 25% complete for ERR022879_2.fastq.gz
Approx 30% complete for ERR022879_2.fastq.gz
Approx 35% complete for ERR022879_2.fastq.gz
Approx 40% complete for ERR022879_2.fastq.gz
Approx 45% complete for ERR022879_2.fastq.gz
Approx 50% complete for ERR022879_2.fastq.gz
Approx 55% complete for ERR022879_2.fastq.gz
Approx 60% complete for ERR022879_2.fastq.gz
Approx 65% complete for ERR022879_2.fastq.gz
Approx 70% complete for ERR022879_2.fastq.gz
Approx 75% complete for ERR022879_2.fastq.gz
Approx 80% complete for ERR022879_2.fastq.gz
Approx 85% complete for ERR022879_2.fastq.gz
Approx 90% complete for ERR022879_2.fastq.gz
Approx 95% complete for ERR022879_2.fastq.gz
Approx 100% complete for ERR022879_2.fastq.gz


Analysis complete for ERR022879_2.fastq.gz
application/gzip
application/gzip


Started analysis of ERR022880_1.fastq.gz
Approx 5% complete for ERR022880_1.fastq.gz
Approx 10% complete for ERR022880_1.fastq.gz
Approx 15% complete for ERR022880_1.fastq.gz
Approx 20% complete for ERR022880_1.fastq.gz
Approx 25% complete for ERR022880_1.fastq.gz
Approx 30% complete for ERR022880_1.fastq.gz
Approx 35% complete for ERR022880_1.fastq.gz
Approx 40% complete for ERR022880_1.fastq.gz
Approx 45% complete for ERR022880_1.fastq.gz
Approx 50% complete for ERR022880_1.fastq.gz
Approx 55% complete for ERR022880_1.fastq.gz
Approx 60% complete for ERR022880_1.fastq.gz
Approx 65% complete for ERR022880_1.fastq.gz
Approx 70% complete for ERR022880_1.fastq.gz
Approx 75% complete for ERR022880_1.fastq.gz
Approx 80% complete for ERR022880_1.fastq.gz
Approx 85% complete for ERR022880_1.fastq.gz
Approx 90% complete for ERR022880_1.fastq.gz
Approx 95% complete for ERR022880_1.fastq.gz
Approx 100% complete for ERR022880_1.fastq.gz


Analysis complete for ERR022880_1.fastq.gz


Started analysis of ERR022880_2.fastq.gz
Approx 5% complete for ERR022880_2.fastq.gz
Approx 10% complete for ERR022880_2.fastq.gz
Approx 15% complete for ERR022880_2.fastq.gz
Approx 20% complete for ERR022880_2.fastq.gz
Approx 25% complete for ERR022880_2.fastq.gz
Approx 30% complete for ERR022880_2.fastq.gz
Approx 35% complete for ERR022880_2.fastq.gz
Approx 40% complete for ERR022880_2.fastq.gz
Approx 45% complete for ERR022880_2.fastq.gz
Approx 50% complete for ERR022880_2.fastq.gz
Approx 55% complete for ERR022880_2.fastq.gz
Approx 60% complete for ERR022880_2.fastq.gz
Approx 65% complete for ERR022880_2.fastq.gz
Approx 70% complete for ERR022880_2.fastq.gz
Approx 75% complete for ERR022880_2.fastq.gz
Approx 80% complete for ERR022880_2.fastq.gz
Approx 85% complete for ERR022880_2.fastq.gz
Approx 90% complete for ERR022880_2.fastq.gz
Approx 95% complete for ERR022880_2.fastq.gz
Approx 100% complete for ERR022880_2.fastq.gz


Analysis complete for ERR022880_2.fastq.gz
application/gzip
application/gzip


Started analysis of ERR022881_1.fastq.gz
Approx 5% complete for ERR022881_1.fastq.gz
Approx 10% complete for ERR022881_1.fastq.gz
Approx 15% complete for ERR022881_1.fastq.gz
Approx 20% complete for ERR022881_1.fastq.gz
Approx 25% complete for ERR022881_1.fastq.gz
Approx 30% complete for ERR022881_1.fastq.gz
Approx 35% complete for ERR022881_1.fastq.gz
Approx 40% complete for ERR022881_1.fastq.gz
Approx 45% complete for ERR022881_1.fastq.gz
Approx 50% complete for ERR022881_1.fastq.gz
Approx 55% complete for ERR022881_1.fastq.gz
Approx 60% complete for ERR022881_1.fastq.gz
Approx 65% complete for ERR022881_1.fastq.gz
Approx 70% complete for ERR022881_1.fastq.gz
Approx 75% complete for ERR022881_1.fastq.gz
Approx 80% complete for ERR022881_1.fastq.gz
Approx 85% complete for ERR022881_1.fastq.gz
Approx 90% complete for ERR022881_1.fastq.gz
Approx 95% complete for ERR022881_1.fastq.gz
Approx 100% complete for ERR022881_1.fastq.gz


Analysis complete for ERR022881_1.fastq.gz


Started analysis of ERR022881_2.fastq.gz
Approx 5% complete for ERR022881_2.fastq.gz
Approx 10% complete for ERR022881_2.fastq.gz
Approx 15% complete for ERR022881_2.fastq.gz
Approx 20% complete for ERR022881_2.fastq.gz
Approx 25% complete for ERR022881_2.fastq.gz
Approx 30% complete for ERR022881_2.fastq.gz
Approx 35% complete for ERR022881_2.fastq.gz
Approx 40% complete for ERR022881_2.fastq.gz
Approx 45% complete for ERR022881_2.fastq.gz
Approx 50% complete for ERR022881_2.fastq.gz
Approx 55% complete for ERR022881_2.fastq.gz
Approx 60% complete for ERR022881_2.fastq.gz
Approx 65% complete for ERR022881_2.fastq.gz
Approx 70% complete for ERR022881_2.fastq.gz
Approx 75% complete for ERR022881_2.fastq.gz
Approx 80% complete for ERR022881_2.fastq.gz
Approx 85% complete for ERR022881_2.fastq.gz
Approx 90% complete for ERR022881_2.fastq.gz
Approx 95% complete for ERR022881_2.fastq.gz
Approx 100% complete for ERR022881_2.fastq.gz


Analysis complete for ERR022881_2.fastq.gz
application/gzip
application/gzip


Started analysis of ERR022882_1.fastq.gz
Approx 5% complete for ERR022882_1.fastq.gz
Approx 10% complete for ERR022882_1.fastq.gz
Approx 15% complete for ERR022882_1.fastq.gz
Approx 20% complete for ERR022882_1.fastq.gz
Approx 25% complete for ERR022882_1.fastq.gz
Approx 30% complete for ERR022882_1.fastq.gz
Approx 35% complete for ERR022882_1.fastq.gz
Approx 40% complete for ERR022882_1.fastq.gz
Approx 45% complete for ERR022882_1.fastq.gz
Approx 50% complete for ERR022882_1.fastq.gz
Approx 55% complete for ERR022882_1.fastq.gz
Approx 60% complete for ERR022882_1.fastq.gz
Approx 65% complete for ERR022882_1.fastq.gz
Approx 70% complete for ERR022882_1.fastq.gz
Approx 75% complete for ERR022882_1.fastq.gz
Approx 80% complete for ERR022882_1.fastq.gz
Approx 85% complete for ERR022882_1.fastq.gz
Approx 90% complete for ERR022882_1.fastq.gz
Approx 95% complete for ERR022882_1.fastq.gz
Approx 100% complete for ERR022882_1.fastq.gz


Analysis complete for ERR022882_1.fastq.gz


Started analysis of ERR022882_2.fastq.gz
Approx 5% complete for ERR022882_2.fastq.gz
Approx 10% complete for ERR022882_2.fastq.gz
Approx 15% complete for ERR022882_2.fastq.gz
Approx 20% complete for ERR022882_2.fastq.gz
Approx 25% complete for ERR022882_2.fastq.gz
Approx 30% complete for ERR022882_2.fastq.gz
Approx 35% complete for ERR022882_2.fastq.gz
Approx 40% complete for ERR022882_2.fastq.gz
Approx 45% complete for ERR022882_2.fastq.gz
Approx 50% complete for ERR022882_2.fastq.gz
Approx 55% complete for ERR022882_2.fastq.gz
Approx 60% complete for ERR022882_2.fastq.gz
Approx 65% complete for ERR022882_2.fastq.gz
Approx 70% complete for ERR022882_2.fastq.gz
Approx 75% complete for ERR022882_2.fastq.gz
Approx 80% complete for ERR022882_2.fastq.gz
Approx 85% complete for ERR022882_2.fastq.gz
Approx 90% complete for ERR022882_2.fastq.gz
Approx 95% complete for ERR022882_2.fastq.gz
Approx 100% complete for ERR022882_2.fastq.gz


Analysis complete for ERR022882_2.fastq.gz
application/gzip
application/gzip


Started analysis of ERR022883_1.fastq.gz
Approx 5% complete for ERR022883_1.fastq.gz
Approx 10% complete for ERR022883_1.fastq.gz
Approx 15% complete for ERR022883_1.fastq.gz
Approx 20% complete for ERR022883_1.fastq.gz
Approx 25% complete for ERR022883_1.fastq.gz
Approx 30% complete for ERR022883_1.fastq.gz
Approx 35% complete for ERR022883_1.fastq.gz
Approx 40% complete for ERR022883_1.fastq.gz
Approx 45% complete for ERR022883_1.fastq.gz
Approx 50% complete for ERR022883_1.fastq.gz
Approx 55% complete for ERR022883_1.fastq.gz
Approx 60% complete for ERR022883_1.fastq.gz
Approx 65% complete for ERR022883_1.fastq.gz
Approx 70% complete for ERR022883_1.fastq.gz
Approx 75% complete for ERR022883_1.fastq.gz
Approx 80% complete for ERR022883_1.fastq.gz
Approx 85% complete for ERR022883_1.fastq.gz
Approx 90% complete for ERR022883_1.fastq.gz
Approx 95% complete for ERR022883_1.fastq.gz
Approx 100% complete for ERR022883_1.fastq.gz


Analysis complete for ERR022883_1.fastq.gz


Started analysis of ERR022883_2.fastq.gz
Approx 5% complete for ERR022883_2.fastq.gz
Approx 10% complete for ERR022883_2.fastq.gz
Approx 15% complete for ERR022883_2.fastq.gz
Approx 20% complete for ERR022883_2.fastq.gz
Approx 25% complete for ERR022883_2.fastq.gz
Approx 30% complete for ERR022883_2.fastq.gz
Approx 35% complete for ERR022883_2.fastq.gz
Approx 40% complete for ERR022883_2.fastq.gz
Approx 45% complete for ERR022883_2.fastq.gz
Approx 50% complete for ERR022883_2.fastq.gz
Approx 55% complete for ERR022883_2.fastq.gz
Approx 60% complete for ERR022883_2.fastq.gz
Approx 65% complete for ERR022883_2.fastq.gz
Approx 70% complete for ERR022883_2.fastq.gz
Approx 75% complete for ERR022883_2.fastq.gz
Approx 80% complete for ERR022883_2.fastq.gz
Approx 85% complete for ERR022883_2.fastq.gz
Approx 90% complete for ERR022883_2.fastq.gz
Approx 95% complete for ERR022883_2.fastq.gz
Approx 100% complete for ERR022883_2.fastq.gz


Analysis complete for ERR022883_2.fastq.gz


<div class="alert alert-block alert-info">

In the code above:

`for X in Y; do ... done` - this the structure for a loop in bash; we are asking that for each X (in this case accession n) in Y (in this case the list of ids), the program does something (in this case fastqc)

`--noextract` - Do not uncompress the output file after creating it

`-o` - create all output files in the output directory specified next

<div class="alert alert-block alert-warning">

Open the output directory. You will see that two new files have been created for each read. Open one of the html files and have a look. [Here](https://hbctraining.github.io/Training-modules/planning_successful_rnaseq/lessons/QC_raw_data.html) you can find guidance on what each graph means. 

Questions:

4. Is there a pattern to where the errors occur in these reads?
5. Are there any overrepresented sequences? What are they from?

<div class="alert alert-block alert-success">
    
Answers:

4. The error rate increases as you get closer to the end of the read
5. Several sequences from Illumina Paired End PCR Primers and Illumina Universal Adaptors are overrepresented. These are short regions of sequence which were attached to the reads during the sequencing process.

# Trimming

When we generate millions of reads in a sequencing experiment, we are able to average multiple observations at the same location. However if there are portions of the reads which are low quality these may still affect the average and so we want to remove these regions. As you will have seen in the previous section, the adaptor sequences used to generate the DNA library may also be over-represented in the reads and could cause contamination.

The next step in an RNA-Seq analysis is therefore to trim poor quality regions and adaptor sequences. [TrimGalore](https://github.com/FelixKrueger/TrimGalore/blob/master/Docs/Trim_Galore_User_Guide.md) is a popular software for trimming sequencing reads. 

In [13]:
%%bash
for accession in $(cat data/Schistosoma_mansoni/list_ids.txt)
do
    trim_galore \
      data/Schistosoma_mansoni/subsampled/$accession*.fastq.gz \
      --paired \
      --output_dir analysis/Schistosoma_mansoni/qc/ \
      --basename $accession \
      --no_report_file \
      --fastqc
done

Multicore support not enabled. Proceeding with single-core trimming.
Path to Cutadapt set as: 'cutadapt' (default)
Cutadapt seems to be working fine (tested command 'cutadapt --version')
Cutadapt version: 4.8
single-core operation.


igzip command line interface 2.31.0


igzip detected. Using igzip for decompressing

No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)

Output will be written into the directory: /Users/rmcolq/Work/git/pathbio3/analysis/Schistosoma_mansoni/qc/
Using user-specified basename (>>ERR022872<<) instead of deriving the filename from the input file(s)


AUTO-DETECTING ADAPTER TYPE
Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> data/Schistosoma_mansoni/subsampled/ERR022872_1.fastq.gz <<)

Found perfect matches for the following adapter sequences:
Adapter type	Count	Sequence	Sequences analysed	Percentage
Illumina	66884	AGATCGGAAGAGC	1000000	6.69
Nextera	2	CTGTCTCTTATA	1000000	0.00
smallRNA	2	TGGAATTCTCGG	1000000	0.00
Using Illumina adapter for trimming (count: 66884). Second best hit was Nextera (count: 2)


SUMMARISING RUN PARAMETERS
Input filename: data/Schistosoma_mansoni/subsampled/ERR022872_1.fastq.gz
Trimming mode: pair

application/gzip


Started analysis of ERR022872_val_1.fq.gz
Approx 5% complete for ERR022872_val_1.fq.gz
Approx 10% complete for ERR022872_val_1.fq.gz
Approx 15% complete for ERR022872_val_1.fq.gz
Approx 20% complete for ERR022872_val_1.fq.gz
Approx 25% complete for ERR022872_val_1.fq.gz
Approx 30% complete for ERR022872_val_1.fq.gz
Approx 35% complete for ERR022872_val_1.fq.gz
Approx 40% complete for ERR022872_val_1.fq.gz
Approx 45% complete for ERR022872_val_1.fq.gz
Approx 50% complete for ERR022872_val_1.fq.gz
Approx 55% complete for ERR022872_val_1.fq.gz
Approx 60% complete for ERR022872_val_1.fq.gz
Approx 65% complete for ERR022872_val_1.fq.gz
Approx 70% complete for ERR022872_val_1.fq.gz
Approx 75% complete for ERR022872_val_1.fq.gz
Approx 80% complete for ERR022872_val_1.fq.gz
Approx 85% complete for ERR022872_val_1.fq.gz
Approx 90% complete for ERR022872_val_1.fq.gz
Approx 95% complete for ERR022872_val_1.fq.gz


Analysis complete for ERR022872_val_1.fq.gz



  >>> Now running FastQC on the validated data ERR022872_val_2.fq.gz<<<



application/gzip


Started analysis of ERR022872_val_2.fq.gz
Approx 5% complete for ERR022872_val_2.fq.gz
Approx 10% complete for ERR022872_val_2.fq.gz
Approx 15% complete for ERR022872_val_2.fq.gz
Approx 20% complete for ERR022872_val_2.fq.gz
Approx 25% complete for ERR022872_val_2.fq.gz
Approx 30% complete for ERR022872_val_2.fq.gz
Approx 35% complete for ERR022872_val_2.fq.gz
Approx 40% complete for ERR022872_val_2.fq.gz
Approx 45% complete for ERR022872_val_2.fq.gz
Approx 50% complete for ERR022872_val_2.fq.gz
Approx 55% complete for ERR022872_val_2.fq.gz
Approx 60% complete for ERR022872_val_2.fq.gz
Approx 65% complete for ERR022872_val_2.fq.gz
Approx 70% complete for ERR022872_val_2.fq.gz
Approx 75% complete for ERR022872_val_2.fq.gz
Approx 80% complete for ERR022872_val_2.fq.gz
Approx 85% complete for ERR022872_val_2.fq.gz
Approx 90% complete for ERR022872_val_2.fq.gz
Approx 95% complete for ERR022872_val_2.fq.gz


Analysis complete for ERR022872_val_2.fq.gz


Deleting both intermediate output files ERR022872_R1_trimmed.fq.gz and ERR022872_R2_trimmed.fq.gz


Multicore support not enabled. Proceeding with single-core trimming.
Path to Cutadapt set as: 'cutadapt' (default)
Cutadapt seems to be working fine (tested command 'cutadapt --version')
Cutadapt version: 4.8
single-core operation.


igzip command line interface 2.31.0


igzip detected. Using igzip for decompressing

No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)

Output will be written into the directory: /Users/rmcolq/Work/git/pathbio3/analysis/Schistosoma_mansoni/qc/
Using user-specified basename (>>ERR022873<<) instead of deriving the filename from the input file(s)


AUTO-DETECTING ADAPTER TYPE
Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> data/Schistosoma_mansoni/subsampled/ERR022873_1.fastq.gz <<)

Found perfect matches for the following adapter sequences:
Adapter type	Count	Sequence	Sequences analysed	Percentage
Illumina	37821	AGATCGGAAGAGC	1000000	3.78
Nextera	0	CTGTCTCTTATA	1000000	0.00
smallRNA	0	TGGAATTCTCGG	1000000	0.00
Using Illumina adapter for trimming (count: 37821). Second best hit was Nextera (count: 0)


SUMMARISING RUN PARAMETERS
Input filename: data/Schistosoma_mansoni/subsampled/ERR022873_1.fastq.gz
Trimming mode: pair

application/gzip


Started analysis of ERR022873_val_1.fq.gz
Approx 5% complete for ERR022873_val_1.fq.gz
Approx 10% complete for ERR022873_val_1.fq.gz
Approx 15% complete for ERR022873_val_1.fq.gz
Approx 20% complete for ERR022873_val_1.fq.gz
Approx 25% complete for ERR022873_val_1.fq.gz
Approx 30% complete for ERR022873_val_1.fq.gz
Approx 35% complete for ERR022873_val_1.fq.gz
Approx 40% complete for ERR022873_val_1.fq.gz
Approx 45% complete for ERR022873_val_1.fq.gz
Approx 50% complete for ERR022873_val_1.fq.gz
Approx 55% complete for ERR022873_val_1.fq.gz
Approx 60% complete for ERR022873_val_1.fq.gz
Approx 65% complete for ERR022873_val_1.fq.gz
Approx 70% complete for ERR022873_val_1.fq.gz
Approx 75% complete for ERR022873_val_1.fq.gz
Approx 80% complete for ERR022873_val_1.fq.gz
Approx 85% complete for ERR022873_val_1.fq.gz
Approx 90% complete for ERR022873_val_1.fq.gz
Approx 95% complete for ERR022873_val_1.fq.gz


Analysis complete for ERR022873_val_1.fq.gz



  >>> Now running FastQC on the validated data ERR022873_val_2.fq.gz<<<



application/gzip


Started analysis of ERR022873_val_2.fq.gz
Approx 5% complete for ERR022873_val_2.fq.gz
Approx 10% complete for ERR022873_val_2.fq.gz
Approx 15% complete for ERR022873_val_2.fq.gz
Approx 20% complete for ERR022873_val_2.fq.gz
Approx 25% complete for ERR022873_val_2.fq.gz
Approx 30% complete for ERR022873_val_2.fq.gz
Approx 35% complete for ERR022873_val_2.fq.gz
Approx 40% complete for ERR022873_val_2.fq.gz
Approx 45% complete for ERR022873_val_2.fq.gz
Approx 50% complete for ERR022873_val_2.fq.gz
Approx 55% complete for ERR022873_val_2.fq.gz
Approx 60% complete for ERR022873_val_2.fq.gz
Approx 65% complete for ERR022873_val_2.fq.gz
Approx 70% complete for ERR022873_val_2.fq.gz
Approx 75% complete for ERR022873_val_2.fq.gz
Approx 80% complete for ERR022873_val_2.fq.gz
Approx 85% complete for ERR022873_val_2.fq.gz
Approx 90% complete for ERR022873_val_2.fq.gz
Approx 95% complete for ERR022873_val_2.fq.gz


Analysis complete for ERR022873_val_2.fq.gz


Deleting both intermediate output files ERR022873_R1_trimmed.fq.gz and ERR022873_R2_trimmed.fq.gz


Multicore support not enabled. Proceeding with single-core trimming.
Path to Cutadapt set as: 'cutadapt' (default)
Cutadapt seems to be working fine (tested command 'cutadapt --version')
Cutadapt version: 4.8
single-core operation.


igzip command line interface 2.31.0


igzip detected. Using igzip for decompressing

No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)

Output will be written into the directory: /Users/rmcolq/Work/git/pathbio3/analysis/Schistosoma_mansoni/qc/
Using user-specified basename (>>ERR022874<<) instead of deriving the filename from the input file(s)


AUTO-DETECTING ADAPTER TYPE
Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> data/Schistosoma_mansoni/subsampled/ERR022874_1.fastq.gz <<)

Found perfect matches for the following adapter sequences:
Adapter type	Count	Sequence	Sequences analysed	Percentage
Illumina	37771	AGATCGGAAGAGC	1000000	3.78
Nextera	1	CTGTCTCTTATA	1000000	0.00
smallRNA	0	TGGAATTCTCGG	1000000	0.00
Using Illumina adapter for trimming (count: 37771). Second best hit was Nextera (count: 1)


SUMMARISING RUN PARAMETERS
Input filename: data/Schistosoma_mansoni/subsampled/ERR022874_1.fastq.gz
Trimming mode: pair

application/gzip


Started analysis of ERR022874_val_1.fq.gz
Approx 5% complete for ERR022874_val_1.fq.gz
Approx 10% complete for ERR022874_val_1.fq.gz
Approx 15% complete for ERR022874_val_1.fq.gz
Approx 20% complete for ERR022874_val_1.fq.gz
Approx 25% complete for ERR022874_val_1.fq.gz
Approx 30% complete for ERR022874_val_1.fq.gz
Approx 35% complete for ERR022874_val_1.fq.gz
Approx 40% complete for ERR022874_val_1.fq.gz
Approx 45% complete for ERR022874_val_1.fq.gz
Approx 50% complete for ERR022874_val_1.fq.gz
Approx 55% complete for ERR022874_val_1.fq.gz
Approx 60% complete for ERR022874_val_1.fq.gz
Approx 65% complete for ERR022874_val_1.fq.gz
Approx 70% complete for ERR022874_val_1.fq.gz
Approx 75% complete for ERR022874_val_1.fq.gz
Approx 80% complete for ERR022874_val_1.fq.gz
Approx 85% complete for ERR022874_val_1.fq.gz
Approx 90% complete for ERR022874_val_1.fq.gz
Approx 95% complete for ERR022874_val_1.fq.gz


Analysis complete for ERR022874_val_1.fq.gz



  >>> Now running FastQC on the validated data ERR022874_val_2.fq.gz<<<



application/gzip


Started analysis of ERR022874_val_2.fq.gz
Approx 5% complete for ERR022874_val_2.fq.gz
Approx 10% complete for ERR022874_val_2.fq.gz
Approx 15% complete for ERR022874_val_2.fq.gz
Approx 20% complete for ERR022874_val_2.fq.gz
Approx 25% complete for ERR022874_val_2.fq.gz
Approx 30% complete for ERR022874_val_2.fq.gz
Approx 35% complete for ERR022874_val_2.fq.gz
Approx 40% complete for ERR022874_val_2.fq.gz
Approx 45% complete for ERR022874_val_2.fq.gz
Approx 50% complete for ERR022874_val_2.fq.gz
Approx 55% complete for ERR022874_val_2.fq.gz
Approx 60% complete for ERR022874_val_2.fq.gz
Approx 65% complete for ERR022874_val_2.fq.gz
Approx 70% complete for ERR022874_val_2.fq.gz
Approx 75% complete for ERR022874_val_2.fq.gz
Approx 80% complete for ERR022874_val_2.fq.gz
Approx 85% complete for ERR022874_val_2.fq.gz
Approx 90% complete for ERR022874_val_2.fq.gz
Approx 95% complete for ERR022874_val_2.fq.gz


Analysis complete for ERR022874_val_2.fq.gz


Deleting both intermediate output files ERR022874_R1_trimmed.fq.gz and ERR022874_R2_trimmed.fq.gz


Multicore support not enabled. Proceeding with single-core trimming.
Path to Cutadapt set as: 'cutadapt' (default)
Cutadapt seems to be working fine (tested command 'cutadapt --version')
Cutadapt version: 4.8
single-core operation.


igzip command line interface 2.31.0


igzip detected. Using igzip for decompressing

No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)

Output will be written into the directory: /Users/rmcolq/Work/git/pathbio3/analysis/Schistosoma_mansoni/qc/
Using user-specified basename (>>ERR022875<<) instead of deriving the filename from the input file(s)


AUTO-DETECTING ADAPTER TYPE
Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> data/Schistosoma_mansoni/subsampled/ERR022875_1.fastq.gz <<)

Found perfect matches for the following adapter sequences:
Adapter type	Count	Sequence	Sequences analysed	Percentage
Illumina	26841	AGATCGGAAGAGC	1000000	2.68
Nextera	1	CTGTCTCTTATA	1000000	0.00
smallRNA	0	TGGAATTCTCGG	1000000	0.00
Using Illumina adapter for trimming (count: 26841). Second best hit was Nextera (count: 1)


SUMMARISING RUN PARAMETERS
Input filename: data/Schistosoma_mansoni/subsampled/ERR022875_1.fastq.gz
Trimming mode: pair

application/gzip


Started analysis of ERR022875_val_1.fq.gz
Approx 5% complete for ERR022875_val_1.fq.gz
Approx 10% complete for ERR022875_val_1.fq.gz
Approx 15% complete for ERR022875_val_1.fq.gz
Approx 20% complete for ERR022875_val_1.fq.gz
Approx 25% complete for ERR022875_val_1.fq.gz
Approx 30% complete for ERR022875_val_1.fq.gz
Approx 35% complete for ERR022875_val_1.fq.gz
Approx 40% complete for ERR022875_val_1.fq.gz
Approx 45% complete for ERR022875_val_1.fq.gz
Approx 50% complete for ERR022875_val_1.fq.gz
Approx 55% complete for ERR022875_val_1.fq.gz
Approx 60% complete for ERR022875_val_1.fq.gz
Approx 65% complete for ERR022875_val_1.fq.gz
Approx 70% complete for ERR022875_val_1.fq.gz
Approx 75% complete for ERR022875_val_1.fq.gz
Approx 80% complete for ERR022875_val_1.fq.gz
Approx 85% complete for ERR022875_val_1.fq.gz
Approx 90% complete for ERR022875_val_1.fq.gz
Approx 95% complete for ERR022875_val_1.fq.gz


Analysis complete for ERR022875_val_1.fq.gz



  >>> Now running FastQC on the validated data ERR022875_val_2.fq.gz<<<



application/gzip


Started analysis of ERR022875_val_2.fq.gz
Approx 5% complete for ERR022875_val_2.fq.gz
Approx 10% complete for ERR022875_val_2.fq.gz
Approx 15% complete for ERR022875_val_2.fq.gz
Approx 20% complete for ERR022875_val_2.fq.gz
Approx 25% complete for ERR022875_val_2.fq.gz
Approx 30% complete for ERR022875_val_2.fq.gz
Approx 35% complete for ERR022875_val_2.fq.gz
Approx 40% complete for ERR022875_val_2.fq.gz
Approx 45% complete for ERR022875_val_2.fq.gz
Approx 50% complete for ERR022875_val_2.fq.gz
Approx 55% complete for ERR022875_val_2.fq.gz
Approx 60% complete for ERR022875_val_2.fq.gz
Approx 65% complete for ERR022875_val_2.fq.gz
Approx 70% complete for ERR022875_val_2.fq.gz
Approx 75% complete for ERR022875_val_2.fq.gz
Approx 80% complete for ERR022875_val_2.fq.gz
Approx 85% complete for ERR022875_val_2.fq.gz
Approx 90% complete for ERR022875_val_2.fq.gz
Approx 95% complete for ERR022875_val_2.fq.gz


Analysis complete for ERR022875_val_2.fq.gz


Deleting both intermediate output files ERR022875_R1_trimmed.fq.gz and ERR022875_R2_trimmed.fq.gz


Multicore support not enabled. Proceeding with single-core trimming.
Path to Cutadapt set as: 'cutadapt' (default)
Cutadapt seems to be working fine (tested command 'cutadapt --version')
Cutadapt version: 4.8
single-core operation.


igzip command line interface 2.31.0


igzip detected. Using igzip for decompressing

No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)

Output will be written into the directory: /Users/rmcolq/Work/git/pathbio3/analysis/Schistosoma_mansoni/qc/
Using user-specified basename (>>ERR022876<<) instead of deriving the filename from the input file(s)


AUTO-DETECTING ADAPTER TYPE
Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> data/Schistosoma_mansoni/subsampled/ERR022876_1.fastq.gz <<)

Found perfect matches for the following adapter sequences:
Adapter type	Count	Sequence	Sequences analysed	Percentage
Illumina	22641	AGATCGGAAGAGC	1000000	2.26
Nextera	2	CTGTCTCTTATA	1000000	0.00
smallRNA	1	TGGAATTCTCGG	1000000	0.00
Using Illumina adapter for trimming (count: 22641). Second best hit was Nextera (count: 2)


SUMMARISING RUN PARAMETERS
Input filename: data/Schistosoma_mansoni/subsampled/ERR022876_1.fastq.gz
Trimming mode: pair

application/gzip


Started analysis of ERR022876_val_1.fq.gz
Approx 5% complete for ERR022876_val_1.fq.gz
Approx 10% complete for ERR022876_val_1.fq.gz
Approx 15% complete for ERR022876_val_1.fq.gz
Approx 20% complete for ERR022876_val_1.fq.gz
Approx 25% complete for ERR022876_val_1.fq.gz
Approx 30% complete for ERR022876_val_1.fq.gz
Approx 35% complete for ERR022876_val_1.fq.gz
Approx 40% complete for ERR022876_val_1.fq.gz
Approx 45% complete for ERR022876_val_1.fq.gz
Approx 50% complete for ERR022876_val_1.fq.gz
Approx 55% complete for ERR022876_val_1.fq.gz
Approx 60% complete for ERR022876_val_1.fq.gz
Approx 65% complete for ERR022876_val_1.fq.gz
Approx 70% complete for ERR022876_val_1.fq.gz
Approx 75% complete for ERR022876_val_1.fq.gz
Approx 80% complete for ERR022876_val_1.fq.gz
Approx 85% complete for ERR022876_val_1.fq.gz
Approx 90% complete for ERR022876_val_1.fq.gz
Approx 95% complete for ERR022876_val_1.fq.gz


Analysis complete for ERR022876_val_1.fq.gz



  >>> Now running FastQC on the validated data ERR022876_val_2.fq.gz<<<



application/gzip


Started analysis of ERR022876_val_2.fq.gz
Approx 5% complete for ERR022876_val_2.fq.gz
Approx 10% complete for ERR022876_val_2.fq.gz
Approx 15% complete for ERR022876_val_2.fq.gz
Approx 20% complete for ERR022876_val_2.fq.gz
Approx 25% complete for ERR022876_val_2.fq.gz
Approx 30% complete for ERR022876_val_2.fq.gz
Approx 35% complete for ERR022876_val_2.fq.gz
Approx 40% complete for ERR022876_val_2.fq.gz
Approx 45% complete for ERR022876_val_2.fq.gz
Approx 50% complete for ERR022876_val_2.fq.gz
Approx 55% complete for ERR022876_val_2.fq.gz
Approx 60% complete for ERR022876_val_2.fq.gz
Approx 65% complete for ERR022876_val_2.fq.gz
Approx 70% complete for ERR022876_val_2.fq.gz
Approx 75% complete for ERR022876_val_2.fq.gz
Approx 80% complete for ERR022876_val_2.fq.gz
Approx 85% complete for ERR022876_val_2.fq.gz
Approx 90% complete for ERR022876_val_2.fq.gz
Approx 95% complete for ERR022876_val_2.fq.gz


Analysis complete for ERR022876_val_2.fq.gz


Deleting both intermediate output files ERR022876_R1_trimmed.fq.gz and ERR022876_R2_trimmed.fq.gz


Multicore support not enabled. Proceeding with single-core trimming.
Path to Cutadapt set as: 'cutadapt' (default)
Cutadapt seems to be working fine (tested command 'cutadapt --version')
Cutadapt version: 4.8
single-core operation.


igzip command line interface 2.31.0


igzip detected. Using igzip for decompressing

No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)

Output will be written into the directory: /Users/rmcolq/Work/git/pathbio3/analysis/Schistosoma_mansoni/qc/
Using user-specified basename (>>ERR022877<<) instead of deriving the filename from the input file(s)


AUTO-DETECTING ADAPTER TYPE
Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> data/Schistosoma_mansoni/subsampled/ERR022877_1.fastq.gz <<)

Found perfect matches for the following adapter sequences:
Adapter type	Count	Sequence	Sequences analysed	Percentage
Illumina	1827	AGATCGGAAGAGC	1000000	0.18
Nextera	1	CTGTCTCTTATA	1000000	0.00
smallRNA	0	TGGAATTCTCGG	1000000	0.00
Using Illumina adapter for trimming (count: 1827). Second best hit was Nextera (count: 1)


SUMMARISING RUN PARAMETERS
Input filename: data/Schistosoma_mansoni/subsampled/ERR022877_1.fastq.gz
Trimming mode: paired

application/gzip


Started analysis of ERR022877_val_1.fq.gz
Approx 5% complete for ERR022877_val_1.fq.gz
Approx 10% complete for ERR022877_val_1.fq.gz
Approx 15% complete for ERR022877_val_1.fq.gz
Approx 20% complete for ERR022877_val_1.fq.gz
Approx 25% complete for ERR022877_val_1.fq.gz
Approx 30% complete for ERR022877_val_1.fq.gz
Approx 35% complete for ERR022877_val_1.fq.gz
Approx 40% complete for ERR022877_val_1.fq.gz
Approx 45% complete for ERR022877_val_1.fq.gz
Approx 50% complete for ERR022877_val_1.fq.gz
Approx 55% complete for ERR022877_val_1.fq.gz
Approx 60% complete for ERR022877_val_1.fq.gz
Approx 65% complete for ERR022877_val_1.fq.gz
Approx 70% complete for ERR022877_val_1.fq.gz
Approx 75% complete for ERR022877_val_1.fq.gz
Approx 80% complete for ERR022877_val_1.fq.gz
Approx 85% complete for ERR022877_val_1.fq.gz
Approx 90% complete for ERR022877_val_1.fq.gz
Approx 95% complete for ERR022877_val_1.fq.gz


Analysis complete for ERR022877_val_1.fq.gz



  >>> Now running FastQC on the validated data ERR022877_val_2.fq.gz<<<



application/gzip


Started analysis of ERR022877_val_2.fq.gz
Approx 5% complete for ERR022877_val_2.fq.gz
Approx 10% complete for ERR022877_val_2.fq.gz
Approx 15% complete for ERR022877_val_2.fq.gz
Approx 20% complete for ERR022877_val_2.fq.gz
Approx 25% complete for ERR022877_val_2.fq.gz
Approx 30% complete for ERR022877_val_2.fq.gz
Approx 35% complete for ERR022877_val_2.fq.gz
Approx 40% complete for ERR022877_val_2.fq.gz
Approx 45% complete for ERR022877_val_2.fq.gz
Approx 50% complete for ERR022877_val_2.fq.gz
Approx 55% complete for ERR022877_val_2.fq.gz
Approx 60% complete for ERR022877_val_2.fq.gz
Approx 65% complete for ERR022877_val_2.fq.gz
Approx 70% complete for ERR022877_val_2.fq.gz
Approx 75% complete for ERR022877_val_2.fq.gz
Approx 80% complete for ERR022877_val_2.fq.gz
Approx 85% complete for ERR022877_val_2.fq.gz
Approx 90% complete for ERR022877_val_2.fq.gz
Approx 95% complete for ERR022877_val_2.fq.gz


Analysis complete for ERR022877_val_2.fq.gz


Deleting both intermediate output files ERR022877_R1_trimmed.fq.gz and ERR022877_R2_trimmed.fq.gz


Multicore support not enabled. Proceeding with single-core trimming.
Path to Cutadapt set as: 'cutadapt' (default)
Cutadapt seems to be working fine (tested command 'cutadapt --version')
Cutadapt version: 4.8
single-core operation.


igzip command line interface 2.31.0


igzip detected. Using igzip for decompressing

No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)

Output will be written into the directory: /Users/rmcolq/Work/git/pathbio3/analysis/Schistosoma_mansoni/qc/
Using user-specified basename (>>ERR022878<<) instead of deriving the filename from the input file(s)


AUTO-DETECTING ADAPTER TYPE
Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> data/Schistosoma_mansoni/subsampled/ERR022878_1.fastq.gz <<)

Found perfect matches for the following adapter sequences:
Adapter type	Count	Sequence	Sequences analysed	Percentage
Illumina	9632	AGATCGGAAGAGC	1000000	0.96
Nextera	1	CTGTCTCTTATA	1000000	0.00
smallRNA	0	TGGAATTCTCGG	1000000	0.00
Using Illumina adapter for trimming (count: 9632). Second best hit was Nextera (count: 1)


SUMMARISING RUN PARAMETERS
Input filename: data/Schistosoma_mansoni/subsampled/ERR022878_1.fastq.gz
Trimming mode: paired

application/gzip


Started analysis of ERR022878_val_1.fq.gz
Approx 5% complete for ERR022878_val_1.fq.gz
Approx 10% complete for ERR022878_val_1.fq.gz
Approx 15% complete for ERR022878_val_1.fq.gz
Approx 20% complete for ERR022878_val_1.fq.gz
Approx 25% complete for ERR022878_val_1.fq.gz
Approx 30% complete for ERR022878_val_1.fq.gz
Approx 35% complete for ERR022878_val_1.fq.gz
Approx 40% complete for ERR022878_val_1.fq.gz
Approx 45% complete for ERR022878_val_1.fq.gz
Approx 50% complete for ERR022878_val_1.fq.gz
Approx 55% complete for ERR022878_val_1.fq.gz
Approx 60% complete for ERR022878_val_1.fq.gz
Approx 65% complete for ERR022878_val_1.fq.gz
Approx 70% complete for ERR022878_val_1.fq.gz
Approx 75% complete for ERR022878_val_1.fq.gz
Approx 80% complete for ERR022878_val_1.fq.gz
Approx 85% complete for ERR022878_val_1.fq.gz
Approx 90% complete for ERR022878_val_1.fq.gz
Approx 95% complete for ERR022878_val_1.fq.gz


Analysis complete for ERR022878_val_1.fq.gz



  >>> Now running FastQC on the validated data ERR022878_val_2.fq.gz<<<



application/gzip


Started analysis of ERR022878_val_2.fq.gz
Approx 5% complete for ERR022878_val_2.fq.gz
Approx 10% complete for ERR022878_val_2.fq.gz
Approx 15% complete for ERR022878_val_2.fq.gz
Approx 20% complete for ERR022878_val_2.fq.gz
Approx 25% complete for ERR022878_val_2.fq.gz
Approx 30% complete for ERR022878_val_2.fq.gz
Approx 35% complete for ERR022878_val_2.fq.gz
Approx 40% complete for ERR022878_val_2.fq.gz
Approx 45% complete for ERR022878_val_2.fq.gz
Approx 50% complete for ERR022878_val_2.fq.gz
Approx 55% complete for ERR022878_val_2.fq.gz
Approx 60% complete for ERR022878_val_2.fq.gz
Approx 65% complete for ERR022878_val_2.fq.gz
Approx 70% complete for ERR022878_val_2.fq.gz
Approx 75% complete for ERR022878_val_2.fq.gz
Approx 80% complete for ERR022878_val_2.fq.gz
Approx 85% complete for ERR022878_val_2.fq.gz
Approx 90% complete for ERR022878_val_2.fq.gz
Approx 95% complete for ERR022878_val_2.fq.gz


Analysis complete for ERR022878_val_2.fq.gz


Deleting both intermediate output files ERR022878_R1_trimmed.fq.gz and ERR022878_R2_trimmed.fq.gz


Multicore support not enabled. Proceeding with single-core trimming.
Path to Cutadapt set as: 'cutadapt' (default)
Cutadapt seems to be working fine (tested command 'cutadapt --version')
Cutadapt version: 4.8
single-core operation.


igzip command line interface 2.31.0


igzip detected. Using igzip for decompressing

No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)

Output will be written into the directory: /Users/rmcolq/Work/git/pathbio3/analysis/Schistosoma_mansoni/qc/
Using user-specified basename (>>ERR022879<<) instead of deriving the filename from the input file(s)


AUTO-DETECTING ADAPTER TYPE
Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> data/Schistosoma_mansoni/subsampled/ERR022879_1.fastq.gz <<)

Found perfect matches for the following adapter sequences:
Adapter type	Count	Sequence	Sequences analysed	Percentage
Illumina	204	AGATCGGAAGAGC	1000000	0.02
Nextera	3	CTGTCTCTTATA	1000000	0.00
smallRNA	1	TGGAATTCTCGG	1000000	0.00
Using Illumina adapter for trimming (count: 204). Second best hit was Nextera (count: 3)


SUMMARISING RUN PARAMETERS
Input filename: data/Schistosoma_mansoni/subsampled/ERR022879_1.fastq.gz
Trimming mode: paired-e

application/gzip


Started analysis of ERR022879_val_1.fq.gz
Approx 5% complete for ERR022879_val_1.fq.gz
Approx 10% complete for ERR022879_val_1.fq.gz
Approx 15% complete for ERR022879_val_1.fq.gz
Approx 20% complete for ERR022879_val_1.fq.gz
Approx 25% complete for ERR022879_val_1.fq.gz
Approx 30% complete for ERR022879_val_1.fq.gz
Approx 35% complete for ERR022879_val_1.fq.gz
Approx 40% complete for ERR022879_val_1.fq.gz
Approx 45% complete for ERR022879_val_1.fq.gz
Approx 50% complete for ERR022879_val_1.fq.gz
Approx 55% complete for ERR022879_val_1.fq.gz
Approx 60% complete for ERR022879_val_1.fq.gz
Approx 65% complete for ERR022879_val_1.fq.gz
Approx 70% complete for ERR022879_val_1.fq.gz
Approx 75% complete for ERR022879_val_1.fq.gz
Approx 80% complete for ERR022879_val_1.fq.gz
Approx 85% complete for ERR022879_val_1.fq.gz
Approx 90% complete for ERR022879_val_1.fq.gz
Approx 95% complete for ERR022879_val_1.fq.gz


Analysis complete for ERR022879_val_1.fq.gz



  >>> Now running FastQC on the validated data ERR022879_val_2.fq.gz<<<



application/gzip


Started analysis of ERR022879_val_2.fq.gz
Approx 5% complete for ERR022879_val_2.fq.gz
Approx 10% complete for ERR022879_val_2.fq.gz
Approx 15% complete for ERR022879_val_2.fq.gz
Approx 20% complete for ERR022879_val_2.fq.gz
Approx 25% complete for ERR022879_val_2.fq.gz
Approx 30% complete for ERR022879_val_2.fq.gz
Approx 35% complete for ERR022879_val_2.fq.gz
Approx 40% complete for ERR022879_val_2.fq.gz
Approx 45% complete for ERR022879_val_2.fq.gz
Approx 50% complete for ERR022879_val_2.fq.gz
Approx 55% complete for ERR022879_val_2.fq.gz
Approx 60% complete for ERR022879_val_2.fq.gz
Approx 65% complete for ERR022879_val_2.fq.gz
Approx 70% complete for ERR022879_val_2.fq.gz
Approx 75% complete for ERR022879_val_2.fq.gz
Approx 80% complete for ERR022879_val_2.fq.gz
Approx 85% complete for ERR022879_val_2.fq.gz
Approx 90% complete for ERR022879_val_2.fq.gz
Approx 95% complete for ERR022879_val_2.fq.gz


Analysis complete for ERR022879_val_2.fq.gz


Deleting both intermediate output files ERR022879_R1_trimmed.fq.gz and ERR022879_R2_trimmed.fq.gz


Multicore support not enabled. Proceeding with single-core trimming.
Path to Cutadapt set as: 'cutadapt' (default)
Cutadapt seems to be working fine (tested command 'cutadapt --version')
Cutadapt version: 4.8
single-core operation.


igzip command line interface 2.31.0


igzip detected. Using igzip for decompressing

No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)

Output will be written into the directory: /Users/rmcolq/Work/git/pathbio3/analysis/Schistosoma_mansoni/qc/
Using user-specified basename (>>ERR022880<<) instead of deriving the filename from the input file(s)


AUTO-DETECTING ADAPTER TYPE
Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> data/Schistosoma_mansoni/subsampled/ERR022880_1.fastq.gz <<)

Found perfect matches for the following adapter sequences:
Adapter type	Count	Sequence	Sequences analysed	Percentage
Illumina	1749	AGATCGGAAGAGC	1000000	0.17
Nextera	3	CTGTCTCTTATA	1000000	0.00
smallRNA	2	TGGAATTCTCGG	1000000	0.00
Using Illumina adapter for trimming (count: 1749). Second best hit was Nextera (count: 3)


SUMMARISING RUN PARAMETERS
Input filename: data/Schistosoma_mansoni/subsampled/ERR022880_1.fastq.gz
Trimming mode: paired

application/gzip


Started analysis of ERR022880_val_1.fq.gz
Approx 5% complete for ERR022880_val_1.fq.gz
Approx 10% complete for ERR022880_val_1.fq.gz
Approx 15% complete for ERR022880_val_1.fq.gz
Approx 20% complete for ERR022880_val_1.fq.gz
Approx 25% complete for ERR022880_val_1.fq.gz
Approx 30% complete for ERR022880_val_1.fq.gz
Approx 35% complete for ERR022880_val_1.fq.gz
Approx 40% complete for ERR022880_val_1.fq.gz
Approx 45% complete for ERR022880_val_1.fq.gz
Approx 50% complete for ERR022880_val_1.fq.gz
Approx 55% complete for ERR022880_val_1.fq.gz
Approx 60% complete for ERR022880_val_1.fq.gz
Approx 65% complete for ERR022880_val_1.fq.gz
Approx 70% complete for ERR022880_val_1.fq.gz
Approx 75% complete for ERR022880_val_1.fq.gz
Approx 80% complete for ERR022880_val_1.fq.gz
Approx 85% complete for ERR022880_val_1.fq.gz
Approx 90% complete for ERR022880_val_1.fq.gz
Approx 95% complete for ERR022880_val_1.fq.gz


Analysis complete for ERR022880_val_1.fq.gz



  >>> Now running FastQC on the validated data ERR022880_val_2.fq.gz<<<



application/gzip


Started analysis of ERR022880_val_2.fq.gz
Approx 5% complete for ERR022880_val_2.fq.gz
Approx 10% complete for ERR022880_val_2.fq.gz
Approx 15% complete for ERR022880_val_2.fq.gz
Approx 20% complete for ERR022880_val_2.fq.gz
Approx 25% complete for ERR022880_val_2.fq.gz
Approx 30% complete for ERR022880_val_2.fq.gz
Approx 35% complete for ERR022880_val_2.fq.gz
Approx 40% complete for ERR022880_val_2.fq.gz
Approx 45% complete for ERR022880_val_2.fq.gz
Approx 50% complete for ERR022880_val_2.fq.gz
Approx 55% complete for ERR022880_val_2.fq.gz
Approx 60% complete for ERR022880_val_2.fq.gz
Approx 65% complete for ERR022880_val_2.fq.gz
Approx 70% complete for ERR022880_val_2.fq.gz
Approx 75% complete for ERR022880_val_2.fq.gz
Approx 80% complete for ERR022880_val_2.fq.gz
Approx 85% complete for ERR022880_val_2.fq.gz
Approx 90% complete for ERR022880_val_2.fq.gz
Approx 95% complete for ERR022880_val_2.fq.gz


Analysis complete for ERR022880_val_2.fq.gz


Deleting both intermediate output files ERR022880_R1_trimmed.fq.gz and ERR022880_R2_trimmed.fq.gz


Multicore support not enabled. Proceeding with single-core trimming.
Path to Cutadapt set as: 'cutadapt' (default)
Cutadapt seems to be working fine (tested command 'cutadapt --version')
Cutadapt version: 4.8
single-core operation.


igzip command line interface 2.31.0


igzip detected. Using igzip for decompressing

No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)

Output will be written into the directory: /Users/rmcolq/Work/git/pathbio3/analysis/Schistosoma_mansoni/qc/
Using user-specified basename (>>ERR022881<<) instead of deriving the filename from the input file(s)


AUTO-DETECTING ADAPTER TYPE
Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> data/Schistosoma_mansoni/subsampled/ERR022881_1.fastq.gz <<)

Found perfect matches for the following adapter sequences:
Adapter type	Count	Sequence	Sequences analysed	Percentage
Illumina	187	AGATCGGAAGAGC	1000000	0.02
smallRNA	3	TGGAATTCTCGG	1000000	0.00
Nextera	1	CTGTCTCTTATA	1000000	0.00
Using Illumina adapter for trimming (count: 187). Second best hit was smallRNA (count: 3)


SUMMARISING RUN PARAMETERS
Input filename: data/Schistosoma_mansoni/subsampled/ERR022881_1.fastq.gz
Trimming mode: paired-

application/gzip


Started analysis of ERR022881_val_1.fq.gz
Approx 5% complete for ERR022881_val_1.fq.gz
Approx 10% complete for ERR022881_val_1.fq.gz
Approx 15% complete for ERR022881_val_1.fq.gz
Approx 20% complete for ERR022881_val_1.fq.gz
Approx 25% complete for ERR022881_val_1.fq.gz
Approx 30% complete for ERR022881_val_1.fq.gz
Approx 35% complete for ERR022881_val_1.fq.gz
Approx 40% complete for ERR022881_val_1.fq.gz
Approx 45% complete for ERR022881_val_1.fq.gz
Approx 50% complete for ERR022881_val_1.fq.gz
Approx 55% complete for ERR022881_val_1.fq.gz
Approx 60% complete for ERR022881_val_1.fq.gz
Approx 65% complete for ERR022881_val_1.fq.gz
Approx 70% complete for ERR022881_val_1.fq.gz
Approx 75% complete for ERR022881_val_1.fq.gz
Approx 80% complete for ERR022881_val_1.fq.gz
Approx 85% complete for ERR022881_val_1.fq.gz
Approx 90% complete for ERR022881_val_1.fq.gz
Approx 95% complete for ERR022881_val_1.fq.gz


Analysis complete for ERR022881_val_1.fq.gz



  >>> Now running FastQC on the validated data ERR022881_val_2.fq.gz<<<



application/gzip


Started analysis of ERR022881_val_2.fq.gz
Approx 5% complete for ERR022881_val_2.fq.gz
Approx 10% complete for ERR022881_val_2.fq.gz
Approx 15% complete for ERR022881_val_2.fq.gz
Approx 20% complete for ERR022881_val_2.fq.gz
Approx 25% complete for ERR022881_val_2.fq.gz
Approx 30% complete for ERR022881_val_2.fq.gz
Approx 35% complete for ERR022881_val_2.fq.gz
Approx 40% complete for ERR022881_val_2.fq.gz
Approx 45% complete for ERR022881_val_2.fq.gz
Approx 50% complete for ERR022881_val_2.fq.gz
Approx 55% complete for ERR022881_val_2.fq.gz
Approx 60% complete for ERR022881_val_2.fq.gz
Approx 65% complete for ERR022881_val_2.fq.gz
Approx 70% complete for ERR022881_val_2.fq.gz
Approx 75% complete for ERR022881_val_2.fq.gz
Approx 80% complete for ERR022881_val_2.fq.gz
Approx 85% complete for ERR022881_val_2.fq.gz
Approx 90% complete for ERR022881_val_2.fq.gz
Approx 95% complete for ERR022881_val_2.fq.gz


Analysis complete for ERR022881_val_2.fq.gz


Deleting both intermediate output files ERR022881_R1_trimmed.fq.gz and ERR022881_R2_trimmed.fq.gz


Multicore support not enabled. Proceeding with single-core trimming.
Path to Cutadapt set as: 'cutadapt' (default)
Cutadapt seems to be working fine (tested command 'cutadapt --version')
Cutadapt version: 4.8
single-core operation.


igzip command line interface 2.31.0


igzip detected. Using igzip for decompressing

No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)

Output will be written into the directory: /Users/rmcolq/Work/git/pathbio3/analysis/Schistosoma_mansoni/qc/
Using user-specified basename (>>ERR022882<<) instead of deriving the filename from the input file(s)


AUTO-DETECTING ADAPTER TYPE
Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> data/Schistosoma_mansoni/subsampled/ERR022882_1.fastq.gz <<)

Found perfect matches for the following adapter sequences:
Adapter type	Count	Sequence	Sequences analysed	Percentage
Illumina	108	AGATCGGAAGAGC	1000000	0.01
smallRNA	3	TGGAATTCTCGG	1000000	0.00
Nextera	1	CTGTCTCTTATA	1000000	0.00
Using Illumina adapter for trimming (count: 108). Second best hit was smallRNA (count: 3)


SUMMARISING RUN PARAMETERS
Input filename: data/Schistosoma_mansoni/subsampled/ERR022882_1.fastq.gz
Trimming mode: paired-

application/gzip


Started analysis of ERR022882_val_1.fq.gz
Approx 5% complete for ERR022882_val_1.fq.gz
Approx 10% complete for ERR022882_val_1.fq.gz
Approx 15% complete for ERR022882_val_1.fq.gz
Approx 20% complete for ERR022882_val_1.fq.gz
Approx 25% complete for ERR022882_val_1.fq.gz
Approx 30% complete for ERR022882_val_1.fq.gz
Approx 35% complete for ERR022882_val_1.fq.gz
Approx 40% complete for ERR022882_val_1.fq.gz
Approx 45% complete for ERR022882_val_1.fq.gz
Approx 50% complete for ERR022882_val_1.fq.gz
Approx 55% complete for ERR022882_val_1.fq.gz
Approx 60% complete for ERR022882_val_1.fq.gz
Approx 65% complete for ERR022882_val_1.fq.gz
Approx 70% complete for ERR022882_val_1.fq.gz
Approx 75% complete for ERR022882_val_1.fq.gz
Approx 80% complete for ERR022882_val_1.fq.gz
Approx 85% complete for ERR022882_val_1.fq.gz
Approx 90% complete for ERR022882_val_1.fq.gz
Approx 95% complete for ERR022882_val_1.fq.gz


Analysis complete for ERR022882_val_1.fq.gz



  >>> Now running FastQC on the validated data ERR022882_val_2.fq.gz<<<



application/gzip


Started analysis of ERR022882_val_2.fq.gz
Approx 5% complete for ERR022882_val_2.fq.gz
Approx 10% complete for ERR022882_val_2.fq.gz
Approx 15% complete for ERR022882_val_2.fq.gz
Approx 20% complete for ERR022882_val_2.fq.gz
Approx 25% complete for ERR022882_val_2.fq.gz
Approx 30% complete for ERR022882_val_2.fq.gz
Approx 35% complete for ERR022882_val_2.fq.gz
Approx 40% complete for ERR022882_val_2.fq.gz
Approx 45% complete for ERR022882_val_2.fq.gz
Approx 50% complete for ERR022882_val_2.fq.gz
Approx 55% complete for ERR022882_val_2.fq.gz
Approx 60% complete for ERR022882_val_2.fq.gz
Approx 65% complete for ERR022882_val_2.fq.gz
Approx 70% complete for ERR022882_val_2.fq.gz
Approx 75% complete for ERR022882_val_2.fq.gz
Approx 80% complete for ERR022882_val_2.fq.gz
Approx 85% complete for ERR022882_val_2.fq.gz
Approx 90% complete for ERR022882_val_2.fq.gz
Approx 95% complete for ERR022882_val_2.fq.gz


Analysis complete for ERR022882_val_2.fq.gz


Deleting both intermediate output files ERR022882_R1_trimmed.fq.gz and ERR022882_R2_trimmed.fq.gz


Multicore support not enabled. Proceeding with single-core trimming.
Path to Cutadapt set as: 'cutadapt' (default)
Cutadapt seems to be working fine (tested command 'cutadapt --version')
Cutadapt version: 4.8
single-core operation.


igzip command line interface 2.31.0


igzip detected. Using igzip for decompressing

No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)

Output will be written into the directory: /Users/rmcolq/Work/git/pathbio3/analysis/Schistosoma_mansoni/qc/
Using user-specified basename (>>ERR022883<<) instead of deriving the filename from the input file(s)


AUTO-DETECTING ADAPTER TYPE
Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> data/Schistosoma_mansoni/subsampled/ERR022883_1.fastq.gz <<)

Found perfect matches for the following adapter sequences:
Adapter type	Count	Sequence	Sequences analysed	Percentage
Illumina	128	AGATCGGAAGAGC	1000000	0.01
Nextera	2	CTGTCTCTTATA	1000000	0.00
smallRNA	1	TGGAATTCTCGG	1000000	0.00
Using Illumina adapter for trimming (count: 128). Second best hit was Nextera (count: 2)


SUMMARISING RUN PARAMETERS
Input filename: data/Schistosoma_mansoni/subsampled/ERR022883_1.fastq.gz
Trimming mode: paired-e

application/gzip


Started analysis of ERR022883_val_1.fq.gz
Approx 5% complete for ERR022883_val_1.fq.gz
Approx 10% complete for ERR022883_val_1.fq.gz
Approx 15% complete for ERR022883_val_1.fq.gz
Approx 20% complete for ERR022883_val_1.fq.gz
Approx 25% complete for ERR022883_val_1.fq.gz
Approx 30% complete for ERR022883_val_1.fq.gz
Approx 35% complete for ERR022883_val_1.fq.gz
Approx 40% complete for ERR022883_val_1.fq.gz
Approx 45% complete for ERR022883_val_1.fq.gz
Approx 50% complete for ERR022883_val_1.fq.gz
Approx 55% complete for ERR022883_val_1.fq.gz
Approx 60% complete for ERR022883_val_1.fq.gz
Approx 65% complete for ERR022883_val_1.fq.gz
Approx 70% complete for ERR022883_val_1.fq.gz
Approx 75% complete for ERR022883_val_1.fq.gz
Approx 80% complete for ERR022883_val_1.fq.gz
Approx 85% complete for ERR022883_val_1.fq.gz
Approx 90% complete for ERR022883_val_1.fq.gz
Approx 95% complete for ERR022883_val_1.fq.gz


Analysis complete for ERR022883_val_1.fq.gz



  >>> Now running FastQC on the validated data ERR022883_val_2.fq.gz<<<



application/gzip


Started analysis of ERR022883_val_2.fq.gz
Approx 5% complete for ERR022883_val_2.fq.gz
Approx 10% complete for ERR022883_val_2.fq.gz
Approx 15% complete for ERR022883_val_2.fq.gz
Approx 20% complete for ERR022883_val_2.fq.gz
Approx 25% complete for ERR022883_val_2.fq.gz
Approx 30% complete for ERR022883_val_2.fq.gz
Approx 35% complete for ERR022883_val_2.fq.gz
Approx 40% complete for ERR022883_val_2.fq.gz
Approx 45% complete for ERR022883_val_2.fq.gz
Approx 50% complete for ERR022883_val_2.fq.gz
Approx 55% complete for ERR022883_val_2.fq.gz
Approx 60% complete for ERR022883_val_2.fq.gz
Approx 65% complete for ERR022883_val_2.fq.gz
Approx 70% complete for ERR022883_val_2.fq.gz
Approx 75% complete for ERR022883_val_2.fq.gz
Approx 80% complete for ERR022883_val_2.fq.gz
Approx 85% complete for ERR022883_val_2.fq.gz
Approx 90% complete for ERR022883_val_2.fq.gz
Approx 95% complete for ERR022883_val_2.fq.gz


Analysis complete for ERR022883_val_2.fq.gz


Deleting both intermediate output files ERR022883_R1_trimmed.fq.gz and ERR022883_R2_trimmed.fq.gz




<div class="alert alert-block alert-warning">

Question:

6. What do the following elements of the code above mean (Use the TrimGalore documentation to find the information)?

`--paired`

`--fastqc`

`--basename`

7. Chose a read file and compare the FastQC report before trimming to the FastQC report after adaptor trimming. What are the improvements to the data quality after trimming? Are there any remaining warnings?

<div class="alert alert-block alert-success">
    
Answers:

6. `--paired` means that the reads are generated in pairs with a set gap size and so should be considered together with their pair when performing QC. In particular, this removes entire read pairs if at least one of the two sequences became shorter than a certain threshold. `--fastqc` automatically runs FastQC after trimming. `--basename` specifies the preferred prefix of output files.
7. For read 1 of ERR022872, the trimmed report tells us that the average read quality does not drop off as much towards the end of the read. The per-base sequence content is still more variable at the starts of the reads, but no longer at the ends of the reads. The overrepresented primers and adaptors have now been removed.
   The read length distribution is now worse as before all reads were 76bp long and now there is a long tail with (small numbers of) much shorter reads  with lengths between 20-75bp after trimming.

# Mapping to the reference

Now that we are happy with the quality of the reads and have removed adapters, we can map our sequences to the genome. This allows us to identify which genes each read came from, and which genes were <i>expressed</i> in our samples in the form of transcripts. 

<figure>
    <img src="https://www.annualreviews.org/docserver/ahah/fulltext/biodatasci/2/1/bd020139.f4_thmb.gif">
</figure>

In some organisms including Plasmodium, expressed transcripts may be generated by splicing together non-contiguous exons from the genome (others such as Trypanosoma do not). To handle this, we can either use a splice-aware mapper to align reads across splice junctions, or we can map directly against panels of known transcripts. In this example we are going to use [STAR](https://academic.oup.com/bioinformatics/article/29/1/15/272537) to perform splice-aware alignment to the reference genome fasta. The manual can be found [here](https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf). 

<div class="alert alert-block alert-danger">

This code will be run during the class, but uses more disk space than is available in Noteable. If you want to try it yourself, it will probably work on your personal computer. 

In [23]:
%%bash
mkdir -p analysis/Schistosoma_mansoni/star/ref

# unzip the reference files
#gunzip data/Schistosoma_mansoni/reference/schistosoma_mansoni.PRJEA36577.WBPS19.annotations.gtf.gz 
#gunzip data/Schistosoma_mansoni/reference/schistosoma_mansoni.PRJEA36577.WBPS19.genomic.fa.gz

# first we need to index the reference
mkdir -p analysis/Schistosoma_mansoni/star/ref
  STAR --runThreadN 4 \
    --runMode genomeGenerate \
    --genomeDir analysis/Schistosoma_mansoni/star/ref \
    --genomeFastaFiles data/Schistosoma_mansoni/reference/*genomic.fa \
    --sjdbGTFfile data/Schistosoma_mansoni/reference/*.gtf \
    --sjdbOverhang 75 \
    --genomeSAindexNbases 11

	STAR --runThreadN 4 --runMode genomeGenerate --genomeDir analysis/Schistosoma_mansoni/star/ref --genomeFastaFiles data/Schistosoma_mansoni/reference/schistosoma_mansoni.PRJEA36577.WBPS19.genomic.fa --sjdbGTFfile data/Schistosoma_mansoni/reference/schistosoma_mansoni.PRJEA36577.WBPS19.annotations.gtf --sjdbOverhang 75 --genomeSAindexNbases 11
	STAR version: 2.7.11b   compiled:  :/Users/distiller/project/STARcompile/source
Aug 16 15:45:17 ..... started STAR run
Aug 16 15:45:17 ... starting to generate Genome files
Aug 16 15:45:29 ..... processing annotations GTF
Aug 16 15:45:31 ... starting to sort Suffix Array. This may take a long time...
Aug 16 15:45:33 ... sorting Suffix Array chunks and saving them to disk...
Aug 16 15:51:21 ... loading chunks from disk, packing SA...
Aug 16 15:51:34 ... finished generating suffix array
Aug 16 15:51:34 ... generating Suffix Array index
Aug 16 15:51:43 ... completed Suffix Array index
Aug 16 15:51:43 ..... inserting junctions into the genome indices

<div class="alert alert-block alert-info">

In the code above:

`--runMode genomeGenerate` - directs STAR to run genome indexing

`--genomeDir /path/to/genomeDir` - specifies where to store the index

`--genomeFastaFiles /path/to/genome/fasta` - provides the reference genome

`--sjdbGTFfile /path/to/annotations.gtf` - provides the coordinates of splice junctions in the reference genome

`--sjdbOverhang ReadLength-1` - this specifies the length of sequences to include in the splice junctions database

We can now use this index to align each pair of readfiles against the reference.

In [24]:
%%bash
for accession in $(cat data/Schistosoma_mansoni/list_ids.txt)
do
    mkdir -p analysis/Schistosoma_mansoni/star/$accession
    
    STAR \
      --genomeDir analysis/Schistosoma_mansoni/star/ref \
      --runThreadN 4 \
      --readFilesIn <(gunzip -c analysis/Schistosoma_mansoni/qc/$accession*.fq.gz) \
      --outFileNamePrefix analysis/Schistosoma_mansoni/star/$accession/$accession \
      --outSAMtype BAM SortedByCoordinate \
      --limitBAMsortRAM 4000000 \
      --outSAMattributes Standard \
      --quantMode TranscriptomeSAM GeneCounts
done

	STAR --genomeDir analysis/Schistosoma_mansoni/star/ref --runThreadN 4 --readFilesIn /dev/fd/63 --outFileNamePrefix analysis/Schistosoma_mansoni/star/ERR022872/ERR022872 --outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 4000000 --outSAMattributes Standard --quantMode TranscriptomeSAM GeneCounts
	STAR version: 2.7.11b   compiled:  :/Users/distiller/project/STARcompile/source
Aug 16 15:52:34 ..... started STAR run
Aug 16 15:52:34 ..... loading genome



Transcriptome.cpp:18:Transcriptome: exiting because of *INPUT FILE* error: could not open input file /geneInfo.tab
Solution: check that the file exists and you have read permission for this file
          SOLUTION: utilize --sjdbGTFfile /path/to/annotations.gtf option at the genome generation step or mapping step

Aug 16 15:52:39 ...... FATAL ERROR, exiting


	STAR --genomeDir analysis/Schistosoma_mansoni/star/ref --runThreadN 4 --readFilesIn /dev/fd/63 --outFileNamePrefix analysis/Schistosoma_mansoni/star/ERR022873/ERR022873 --outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 4000000 --outSAMattributes Standard --quantMode TranscriptomeSAM GeneCounts
	STAR version: 2.7.11b   compiled:  :/Users/distiller/project/STARcompile/source
Aug 16 15:52:39 ..... started STAR run
Aug 16 15:52:39 ..... loading genome



Transcriptome.cpp:18:Transcriptome: exiting because of *INPUT FILE* error: could not open input file /geneInfo.tab
Solution: check that the file exists and you have read permission for this file
          SOLUTION: utilize --sjdbGTFfile /path/to/annotations.gtf option at the genome generation step or mapping step

Aug 16 15:52:42 ...... FATAL ERROR, exiting


	STAR --genomeDir analysis/Schistosoma_mansoni/star/ref --runThreadN 4 --readFilesIn /dev/fd/63 --outFileNamePrefix analysis/Schistosoma_mansoni/star/ERR022874/ERR022874 --outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 4000000 --outSAMattributes Standard --quantMode TranscriptomeSAM GeneCounts
	STAR version: 2.7.11b   compiled:  :/Users/distiller/project/STARcompile/source
Aug 16 15:52:42 ..... started STAR run
Aug 16 15:52:42 ..... loading genome



Transcriptome.cpp:18:Transcriptome: exiting because of *INPUT FILE* error: could not open input file /geneInfo.tab
Solution: check that the file exists and you have read permission for this file
          SOLUTION: utilize --sjdbGTFfile /path/to/annotations.gtf option at the genome generation step or mapping step

Aug 16 15:52:46 ...... FATAL ERROR, exiting


	STAR --genomeDir analysis/Schistosoma_mansoni/star/ref --runThreadN 4 --readFilesIn /dev/fd/63 --outFileNamePrefix analysis/Schistosoma_mansoni/star/ERR022875/ERR022875 --outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 4000000 --outSAMattributes Standard --quantMode TranscriptomeSAM GeneCounts
	STAR version: 2.7.11b   compiled:  :/Users/distiller/project/STARcompile/source
Aug 16 15:52:46 ..... started STAR run
Aug 16 15:52:46 ..... loading genome
Error while terminating subprocess (pid=98231): 


<div class="alert alert-block alert-info">

In the code above:

`<(gunzip -c reads.fq.gz)` - this uncompresses the read sequences to input to STAR which does not support compressed files

`--outSAMtype BAM SortedByCoordinate` - sort and compress the output

`--outSAMattributes Standard` - include some standard count information in the output file

<div class="alert alert-block alert-warning">

Question:

8. Using the STAR manual, what outputs are generated using the flags `--quantMode TranscriptomeSAM GeneCounts`?

9. Find a python library which can load a SAM file

<div class="alert alert-block alert-success">

Answers:

8. (p18) With --quantMode TranscriptomeSAM option STAR will output alignments translated into transcript coordinates in the Aligned.toTranscriptome.out.bam file (in addition to alignments in genomic coordinates in Aligned.*.sam/bam files).
   
    A SAM file usually records the position and how each read lines up against the reference genome. This instead writes down how the reads map against the transcripts in the reference genome. A BAM is a compressed SAM file.

   (p18) STAR outputs read counts per gene into ReadsPerGene.out.tab file with 4 columns which correspond to different strandedness options:

    column 1: gene ID
    column 2: counts for unstranded RNA-seq
    column 3: counts for the 1st read strand aligned with RNA (htseq-count option -s yes
    column 4: counts for the 2nd read strand aligned with RNA (htseq-count option -s reverse)

    With --quantMode TranscriptomeSAM GeneCounts, it will output both the Aligned.toTranscriptome.out.bam and ReadsPerGene.out.tab outputs.

9. [pysam](https://pysam.readthedocs.io/en/latest/api.html) is an example of a python library which can load a SAM file

We will be running STAR on each of the full datasets and will make the mapped read files available for the next class.

# Extension

There also exist several methods for transcript abundance quantification using `pseudo-alignment`. These methods don't fully line up reads against the reference genome or transcript sequences, but instead count the occurance of substrings of these transcripts and use this to estimate transcript abundances. 

One example of this method is [Kallisto](https://www.nature.com/articles/nbt.3519)

In [None]:
! conda install --yes --quiet bioconda::kallisto=0.48

In [None]:
%%bash

mkdir -p analysis/Schistosoma_mansoni/kallisto/

kallisto index --index=analysis/Schistosoma_mansoni/kallisto/smansoni data/Schistosoma_mansoni/reference/schistosoma_mansoni.PRJEA36577.WBPS19.mRNA_transcripts.fa.gz

for accession in $(cat data/Schistosoma_mansoni/subsampled/list_ids.txt)
do
    kallisto quant --threads=2 \
      --index=analysis/Schistosoma_mansoni/kallisto/smansoni \
      --output-dir=analysis/Schistosoma_mansoni/kallisto \
      --gtf=data/Schistosoma_mansoni/reference/schistosoma_mansoni.PRJEA36577.WBPS19.annotations.gtf.gz \
      analysis/Schistosoma_mansoni/qc/"$accession"_1.trimmed.fastq.gz analysis/Schistosoma_mansoni/qc/"$accession"_2.trimmed.fastq.gz
done


[build] loading fasta file data/Schistosoma_mansoni/reference/schistosoma_mansoni.PRJEA36577.WBPS19.mRNA_transcripts.fa.gz
[build] k-mer length: 31
        from 1 target sequences
[build] counting k-mers ... done.


<div class="alert alert-block alert-warning">

Question:

10. Give 2 differences between the methods used above by STAR and Kallisto to quantify transcript abundances

<div class="alert alert-block alert-success">

Answers:

10. - STAR performs splice aware alignment to the reference genome. Kallisto instead takes as input a file of transcripts so does not need to be aware of splicing.
    - Kallisto uses approximate pseudo-alignment rather than full alignment so will not give you a SAM/BAM file output
    - Kallisto is much quicker and less memory intensive than STAR.
    - Kallisto gives transcript level expression, whilst STAR + counting only gives gene-level (transcript level can be found by following on with RSEM)
    - Kallisto can't quantify genes or splice-variants that are not in the input transcript file. Differences between the real transcriptome and your annotation of the transcriptome will reduce accuracy. This means it cannot be used for finding new genes, transcripts or splice forms, or for any analysis other than quantification.