<a href="https://colab.research.google.com/github/taejoonlab/BloodSweatTears/blob/main/rnaseq/kallisto_quant.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Gene expression quantification

## Prepare the kallisto environment

See https://github.com/taejoonlab/BloodSweatTears/blob/main/rnaseq/kallisto_index.ipynb 

We assume that your kallisto index is available on your Google Drive.

In [4]:
from google.colab import drive
drive.mount('/content/drive')

!ls /content/drive/MyDrive/BloodSweatTears/

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
gencode.v40.transcripts.fa.gz  kallisto_linux-v0.46.1.tar.gz
gencode_v40_tx.kallisto_idx


Copy the kallisto index to the Google Colab space.

In [5]:
! cp /content/drive/MyDrive/BloodSweatTears/gencode_v40_tx.kallisto_idx .
! ls

drive			     kallisto			    sample_data
gencode_v40_tx.kallisto_idx  kallisto_linux-v0.46.1.tar.gz


In [3]:
! cp /content/drive/MyDrive/BloodSweatTears/kallisto_linux-v0.46.1.tar.gz .
! tar xvzf kallisto_linux-v0.46.1.tar.gz

kallisto/
kallisto/test/
kallisto/README.md
kallisto/kallisto
kallisto/license.txt
kallisto/test/reads_1.fastq.gz
kallisto/test/transcripts.fasta.gz
kallisto/test/README.md
kallisto/test/chrom.txt
kallisto/test/Snakefile
kallisto/test/reads_2.fastq.gz
kallisto/test/transcripts.gtf.gz


## Download the raw RNA-seq file

In this example, we will analyze the blood transcriptomes of COVID-19 patients from Zhang, *et al., Front. Cell. Infect. Microbiol.* (2022) https://doi.org/10.3389/fcimb.2021.821828

Raw data is available at NCBI GEO (accession number GSE189263) that you can find here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE189263

To download raw FASTQ file, you can also use EBI European Nucleotide Archive(ENA) at https://www.ebi.ac.uk/ena/browser/view/PRJNA782306 (you can go there by searching for "GSE189263" on the page of EBI ENA https://www.ebi.ac.uk/ena/browser/home) Check "Generated FASTQ files: FTP" column on the table.


In [12]:
# Download FQ1 sample
! wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR169/013/SRR16992513/SRR16992513_1.fastq.gz

# Downlaod FQ2 sample
! wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR169/013/SRR16992513/SRR16992513_2.fastq.gz

--2022-05-01 08:52:02--  ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR169/013/SRR16992513/SRR16992513_1.fastq.gz
           => ‘SRR16992513_1.fastq.gz’
Resolving ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)... 193.62.193.138
Connecting to ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)|193.62.193.138|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /vol1/fastq/SRR169/013/SRR16992513 ... done.
==> SIZE SRR16992513_1.fastq.gz ... 2839073182
==> PASV ... done.    ==> RETR SRR16992513_1.fastq.gz ... done.
Length: 2839073182 (2.6G) (unauthoritative)


2022-05-01 08:55:47 (12.1 MB/s) - ‘SRR16992513_1.fastq.gz’ saved [2839073182]

--2022-05-01 08:55:47--  ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR169/013/SRR16992513/SRR16992513_2.fastq.gz
           => ‘SRR16992513_2.fastq.gz’
Resolving ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)... 193.62.193.138
Connecting to ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)|193.62.193.138|:21... connected.
Logging in as 

In [14]:
! ls -lh

total 8.6G
drwx------ 7 root root  4.0K May  1 06:35 drive
-rw------- 1 root root  3.2G May  1 07:51 gencode_v40_tx.kallisto_idx
drwxr-xr-x 3  501 staff 4.0K Nov  4  2019 kallisto
-rw------- 1 root root  7.0M May  1 06:37 kallisto_linux-v0.46.1.tar.gz
drwxr-xr-x 1 root root  4.0K Apr 29 03:19 sample_data
drwxr-xr-x 2 root root  4.0K May  1 08:44 SRR10355151_C1.kallisto_quant
-rw-r--r-- 1 root root  2.7G May  1 08:55 SRR16992513_1.fastq.gz
-rw-r--r-- 1 root root  2.9G May  1 08:57 SRR16992513_2.fastq.gz


## Run kallisto


For single-end RNA-seq data, you need to set the fragment_length (-l) and its deviation (-s), that you can get from the library QC data. If you do not know, typically you can set this as '-l 200 -s 30' (mean fragement length as 200 bp; standard deviation of fragment length as 30). 

```
$ kallisto quant --single -l <fragment length> -s <fragment length standard deviation> -t <number of threads> -i <index name>  -o <out directory name> <FQ1>
```

For paired-end data, these values will be automatically estimated from the data

```
$ kallisto quant -t <number of threads> -i <index name>  -o <out directory name> <FQ1> <FQ2>
```


In [15]:
! ./kallisto/kallisto quant -t 1 -i gencode_v40_tx.kallisto_idx  -o SRR16992513_Healthy-3rd-4.kallisto_quant SRR16992513_1.fastq.gz SRR16992513_2.fastq.gz


[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 246,624
[index] number of k-mers: 148,002,292
tcmalloc: large alloc 6442450944 bytes == 0x21ae000 @  0x7f4249bc21e7 0x6f292d 0x6f29a9 0x4adbe9 0x4a7d60 0x44eb65 0x7f4248bdec87 0x452d59
[index] number of equivalence classes: 1,057,791
[quant] running in paired-end mode
[quant] will process pair 1: SRR16992513_1.fastq.gz
                             SRR16992513_2.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 58,812,362 reads, 25,261,116 reads pseudoaligned
[quant] estimated average fragment length: 151.84
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 1,090 rounds



In [16]:
! head SRR16992513_Healthy-3rd-4.kallisto_quant/*tsv

target_id	length	eff_length	est_counts	tpm
ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-202|DDX11L1|1657|processed_transcript|	1657	1506.16	30.5405	0.51664
ENST00000450305.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000002844.2|DDX11L1-201|DDX11L1|632|transcribed_unprocessed_pseudogene|	632	481.209	0	0
ENST00000488147.1|ENSG00000227232.5|OTTHUMG00000000958.1|OTTHUMT00000002839.1|WASH7P-201|WASH7P|1351|unprocessed_pseudogene|	1351	1200.16	124.093	2.63445
ENST00000619216.1|ENSG00000278267.1|-|-|MIR6859-1-201|MIR6859-1|68|miRNA|	68	17.1792	0	0
ENST00000473358.1|ENSG00000243485.5|OTTHUMG00000000959.2|OTTHUMT00000002840.1|MIR1302-2HG-202|MIR1302-2HG|712|lncRNA|	712	561.16	0	0
ENST00000469289.1|ENSG00000243485.5|OTTHUMG00000000959.2|OTTHUMT00000002841.2|MIR1302-2HG-201|MIR1302-2HG|535|lncRNA|	535	384.292	0	0
ENST00000607096.1|ENSG00000284332.1|-|-|MIR1302-2-201|MIR1302-2|138|miRNA|	138	22.9325	0	0
ENST00000417324.1|ENSG00000237613.2|OTTHUMG00000

In [17]:
! tar cvzpf SRR16992513_Healthy-3rd-4.kallisto_quant.tgz SRR16992513_Healthy-3rd-4.kallisto_quant/
! cp SRR16992513_Healthy-3rd-4.kallisto_quant.tgz /content/drive/MyDrive/BloodSweatTears/

SRR16992513_Healthy-3rd-4.kallisto_quant/
SRR16992513_Healthy-3rd-4.kallisto_quant/abundance.h5
SRR16992513_Healthy-3rd-4.kallisto_quant/run_info.json
SRR16992513_Healthy-3rd-4.kallisto_quant/abundance.tsv
