<a href="https://colab.research.google.com/github/taejoonlab/BloodSweatTears/blob/main/rnaseq/kallisto_index.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

Kallisto is an ultra-fast pseudoalignment tool for RNA-seq quantification. 

Check the homepage https://pachterlab.github.io/kallisto/ for more details.

Here we will make a reference index to run kallisto. There are several 'reference transcriptome databases' available, such as NCBI RefSeq and EnsEMBL. We will use the GENCODE as a reference. https://www.gencodegenes.org/human/


# Set the Google Drive
We will download files and make a reference, which we will use later also. 
Because the colab space is initialized every time (your files will be gone next time), it would be good to save them to your Goole Drive. 

Mount your Google Drive space as below. See more options at https://colab.research.google.com/notebooks/io.ipynb

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Move to the directory you want to use. I will use "MyDrive/BloodSweatTears/" here. If the directory is not available, please create it first through your Google Drive client.

In [24]:
!ls /content/drive/MyDrive/BloodSweatTears/

gencode.v40.transcripts.fa.gz


# Downlaod FASTA file 

Download  the FASTA file for all transcripts (including the splicing variants) at https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_40/gencode.v40.transcripts.fa.gz 

(FYI, the link above is for human GENCODE release 40. The version number will be chnaged if you use different version. Just use the latest version on the GENCODE website.

- "!" symbol at first means to run the following line as a terminal command. See https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/01.05-IPython-And-Shell-Commands.ipynb for more details.
- "wget" is a tool to download the file from the URL. It is a default utility availabe in the google colab so you do not need to install it. 

In [30]:
! rm gencode.*fa*
# For the first time 
# ! wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_40/gencode.v40.transcripts.fa.gz
# ! cp gencode.v40.transcripts.fa.gz /content/drive/MyDrive/BloodSweatTears/

# Next time
! cp /content/drive/MyDrive/BloodSweatTears/gencode.v40.transcripts.fa.gz .

Check the downloaded file.

- "ls" is a command to present the list of files and directories.
- "gunzip" is a command to uncompress the gzip file.
- 'head' will present the first 10 lines of a text file.

In [31]:
! ls
! gunzip gencode.v40.transcripts.fa.gz
! head gencode.v40.transcripts.fa


drive			       kallisto			      sample_data
gencode.v40.transcripts.fa.gz  kallisto_linux-v0.46.1.tar.gz
>ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-202|DDX11L1|1657|processed_transcript|
GTTAACTTGCCGTCAGCCTTTTCTTTGACCTCTTCTTTCTGTTCATGTGTATTTGCTGTC
TCTTAGCCCAGACTTCCCGTGTCCTTTCCACCGGGCCTTTGAGAGGTCACAGGGTCTTGA
TGCTGTGGTCTTCATCTGCAGGTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTG
CAAGCTGAGCACTGGAGTGGAGTTTTCCTGTGGAGAGGAGCCATGCCTAGAGTGGGATGG
GCCATTGTTCATCTTCTGGCCCCTGTTGTCTGCATGTAACTTAATACCACAACCAGGCAT
AGGGGAAAGATTGGAGGAAAGATGAGTGAGAGCATCAACTTCTCTCACAACCTAGGCCAG
TGTGTGGTGATGCCAGGCATGCCCTTCCCCAGCATCAGGTCTCCAGAGCTGCAGAAGACG
ACGGCCGACTTGGATCACACTCTTGTGAGTGTCCCCAGTGTTGCAGAGGCAGGGCCATCA
GGCACCAAAGGGATTCTGCCAGCATAGTGCTCCTGGACCAGTGATACACCCGGCACCCTG


# Install kalllisto

You can simply install kallisto by downloading it. Check the latest release version at https://pachterlab.github.io/kallisto/download .

- "tar xvzf filename.tar.gz" is a command to untar and uncompress the file.

You can run kallisto with the full path to a exacutable file. If you just run 'kallisto' it won't work. Use "./kallisto/kallisto" instead. If the installation is complete, you can see the kallisto usage message with this command. If so, you are ready to go. 

In [27]:
! rm -rf kallisto*
# ! wget https://github.com/pachterlab/kallisto/releases/download/v0.46.1/kallisto_linux-v0.46.1.tar.gz
# ! cp kallisto_linux-v0.46.1.tar.gz /content/drive/MyDrive/BloodSweatTears/
! cp /content/drive/MyDrive/BloodSweatTears/kallisto_linux-v0.46.1.tar.gz .
! tar xvzf kallisto_linux-v0.46.1.tar.gz
! ls
! ./kallisto/kallisto

kallisto/
kallisto/test/
kallisto/README.md
kallisto/kallisto
kallisto/license.txt
kallisto/test/reads_1.fastq.gz
kallisto/test/transcripts.fasta.gz
kallisto/test/README.md
kallisto/test/chrom.txt
kallisto/test/Snakefile
kallisto/test/reads_2.fastq.gz
kallisto/test/transcripts.gtf.gz
drive			       kallisto			      sample_data
gencode.v40.transcripts.fa.gz  kallisto_linux-v0.46.1.tar.gz
kallisto 0.46.1

Usage: kallisto <CMD> [arguments] ..

Where <CMD> can be one of:

    index         Builds a kallisto index 
    quant         Runs the quantification algorithm 
    bus           Generate BUS files for single-cell data 
    pseudo        Runs the pseudoalignment step 
    merge         Merges several batch runs 
    h5dump        Converts HDF5-formatted results to plaintext
    inspect       Inspects and gives information about an index
    version       Prints version information
    cite          Prints citation information

Running kallisto <CMD> without arguments prints usage infor

In [32]:
!./kallisto/kallisto index -i gencode_v40_tx.kallisto_idx gencode.v40.transcripts.fa 


[build] loading fasta file gencode.v40.transcripts.fa
[build] k-mer length: 31
        from 1961 target sequences
        with pseudorandom nucleotides
[build] counting k-mers ... tcmalloc: large alloc 1610612736 bytes == 0x84a4c000 @  0x7ff29295c1e7 0x6f292d 0x6f29a9 0x4adbe9 0x4a5db8 0x4acf59 0x44e0c2 0x7ff291978c87 0x452d59
tcmalloc: large alloc 3221225472 bytes == 0xe4a4c000 @  0x7ff29295c1e7 0x6f292d 0x6f29a9 0x4adbe9 0x4a5db8 0x4acf59 0x44e0c2 0x7ff291978c87 0x452d59
tcmalloc: large alloc 6442450944 bytes == 0x1a527e000 @  0x7ff29295c1e7 0x6f292d 0x6f29a9 0x4adbe9 0x4a5db8 0x4acf59 0x44e0c2 0x7ff291978c87 0x452d59
done.
[build] building target de Bruijn graph ...  done 
[build] creating equivalence classes ...  done
[build] target de Bruijn graph has 1592242 contigs and contains 148002292 k-mers 



In [33]:
! ls -lh
! cp gencode_v40_tx.kallisto_idx /content/drive/MyDrive/BloodSweatTears/

total 3.6G
drwx------ 7 root root  4.0K Apr 29 01:14 drive
-rw------- 1 root root  433M Apr 29 01:44 gencode.v40.transcripts.fa
-rw-r--r-- 1 root root  3.2G Apr 29 01:55 gencode_v40_tx.kallisto_idx
drwxr-xr-x 3  501 staff 4.0K Nov  4  2019 kallisto
-rw------- 1 root root  7.0M Apr 29 01:42 kallisto_linux-v0.46.1.tar.gz
drwxr-xr-x 1 root root  4.0K Apr 25 13:46 sample_data
