Skip to content
Robert Lehmann edited this page Mar 14, 2019 · 2 revisions

This wiki (thanks RodenLuo for the writeup) page is intended as auxiliary guide for the practical part of the compBio2019 course. It is recommended to roughly scan the provided course work PDFs at least once and then read this guide. After reading this guide, one can go back to the course work PDFs to do the analyses. Good luck.

Table of contents

Work on IBEX

Basics

First and most importantly, one needs to understand what IBEX is. This is given in the lecture before the hands-on session. I am not going to go too much deep. But the basic idea is that often times the local computer (your laptop) is not powerful enough to perform the scientific computation. One need to use much more powerful machine(s) to do that. In this case, we use ibex.

Access to ibex, if you are on Mac, open an APP called terminal; if you are one Windows, download "babun" and install it and then open it (you can install any other shell application on Windows but I suggest babun). Then type the following on the newly opened window. I will use my user name luod to demonstrate, you will need to change it accordingly. Check the screenshot below. You will need to type your password (unless you did some magic work called ssh key authentication).

Image from Gyazo

Once you see the welcome message like me, you are on ibex. Once you are on ibex, you no longer have the mouse or the nice looking graphics like your laptop. Then you need to know how to navigate yourself on ibex (find folders, files, use applications). This has been covered in the lecture before. But here is the tutorial I used when I first learned unix/linux (you can also take a look at ibex training page). At least you need to go through "Tutorial One" and "Tutorial Two". You at least need to master the following commands (if not covered in the tutorial, you want to google it, when you google a command in linux, follow the word by linux, e.g. Google "pwd linux"):

ls
cd
pwd
cp
mv
cat
nano
vi # talk about this in a sec

That gives you how to enter into a certain folder. Then you need to know how to edit a file on linux. That where vi comes in handy. You can google "vi tutorial ". I found this one very interesting.
You can also choose nano instead of vi (some people find that easier), and then google "nano tutorial". Below I will give you a comparison of graphics and command line using vi and cat as examples.

vi and cat on command line vs. desktop graphics

vi

When you type vi README, is like saying, "Hey, Linux, Please use the app called vi to open README". The corresponding operations on desktop graphics is like, double click the README.docx. While you do that, you are using your mouse. But behind the scene, the computer is running something like MS_Word README.docx.

Image from Gyazo

After you hit enter, you are now in the vi application, just like you are in MS Word. You may need to learn the tutorial I mentioned above for vi. One thing to keep in mind is, basically, if you are in Linux, you lose all the nice features of MS Word, say, the beautiful looking fonts, change size of the font, give any color you want, insert table, pictures, etc. You only have "plain text"! (As you learn more, you may find something fancier things in linux too.)

Image from Gyazo

cat

vi is for editting. cat is for viewing. (It is actually "concatenate", but we do not go that deep.) Think as vi is MS Word, and cat is "Preview" on Mac OS or "Adobe Acrobat Reader" on Windows.

Image from Gyazo

For course work

Familiarize yourself with the above commands and then come to the homework. I have seen several people running code as shown in the screenshot.

Image from Gyazo

This is not the intended way to run the code and finish the work (mentioned by instructors, probably get not heard or not understood).

Locate where is the code

You DO NOT have to write or copy paste the code by yourself. All the code is already written there. You just need to change something.

Image from Gyazo

Read carefully and compare the codes in PDF and the contents of these so-called script files. You will find that they are very similar. Actually, these scripts are exactly the code we need to run.

Image from Gyazo

How to run code/script

The way you run these scripts is by typing sbatch <script_name>. (And this is the way to submit most of your jobs to ibex.) See screenshot below.

Image from Gyazo

You can use squeue to see all jobs by all users. Use squeue | grep <user_name> to see your own jobs. You can see the job state in the output of squeue.

What is the meaning of "sbatch <script_name>"

This is not of immediately interst to the course work. But understanding this will help a lot.

I will use one example in real world to explain this. But before that, you need to know something called "Login node vs. Compute Node", see the screenshot below (taken from page 4 of IBEX Hardware and Software stack found from ibex training, the screenshot itself is originally taken from https://wiki.rc.hms.harvard.edu/).

Image from Gyazo

In short:

  1. When we ssh <user_name>@<ip_address_or_domain> to login to ibex, we (user) go from local computer to the login node (a computer that is powerful, but may not that powerful as the compute nodes).
  2. When we "sbatch <script_name>", we are using sbatch to send the job named "<script_name>" to "slurm job manager", slurm will further ask one (or more) of the computer node(s) to execute the job.

An analogy using a real-world example (simplified, not exactly match on every step, but enough to get the idea). I want someone to make the breakfast for me. I then write a document indicating what need to be done. I send the document via Gmail to someone, say Sarah. Sarah then executes all the necessary steps. Finally, I get the breakfast, which is the "result". See the below screenshot for the comparison of this example and the real bioinformatics analysis command.

Image from Gyazo

Reading tips: Zoom the picture in the browser. Read left and right in parallel. Read from circled 1 to circled 4.

What exactly need to be done in the course work?

If you read carefully the course work PDF for DNA assembly, "Computational Bioscience_Practical_Genomics.pdf", you will find the following words.

Image from Gyazo

In short, we are supposed to

  1. Run all the scripts for 40x reads. (everything is already written there, just run, i.e. just sbatch.)
  2. Change the input and output of each script to make everything connected, to run them for 10x, 20x, 80x. And do what have written there (observe the impact of the input data quality on the resulting genome assembly).
  3. Only run the QC steps for task number 2. And then, based on the results/impression/knowledge we get from the task 1, we predict the goodness of the genome assembly, rather than run the assembly steps (which could be done, but need some other databases I guess).

What exactly to be changed?

See the screenshot below. All the "40x" means that, the whole pipeline is written for "40x" dataset. You need to change them all to run for 10x, 20x, 80x.

Image from Gyazo

Before you go

Enter ./data/ folder, and run ./README if you have not done so. This is to download some other data, specifically the databases needed by some app (check the content of the README). This step takes some time.

Read again the course work PDF.

The RNA-seq part, DESeq2, will come soon. (If you find this useful.)

One more thing: download data from ibex

This is mentioned in the PDF at here.

Image from Gyazo

What you do is:

  1. On your local computer, in terminal, run scp.
  2. But before that, you need to find the path of the file you want to download.
  3. Compose the full command you need to download the data.

See the screenshot below for one example:

Say I want to download multiqc_report.html under compBio2019/00_inputQC/. I first find the path to that file then compose the whole command to download it. It is finally downloaded to my desktop.

Image from Gyazo

Image from Gyazo

If you are on Windows, you can use "WinSCP". Google "WinSCP tutorial" or find videos on YouTube to learn how to use it. This is the so-called FTP software, helping transferring data between local computer and server. Mac OS has similar FTP software, like FileZilla. But scp on Mac works very well.