# Week 4 computer exercises

In week 2, we learned that sequencing reads are stored in `FASTQ` files. During this week's lecture, we learned that, typically, the results of sequencing read alignment to a reference genome (e.g., the human reference genome) are stored in a Sequence Alignment/Map `SAM` file. In this exercise, we learn about this format and use JupyterLab's *Terminal* as well as some Unix/Linux commands to inspect a SAM file (i.e., `ezh2.sam` that can be found in the `~/BBT_021_Bioinformatics/Week_04/` directory).

## Learning about the SAM file format

The best way to learn about the SAM file format is to read the *Sequence Alignment/Map Format Specification* that is available [here](https://samtools.github.io/hts-specs/SAMv1.pdf). But since this document is a bit complex for our purposes, we will use [this](https://en.wikipedia.org/wiki/SAM_(file_format)) Wikipedia entry about the SAM file format instead.

We first read the Wikipedia entry about the SAM file format and answer the related questions in the quiz (**note**. You can skip reading the *Optional fields* section).

## Inspecting a SAM file

- Open a *Terminal* in the JupyterLab (e.g., via `File > New > Terminal`). 

- In the *Terminal*, use the `cd` command to move to the `Week_04` directory. Remember that you can use either an **absolute path** (i.e., `cd /home/jovyan/BBT_021_Bioinformatics/Week_04/`) or a **relative path** (i.e., `cd ~/BBT_021_Bioinformatics/Week_04/`) to reach this directory.

- List the content of this directory using `ls -lh`. You should be able to see `ezh2.sam` file in this directory. We will inspect the contents of this SAM file. 

- We already have learned that SAM files are text-based file formats, thus we can print their content to the *Terminal* and be able to read and understand them. Use `head -n5` to print the first `5` lines in the `ezh2.sam` file. We can see that all these `5` lines start with `@`. We learned that these lines are part of the SAM **header**.

- The second line in the output of the `head` command we just used looks like this:   
`@SQ     SN:1    LN:249250621`  
The `SQ` tells that on this line some information about a *reference sequence* (which often is a chromosome) is stored. `SN` tells the *reference sequence name* and `LN` tells about the length of the reference sequence on that line. For example, the line above tells that the length of chromosome `1` is `249250621`.

One handy Unix/Linux command in bioinformatics is `grep`. This command prints lines in a file that **match** a given **pattern**. For example, to look for `SN:` pattern, we can run `grep "SN:" ezh2.sam` in the *Terminal* (the pattern is enclosed between double quotation marks). Run this command and inspect the output (**NOTE**. Ensure that you are currently in the correct directory, or specify the correct file path when using `grep` to inspect the file.). 

Next, update the pattern in this command (i.e., `grep "SN:" ezh2.sam`) to find the length of chromosome `Y` (use the result to answer one of the quiz questions). **Note** that by default `grep` is *case sensitive* and the *case* of the pattern matters which means that the pattern `"SN:"` is different from `"sn:"`. Also **note** that the length you get here for chromosome `Y` is dependent on the reference sequence that was used for alignment.

The `grep` command has a few handy command-line options. One of which is `-v` which tells `grep` to print lines that are not matching the pattern ([see the explanation here](https://explainshell.com/explain?cmd=grep+-v)).

For example, if we type and execute this command `grep -v "^@" ezh2.sam`, it prints all lines that does **not start with** the `@` sign which means **it prints only alignment lines**. Here, `^` tells `grep` to only look for lines that **start with the pattern** we are looking. Here is another way to think about this: first, `grep` finds all the lines that start with the pattern but since the `-v` command-line option is present, it inverts the results and prints all other lines meaning the lines that do not start with the pattern.  

**Note** that in this course it is enough to know that there are such Unix/Linux commands and we can use them to help us in our bioinformatics work. We will learn more about these e.g., in our university's master's degree bioinformatics courses (if you are interested to learn more about these check [this](https://en.wikipedia.org/wiki/Regular_expression) Wikipedia Entry about **regular expressions** and [this](https://en.wikipedia.org/wiki/Regular_expression#POSIX_basic_and_extended) section to learn about characters other than `^`).

You might have noticed that if you run `grep -v "^@" ezh2.sam` in the *Terminal*, it gets flooded with all alignment lines. In this toy SAM file, we only have 20 alignment lines. This would get out of hand if we had tens of millions of lines in our SAM file which is normal as we learned that high-throughput sequencers generate millions of lines.

One handy **trick** that we can use here is to give the output of `grep -v "^@" ezh2.sam` to another command, for example, `head` and `head` would use it as its input. We can do this by using a `|` sign which called a **pipe** (this is a metaphor from plumbing where water from one pipe can be transferred to another pipe using a pipe fitting).

Let's say that we are interested to only inspect the **first** line in the alignment section of the `ezh2.sam` file. We can run `grep -v "^@" ezh2.sam | head -n1` in the *Terminal* to get it.

Now, that we have the first line in the *Terminal*, let's inspect it and also use what we learned by reading the Wikipedia entry about SAM file format and answer questions in this week's assignment quiz (please check the questions in the Moodle). 

**Note** that the fields/columns in this line are separated by a tab (a tab is often equivalent to 4 spaces). As it may be diffcult to keep track of the column numbers, we can pipe the output of `grep -v "^@" ezh2.sam | head -n1` to `tr` tools like this: `grep -v "^@" ezh2.sam | head -n1 | tr "\t" "\n"`. `tr` is short for translate and what `tr "\t" "\n"` does is that every time it sees a tab (i.e., "\t"), it replaces it with a new line (i.e., "\n"). This way it may be easier to go through the columns as every column now starts on a new line.

Note that normally, most of the time, we work with **BAM** files instead of **SAM** files because they use less storage space. In that case, we need to use `samtools` to be able inspect the content of a BAM file. We did not use `samtools` here, but in our master's courses, we get plenty of chances to work with `samtools`.

### Using Python to inspect SAM/BAM files

Here, we see how we can use a Python package called [pysam](https://pysam.readthedocs.io/en/latest/index.html) to interact with SAM/BAM files.

It can be used: 
>to read and manipulate mapped short read sequence data stored in SAM/BAM files.

`pysam` can be installed e.g., by using the following command. It may take around 5-10 minutes for the installation to complete.

In [None]:
pip install pysam

You can read [this usage page](https://pysam.readthedocs.io/en/latest/usage.html) to get started with using `pysam`.

The following cell, loads `pysam`, opens our `ezh2.sam` file and fetches the **first aligned read** from the SAM file and assign it to a variable called `read` and closes the file handle (since here we are interested in the first alignment line in the file, we close the file after we have fetched the alignment line).

In [None]:
import pysam
sam_file = pysam.AlignmentFile("/home/jovyan/BBT_021_Bioinformatics/Week_04/ezh2.sam", "r")
read = next(sam_file.fetch())
sam_file.close()

Now, we can use the following to access each of the fields/columns of the first line (please read the [pysam's documentation](https://pysam.readthedocs.io/en/latest/api.html) for more).

**Note**. Use the following results to answer one of the quiz questions.

In [None]:
read.query_name

In [None]:
read.query_sequence

In [None]:
read.flag

This concludes our brief demo on how to inspect a SAM/BAM file using Python! We can do much more than we covered here! Please read the pysam's documentation for more information.

---

This concludes the exercise session for this week. Way to go! 👏 

Once you are done, remember to save your work, close the notebook and the *Terminal*, and go to `Running Terminals and Kernels` in the left sidebar and, shut down the kernels as well as the terminals. 