# Week 2 computer exercises

<p>Python is a general purpose programming language. To be able to do some biological computation, we either need to write Python code/scripts ourselves for a given computation or we can use some ready-made tools or modules. <code>Biopython</code> is a module that extends the capabilities of Python programming language for biological computation. In this exercise, we will use <code>Biopython</code>.</p>

<h2 id="Inspecting-the-contents-of-a-FASTQ-file-with-command-line">Inspecting the contents of a <code>FASTQ</code> file with the command line</h2>

<p>First, skim through <a href="https://en.wikipedia.org/wiki/FASTQ_format" target="_blank" rel="noopener">this</a> Wikipedia entry about <code>FASTQ</code> files and familiarize yourself with the FASTQ file format.</p>

<p>Using JupyterLab Launcher, open a terminal. Use the command <code>cd</code> to change the directory to <code>BBT_021_Bioinformatics/Week_02</code>. Use the <code>ls -lh</code> command to list the content of the directory (<strong>note</strong>: You can use <a href="https://explainshell.com/" target="_blank" rel="noopener">explainshell</a> website to get an explanation of what <code>ls -lh</code> does). Inspect the list and answer the following <strong>ungraded</strong> questions:</p>
<ul>
<li>How many files are there ending with <code>.fastq</code>?</li>
<li>What are the sizes of the files ending with <code>.fastq</code>?</li>
</ul>

<p>Use <code>head -n8 week_02_sample_r1.fastq</code> to show the first <code>8</code> lines of the <code>week_02_sample_r1.fastq</code> FASTQ file in the terminal. Check the results and see if what you see follows the description in the Wikipedia entry. Answer the following questions just by looking at the result in the terminal (no need to write any extra commands):</p>
<ul>
<li>How many sequencing reads do these 8 lines correspond to (also answer in Moodle)?</li>
<li>What is the <strong>first base</strong> in the <strong>first sequencing read</strong> (also answer in Moodle)?</li>
<li>What is the <strong>last base</strong> in the <strong>first sequencing read</strong> (also answer in Moodle)?</li>
<li>What is the character annotation of the <strong>phred quality score</strong> for the <strong>first base</strong> in the <strong>first sequencing read</strong> (also answer in Moodle)?</li>
<li>What is the character annotation of the <strong>phred quality score</strong> for the <strong>last base</strong> in the <strong>first sequencing read</strong> (also answer in Moodle)?</li>
</ul>

Once you are done with answering the questions above, you can type `exit` in the terminal to exit the terminal. 

## Inspecting the contents of a `FASTQ` file with `Biopython`

<p>The following cell installs <code>Biopython</code> module. Normally we would install it only once, but here we have to run the installation command every time we use <code>GenePattern</code>.</p>

In [None]:
pip install biopython

In [None]:
## The following lines load the necessary modules and set the configurations for plotting 
import matplotlib as mpl
mpl.use('Agg')
import matplotlib.pyplot as plt
%matplotlib inline

## This loads the standard Sequence Input/Output interface for BioPython 
## that provides us with the functionality to work with sequences
from Bio import SeqIO

<p>In this week's directory, we have 2 <code>FASTQ</code> files for <code>1</code> sample. We have <code>2</code> files because the data is obtained from a <strong>paired-end sequencing</strong> experiment. The file ending with <code>r1.fastq</code> contains the <code>read1</code>s and the file ending with <code>r2.fastq</code> contains the <code>read2</code>s. We can check this using the <code>ls</code> command.</p>

In [None]:
ls

### Loading a `FASTQ` file using a Biopython function

We can use `SeqIO.parse()` to load a file containing sequences and go through its contents.

In [None]:
## The contents of week_02_sample_r1.fastq are now loaded
## appropriately in the SeqRecords variable
SeqRecords = SeqIO.parse("week_02_sample_r1.fastq", "fastq")

<p>We can use the <code>next()</code> function to take the first record in <code>SeqRecords</code>. The next time we use this function, we will get the next record. We can use this function as many times as there are records. After that, if we use it, we get an <code>StopIteration</code> error message.</p>

In [None]:
## We get the first record in the SeqRecords and put it in a variable called sequence_record.
sequence_record = next(SeqRecords)

<p>Remember that a record in a <code>FASTQ</code> file has 4 lines. A sequence identifier, raw sequence, a line that most of the time begins with a <code>+</code> character, and the 4th line contains the quality values for the sequence on line 2.</p>
<p>We can extract these lines for each record as follows:</p>

### Extract the sequence identifier

In [None]:
print(sequence_record.id)

### Extract the raw sequence

In [None]:
print(sequence_record.seq)

### Finding out length of a read

In [None]:
## We can use Python len() function to check the length of the sequence
print(len(sequence_record.seq))

<p>From above, we see that <code>read1</code> is <code>101</code> bases long.</p>

### Extract the `phred` quality scores per base call

In [None]:
print(sequence_record.letter_annotations['phred_quality'])

### Indexing

<p>You can see that the result of the operation above is a <em>list</em> of numbers. What if we are interested in checking the value at a specific position? For example, if we are interested in the base call quality of the <code>5th</code> base, we can use the following cell (what we are doing here is called <em>indexing</em>). <strong>Recall that the reason that we use <code>4</code> instead of <code>5</code> is that in Python the counting starts from <code>0</code>.</strong></p>

In [None]:
print(sequence_record.letter_annotations['phred_quality'][4])

<h3 id="Calculate-the-probability-if-a-base-call-is-wrong">Calculate the probability that a base call is wrong</h3>

<p>If the base call quality for the <code>5th</code> base is <code>31</code>, this means that the probability that the call is wrong can be calculated as: $$-10log_{10}(p) = 31$$ $$log_{10}(p) = -3.1$$ $$p = 10^{-3.1}$$</p>

In [None]:
p = 10**-3.1
print(p)

The value above shows that there is a rather low chance that the base call is wrong.

<p>To see what the <code>5th</code> base is, we can use a similar approach to above and index the sequence as follows:</p>

In [None]:
print(sequence_record.seq[4])

### Slicing

<p>To see the <code>5th</code> through <code>10th</code> bases on this record, we can use the following cell. <strong>Note</strong> that even though we start counting from <code>0</code> (i.e. indexing starts from <code>0</code>), to get the <code>10th</code> base, we need to use <code>10</code> and not <code>9</code>. This is because when using ranges in Python, it does not include the last number in the range. What we are doing here is called <em>slicing</em>.</p>

In [None]:
print(sequence_record.seq[4:10])

### Plotting

<p>We can visualize the base call quality scores for the first record. In the plot generated below, the X-axis shows the base (e.g. <code>1st</code> base as position <code>0</code>) and the Y-axis shows the phred quality score.</p>

In [None]:
plt.plot(sequence_record.letter_annotations["phred_quality"])
plt.xlabel('Base position', fontsize=16) # adds x-axis lable
plt.ylabel('Phred quality score', fontsize=16); # add y-axis label

<p>From the plot, we can see that the beginning of the sequence and the end of the sequence have slightly lower quality than the middle part (but phred score of <code>26</code> is still good, i.e., the chance of the base call being wrong is only <code>0.002512</code>). Can you think of a reason why the sequence quality drops as the sequence gets longer? If you're curious, you can watch <a href="https://www.youtube.com/watch?v=mI0Fo9kaWqo">this video</a> about next-generation sequencing, which also explains this sequence quality drop phenomenon.</p>

## Exercises

<p>Once you are done with the following exercises and have the answers, use them to answer Week 2. Assignment 2 in Moodle.</p>
<p>Using what you have learned above, load the contents of <code>week_02_sample_r2.fastq</code> and assign it to a variable named <code>SeqRecords_r2</code>. Next, extract <strong>the first record</strong> (i.e. read) from it.</p>
<p>&nbsp;</p>

### Exercise 1

What is the `82nd` base?

### Exercise 2

What is the `82nd` through `90th` base?   

### Exercise 3

What is the phred score of `88th` base?

### Exercise 4

What is a the probability that the `88th` base is wrong?

This concludes this week's computer exercises.