# Notebook 2.0: bash assessment

### Learning objectives
This notebook will demonstrate several introductory bash scripting skills in the context of bioinformatic processing of genomic data. You will not need to be familiar with genomics to complete the notebook. By the end of this exercise you will:

1. Recognize that a genome can be described using text data.  
2. Know how to execute bash code in jupyter notebooks.  
3. Recognize there are many useful unix tools for manipulating text/genomic data.

### What to do when you get stuck?

Throughout this course you will surely at some point try to write code to answer a problem and it will return an error message instead of the correct answer. This is expected. It happens all the time. And it is a great learning opportunity. When you see an error message be sure to read it carefully. Some parts of the message may look very confusing, these likely represent a *stack trace*, which attempts to tell you *which part of the code is causing a problem*. In addition to this, you will usually get an *error message*. The message tells you the type of error. 

To learn how to solve a problem that is causing an error you can first try modifying your code to see whether there is an obvious fix, such as a typo or syntax error. If you still can't solve the problem, try googling the error message to find out what it means, and to see how other people have encountered and solved similar problems.

Every error has happened before. If you encountered it, then millions of other people have surely encountered the same problem as well. You will be surprised how often you can find the answer you are looking for online. **The best trick to learning programming is to learn *how* to ask google the right questions to find your answers.**

<div class="alert alert-success">
    Reminder: <b>Questions</b> that require you to respond using Markdown (text) or <b>Actions</b> that require you to edit and execute code to produce new results will be highlighted on a green background. 
</div>

### DNA as text
To look at genomes means to look at **text files**. That's right, genomic data can be represented as simple text. The *genome sequence* can be represented using the letters <span style="color:red;">A</span>, <span style="color:green;">C</span>, <span style="color:blue;">G</span>, and <span style="color:darkorange;">T</span> to refer to the nucleotides (molecules) that make up DNA. Pretty convenient, huh? In addition to the sequence of a genome a published reference assembly usually contains several additional files containing meta information about the genome, we'll take a look at this file format as well.

Because genomic data files are typically large, analyzing or even just looking at genomic data requires learning to use efficient software tools for working with large text files. It may surprise you to learn that many of the best tools for this are actually the basic command line tools developed for unix-based computers. Computer file systems are also written in text, and so many of the most common and basic programming tools are designed specifically to be able to very efficiently read, edit, or analyze text. 

### Jupyter notebooks
Jupyter notebooks are a powerful tool for combining code and written text together to create documents for sharing your work. In science we refer to this as *reproducibility* -- creating documents that others can use to reproduce your work entirely. In this course, we will use notebooks to share code with you for completing assignments; you can run the code, learn from it, and even edit the code or text. Upon completing each notebook, you will have entered in new values and produced new results, which you can then save, download and submit as your completed assignment for us to grade. Instructions for this will always be found at the bottom (end) of the notebook.

### Bash and Python 

The primary coding language used in Jupyter notebooks is Python. However, other languages can be used as well. In this class we will primarily use just two languages: **Python** and **Bash**. (We will cover bash briefly first and then learn Python in more depth.)

[Bash](https://en.wikipedia.org/wiki/Bash_(Unix_shell)) is a language for interacting with a **terminal** (it is a command line language). We will primarily use bash to run *executable computer programs* in a terminal. Those program themselves may be written in a variety of computer languages, often C, fortran, or even Python. There are several languages similar to bash for running commands in a terminal, of which bash is the most commonly used. 

### Running bash in jupyter

In jupyter you can open a terminal at any time to execute bash code directly in it if you wish. We demonstrated this in class. However, for the purpose of some of your assignments we will have you instead execute bash code inside of jupyter notebooks. Again, this is for the sake of reproducibility, notebooks keep a record of the code that you execute, so it is easy to share, and for our purposes, to hand in as an assignment. 

To say that we are going to "run a command line program using bash inside of a jupyter notebook" is describing a [multi-layered](https://tenor.com/view/shrek-onions-gif-5703242) process: jupyter actually uses Python to connect to a temporary bash terminal, which then runs the command line program, captures the output, and returns it to jupyter, which displays the output in your notebook. This sounds complicated, but it's all happening behind the scenes. To run bash code in the notebook all you have to do is append  `%%bash` to the top of a code cell. 

## Unix command line tools

Most of the tools we are going to use in this tutorial will be available on any \*nix based computer system (e,g., Linux, MacOSX), although the specific version that comes pre-installed on your system  may vary slightly (e.g., you would replace `zcat` on linux with `gunzip -c` on Mac). 



In this notebook we will explore the following unix programs. There are many more that we do not mention here, and there are many tutorials available on-line to learn more about them. 

1. `pwd`: print working directory  
2. `mkdir`: make a new directory  
3. `ls`: list contents in a directory  
4. `wget`: download from a URL (download or sub for `curl` on Mac)   
5. `head`: print first N lines of a file  
6. `cat/zcat`: stream/read through a file line by line  
7. `grep`: select lines from a file based on matching characters  
8. `cut`: select elements from text based on a delimiter/separator  
9. `wc`: count words, lines, or bytes in a file  
10. `awk`: match a pattern and perform action on conditional clause  

In [1]:
%%bash
# pwd: 'print working directory'
pwd

/home/deren/Documents/hack-the-planet/notebooks


### The format of command line tools
As we discussed in class, almost all command line programs follow a similar syntax (set of rules) for how to enter arguments to them to make them run. You always start with a program name, and then optionally with a set or arguments (flags) that start with a dash. Finally, some programs require input (target) that they process to return a result. Below are some examples of this syntax. Next we will try it out with some interactive examples. 


```bash
# program followed by optional flags and optional target
<program name> [-option1] [-option2] [target]

# example, to call the program ls (list)
ls

# to call the program with a target (location to list)
ls ../notebooks

# to call the program with an option and target
ls -l ../notebooks
```

### Create a new directory to store downloaded genome files
Here we will use the `mkdir` command to create a new directory (we'll add the `-p` option so that it doesn't matter if the directory already exists.) In this code block we also use the `ls` command with the option `-l` to show the contents of our current directory as a list. We can run as many bash commands as we want on separate lines of the same code block. However, when creating notebooks of your own it is generally good practice to break up code into atomized steps. From the output of `ls` below you can see that the new directory we created (genomes/) is now located inside of our current directory (called notebooks/), and is alongside several notebook files.

In [2]:
%%bash
# make new directory called genomes in our current directory
mkdir -p ./genomes/

# show everything in my current directory as a list
ls -l

total 88
drwxrwxr-x 2 deren deren  4096 Jan 13 12:50 genomes
-rw-rw-r-- 1 deren deren 10184 Jan 11 12:24 nb-1.1-jupyter.ipynb
-rw-rw-r-- 1 deren deren 70208 Jan 13 12:49 nb-2.0-bash-practice.ipynb


### Where do we get genomic data?
When writing reproducible code it is great to be able do download data directly from the internet, since this way any user can access files easily. Here we will download example data from a public database called [NCBI](https://www.ncbi.nlm.nih.gov/home/about/).

Open this link in a new tab [https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/reference/](https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/reference/) to view an NCBI FTP site where the reference Human genome is hosted. As you can see in the URL we are visiting a site that is structured like a set of folders on a computer system, and in which we are jumping to `vertebrate_mammalian/Homo_sapiens`, which suggests that there are many other vertebrate mammalian genomes on this site as well. At the new page click on the link for `GCF_000001405.39_GRCh38.p13/`. This is the latest published human reference genome. That's right, assemblies are continually being updated as new data and new software tools become available, or as new individuals are sequenced.

As you can see there are many files containing different aspects of genome information. Some describe how the genome was assembled, others how many genes there are and where they are located. The assembled **genome sequence** itself is in a **fasta file**. This is the file we are interested in right now. *Do not click to download yet*, we will instead use a command line program to download files. This is generally good form, since you'll leave a record in your notebook of exactly where you got your data from (i.e., it is good for reproducibility).

### Download a few (small) reference genome fasta files from public URLs
We will use the unix command `wget` to retrieve data from a URL, with the option `-O` (capital o, not a zero) to designate the location where we want the file to be saved (for simplicity we also rename the files to a simpler name: virus, yeast, and chicken). Let's also add the option `-q` (quiet) to repress progress messages from being printed during the downloads (unless your download fails, then remove the `-q` to see the error message). This command will take somewhere between 30 seconds and several minutes depending on your internet speed.

In [3]:
%%bash
# to make this code more readable we will save the URLs as variables
url1="https://ftp.ncbi.nlm.nih.gov/genomes/refseq/viral/Pandoravirus_quercus/latest_assembly_versions/GCF_003233895.1_ASM323389v1/GCF_003233895.1_ASM323389v1_genomic.fna.gz"
url2="https://ftp.ncbi.nlm.nih.gov/genomes/refseq/fungi/Saccharomyces_cerevisiae/reference/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz"
url3="https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_other/Gallus_gallus/representative/GCF_000002315.6_GRCg6a/GCF_000002315.6_GRCg6a_genomic.fna.gz"

# now we can run the wget commands by referring to these variables
wget $url1 -q -O ./genomes/virus.fna.gz
wget $url2 -q -O ./genomes/yeast.fna.gz
wget $url3 -q -O ./genomes/chicken.fna.gz

### The format of the genome files
We just downloaded three `fasta files`. Such files will usually have a suffix like `.fna` or `.fasta`. You can see that all of our files have the ending `.fna.gz`, meaning that they are in fasta format, but they are also *gzip compressed* (.gz). Compression is used to store large files more efficiently. When files are compressed we sometimes need to use additional tools to decompress them before reading the contents. Many genomic software tools are designed to work with compressed files so that we can just enter the compressed file into the program directly. Here we will leave the files compressed to save disk space (e.g. the compressed Chicken genome is about 0.3 Gb) and simply decompress their contents on the fly as we read/stream it.    

In [4]:
%%bash
# show the contents of the genomes directory
ls -lh genomes

total 319M
-rw-rw-r-- 1 deren deren 314M Feb 22  2019 chicken.fna.gz
-rw-rw-r-- 1 deren deren 588K Jun 24  2018 virus.fna.gz
-rw-rw-r-- 1 deren deren 3.7M Apr  9  2018 yeast.fna.gz


### Look at the first couple lines of the compressed file
A convenient tool for looking at just a few lines from a large file is the command `head`. We can pass this an additional argument to tell it how many lines to show us from the top of the file. Let's start by looking at the first 2 lines of one of the compressed genome files...

It looks like gibberish, right? Well, that's what compressed files look like if you don't decompress them.

In [5]:
%%bash
# read a few lines of the compressed virus genome (it will look ugly)
head -n 2 genomes/virus.fna.gz

�      ��ͮeˎ�׏��0�� ��7�7�B �����\���ڙ�B*���͓�{��3�?������_����_����o�����������������?�?��������������_��������������_�v2�:2�+������?����痝��������������_���ů��_ׯ?��矟�?�y��?~�����������_���y��ߍ��ϒ�o���������Z5o�>��}�_oq_��}~�ƯO��>?��۟�������r��_|���P����6�ٿ^��G�������u?Wv�+q_}~�ן�}�2�{w򮟯\yoȯ��u�\�������y����M��S��u�I��>s�>�K���e���C�|6�սP�7����w}ܗ:5ˠ�Ku��s�Z����?�|���&�'����ޅ__y>�\�{i�^����5Ss�����+-s>�窅�z�W��f�� wi��믏�~�i�u���sa>����c~���k~��������O�����ź�%긏��|���]������g��A�ס��>?���y">*�����_�7�Q�k�]�~y����������;��ޛ�w�7����^�.��ϳ|��^�{??����׾O����e�-$Y��'f鹎'�u�g�y����;��a���Z�9����/�:݇��:��>wFW�s��:��e�kq?Q�Z���ι�fv��W����w���Lz���Og{�����}V�]ح�,{�����.��γp����\��9�K�u�_r^2��%�^�s'�\Z����|�3��]\����>��3������G^���pE�&�/Z�����Ƭ3]�ϝ��>kΗ�?9������CK���}�"Y[ٳ��|����� �������D�������~���aΦ7�i��W��1{��{����ܔ���9z�<;s��!�z�>��z*����پ政�<��s��6;w!?[]��GR/3�>z9߬���/�#Z'F�BSwn�p�͙��ݏO0�o�^���L7�sE���^�}��1:��

### `cat / zcat` to read/stream the entire contents of a file


Here we are using two tricks: first, the command `zcat` can be used as a replacement for `cat`, the normal command we would use to stream a file to stdout (the output stream). The second trick here is that we are using a `pipe`. As you learned in your linux tutorial, pipes are used to send the output of one program directly into the input of another. Here we pipe the streamed output of `zcat` directly into the program `head`. This way we do not read the entire file into memory at once, which would be too much text for a jupyter notebook to display comfortably, and instead capture only the first N lines to print to stdout. As you can see, now that we are using the `zcat` command to decompress the stream of text it looks like sensible DNA data. 

In [6]:
%%bash
# run zcat and PIPE the output into head
zcat genomes/virus.fna.gz | head -n 10

>NC_037667.1 Pandoravirus quercus, complete genome
CCGGTACAGTGAGCGGTTCACGGCCTGGCCACGGTCGACGGAGTGCCGTGCGATGCCATCGGCGACGGCCGCGGCGACGT
CGCGGGCATTCGCACGTGCGACCACAGCCGTCAGTGGTACTGGCGGGACGAGGCCGTCGGGGTGACGGACGTCGTCAAGA
ACCTGCTCGATGCCATCACACGATGCGCCGAGTACGCGCACGATACCATCAGGGCGCCGTTGGCGAGCAAACCTGACACT
GAGATTATGGAGTTCAGCGTCCGTTGCACCCGCCAGGCGGCGGCCGGAGGCGACGACGTCACGGACCCCATGCGCCGCTT
GGACGCGAGGCCAGGCGCACGTGGCGCGCCTATCGCATGCACGCGCGCGTGTTCAGCGCCATCGCGTTGCTCGGCGGCGG
ACCGCTGAGCATGATGGCGACGGCGGGTCTGCCCTTCTATGACGTGCGCCGGTACGCGCTGGTGGCGGCCCGCTACGACG
GCCGCGCCGAACGCGCGTCGAGCCTGCTCCCAACACGCGTGCGACCAGACACCCTTGCGCACGAGGTGATGCGGTGCCGC
GGCGATGGGCGTCTTCCGCGGCGCTCAATCGCGCACAGCCTCTTTGCAAGTTGGTTCGAACGCAATTACGCACTGGAGGG
CTACGAGGACGCCAGCGGCATCGACGCCGTGTGGTACGACCATCTCGGTCAAGAGGGCACCCACGAGACCGACCGTTGGT


<div class="alert alert-success">
    <b>Action:</b> In the code cell below write and run a bash command like the one above to stream the contents of a genome file using zcat and pipe it into the head command. In your code block, select the *yeast genome file* instead of the virus, and change the -n option of head to print 20 lines instead of 10.  
</div>

In [7]:
%%bash
# write code here to run zcat and PIPE the output into head


### The fasta format
The fasta format is a simple file format used to list contiguous DNA sequences with identifiers. The sequences in a fasta file could be genes, or non-coding regions, or transposable elements, etc., or it could be all of those things ordered as a large contiguous sequence (<span style='color:red;'>a contig</span>). These details are not important for now. Just look at an example below:

```
> sequence 1 name and other notes about sequence 1
AAAAAAAAAAAAAAAAAAATTTTTTTTCCCCCCCCCCCTTTTTTTTTTTTTGGGGGGGGGGG
AAAAAAAAAAAAAAAAAAATTTTTTTTCCCCCCCCCCCTTTTTTTTTTTTTGGGGGGGGGGG
AAAAAAAAAAAAAAAAAAATTTTTTTTCCCCCCCCCCCTTTTTTTTTTTTTGGGGGGGGGGG
AAAAAAAAAAAAAAAAAAATTTTTTTTCCCCCCCCCCCTTTTTTTTTTTTTGGGGGGGGGGG
AAAAAAAAAAAAAA
> sequence 2 name and other notes about sequence 2
AAAAAAAAAAAAAAAAAAATTTTTTTTCCCCCCCCCCCTTTTTTTTTTTTTGGGGGGGGGGG
AAAAAAAAAAAAAAAAAAATTTTTTTTCCCCCCCCCCCTTTTTTTTTTTTTGGGGGGGGGGG
AAAAAAAAAAAAAAAAAAATTTTTTTTCCCCCCCCCCCTTTTTTTTTTTTTGGGGGGGGGGG
AAAAAAAAAAAAAA
```

In the virus genome file we can see that the name of the first sequence in the file is labeled "... complete genome", this is becuase the entire genome of this virus is contained on a single assembled chromosome. By contrast, the other genome files contain several distinct named contigs.

### Searching text (grep)
The names of each sequence in a fasta file begin with the character `>`. When a character is used to separate elements in a file like this we refer to it as a **delimiter**. The `>` character in a fasta file delimits where one sequence stops and the next one begins. 

Knowing this, we can easily extract the names of all of the sequences in a fasta file by using the command line tool [`grep`](https://www.tutorialspoint.com/unix_commands/grep.htm) to extract lines that match a pattern we are searching for. You can find a quick grep tutorial [here](https://www.tutorialspoint.com/unix_commands/grep.htm) and in many other places online. We will demonstrate some usage below.

Let's add an additional pipe (remember, this connects the output from one program to be the input to another) to our command from above to now extract the sequence names in the Yeast genome file as we read/stream though it using the cat command. As you can see, the only lines of the file that are returned are those which matched the `grep` search pattern (i.e., contained a `>` character). 

In [8]:
%%bash
# pipe zcat output to grep
zcat genomes/yeast.fna.gz | grep ">"

>NC_001133.9 Saccharomyces cerevisiae S288C chromosome I, complete sequence
>NC_001134.8 Saccharomyces cerevisiae S288C chromosome II, complete sequence
>NC_001135.5 Saccharomyces cerevisiae S288C chromosome III, complete sequence
>NC_001136.10 Saccharomyces cerevisiae S288C chromosome IV, complete sequence
>NC_001137.3 Saccharomyces cerevisiae S288C chromosome V, complete sequence
>NC_001138.5 Saccharomyces cerevisiae S288C chromosome VI, complete sequence
>NC_001139.9 Saccharomyces cerevisiae S288C chromosome VII, complete sequence
>NC_001140.6 Saccharomyces cerevisiae S288C chromosome VIII, complete sequence
>NC_001141.2 Saccharomyces cerevisiae S288C chromosome IX, complete sequence
>NC_001142.9 Saccharomyces cerevisiae S288C chromosome X, complete sequence
>NC_001143.9 Saccharomyces cerevisiae S288C chromosome XI, complete sequence
>NC_001144.5 Saccharomyces cerevisiae S288C chromosome XII, complete sequence
>NC_001145.3 Saccharomyces cerevisiae S288C chromosome XIII, complete seq

### Sequence names/labels

We can see from above that all of the sequences in the Yeast genome are complete chromosomes. This is not always the case, though. Below we run a `grep` search to get all lines of the Chicken genome file that contain "chromosome 1" in their names. As you can see, there are many scaffolds that match our search term, meaning there are many scaffolds in this assembly that are thought to be located *somewhere* on chromosome 1. 

In [9]:
%%bash
zcat genomes/chicken.fna.gz | grep "chromosome 1" | head -n 10

>NC_006088.5 Gallus gallus breed Red Jungle Fowl isolate RJF #256 chromosome 1, GRCg6a
>NW_020109737.1 Gallus gallus breed Red Jungle Fowl isolate RJF #256 chromosome 1 unlocalized genomic scaffold, GRCg6a CHR1_134_RANDOM
>NW_020109738.1 Gallus gallus breed Red Jungle Fowl isolate RJF #256 chromosome 1 unlocalized genomic scaffold, GRCg6a CHR1_360_RANDOM
>NW_020109739.1 Gallus gallus breed Red Jungle Fowl isolate RJF #256 chromosome 1 unlocalized genomic scaffold, GRCg6a CHR1_123_RANDOM
>NW_020109740.1 Gallus gallus breed Red Jungle Fowl isolate RJF #256 chromosome 1 unlocalized genomic scaffold, GRCg6a CHR1_132_RANDOM
>NW_020109741.1 Gallus gallus breed Red Jungle Fowl isolate RJF #256 chromosome 1 unlocalized genomic scaffold, GRCg6a CHR1_330_RANDOM
>NW_020109742.1 Gallus gallus breed Red Jungle Fowl isolate RJF #256 chromosome 1 unlocalized genomic scaffold, GRCg6a CHR1_247_RANDOM
>NW_020109743.1 Gallus gallus breed Red Jungle Fowl isolate RJF #256 chromosome 1 unlocalized genomic s

----------------------------------

### Genome structure: mitochondrial genomes
We can use `grep` to match lines of a file in order to answer basic questions about its contents. For example, we can search for the term "mitochond" in each genome to see if there is a sequence that is labeled as being from a mitochondrian. Here we can see that two genomes have a mitochondrian (yeast and chicken) while the virus does not.

In [10]:
%%bash

zcat genomes/virus.fna.gz | grep mitochond
zcat genomes/yeast.fna.gz | grep mitochond
zcat genomes/chicken.fna.gz | grep mitochond


>NC_001224.1 Saccharomyces cerevisiae S288c mitochondrion, complete genome
>NC_040902.1 Gallus gallus spadiceus isolate YP19903 breed Red jungle fowl mitochondrion, complete genome


-----------------------------

### How large is each genome?
Because genome files are composed simply of text, we can use simple text operations to answer interesting biological questions, such as "how big is the genome?". Here we stream the decompressed text from the file using `zcat` and them use the `wc` command to count the number of characters in each file.

In [11]:
%%bash
# count characters in each file
zcat genomes/virus.fna.gz | wc -m 
zcat genomes/yeast.fna.gz | wc -m
zcat genomes/chicken.fna.gz | wc -m

2103306
12310392
1078736272


We can do this even more accurately by excluding the lines that contain the names of each contig by using `grep` to filter these lines (exclude them by using option `-v`). 

In [12]:
%%bash
# more accurately, we should exclude seq name lines using grep (line starting with '>')
zcat genomes/virus.fna.gz | grep -v '^>' | wc -m 
zcat genomes/yeast.fna.gz | grep -v '^>' | wc -m
zcat genomes/chicken.fna.gz | grep -v '^>' | wc -m

2103255
12309078
1078682731


-----------------------------------

<div class="alert alert-success">
    <b>Action:</b> 
    Congratulations, you have completed the notebook. You do not need to upload anything for assessment.
</div>
