# Lab 1: Introduction to Handling Phylogenetic Data

<img src="http://bigtreestrategies.com/wp-content/uploads/2011/07/BigTree.jpg" width="900" height="10">

## Objectives

* Understand the difference between text and binary files, and appreciate that for our purposes, all data are simply text files
* Explore basic computing skills: the command line (shell), R, and Jupyter notebooks
* Learn how to find and download sequences from GenBank:
  * Form search queries
  * Perform BLAST searches
* Use Python to download sequences and format them into FASTA files

## Introduction

For extant (living) species, reconstructing phylogeny has come to rely primarily on **DNA sequences** -- initially, just sequences of single genes, but now, increasingly, on genome-scale data sets of hundreds or thousands of genes. We will learn the basics in upcoming labs, but to begin we will get acquainted with scientific computing in general, including files and file formats, and how to find and download DNA sequence data (and some morphological data) from public respositories.

## Files and file formats

If your work with computers to date has mostly involved double-clicking on file icons to open them in programs like Word or Photoshop, you might not appreciate the difference between [binary files](https://en.wikipedia.org/wiki/Binary_file) and [text files](https://en.wikipedia.org/wiki/Text_file). An accessible yet detailed explanation can be found [here](https://www.nayuki.io/page/what-are-binary-and-text-files), but the main distinction is that text files are organized fundamentally by _lines_ (rows) of _characters_ (letters, numbers, and so on), and are generally meant to be human-readable.

> ***Bottom line:*** text files are the bread-and-butter of scientific data analysis; they are used for source code and scripts, and often for data storage. In phylogenetics, ***all data files are text files***.

### Text file formats

The _format_ of a file refers to how its contents are structured. Whereas binary file formats are commonly specific to particular programs -- you can't open a Word document in Photoshop, for example -- a text file of any format can be opened in a text editor (e.g., TextEdit, Sublime Text, Atom, and for the diehards, Emacs and Vim). ***(Note: Word is not a text editor!)*** A common text file format for tabular data is [CSV](https://www.computerhope.com/issues/ch001356.htm). In phylogenetics, there are various text file formats for data, trees, and so on, that we will see in time.

## 1. Using the command line

We will be using the Terminal, so it’s important to to feel at least somewhat comfortable with the command line. Here are some common commands.

> ***NB:*** A Unix "directory" is the same as a Mac or Windows "folder"

|Command|Function|
|:---------:|:----------:|
|`pwd`|Print working directory (tells you where you are)|
|`cd foo`|Change (move) to existing directory `foo`|
|`mkdir foo`|Make a new directory `foo`|
|`ls`|List the contents of the current directory|
|`mv foo bar/`|Move `foo` into the directory `bar`|
|`mv foo bar`|Rename the file `foo` to `bar` (assuming `bar` is not an existing directory)|
|`rm foo`|Deletes the file `foo`|
|`man foo`|Read the documentation for command `foo` (exit = `q`)|
|`head foo`|Look at first ten lines of `foo` (useful for large files)|
|`tail foo`|Look at last ten lines of `foo` (useful for large files)|
|`less foo`|Interactively page through the contents of `foo`|
|`grep -e foo bar`|Print all lines containing the text `foo` in file `bar`|
|`./foo`|Run the program `foo` whose executable file is in the current directory|
|`cd ../`|Move to the containing directory (one level "up")|

---
<img style="float: right;" src="http://www.reelab.net/todo.png">

#### Practice
1. Open a Terminal (a new one, not the one running this notebook).
2. Look at what directory you are in.
```bash
pwd
```
3. Make a new directory (name it whatever you want).
```bash
mkdir your_directory/
```
4. Make a second directory.
5. Move one of your directories into the other.
```bash
mv your_directory/ dir2/
```
6. Change working directory to your new directory. 
```bash
cd dir2/
```
7. Look inside your new directory.
```bash
ls
```
For a more detailed look:
```bash
ls –l
```
8. Move back to the original directory.
```bash
cd ..
```
9. Read more about the command `rm`
```bash
man rm
```
Use `<space>` to page through the documentation, and `q` to quit.
10. Delete both directories you’ve made.
```bash
rm –r dir2/
```

##### Warning!
* **`rm -r`** will delete the file or directory you list and *everything* inside of it. It will not ask you whether you are sure. BE CAREFUL using it. 

---
## 2. Statistical Computing in R

R provides a wide variety of statistical and graphical functions, and is widely used in data science. Here are some basics:

|Command|Function|
|:---------:|:----------:|
|`getwd()`|Print working directory (equivalent to “pwd” in Unix)|
|`setwd("[dir/]")`|Set working directory (equivalent to “cd” in Unix)|
|`c(a,b,c)`|Concatenate *numbers* a,b,c into *vector*|
|`c(a:z)`|Concatenate *numbers* from a to z into a *vector*|
|`cbind(x,y,z)`|Concatenate *vectors* x,y,z into a *matrix* by columns|
|`rbind(x,y,z)`|Concatenate *vectors* x,y,z into a *matrix* by rows|
|`class(x)`|Return class of objext x|
|`matrix(x,nrow,ncol)`|Builds *matrix* by dividing *vector* x into nrow and ncol|
|`plot(x,y)`|Plots elements of *vector* x against *vector* y|
|`density(x)`|Estimates density distribution of univeriate observations|
|`read.table("file.txt")`|Read text file that is in a table format|
|`rnorm(n)`|Generates a normally distributed *vector* of length n|
|`?[function]`|Access documentation for a given function (e.g. ?rnorm)|

---
<img style="float: right;" src="http://www.reelab.net/todo.png">
#### Practice

1. Change the kernel of this notebook to R using the _Kernel_ menu above.
2. Insert a new cell below this one.
2. Create a 10x10 matrix of random numbers drawn from a normal distribution, bound to the variable `A`, and print it out. Retype or copy the following code into the new cell, and execute it by typing `CTRL-Enter` (or use the buttons/menus above).
```R
A <- matrix(rnorm(100),10,10)
A
```

---


---
<img style="float: right;" src="http://www.reelab.net/todo.png">
#### Insert a new cell below and use it for the following exercises.

5. Plot the first and second columns of `A` against each other.
```R
plot(A[,1],A[,2])
```
5. Replot the same data with a title and axis labels.
```R
plot(A[,1],A[,2],main="Title",xlab="X axis",ylab="Y axis")
```
6. Change your matrix to a dataframe.
```R
A <- as.data.frame(A)
```
7. Check the class of your dataframe.
```R
class(A)
```
8. Plot the first and second columns from the dataframe.
```bash
plot(A$V1,A$V2)
```

---

---
<img style="float: right;" src="http://www.reelab.net/todo.png">
#### Insert a new cell below for the following exercise.

1. Download a test file of tabular data: [sample.csv](https://sites.google.com/a/fieldmuseum.org/rtol/lab-exercises/sample.csv?attredirects=0&d=1) and save it to the current directory
9. Import the data into a variable `B`.
```R
B <- read.table("sample.csv",header = TRUE)
```

10. Check the class of the object `B`.
```R
class(B)
```

11. Plot "Lineages" against "Age".
```bash
plot(B$Age,B$Lineages)
```

---

---
## 3. Searching for and Obtaining Data from Online Databases

![running horse](http://buzzsharer.com/wp-content/uploads/2015/06/beautiful-running-horse.jpg)

_GenBank_ (http://www.ncbi.nlm.nih.gov/Genbank/index.html) is a free, publicly available online database for DNA sequence data. It contains loads of data.  Nearly all published studies are required (by the journal) to deposit any new sequence data in _GenBank_. It is part of the _NCBI_ cluster (http://www.ncbi.nlm.nih.gov/), which includes several databases (literature, molecular, and genome), as well as tools for searching and analyzing the databases, tutorials and educational links, and information on submitting your own data.

---
<img style="float: right;" src="http://www.reelab.net/todo.png">
#### Go back to the NCBI webpage and search for “Equus” (the genus for horses, zebras, donkeys and asses).

By default, the search option is set to “All Databases”. You can ask it to select an individual database, but for now, just leave it on “All Databases”. You should now see the number of matches returned for each database. Try clicking on a few of the databases to see what data are available for _Equus_.  

For this course we will primarily rely on data found in the _Nucleotide_ and _PopSet_ databases, but also use the _Taxonomy_ and _PubMed/PubMed Central_ databases. **Click on the PubMed and PubMed Central links and check out what articles are available for _Equus_**.

---
<img style="float: right;" src="http://www.reelab.net/todo.png">
#### Edit this cell (double click on it; CTRL-Enter to save) and answer the following questions about _Equus_:

1. How many nucleotide sequences are available in the _Nucelotide_ database?

2. Have multiple genes been sequenced?

3. Look at "Top Organisms" (upper right bar). Are all the sequences from _Equus_?

4. If not, do you have any idea why they were returned as matches?

---

---
<img style="float: right;" src="http://www.reelab.net/todo.png">
#### Search for accession number `U70192`

This is a complete sequence of the alpha 2 hemoglobin gene in _Equus greyi_ (Grevy's Zebra - one of the zebras from today's reading). Notice the phrase "complete cds" at the end of the defintion. This tells us the record contains the complete coding sequence of the gene, versus "partial cds".

> ***NB: `cds` = coding sequence***

---
<img style="float: right;" src="http://www.reelab.net/todo.png">
#### Edit this cell and answer the following questions

1. How long is the sequence?
2. Is the sequence from a protein-coding gene?
3. What happens if you click on the "CDS" link?
4. Is base 167 of the unedited sequence a first, second, or third position codon? How do you know?
---

This is a complete cds, but remember that the first base in _partial cds_ records in _GenBank_ may not be the first base of the coding sequence.

We can also see the name of the product that is produced by these exons, as well as a translation of the protein itself (a string of letters coding for individual amino acids – MVLSAADKTNV…).  Notice that the protein sequence has its own _GenBank_ record with some more detailed information (where it says “protein_id=”).


### Using Search Limits

Let's go back to the search you did for _Equus_ earlier in the lab. Recall that many of the sequences returned were not from _Equus_ species.

---
<img style="float: right;" src="http://www.reelab.net/todo.png">
#### Edit this cell to provide the answers
1. Click on "Advanced" under the search field within the _Nucleotide_ database.
2. Select "Organism" from the Search Builder options.
3. Again, type in "_Equus_", click "Search", and look at your results.
4. **Are all of the sequences from _Equus_?**
5. **How many are there?**
---

>***Accession numbers***: In NCBI, each sequence is identified primarily by a unique __accession number__. These usually begin with 1-2 letters, followed by 5-6 numbers. They can be used in searches to retrive a range of sequences: e.g.,you can enter two accession numbers separated by a colon (try “U70192:U70194”). Sequences published together in a paper are usually assigned a continuous sequence of accession numbers.

### Downloading Sequences from _GenBank_

![](https://i.ytimg.com/vi/uXdzuz5Q-hs/maxresdefault.jpg)

There is _a lot_ of data in _GenBank_, and a lot of great research can be done using it -- you will do some for your final projects! But how do you wrangle the data into a usable format?

---
<img style="float: right;" src="http://www.reelab.net/todo.png">
#### Download just the sequences from _Equus kiang_ (the largest of the wild asses)

1. Return again to the “_Equus_” advanced search results
2. Find _Equus kiang_ in the "Top Organism" list, and select it.
2. Click the "Send to" link (upper right) and choose "File".
3. Select the "FASTA" format.
4. Save the file and open it in any text editor, or open it from this notebook using _File_->_Open..._
5. Look at the file and note its structure: this is the FASTA format.

> ***FASTA format***: this is a simple format that can be read by many programs. The first line of a sequence begins with a  `>` followed by the sequence name, and subsequent lines contain the sequence itself.  A line break separates each sequence from the name of the next sequence.

```
    >cat
    ATGCATGC
    >dog
    ATGCATGC
    ...
```
---

### Searching by similarity: the Basic Local Alignment Search Tool (_BLAST_)

_BLAST_ (http://blast.ncbi.nlm.nih.gov/) is an incredibly useful tool that very quickly finds the closest (most similar) matches to a query sequence. The “nucleotide blast” option is the most commonly used in phylogenetics.

On the Standard Nucleotide BLAST query page, you can paste the query sequence, enter its accession number, or upload a file containing the sequence. Let’s use the accession number of the zebra sequence we examined previously, `U70192`.

---
<img style="float: right;" src="http://www.reelab.net/todo.png">
1. Enter the accession number in the first box.
2. Select the database to search: **nr/nt** is best for non-model organisms.
3. Under **Program Selection**, select **Somewhat similar sequences (blastn)**. This option is best for finding matches in more distantly related species.
---

A number of sequences will be returned.  These matches are color-coded by the degree of similarity.  Scroll down and you can see the individual sequences, along with a number of statistics describing the closeness of the match.  The higher the coverage and the lower the E-value, the closer the match. See the link on the top-right of the page for more information.

---
<img style="float: right;" src="http://www.reelab.net/todo.png">
#### Edit this cell and answer:

1. What is the closest match?
2. What is the second closest match?
---

## 4. Downloading published sequences for future labs

In the coming labs we will work with data from this paper:

[Gómez-Acevedo S, Ruci-Arce L, Delgado-Salinas A, Magallón S, Equiarte LE. 2010. Neotropical mutualism between _Acacia_ and _Pseudomyrmex_: Phylogeny and divergence times. _Mol Phylogenet Evol_. 56:393-408.](http://www.sciencedirect.com.proxy.uchicago.edu/science/article/pii/S1055790310001168)

This study examines coevolution between acacias and their mutualistic ant parters in the Neotropics. These ants nest in acacia plants and protect them from herbivores, pathogenic fungi, and encroaching vegetation.

We will download the ant sequences from their [Table 2](http://www.sciencedirect.com.proxy.uchicago.edu/science/article/pii/S1055790310001168#tbl2) directly from GenBank using the [Biopython](http://biopython.org) library.

---
<img style="float: right;" src="http://www.reelab.net/todo.png">
### 4.0 Change the kernel of the notebook to Python 3

---
<img style="float: right;" src="http://www.reelab.net/todo.png">
### 4.1 Copy the accession numbers into a text file

1. Go to the Jupyter home tab (showing files in the current directory)
2. Create a new text file using the **New** dropdown menu in the upper right corner of the page
3. Copy and paste the text from [Table 2](http://www.sciencedirect.com.proxy.uchicago.edu/science/article/pii/S1055790310001168#tbl2) into this file.
4. Rename the file `ants.csv` and save it.
5. In a Terminal, the command
    ```bash
    head ants.csv
    ```
    should yield:
    ```
    	Species	LW Rh	Wg
    1	Myrcidris epicharisa	AY703785	AY703651
    2	Pseudomyrmex apachea	AY703786	AY703652
    3	Pseudomyrmex boopisa	AY703787	AY703653
    4	Pseudomyrmex concolora	AY703788	AY703654
    5	Pseudomyrmex cordiaea	AY703789	AY703655
    6	Pseudomyrmex cubaensisa	AY703790	AY703656
    7	Pseudomyrmex dendroicusa	AY703791	AY703657
    8	Pseudomyrmex denticollisa	AY703792	AY703658
    9	Pseudomyrmex elongatulusa	AY703793	AY703659
    ```

---
<img style="float: right;" src="http://www.reelab.net/todo.png">
### 4.2 Read the file into a [pandas dataframe](https://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe)

Insert a new cell below and execute the following Python code to read the file into a dataframe `df`:

```python
import pandas as pd
df = pd.read_csv('ants.csv', sep='\t')
```

The dataframe has 2 columns of accession numbers, one for the long-wavelength rhodopsin gene (`LW Rh`) and one for the _wingless_ gene (`wh`).

---
<img style="float: right;" src="http://www.reelab.net/todo.png">
### 4.3 Use Biopython to fetch the sequences from NCBI and save them as FASTA files

```python
from Bio import Entrez, SeqIO
Entrez.email = 'you@uchicago.edu'  # edit this - identify yourself to NCBI!
```

Let's do the rhodopsin sequences first.

```python
accessions = df['LW Rh']  # this selects the dataframe column containing the rhodopsin accession numbers
```

In the code below, [`Entrez.epost()`](http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc115) searches NBCI databases by primary identifier; for sequences, these are accession numbers. Next, [`Entrez.efetch()`](http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc117) takes the results of the search and downloads the sequence data. Finally, the sequence data are parsed into a list of Biopython [`SeqRecord`](http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc32) objects.

```python
h = Entrez.epost(db='nucleotide', id=','.join(accessions))  # h is a 'handle' on the results of the search
d = Entrez.read(h)
h.close()
h = Entrez.efetch(db='nucleotide', rettype='gb', retmax=len(v), webenv=d['WebEnv'], query_key=d['QueryKey'])
seqs = list(SeqIO.parse(h, 'genbank'))
```

Let's save the sequences as a FASTA file.

```python
SeqIO.write(seqs, 'LWRh.fasta', format='fasta')
```

**Use the `head` (or `less`) command to look at the file**, and notice that lots of information is packed into the sequence labels, e.g.:

    >AY703786.1 Pseudomyrmex apache long-wavelength rhodopsin (LW Rh) gene, partial cds

For phylogenetic analysis, we typically want these labels to be short and sweet - just the name of the organism, and possibly an identifier like the accession number. Also, they should not contain spaces or punctuation characters. Here's how to save an improved file:

```python
for s in seqs:
    species = '_'.join(s.description.split()[:2])  # the first 2 words are the genus and species
    accession = s.name
    s.id = '_'.join((species, accession))
    s.description = ''  # make this blank, so only the id field gets written
SeqIO.write(seqs, 'LWRh.fasta', format='fasta')
```

Here we take advantage of the fact that the SeqRecord.description field begins with the species binomial. A better method would actually query NCBI and look up the full taxonomic record of the sequence, but we will keep it simple.

#### Copy and execute the above code in a new cell below. Then, repeat the exercise to fetch and save the sequences for _wingless_, editing the code as needed.

---

---

## That's all for today!