# Investigating evolutionary relationships through sequence and cluster analysis

## Wally Novak, Wabash College

***

Scientists often deal with large datasets in biochemistry. For example, we might go online and run a BLAST search to identify a protein sequence. The default BLAST parameters are designed to limit the amount of data we receive to 100 sequences. In many cases this is more than sufficient, and we can stay out of "Big Data" weeds so to speak. 

However, protein (and nucleotide) sequences are rich with information, and there is much knowledge to be gained about evolutionary relationships between proteins using Big Data. Processing large datasets often requires specialized software packages. 

The overarching goals of this exercise are to:

    1. Learn to navigate Jupyter Lab and Notebooks in the Binder.org environment
    2. Learn about FASTA sequence files
    3. Be able to handle, filter, and clean large sequence datasets
    4. Learn about and use the basic local alignment sequence tool (BLAST)
    5. Be able to create interactive sequence similarity networks from BLAST results to investigate sequence relationships
    6. Be able to generate new knowledge using sequence similarity networks 


This Notebook will give a brief overview of running and editing Jupyter Notebooks and the Python programming language.

***



## A (very) brief introduction to Jupyter Notebooks, Jupyter Lab, and Python


Jupyter notebooks provide an interactive, web-based programming environment. There are many benefits of using a Jupyter notebook in your classroom or lab, including access to a variety of Big Data programming tools (code sections) and the ability to include narrative sections (such as this one, also called a markdown section). 

You can edit markdown sections. Just double click in the markdown area and type what you like. Go ahead and double click in this box. I just did!

To return to the formatted view, you need to "run" the markdown section. With the box highlighted (the vertical blue bar at left), you can run the code in two ways. The first way is to click the Run (Play) button near the top of the page. The second way is to use shift+return.

Here is a link to some cool formatting you can do in markdown sections https://towardsdatascience.com/jupyter-and-markdown-cbc1f0ea6406, but you can also do an internet search for "jupyter markdown." 
***

This notebook is based on the Python 3 programming language. Throughout this workshop you will encounter code that is represented like this:

~~~python
print("This is some python code.")
~~~
I will use these markdown sections to guide you through the exercise, explain the basics of the code, provide general information, or even pose some questions. The above print statement simply prints whatever is in between the quotes. The box below this one is a code section. 

<font color=blue><b>STEP 1:</b></font> Copy and paste the above code into the code box below. The code box starts with [ ]:

<font color=blue><b>STEP 2:</b></font> Run the code using the Run button or shift+return. You should see some output appear below the box.

While the code is running you will see an asterisk appear in the brackets, e.g.: [\*]. When it is completed, a number appears, e.g. [1].

Admittedly, this is some boring code, so why not spice it up? 

<font color=blue><b>STEP 3:</b></font> Edit the code box above to print your name at the end of the statement. For example, I might want it to say "This is some python code, Wally." You can change anything in between the quotes. When you rerun the code (either with the Run button or the shift+return) you will see the new output below that box and the number in brackets on that line increments by 1.

If this is your first time writing and running some Python code, congratulations! 

<font color=blue><b>STEP 4:</b></font> Let's try copying and pasting some slightly more complicated code into the code box below.

~~~python
for i in range(3):
    print("Python in Jupyter is awesome!")
~~~

This code creates a loop and runs the print statement over the range given.

<font color=blue><b>STEP 5:</b></font> Go ahead and run it.

<font color=blue><b>STEP 6:</b></font> Before we move on, change the range in the code above to 10 and run it again. Even more awesomeness!
***

These simple exercises serve two purposes (I hope). Namely that you can...
* Run working code with minimal guidance!
* Edit Jupyter notebooks to tailor your exercises to your own needs!

***

# Why we use Binder

In this exercise we will use protein sequence files that we obtain from uniprot. We also want to utilize NCBI BLAST software to use expectation values for visualization, the CD-HIT software to reduce our data set size in a rational way, and cytoscape for viewing networks!

While it is possible to install most of this software on any computer, it can be very difficult to get everything working. Also, many of us want to do or teach the science, so figuring out how to install all the software is not a good use of our time.  

We are running this Jupter notebook in Binder, a <i> remote </i> environment that I have set up. In this environment all the needed software is already installed, and we can easily upload files to work with! Check out more at https://mybinder.org.
***

## Some important ideas for using Jupyter Notebooks in Binder

    1. Please ensure the code in a cell has finished running before moving on to the next one!
    2. If you accidentally delete or mangle some code, use the Edit -> Undo function!
    3. If you need to stop some code from running in a cell use the Stop button (next to the Run button).
    4. If you don't interact with Binder for too long, you may lose the Kernel and need to restart it. 
    5. If you are away for a long time, then you may need to relaunch your Binder - which means you might need to regenerate any data that you did not download!
***


# Investigating evolutionary relationships through sequence and cluster analysis!
# On to the Science - Importing some sequence data!

In this set of Jupyter Notebooks you will guide you through obtaining a set of DtxR-related protein sequences and working with and processing those sequences to create a sequence similarity network (SSN). You will use the SSN to examine relationships among DtxR-related proteins. 

In this first part of the exercise, we will retrieve and upload a set of diphthteria toxin repressor (DTXR)-like protein sequences from Uniprot. 

### <font color="blue">Workshop Note:</font> I find that these types of exercises work well with small groups and access to instructor help and clarification. For the purpose of this workshop I will walk us through the exercise. Please do use the Q&A feature or Raise hand to get some help or slow me down! More experienced users can feel free to work through and/or edit the notebooks at a faster pace. 

<font color=blue><b>STEP 7:</b></font> Click this link to head to https://uniprot.org in a new browser window.

***
At the uniprot site...

<font color=blue><b>STEP 8:</b></font> Type dtxr in the search bar and hit return.

<font color=blue><b>STEP 9:</b></font> Explore the results and consider these questions:

    1. How many results did you get from this search?
    2. How many results are reviewed?
    3. How many sequences are unreviewed?

Hopefully you found something over 30,000 total sequences. I would certainly call this <b>big data</b> - way too many sets to look at manually!

<font color=blue><b>STEP 10:</b></font> Near the top of the sequences you will see the download button. Download all the sequences in FASTA (canonical) format and ensure <b>uncompressed</b> is checked. <b>Please change the file name to <b>uniprot-dtxr.fasta</b>. Note where this file is being saved on your computer, we will need to upload it to the Binder environment next!</b><br> <img src="images/download.png" width=200>

<font color=blue><b>STEP 11:</b></font> Next, we will upload this file to the Binder environment. In the files panel at the left, <b>double click the files folder to open it</b>. You should see two existing files in there: dtxr_pdbs.fasta and dtxr.tfa. Now you may either drag the uniprot-dtxr.fasta file from your computer folder into the Binder folder or click the Upload files button that looks like this:<img src="images/upload.png" width=50>
***

<font color=blue><b>STEP 12:</b></font> As the fasta file is simply a text file, you can double click it in the files panel and it will display the contents in a new window.

You should see that each sequence record in the fasta formatted file starts with a '>' and that the first line contains identifying information. The following lines are the protein sequence in single letter amino acid code. 

<font color=blue><b>STEP 13:</b></font> Answer these questions:

    1. What is the function/name of the first protein (ID = P9WMH1)?
    2. What is the function/name of the second protein (ID = P0DJL7)?

So far we have learned how to enter and edit Python code in a Jupyter notebook. We have also learned to upload a file and even looked at the first few lines of a very large sequence file.

***
## Making a multiple alignment of sequences

We can learn alot by aligning sequences and examining positions that are conserved. Just as an introduction, we will make an alignment of some DtxR-like proteins of known function. We will use Clustal Omega on the command line. More about Clustal Omega can be found here: http://www.clustal.org/omega/.

<font color=blue><b>STEP 15:</b></font> Run the code below to align the fasta file containing knowns in an msf (multiple sequence format) file.


In [None]:
!clustalo -i files/dtxr_pdbs.fasta --outfmt='msf' -o files/dtxr_pdbs.msf

<font color=blue><b>STEP 16:</b></font> Find the dtxr_pdbs.msf file in the files tab and double click on it to open it. The symbol '~' indicates that a particular sequence does not have additional amino acids at the N or C terminus. A '.' means there is no amino acid in that sequence at that position, also called a gap. 

<font color=blue><b>STEP 17:</b></font> Use the msf to answer the following questions:

    1. Which sequences appear most similar to each other?
    2. If you had to pick one that is the most different, which would it be and why?

***
## Congratulations, you have finished the first notebook in this exercise!

<font color=blue><b>STEP 18:</b></font> Open Notebook "2 - Sequences_and_BLAST" by double clicking on the file in the left hand panel. The code cells in 2 - Sequences_and_BLAST already have the needed code, to keep you engaged and on your toes, you will have to edit that code in specific (and hopefully clearly annotated) ways.
