# Investigating evolutionary relationships through sequence and cluster analysis

## Wally Novak, Wabash College

***

Faculty and students often deal with large datasets in biochemistry. For example, we might go online and run a BLAST search to identify a protein sequence. The default BLAST parameters are designed to limit the amount of data we receive to 100 sequences. In many cases this is more than sufficient, and we can stay out of "Big Data" weeds so to speak. 

However, protein (and nucleotide) sequences are rich with information, and there is much knowledge to be gained about evolutionary relationships between proteins using Big Data. Processing large datasets often requires specialized software packages. 

This workshop will:

* Give a brief overview of using Jupyter Notebooks
* Introduce you to several software packages and file formats we will use for analysis
* Walk you through uploading and processing fasta files
* Show you how to reduce large sequence datasets into diverse, yet manageable sets
* Demonstrate how visualization of sequence similarity networks can provide important and novel information about proteins
* Leave you with a computational lab experience that can be tailored to your interests!

***



## A (very) brief introduction to Jupyter Notebooks, Jupyter Lab, and Python


Jupyter notebooks provide an interactive, web-based programming environment. There are many benefits of using a Jupyter notebook in your classroom or lab, including access to a variety of Big Data programming tools (code sections) and the ability to include narrative sections (such as this one, also called a markdown section). This notebook is based on the Python 3 programming language. Throughout this workshop you will encounter code that is represented like this:

`print("This is some python code.")`

I try to explain all the code in these markdown sections. The print statement simply prints whatever is in between the quotes. The box below this one is a code section. You can simply copy and paste the code into the box below. Its the line that starts with In [ ]:

With that box highlighted, you can run the code in two ways. The first way is to click the Run button near the top of the page. The second way is to use shift+return.

If you haven't already, go ahead and copy and paste the print code into the box and run it. You should see some output appear below the box.

Admittedly, this is some boring code, so why not spice it up? Edit the code box above to print your name at the end of the statement. For example, I might want it to say "This is some python code, Wally." You can change anything in between the quotes. When you rerun the code (either with the Run button or the shift+return) you will see the new output below that box and the number in brackets on that line increments by 1.

If this is your first time writing and running some Python code, congratulations! 

Let's try copying and pasting some slightly more complicated code into the code box below.

`for i in range(3):
    print("This is freaking awesome!")`

This code creates a loop and runs the print statement over the range given.

Go ahead and run it.

Before we move on, change the range in the code above to 10 and run it again. Even more awesomeness!

***

These simple exercises serve two purposes (I hope).
* You can type in working code with minimal guidance!
* You can edit Jupyter notebooks to tailor your exercises to what you want to study!

***

In this exercise we will use protein sequence files that we obtain from uniprot. We also want to utilize NCBI BLAST software to use expectation values for visualization, the CD-HIT software to reduce our data set size in a rational way, and cytoscape for viewing networks!

## This is overwhelming and impossible to get everyone on the same page.

One person is using a mac, another a Windows PC. It is also difficult to find where files are. It would be nearly impossible to make any headway on this with a group this size.

## However...
We will perform the rest of the workshop using a Jupyter lab and a <i> remote </i> environment that I have set up. In this environment all the needed software is already installed, and we can easily upload files to work with!

Go ahead and click the launch binder button below (this may take a few minutes to start!)

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/wallynovak/my-first-binder/HEAD?urlpath=lab)
