# October 8

We're going to do a number of things today. The first is that we'll align a sequence to our existing alignment, and estimate a phylogeny. Then, we'll learn about revision management - the process of tracking changes to code and data via the command line.

To do the alignment, we'll be using a classroom set of sequences. I have created a matrix of two sequences that will be added to the existing alignment. To do this, we will use a piece of software called mafft. First, we will convert our sequences from nexus to fasta. 

In [8]:
import dendropy

amphib = dendropy.DnaCharacterMatrix.get(
    path="../data/plethodon.phy",
    schema="phylip"
)

amphib.write_to_path("../data/plethodon.fa", schema="fasta")

Now, we will make our alignment. At the command line, navigate to the data directory. Enter the below:

In [None]:
mafft --add plethodon.fa > full_plethodon_alignment.phy

## Coda: How do we know if scientific software is reliable?

-   ¯\_(ツ)_/¯
- I'd start by searching for a paper. Navigate to Google Scholar. Is there a paper for MAFFT?
    - There are many, in fact!
    - Do I need to read all of them? 
- Next, I would read a paper or two for the following answers:
    - What assumptions does this software make?
        - What does an aligner assume? 
        - What does a phylogenetic tree software assume?
    - What input will it take? 
        - What input does an aligner take?
    - What is the output?
    - Roughly, what is the methodology? 
    
![](img/bbn013f1.png)


(If you are interested in Multiple Sequence Alignment, please see my [lesson](https://selusys.github.io/SELUSys2018/14-MSA/) on MSA from last year's systematics class)

- What about the software itself? 
    - I strongly prefer open-source software. 
    - MAFFT's source code is [here](https://mafft.cbrc.jp/alignment/software/source.html)
    - Does it have tests? Tests, we will talk about in a few weeks, allow us to check our results against known results. They _test_ if the software is realiable. 
    - Do they report changes and bug fixes? 
        - Every piece of software has bugs. Every one of them. 100% All of 'em. The problem is when developers are not honest and transparent about the bugs.
        
- Usage: Are people using the software?
    - Do their results make sense?
    
## The Future:

It is becoming increasingly common to publish software via what we call code journals. [For example!](https://github.com/ropensci/onboarding/issues/239) 



# Exercise One: 

We've been working with RAxML to build trees. I want you to look at the above questions for RAxML. The website is [here](https://github.com/stamatak/standard-RAxML). Discuss with a partner - do you think this software is realiable?

In [7]:
import dendropy
from dendropy.interop import raxml
pleth = dendropy.DnaCharacterMatrix.get(
    path="../data/full_plethodon_alignment.phy",
    schema="fasta"
)

In [8]:
rx = raxml.RaxmlRunner(raxml_path="/bin/raxmlHPC")
tree = rx.estimate_tree(
        char_matrix=pleth)
tree.write_to_path("../data_output/tree.phy", schema="newick")

Have a look at it in IcyTree

## Managing Revisions with Git



We've done something cool - we added a new tip to a phylogeny with novel sequence data. How exciting! 

So how many of you, in the past few weeks have gotten something working, and then broke it? 

We just looked at some software on Github, which is a website for hosting _version control repositories_. Version control is sort of like "track changes" for code, and remote backing up all in one:

- Nothing that is committed to version control is ever lost, unless you work really, really hard at it. Since all old versions of files are saved, it’s always possible to go back in time to see exactly who wrote what on a particular day, or what version of a program was used to generate a particular set of results.

- As we have this record of who made what changes when, we know who to ask if we have questions later on, and, if needed, revert to a previous version, much like the “undo” feature in an editor.

- When several people collaborate in the same project, it’s possible to accidentally overlook or overwrite someone’s changes. The version control system automatically notifies users whenever there’s a conflict between one person’s work and another’s.

Teams are not the only ones to benefit from version control: lone researchers can benefit immensely. Keeping a record of what was changed, when, and why is extremely useful for all researchers if they ever need to come back to the project later on (e.g., a year later, when memory has faded).

Version control is the lab notebook of the digital world: it’s what professionals use to keep track of what they’ve done and to collaborate with other people. Every large software development project relies on it, and most programmers use it for their small jobs as well. And it isn’t just for software: books, papers, small data sets, and anything that changes over time or needs to be shared can and should be stored in a version control system.

![](img/phd101212s.gif)

We're going to move to the terminal now. Open a terminal. Enter the following, replacing Your_name with your first and last name.

In [None]:
mkdir Your_name_project

Have a look in the browser - what has happened? 

Change into the directory you created. Next, we'll create three more directories: Data, Output and Scripts.

In [None]:
# Create them

We'll now practice moving files. We want to move the `plethodon.fa` and the `MysterPleth.fa` files into `data`.

In [None]:
# Answer follows.
mv CompBio2018.git/data/plethodon.fa data/

## Exercise 2

Try moving the phylogeny into output. 

What will we put in scripts? Let's make four quick scripts. We will make:
- A script that converts a file from Phylip to Fasta
- A script that adds mystery sequences to the alignment with MAFFT
- A script that reads in the data and runs RAxML
- A script that runs all three of the previous

Once we do this, we want to be _damn_ sure we don't delete it all, right? Enter Git, our lab notebook and backup server in one. First, let's tell git who we are:

In [None]:
git config --global user.name "Vlad Dracula"
git config --global user.email "vlad@tran.sylvan.ia"
git config --global core.editor "nano -w"

Now, we tell Git where we would like to track our changes. 

In [None]:
git init

This tells git that we will be keeping safe our changes that are in this directory (Your_name_project). Next, we can check that git is inititalized.

In [None]:
git status

We have our repository initialized, but we have not asked git to log any of our files. We will now do that:

In [None]:
git add scripts/*

The star is called a wildcard. Use git status again to see which files have been "added". You can think of git like taking a picture. We have just focused what will be in the picture - our scripts.

We can also add single files like so:

`git add output/tree.phy`

# Exercise three:

Add the two data files to git.

Now that our camera is focused, we'll take the snapshot:

`git commit -m "Initial phylo scripts and data"`

We can think of commit as taking a photo, and naming it right away. The `-m` means message - what did we do? What are we committing? 

Lastly, we will now connect our local code to our online back-up. On the [Github.com](github.com) website, look for the following on the left-hand side:

![](img/New_repo.png)

This will create a repository for your code online.

The next step will allow you to give it a name. I prefer to name it what I called it on my computer:

![](img/Name_repo.png)

There will now be some instructions for you to copy and paste in to your terminal. Do it.