# Oct 2


Let's remind ourselves of what we did last time: 
- Parsed sequence data
- BLAST searched a sequence against the NCBI database
- Got some potential homologous genes

As seen below:

In [None]:
from Bio.Blast import NCBIXML

parsed = NCBIXML.read(open("results.xml"))
alignment_set = parsed.alignments
for individual_alignment in alignment_set:
    print(individual_alignment)

We also read an article by Bill Pearson about why we should be concerned about doing this. As you can see, we don't have E-Values on any of these sequences. The last thing we'll look at is if any of our sequences had significant E-values:

In [None]:
for individual_alignment in alignment_set:
    print(individual_alignment.hsps[0].expect)

Is that all our sequences? How could we check? 

In [None]:
len(alignment_set)

In [None]:
# Answer follows
good_e = []
for individual_alignment in alignment_set:
    if individual_alignment.hsps[0].expect < 0.000001:
        good_e.append(individual_alignment)
    
print(len(good_e))

So now, we have done some quality control on these matches. I feel OK that they are what they say they are. There were a couple sequences in this dataset that aren't on the tree. Maybe we want to add them to the tree. Let's read in our tree and see if there are any sequences that are in the BLAST hits, but not our tree.

In [None]:
import pandas as pd
import dendropy

pd.options.display.max_colwidth = 1000000
salamanders = pd.read_json("salamanders.json")
salamanders

pd.options.display.max_colwidth = 1000000
newick_tree = salamanders.newick.to_string(index=False)
converted_tree = dendropy.Tree.get_from_string(newick_tree, schema = "newick", preserve_underscores=True)
converted_tree.as_string(schema="newick")

## Exercise One: 

With a partner, look at the below code. Some is familiar. Some is not. 
- Look at the help function for taxon_namespace. What does this mean?
- Look at the loop. What is this doing?
- Why does a phylogeny need a special names data structure?


In [None]:
new_labels = []
for taxon in converted_tree.taxon_namespace.labels():
    taxon_split = taxon.split("_")
    new_taxon = taxon_split[0] + "_" + taxon_split[1]
    new_labels.append(new_taxon)  


In [None]:
help(converted_tree.taxon_namespace)

Check your answers by running the below:

In [None]:
new_labels

# Exercise 2
Next, we will get the list of names out of the BLAST hits. We will loop over our alignments to do this. In your loop, you will:
- Need to use one split command
- Need to select one index from the list that results from split
- Use another split command

In [None]:
names_list=[]

#Answer follows
for individual_alignment in alignment_set:
    split_aln = str(individual_alignment).split('|')
    name_string = split_aln[4]
    specific_ep = name_string.split(' ')[1:3]
    names_list.append('_'.join(specific_ep))


In [None]:
import dendropy
amphib = dendropy.DnaCharacterMatrix.get(
    path="../data/plethodon.phy",
    schema="phylip"
)

for taxon in names_list:
    if taxon in new_labels:
        print("Taxon present: {}".format(taxon))
    else:
        print("No taxon: {}".format(taxon))

All the taxa in the BLAST hits are on the tree. Which seems odd - I don't think I saw the taxon _Plethodon oconaluftee_ when I was preparing the data matrix. 

In [194]:
 amphib = dendropy.DnaCharacterMatrix.get(
    path="../data/plethodon.phy",
    schema="phylip"
)
    
for taxon in names_list:
    if taxon in amphib.taxon_namespace.labels():
           seq = amphib[taxon].symbols_as_string()
           cytb = seq[4097:5236]
           print(cytb)
    else:
        print("BLAST hits not in the data: {}".format(taxon))

-----------------------CACCCCCTACTAAGTATTATTAACAACTCATTTATTGACCTACCAACTCCCTCAGGCCTATCCTACTTATGAAATTTTGGGTCCTTACTAGGAATTTGCCTAATTACACAAGTTTTAACCGGCCTATTTTTAGCAATACACTACACAACAGATATCCACTTTGCATTTTCATCAGTAGCGCATATCTGCCGCGACGTAAACTACGGATGATTAATTCGAAATATTCATGCCAATGGAGCTTCTTTATTTTTCATTTGTATTTATATACACATTGGACGAGGAATCTACTATGGTTCATTTATACTCAAAGAAACCTGAAACATTGGGGTAGTCCTTTTCTTTATAACCATAGCAACAGCCTTTATAGGATATGTTCTTCCATGAGGACAGATATCCTTCTGGGGGGCCACAGTAATTACTAATTTACTCTCAGCAATCCCATACATAGGCAACACCCTCGTACTATGAATTTGGGGAGGATTCTCAGTGGATCAAGCCACTTTATCCCGATTCTTTGTGTTCCACTTCATCCTACCATTTCTTATCATAGGCACTACTATAATACACCTTCTCTTTCTACATGAAACCGGCTCAAACAATCCAACAGGACTTAACTCGAACCCAG----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

-----------ATACGAAAAACACACCCCCTACTAAATATCATTAATAACTCATTCATTGATCTACCAACCCCATCAAGCCTATCCTACTTATGAAATTTCGGGTCTTTATTAGGAGTTTGCCTAATTACACAAGTCCTAACCGGCTTATTTTTAGCAATACACTATACAGCAGATATCCACTCTGCATTTTCATCAGTAGCACATATTTGCCGTGACGTAAACCACGGATGACTAATCCGAAATATTCATGCCAACGGAGCCTCTTTATTTTTCATTTGTATTTATATACACATTGGACGAGGAATCTACTATGGATCATTTATACTCAAAGAGACTTGAAACATTGGAGTAATCCTTTTTTTTATAACTATGGCAACAGCTTTCATAGGATATGTTCTCCCATGAGGACAAATATCCTTCTGAGGAGCCACAGTAATTACTAACTTACTCTCAGCAATCCCATACATAGGTAACACTCTTGTATTGTGAATCTGAGGAGGATTCTCCGTAGACCAGGCCACTTTATCCCGATTCTTTGTATTTCACTTTATCTTACCCTTCCTCATTATGGGAACTACAATTATACACCTCCTCTTTCTACATGAAACCGGCTCAAACAACCCAACAGGACTTAACTCAAACCCAG----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Somehow this taxon is on the tree, but not in the data! You know what we're going to do now. We're going to go get the data. To do this, we need something called the GI ID. This is a unique _genetic_ _id_endtifier. It links each taxon in the NCBI databases to a unique, global number. That way, when someone uploads new data, it becomes linked to other records for the same species.

How will we modify the below loop to get the GI ID for Plethodon oconaluftee?

In [None]:
gi_ids = []
for individual_alignment in alignment_set:
#First, we get the names. This Does not have to be modified    
    split_aln = str(individual_alignment).split('|')
    name_string = split_aln[4]
    specific_ep = name_string.split(' ')[1:3]
#Then, we get the ID.    
    if 'oconaluftee' in specific_ep:
        gi_ids.append(What part of split_aln will we take to get the GI id?)
gi_ids

In [None]:
#Answer

gi_ids = []
for individual_alignment in alignment_set:
    split_aln = str(individual_alignment).split('|')
    name_string = split_aln[4]
    specific_ep = name_string.split(' ')[1:3]
    if 'oconaluftee' in specific_ep:
        gi_ids.append(''.join(split_aln[0:2]))
gi_ids

Let's get some sequences!

In [None]:
from dendropy.interop import genbank
gb_dna = genbank.GenBankDna(ids=gi_ids)
char_matrix = gb_dna.generate_char_matrix()

In [195]:
print(char_matrix.as_string("nexus"))

#NEXUS

BEGIN TAXA;
    DIMENSIONS NTAX=2;
    TAXLABELS
        125380720
        125380722
  ;
END;

BEGIN CHARACTERS;
    DIMENSIONS NCHAR=649;
    FORMAT DATATYPE=DNA GAP=- MISSING=? MATCHCHAR=.;
    MATRIX
        125380720    ATGACTCACATCACACGAAAAACACACCCCCTACTGAGTGTTATTAATAATTCACTTATTGATCTACCAACTCCCTCAAACCTATCCTACTTATGAAATTTTGGATCCTTACTAGGAGTCTGCCTAATCACACAAATTTTAACCGGCCTATTTTTAGCAATACATTATACAGCAGACATTTACTTCGCATTTTCATCAGTAGCACATATTTGCCGCGACGTAAACTATGGATGATTAATTCGAAATATTCATGCCAACGGAGCCTCTTTATTTTTTATTTGCATTTACATACACATTGGACGAGGAATTTACTATGGATCATTTATATTAAAAGAGACCTGAAACATTGGAGTAATCCTCTTTTTTATAACCATAGCAACAGCCTTTATAGGATATGTTCTCCCATGAGGACAAATATCCTTCTGAGGGGCCACAGTGATTACTAATCTACTCTCAGCAATCCCTTATATAGGCAACACCCTCGTACTGTGAATTTGGGGTGGGTTCTCGGTAGATCAAGCCACTCTATCCCGATTCTTTGTATTTCACTTCATTCTACCCTTTCTCATCATAGGAACTACTATTATTCACCTCCTTTTTCTACACGAAACCGGCTCAAACAACCCAACAGGAATTAATTCAAGCCCAG
        125380722    ATGACTCACATCACACGGAAAACACACCCCGTACTGAGTGTTATTAATAATTCACTTATTGATCTACCAACTCCCTCAAACCTATCCTACTTATGAA

649 characters. Where have we seen that before? That's the size of CytB!

Wonderful, we have one gene. One gene ... two problems:
- Which should we use?
- How do we get it in the rest of the matrix? 

We have one sequence (out of 12 genes) - and we obviously aren't just going to sit here, writing in missing data symbols. 

Are these sequences even different?

In [None]:
char_matrix['125380722'] == char_matrix['125380720']

Yes! 

Let's add both to our data matrix. Next time. Today, we'll finish up by writing out our data for later:

In [206]:
char_matrix.write_to_path("test_matrix.nex", schema="nexus")