# Oct 2


Let's remind ourselves of what we did last time: 
- Parsed sequence data
- BLAST searched a sequence against the NCBI database
- Got some potential homologous genes

As seen below:

In [1]:
from Bio.Blast import NCBIXML

parsed = NCBIXML.read(open("results.xml"))
alignment_set = parsed.alignments
for individual_alignment in alignment_set:
    print(individual_alignment)

gi|125380686|gb|DQ994948.1| Plethodon kentucki voucher RH 66693 cytochrome b (cytb) gene, partial cds; mitochondrial
           Length = 625

gi|51979883|gb|AY728222.1| Plethodon petraeus mitochondrion, complete genome
           Length = 19235

gi|125380683|gb|DQ994946.1| Plethodon jordani voucher JB 201-73-01 cytochrome b (cytb) gene, partial cds; mitochondrial
           Length = 644

gi|125380685|gb|DQ994947.1| Plethodon jordani voucher JB 201-67w-01 cytochrome b-like (cytb) gene, partial sequence; mitochondrial
           Length = 649

gi|125380736|gb|DQ994973.1| Plethodon petraeus voucher RWV S21-B cytochrome b (cytb) gene, partial cds; mitochondrial
           Length = 630

gi|125380641|gb|DQ994924.1| Plethodon chlorobryonis voucher JJW 1807 cytochrome b (cytb) gene, partial cds; mitochondrial
           Length = 649

gi|125380761|gb|DQ994986.1| Plethodon shermani voucher JJW 1741 cytochrome b (cytb) gene, partial cds; mitochondrial
           Length = 649

gi|125380664|gb|DQ994

We also read an article by Bill Pearson about why we should be concerned about doing this. As you can see, we don't have E-Values on any of these sequences. The last thing we'll look at is if any of our sequences had significant E-values:

In [2]:
for individual_alignment in alignment_set:
    print(individual_alignment.hsps[0].expect)

0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0


Is that all our sequences? How could we check? 

In [3]:
len(alignment_set)

50

In [4]:
# Answer follows
good_e = []
for individual_alignment in alignment_set:
    if individual_alignment.hsps[0].expect < 0.000001:
        good_e.append(individual_alignment)
    
print(len(good_e))

50


In [8]:
# Juan's solution
count = 0
for individual_alignment in alignment_set:
    if individual_alignment.hsps[0].expect < 0.000001:
        count = count + 1
print(count)

50


So now, we have done some quality control on these matches. I feel OK that they are what they say they are. There were a couple sequences in this dataset that aren't on the tree. Maybe we want to add them to the tree. Let's read in our tree and see if there are any sequences that are in the BLAST hits, but not our tree.

In [11]:
import pandas as pd
import dendropy

pd.options.display.max_colwidth = 1000000
salamanders = pd.read_json("salamanders.json")
salamanders

newick_tree = salamanders.newick.to_string(index=False)
converted_tree = dendropy.Tree.get_from_string(newick_tree, schema = "newick", preserve_underscores=True)
converted_tree.as_string(schema="newick")

"((((((((('Plethodon_chlorobryonis_ott57750','Plethodon_variolatus_ott162442'),'Plethodon_chattahoochee_ott161611'),((('Plethodon_cylindraceus_ott128947','Plethodon_teyahalee_ott208588'),('Plethodon_aureolus_ott367241','Plethodon_cheoah_ott942947')),'Plethodon_glutinosus_ott534317')),(('Plethodon_montanus_ott128931','Plethodon_metcalfi_ott128944'),('Plethodon_amplus_ott128937','Plethodon_meridianus_ott942954'))),((((((('Plethodon_mississippi_ott128934','Plethodon_kiamichi_ott229842'),(((('Plethodon_savannah_ott128940','Plethodon_ocmulgee_ott162445'),'Plethodon_grobmani_ott145459'),'Plethodon_kisatchie_ott596896'),('Plethodon_albagula_ott162448','Plethodon_sequoyah_ott596892'))),'Plethodon_shermani_ott942942'),'Plethodon_yonahlossee_ott515343'),'Plethodon_jordani_ott515340'),'Plethodon_kentucki_ott942949'),('Plethodon_caddoensis_ott162451',('Plethodon_ouachitae_ott385557','Plethodon_fourchensis_ott732220')))),'Plethodon_petraeus_ott366877'),((((('Plethodon_ventralis_ott128950','Plethodo

## Exercise One: 

With a partner, look at the below code. Some is familiar. Some is not. 
- Look at the help function for taxon_namespace. What does this mean?
- Look at the loop. What is this doing?
- Why does a phylogeny need a special names data structure?


In [10]:
new_labels = []
for taxon in converted_tree.taxon_namespace.labels():
    taxon_split = taxon.split("_")
    new_taxon = taxon_split[0] + "_" + taxon_split[1]
    new_labels.append(new_taxon)  


In [None]:
help(converted_tree.taxon_namespace)

Check your answers by running the below:

In [12]:
new_labels

['Plethodon_chlorobryonis',
 'Plethodon_variolatus',
 'Plethodon_chattahoochee',
 'Plethodon_cylindraceus',
 'Plethodon_teyahalee',
 'Plethodon_aureolus',
 'Plethodon_cheoah',
 'Plethodon_glutinosus',
 'Plethodon_montanus',
 'Plethodon_metcalfi',
 'Plethodon_amplus',
 'Plethodon_meridianus',
 'Plethodon_mississippi',
 'Plethodon_kiamichi',
 'Plethodon_savannah',
 'Plethodon_ocmulgee',
 'Plethodon_grobmani',
 'Plethodon_kisatchie',
 'Plethodon_albagula',
 'Plethodon_sequoyah',
 'Plethodon_shermani',
 'Plethodon_yonahlossee',
 'Plethodon_jordani',
 'Plethodon_kentucki',
 'Plethodon_caddoensis',
 'Plethodon_ouachitae',
 'Plethodon_fourchensis',
 'Plethodon_petraeus',
 'Plethodon_ventralis',
 'Plethodon_dorsalis',
 'Plethodon_angusticlavius',
 'Plethodon_welleri',
 'Plethodon_wehrlei',
 'Plethodon_punctatus',
 'Plethodon_websteri',
 'Plethodon_hoffmani',
 'Plethodon_richmondi',
 'Plethodon_electromorphus',
 'Plethodon_hubrichti',
 'Plethodon_nettingi',
 'Plethodon_virginia',
 'Plethodon_sh

# Exercise 2
Next, we will get the list of names out of the BLAST hits. We will loop over our alignments to do this. In your loop, you will:
- Need to use one split command
- Need to select one index from the list that results from split
- Use another split command

In [18]:
names_list=[]

#Answer follows
for individual_alignment in alignment_set:
    split_aln = str(individual_alignment).split('|')
    name_string = split_aln[4]
    specific_ep = name_string.split(' ')[1:3]
#    print(specific_ep)
    names_list.append('_'.join(specific_ep))
print(names_list)

['Plethodon_kentucki', 'Plethodon_petraeus', 'Plethodon_jordani', 'Plethodon_jordani', 'Plethodon_petraeus', 'Plethodon_chlorobryonis', 'Plethodon_shermani', 'Plethodon_glutinosus', 'Plethodon_variolatus', 'Plethodon_cylindraceus', 'Plethodon_caddoensis', 'Plethodon_caddoensis', 'Plethodon_caddoensis', 'Plethodon_caddoensis', 'Plethodon_caddoensis', 'Plethodon_variolatus', 'Plethodon_shermani', 'Plethodon_sequoyah', 'Plethodon_oconaluftee', 'Plethodon_teyahalee', 'Plethodon_albagula', 'Plethodon_caddoensis', 'Plethodon_caddoensis', 'Plethodon_caddoensis', 'Plethodon_caddoensis', 'Plethodon_caddoensis', 'Plethodon_caddoensis', 'Plethodon_caddoensis', 'Plethodon_metcalfi', 'Plethodon_glutinosus', 'Plethodon_caddoensis', 'Plethodon_caddoensis', 'Plethodon_caddoensis', 'Plethodon_caddoensis', 'Plethodon_caddoensis', 'Plethodon_caddoensis', 'Plethodon_caddoensis', 'Plethodon_caddoensis', 'Plethodon_caddoensis', 'Plethodon_caddoensis', 'Plethodon_caddoensis', 'Plethodon_oconaluftee', 'Pletho

In [19]:
import dendropy
amphib = dendropy.DnaCharacterMatrix.get(
    path="../data/plethodon.phy",
    schema="phylip"
)

for taxon in names_list:
    if taxon in new_labels:
        print("Taxon present: {}".format(taxon))
    else:
        print("No taxon: {}".format(taxon))

Taxon present: Plethodon_kentucki
Taxon present: Plethodon_petraeus
Taxon present: Plethodon_jordani
Taxon present: Plethodon_jordani
Taxon present: Plethodon_petraeus
Taxon present: Plethodon_chlorobryonis
Taxon present: Plethodon_shermani
Taxon present: Plethodon_glutinosus
Taxon present: Plethodon_variolatus
Taxon present: Plethodon_cylindraceus
Taxon present: Plethodon_caddoensis
Taxon present: Plethodon_caddoensis
Taxon present: Plethodon_caddoensis
Taxon present: Plethodon_caddoensis
Taxon present: Plethodon_caddoensis
Taxon present: Plethodon_variolatus
Taxon present: Plethodon_shermani
Taxon present: Plethodon_sequoyah
Taxon present: Plethodon_oconaluftee
Taxon present: Plethodon_teyahalee
Taxon present: Plethodon_albagula
Taxon present: Plethodon_caddoensis
Taxon present: Plethodon_caddoensis
Taxon present: Plethodon_caddoensis
Taxon present: Plethodon_caddoensis
Taxon present: Plethodon_caddoensis
Taxon present: Plethodon_caddoensis
Taxon present: Plethodon_caddoensis
Taxon p

All the taxa in the BLAST hits are on the tree. Which seems odd - I don't think I saw the taxon _Plethodon oconaluftee_ when I was preparing the data matrix. 

In [20]:
   
for taxon in names_list:
    if taxon in amphib.taxon_namespace.labels():
           seq = amphib[taxon].symbols_as_string()
           cytb = seq[4097:5236]
           print(cytb)
    else:
        print("BLAST hits not in the data: {}".format(taxon))

-----------------------CACCCCCTACTAAGTATTATTAACAACTCATTTATTGACCTACCAACTCCCTCAGGCCTATCCTACTTATGAAATTTTGGGTCCTTACTAGGAATTTGCCTAATTACACAAGTTTTAACCGGCCTATTTTTAGCAATACACTACACAACAGATATCCACTTTGCATTTTCATCAGTAGCGCATATCTGCCGCGACGTAAACTACGGATGATTAATTCGAAATATTCATGCCAATGGAGCTTCTTTATTTTTCATTTGTATTTATATACACATTGGACGAGGAATCTACTATGGTTCATTTATACTCAAAGAAACCTGAAACATTGGGGTAGTCCTTTTCTTTATAACCATAGCAACAGCCTTTATAGGATATGTTCTTCCATGAGGACAGATATCCTTCTGGGGGGCCACAGTAATTACTAATTTACTCTCAGCAATCCCATACATAGGCAACACCCTCGTACTATGAATTTGGGGAGGATTCTCAGTGGATCAAGCCACTTTATCCCGATTCTTTGTGTTCCACTTCATCCTACCATTTCTTATCATAGGCACTACTATAATACACCTTCTCTTTCTACATGAAACCGGCTCAAACAATCCAACAGGACTTAACTCGAACCCAG----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

-----------ATACGAAAAACACACCCCCTACTAAATATCATTAATAACTCATTCATTGATCTACCAACCCCATCAAGCCTATCCTACTTATGAAATTTCGGGTCTTTATTAGGAGTTTGCCTAATTACACAAGTCCTAACCGGCTTATTTTTAGCAATACACTATACAGCAGATATCCACTCTGCATTTTCATCAGTAGCACATATTTGCCGTGACGTAAACCACGGATGACTAATCCGAAATATTCATGCCAACGGAGCCTCTTTATTTTTCATTTGTATTTATATACACATTGGACGAGGAATCTACTATGGATCATTTATACTCAAAGAGACTTGAAACATTGGAGTAATCCTTTTTTTTATAACTATGGCAACAGCTTTCATAGGATATGTTCTCCCATGAGGACAAATATCCTTCTGAGGAGCCACAGTAATTACTAACTTACTCTCAGCAATCCCATACATAGGTAACACTCTTGTATTGTGAATCTGAGGAGGATTCTCCGTAGACCAGGCCACTTTATCCCGATTCTTTGTATTTCACTTTATCTTACCCTTCCTCATTATGGGAACTACAATTATACACCTCCTCTTTCTACATGAAACCGGCTCAAACAACCCAACAGGACTTAACTCAAACCCAG----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Somehow this taxon is on the tree, but not in the data! You know what we're going to do now. We're going to go get the data. To do this, we need something called the GI ID. This is a unique _genetic_ _id_endtifier. It links each taxon in the NCBI databases to a unique, global number. That way, when someone uploads new data, it becomes linked to other records for the same species.

How will we modify the below loop to get the GI ID for Plethodon oconaluftee?

In [23]:
gi_ids = []
for individual_alignment in alignment_set:
#First, we get the names. This Does not have to be modified    
    split_aln = str(individual_alignment).split('|')
    name_string = split_aln[4]
    specific_ep = name_string.split(' ')[1:3]
#Then, we get the ID.    
    if 'oconaluftee' in specific_ep:
        gi_ids.append(split_aln[1])
gi_ids

['125380720', '125380722']

In [22]:
# I wanted to check which element of split_aln to get, so I printed split_aln to the screen
for individual_alignment in alignment_set:
#First, we get the names. This Does not have to be modified    
    split_aln = str(individual_alignment).split('|')
    print(split_aln)

['gi', '125380686', 'gb', 'DQ994948.1', ' Plethodon kentucki voucher RH 66693 cytochrome b (cytb) gene, partial cds; mitochondrial\n           Length = 625\n']
['gi', '51979883', 'gb', 'AY728222.1', ' Plethodon petraeus mitochondrion, complete genome\n           Length = 19235\n']
['gi', '125380683', 'gb', 'DQ994946.1', ' Plethodon jordani voucher JB 201-73-01 cytochrome b (cytb) gene, partial cds; mitochondrial\n           Length = 644\n']
['gi', '125380685', 'gb', 'DQ994947.1', ' Plethodon jordani voucher JB 201-67w-01 cytochrome b-like (cytb) gene, partial sequence; mitochondrial\n           Length = 649\n']
['gi', '125380736', 'gb', 'DQ994973.1', ' Plethodon petraeus voucher RWV S21-B cytochrome b (cytb) gene, partial cds; mitochondrial\n           Length = 630\n']
['gi', '125380641', 'gb', 'DQ994924.1', ' Plethodon chlorobryonis voucher JJW 1807 cytochrome b (cytb) gene, partial cds; mitochondrial\n           Length = 649\n']
['gi', '125380761', 'gb', 'DQ994986.1', ' Plethodon she

In [None]:
#Answer

gi_ids = []
for individual_alignment in alignment_set:
    split_aln = str(individual_alignment).split('|')
    name_string = split_aln[4]
    specific_ep = name_string.split(' ')[1:3]
    if 'oconaluftee' in specific_ep:
        gi_ids.append(split_aln[1])
gi_ids

Let's get some sequences!

In [24]:
from dendropy.interop import genbank

gb_dna = genbank.GenBankDna(ids=gi_ids)
char_matrix = gb_dna.generate_char_matrix()

In [25]:
print(char_matrix.as_string("nexus"))

#NEXUS

BEGIN TAXA;
    DIMENSIONS NTAX=2;
    TAXLABELS
        125380720
        125380722
  ;
END;

BEGIN CHARACTERS;
    DIMENSIONS NCHAR=649;
    FORMAT DATATYPE=DNA GAP=- MISSING=? MATCHCHAR=.;
    MATRIX
        125380720    ATGACTCACATCACACGAAAAACACACCCCCTACTGAGTGTTATTAATAATTCACTTATTGATCTACCAACTCCCTCAAACCTATCCTACTTATGAAATTTTGGATCCTTACTAGGAGTCTGCCTAATCACACAAATTTTAACCGGCCTATTTTTAGCAATACATTATACAGCAGACATTTACTTCGCATTTTCATCAGTAGCACATATTTGCCGCGACGTAAACTATGGATGATTAATTCGAAATATTCATGCCAACGGAGCCTCTTTATTTTTTATTTGCATTTACATACACATTGGACGAGGAATTTACTATGGATCATTTATATTAAAAGAGACCTGAAACATTGGAGTAATCCTCTTTTTTATAACCATAGCAACAGCCTTTATAGGATATGTTCTCCCATGAGGACAAATATCCTTCTGAGGGGCCACAGTGATTACTAATCTACTCTCAGCAATCCCTTATATAGGCAACACCCTCGTACTGTGAATTTGGGGTGGGTTCTCGGTAGATCAAGCCACTCTATCCCGATTCTTTGTATTTCACTTCATTCTACCCTTTCTCATCATAGGAACTACTATTATTCACCTCCTTTTTCTACACGAAACCGGCTCAAACAACCCAACAGGAATTAATTCAAGCCCAG
        125380722    ATGACTCACATCACACGGAAAACACACCCCGTACTGAGTGTTATTAATAATTCACTTATTGATCTACCAACTCCCTCAAACCTATCCTACTTATGAA

649 characters. Where have we seen that before? That's the size of CytB!

Wonderful, we have one gene. One gene ... two problems:
- Which should we use?
- How do we get it in the rest of the matrix? 

We have one sequence (out of 12 genes) - and we obviously aren't just going to sit here, writing in missing data symbols. 

Are these sequences even different?

In [None]:
char_matrix['125380722'] == char_matrix['125380720']

Yes! 

Let's add both to our data matrix. Next time. Today, we'll finish up by writing out our data for later:

In [26]:
char_matrix.write_to_path("test_matrix.nex", schema="nexus")

In [None]:
github.com