## BIOS 470/570 Lecture 15

## Last time we covered:
* ### Sequence alignment with Smith-Waterman

## Today we will cover:
* ### NCBI databases [Gene and nucleotide]
* ### The blast algorithm and web interface
* ### Accessing NCBI databases programmatically
* ### Using the BLAST alignment tool programmatically

In [155]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from Bio import Entrez
from Bio.Blast import NCBIWWW, NCBIXML

### The esearch function can be used to search the NCBI databases. Here we search for the gene Smad4 in the gene database:

In [92]:
Entrez.email = "aw21@rice.edu" # 
hh = Entrez.esearch("Gene","Smad4 [gene]") #search the gene database for the term "Smad4"

### The read function reads the result. Count gives the of results. IdList gives the database ids you can use to get the info from those entries:

In [93]:
record = Entrez.read(hh)
record

{'Count': '456', 'RetMax': '20', 'RetStart': '0', 'IdList': ['4089', '17128', '50554', '43725', '559111', '397142', '701803', '540248', '780764', '100038088', '476196', '443171', '122902875', '106037328', '468537', '100861019', '794669', '102390834', '778905', '101675099'], 'TranslationSet': [], 'TranslationStack': [{'Term': 'Smad4[gene]', 'Field': 'gene', 'Count': '456', 'Explode': 'N'}, 'GROUP'], 'QueryTranslation': 'Smad4[gene]'}

### Read the information in the ids list. A summary can be read with the esummary command:

In [95]:
summary = Entrez.esummary(db="gene",  id=record["IdList"][0])

In [97]:
summary_read = Entrez.read(summary)

### Inside this is a dictionary with a lot of useful information:

In [120]:
summary_read['DocumentSummarySet']['DocumentSummary'][0]

DictElement({'Name': 'SMAD4', 'Description': 'SMAD family member 4', 'Status': '0', 'CurrentID': '0', 'Chromosome': '18', 'GeneticSource': 'genomic', 'MapLocation': '18q21.2', 'OtherAliases': 'DPC4, JIP, MADH4, MYHRS', 'OtherDesignations': 'mothers against decapentaplegic homolog 4|MAD homolog 4|SMAD, mothers against DPP homolog 4|deleted in pancreatic carcinoma locus 4|deletion target in pancreatic carcinoma 4|mothers against decapentaplegic, Drosophila, homolog of, 4', 'NomenclatureSymbol': 'SMAD4', 'NomenclatureName': 'SMAD family member 4', 'NomenclatureStatus': 'Official', 'Mim': ['600993'], 'GenomicInfo': [{'ChrLoc': '18', 'ChrAccVer': 'NC_000018.10', 'ChrStart': '51030212', 'ChrStop': '51085041', 'ExonCount': '14'}], 'GeneWeight': '62661', 'Summary': 'This gene encodes a member of the Smad family of signal transduction proteins. Smad proteins are phosphorylated and activated by transmembrane serine-threonine receptor kinases in response to transforming growth factor (TGF)-beta s

### Even more information can be retrieved with the efetch to get the record from the database:

In [104]:
h_fetch = Entrez.efetch(db="gene", id=record["IdList"][0],rettype = "gb",retmode = "xml") #return text in genbank mode

In [105]:
fetch_read = Entrez.read(h_fetch)

### Let's get the accession numbers of the gene products:

In [133]:
products = fetch_read[0]["Entrezgene_locus"][0]["Gene-commentary_products"] #This contains information about all the gene products
transcript_list = []
for ii in range(len(products)):
    transcript_list.append(products[ii]["Gene-commentary_accession"])

### Let's now get the information about one of the products from the nucleotide database, first the summary:

In [144]:
rna_summary = Entrez.esummary(db = "nucleotide",id = transcript_list[4])

In [145]:
rna_summary_read = Entrez.read(rna_summary)

In [146]:
rna_summary_read

[{'Item': [], 'Id': '2244986580', 'Caption': 'NM_001407041', 'Title': 'Homo sapiens SMAD family member 4 (SMAD4), transcript variant 2, mRNA', 'Extra': 'gi|2244986580|ref|NM_001407041.1|[2244986580]', 'Gi': IntegerElement(2244986580, attributes={}), 'CreateDate': '2022/05/24', 'UpdateDate': '2023/09/24', 'Flags': IntegerElement(512, attributes={}), 'TaxId': IntegerElement(9606, attributes={}), 'Length': IntegerElement(8357, attributes={}), 'Status': 'live', 'ReplacedBy': '', 'Comment': '  ', 'AccessionVersion': 'NM_001407041.1'}]

### Now the full entry, which incldues the sequence and annotations:

In [147]:
rna_fetch = Entrez.efetch(db="nucleotide",id = transcript_list[4],rettype = "gb",retmode = "xml")

In [148]:
rna_fetch_read = Entrez.read(rna_fetch)

In [150]:
rna_fetch_read[0]["GBSeq_sequence"]

'atgctcagtggcttctcgacaagttggcagcaacaacacggccctggtcgtcgtcgccgctgcggatcaaaattgcttcagaaattggagacatatttgatttaaaaggaaaaacttgaacaaatggacaatatgtctattacgaatacaccaacaagtaatgatgcctgtctgagcattgtgcatagtttgatgtgccatagacaaggtggagagagtgaaacatttgcaaaaagagcaattgaaagtttggtaaagaagctgaaggagaaaaaagatgaattggattctttaataacagctataactacaaatggagctcatcctagtaaatgtgttaccatacagagaacattggatgggaggcttcaggtggctggtcggaaaggatttcctcatgtgatctatgcccgtctctggaggtggcctgatcttcacaaaaatgaactaaaacatgttaaatattgtcagtatgcgtttgacttaaaatgtgatagtgtctgtgtgaatccatatcactacgaacgagttgtatcacctggaattgatctctcaggattaacactgcagagtaatgctccatcaagtatgatggtgaaggatgaatatgtgcatgactttgagggacagccatcgttgtccactgaaggacattcaattcaaaccatccagcatccaccaagtaatcgtgcatcgacagagacatacagcaccccagctctgttagccccatctgagtctaatgctaccagcactgccaactttcccaacattcctgtggcttccacaagtcagcctgccagtatactggggggcagccatagtgaaggactgttgcagatagcatcagggcctcagccaggacagcagcagaatggatttactggtcagccagctacttaccatcataacagcactaccacctggactggaagtaggactgcaccatacacacctaatttgcctcaccaccaaaacggccatcttcagcaccacccg

### Let's see what we get when we blast the beginning of this sequence. This runs a nucleotide blast (blastn), against the nucleotide database (nt) with the first 150 basepairs of the sequence:

In [160]:
h_blast = NCBIWWW.qblast("blastn","nt",rna_fetch_read[0]["GBSeq_sequence"][:150])

### Note you can set parameter of the blast algorithm in the query such as word_size 
### The parameter hitlist_size determines the number of hits to return (default is 50)

### You can only read the blast result once, so it is recommended to save it and then read it back from the saved file to prevent losing it:

In [161]:
save_file = open("blast.xml","w")
save_file.write(h_blast.read())

89921

### Now you can use the NCBIXML module to read this file and get the results of the blast operation. 

In [175]:
blast_result = open("blast.xml",'r')

In [176]:
blast_record = NCBIXML.read(blast_result)

In [185]:
q = 0
for alignment in blast_record.alignments[:5]:
    for hsp in alignment.hsps:
        print('****Alignment****')
        print('sequence:', alignment.title)
        print('length:', alignment.length)
        print('e value:', hsp.expect)
        print(hsp.query[0:75] + '...')
        print(hsp.match[0:75] + '...')
        print(hsp.sbjct[0:75] + '...')

****Alignment****
sequence: gi|2244986580|ref|NM_001407041.1| Homo sapiens SMAD family member 4 (SMAD4), transcript variant 2, mRNA
length: 8357
e value: 2.3099e-68
ATGCTCAGTGGCTTCTCGACAAGTTGGCAGCAACAACACGGCCCTGGTCGTCGTCGCCGCTGCGGATCAAAATTG...
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||...
ATGCTCAGTGGCTTCTCGACAAGTTGGCAGCAACAACACGGCCCTGGTCGTCGTCGCCGCTGCGGATCAAAATTG...
****Alignment****
sequence: gi|1743208121|ref|XM_030810408.1| PREDICTED: Nomascus leucogenys SMAD family member 4 (SMAD4), transcript variant X5, mRNA
length: 3572
e value: 3.66677e-34
GGATCAAAATTGCTTCAGAAATTGGAGACATATTTGATTTAAAAGGAAAAACTTGAACAAATGGACAATATGTCT...
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||...
GGATCAAAATTGCTTCAGAAATTGGAGACATATTTGATTTAAAAGGAAAAACTTGAACAAATGGACAATATGTCT...
****Alignment****
sequence: gi|1743208121|ref|XM_030810408.1| PREDICTED: Nomascus leucogenys SMAD family member 4 (SMAD4), transcript variant X5, mRNA
length: 3572
e value: 1.6

In [194]:
q = 0
for alignment in blast_record.alignments[-5:]:
    for hsp in alignment.hsps:
        print('****Alignment****')
        print('sequence:', alignment.title)
        print('length:', alignment.length)
        print('e value:', hsp.expect)
        print(hsp.query[0:75] + '...')
        print(hsp.match[0:75] + '...')
        print(hsp.sbjct[0:75] + '...')

****Alignment****
sequence: gi|2161870587|ref|XM_045377957.1| PREDICTED: Macaca fascicularis SMAD family member 4 (SMAD4), transcript variant X2, mRNA
length: 2093
e value: 5.44197e-32
GATCAAAATTGCTTCAGAAATTGGAGACATATTTGATTTAAAAGGAAAAACTTGAACAAATGGACAATATGTCTA...
||||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||...
GATCAAAATTGCTCCAGAAATTGGAGACATATTTGATTTAAAAGGAAAAACTTGAACAAATGGACAATATGTCTA...
****Alignment****
sequence: gi|2161870587|ref|XM_045377957.1| PREDICTED: Macaca fascicularis SMAD family member 4 (SMAD4), transcript variant X2, mRNA
length: 2093
e value: 1.28096e-14
AGTGGCTTCTCGACAAGTTGGCAGCAACAACACGGCCCTGGTCGTCGTCGCCGCTGCGG...
||||  ||||||||||||||||||||||||||||||||||||||||||||| |||||||...
AGTGTGTTCTCGACAAGTTGGCAGCAACAACACGGCCCTGGTCGTCGTCGCTGCTGCGG...
****Alignment****
sequence: gi|2161870586|ref|XM_005586675.3| PREDICTED: Macaca fascicularis SMAD family member 4 (SMAD4), transcript variant X1, mRNA
length: 8684
e value: 5.44197e-32
GATCAAAATTGCTTCAGAA