## Resolução da ficha BioPython - módulo SeqIO

Carregamento do package e dos módulos necessários

In [1]:
from Bio import Seq
from Bio import SeqIO

Procure no site NCBI o registo Genbank correspondente ao gene humano BRCA1, neste [link](http://www.ncbi.nlm.nih.gov/nuccore/NC_000017.10?report=genbank&from=41184133&to=41289677&strand=true). Grave o ficheiro no formato “genbank”.



1.a) Usando o BioPython, carregue o ficheiro anterior para análise. 

In [2]:
record2 = SeqIO.read("BRCA.gb", "genbank")
record2

SeqRecord(seq=Seq('GATCCGCTCACCTTGACCTCCCAAAGTGCTGGGATTAAAGGCATGAGCCACTGT...GCC'), id='NC_000017.10', name='NC_000017', description='Homo sapiens chromosome 17, GRCh37.p13 Primary Assembly', dbxrefs=['BioProject:PRJNA168', 'Assembly:GCF_000001405.25'])

1.b) Verifique o tamanho da respetiva sequência de DNA.

In [3]:
len(record2.seq)

105545

1.c) Verifique o ID, a descrição e o nome deste registo.

In [4]:
record2.id

'NC_000017.10'

In [5]:
record2.description

'Homo sapiens chromosome 17, GRCh37.p13 Primary Assembly'

In [6]:
record2.name

'NC_000017'

2.a) Verifique a lista de anotações para este registo.

In [7]:
record2.annotations

{'molecule_type': 'DNA',
 'topology': 'linear',
 'data_file_division': 'CON',
 'date': '13-AUG-2013',
 'accessions': ['NC_000017', 'REGION:', 'complement(41184133..41289677)'],
 'sequence_version': 10,
 'keywords': ['RefSeq'],
 'source': 'Homo sapiens (human)',
 'organism': 'Homo sapiens',
 'taxonomy': ['Eukaryota',
  'Metazoa',
  'Chordata',
  'Craniata',
  'Vertebrata',
  'Euteleostomi',
  'Mammalia',
  'Eutheria',
  'Euarchontoglires',
  'Primates',
  'Haplorrhini',
  'Catarrhini',
  'Hominidae',
  'Homo'],
 'references': [Reference(title='DNA sequence of human chromosome 17 and analysis of rearrangement in the human lineage', ...),
  Reference(title='Finishing the euchromatic sequence of the human genome', ...),
  Reference(title='Initial sequencing and analysis of the human genome', ...)],
 'structured_comment': OrderedDict([('Genome-Annotation-Data',
               OrderedDict([('Annotation Provider', 'NCBI'),
                            ('Annotation Status', 'Full annotation'),


2.b) Através das anotações do registo, determine o organismo a que a sequência pertence e a sua classificação taxonómica completa.

In [8]:
list(record2.annotations.keys())

['molecule_type',
 'topology',
 'data_file_division',
 'date',
 'accessions',
 'sequence_version',
 'keywords',
 'source',
 'organism',
 'taxonomy',
 'references',
 'comment',
 'structured_comment']

In [9]:
record2.annotations["source"]

'Homo sapiens (human)'

In [10]:
record2.annotations["organism"]

'Homo sapiens'

In [11]:
record2.annotations["taxonomy"]

['Eukaryota',
 'Metazoa',
 'Chordata',
 'Craniata',
 'Vertebrata',
 'Euteleostomi',
 'Mammalia',
 'Eutheria',
 'Euarchontoglires',
 'Primates',
 'Haplorrhini',
 'Catarrhini',
 'Hominidae',
 'Homo']

3. Verifique a lista de features do registo, nomeadamente o seu tipo e localização.

In [12]:
type(record2.features)

list

In [13]:
print("Temos ", len(record2.features), " features")

Temos  17  features


In [14]:
record2.features

[SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(105545), strand=1), type='source'),
 SeqFeature(FeatureLocation(BeforePosition(0), ExactPosition(12078), strand=-1), type='gene'),
 SeqFeature(CompoundLocation([FeatureLocation(ExactPosition(11890), ExactPosition(12078), strand=-1), FeatureLocation(BeforePosition(6390), ExactPosition(6453), strand=-1)], 'join'), type='misc_RNA', location_operator='join'),
 SeqFeature(FeatureLocation(ExactPosition(12177), AfterPosition(93366), strand=1), type='gene'),
 SeqFeature(CompoundLocation([FeatureLocation(ExactPosition(12177), ExactPosition(12390), strand=1), FeatureLocation(ExactPosition(13545), ExactPosition(13644), strand=1), FeatureLocation(ExactPosition(21881), ExactPosition(21935), strand=1), FeatureLocation(ExactPosition(31127), ExactPosition(31205), strand=1), FeatureLocation(ExactPosition(32704), ExactPosition(32793), strand=1), FeatureLocation(ExactPosition(33399), ExactPosition(33539), strand=1), FeatureLocation(ExactPosition

In [15]:
for f in record2.features: 
    print(f.type, f.location)

source [0:105545](+)
gene [<0:12078](-)
misc_RNA join{[11890:12078](-), [<6390:6453](-)}
gene [12177:>93366](+)
mRNA join{[12177:12390](+), [13545:13644](+), [21881:21935](+), [31127:31205](+), [32704:32793](+), [33399:33539](+), [37780:37886](+), [40371:40417](+), [41738:41815](+), [42800:46226](+), [46628:46717](+), [55085:55257](+), [61046:61173](+), [63139:63330](+), [66422:66733](+), [69965:70053](+), [73709:73787](+), [74287:74328](+), [80525:80609](+), [86543:86598](+), [88466:88540](+), [89957:90018](+), [91858:93366](+)}
mRNA join{[12177:12390](+), [13545:13644](+), [21881:21935](+), [31127:31205](+), [32704:32793](+), [33399:33539](+), [37780:37886](+), [40371:40417](+), [41738:41815](+), [42800:46226](+), [46628:46717](+), [55085:55257](+), [58261:58327](+), [61049:61173](+), [63139:63330](+), [66422:66733](+), [69965:70053](+), [73709:73787](+), [74287:74328](+), [80525:80609](+), [86543:86598](+), [88466:88540](+), [89957:90018](+), [91858:93366](+)}
mRNA join{[12209:12384

4.a) A partir da lista de features identifique as sequências codificantes associadas a este registo  e guarde o seu índice numa lista.

In [16]:
feat_cds = []
for i in range(len(record2.features)):
    if record2.features[i].type == "CDS":
        feat_cds.append(i)
feat_cds

[10, 11, 12, 13, 14]

4.b) Através dos “qualifiers” associados, determine qual a proteína codificada e qual o seu significado biológico.

In [17]:
record2.features[8].qualifiers  ## exemplo

OrderedDict([('gene', ['BRCA1']),
             ('gene_synonym',
              ['BRCAI; BRCC1; BROVCA1; IRIS; PNCA4; PPP1R53; PSCP; RNF53']),
             ('product',
              ['breast cancer 1, early onset, transcript variant 6']),
             ('note',
              ['Derived by automated computational analysis using gene prediction method: BestRefSeq.']),
             ('transcript_id', ['NR_027676.1']),
             ('db_xref', ['GeneID:672', 'HGNC:HGNC:1100', 'MIM:113705'])])

In [18]:
for k in feat_cds:
    print(record2.features[k].qualifiers["product"])
    print(record2.features[k].qualifiers["translation"])

['breast cancer type 1 susceptibility protein isoform 4']
['MDLSALRVEEVQNVINAMQKILECPICLELIKEPVSTKCDHIFCKFCMLKLLNQKKGPSQCPLCKNDITKRSLQESTRFSQLVEELLKIICAFQLDTGLEYANSYNFAKKENNSPEHLKDEVSIIQSMGYRNRAKRLLQSEPENPSLQETSLSVQLSNLGTVRTLRTKQRIQPQKTSVYIELGSDSSEDTVNKATYCSVGDQELLQITPQGTRDEISLDSAKKAACEFSETDVTNTEHHQPSNNDLNTTEKRAAERHPEKYQGEAASGCESETSVSEDCSGLSSQSDILTTQQRDTMQHNLIKLQQEMAELEAVLEQHGSQPSNSYPSIISDSSALEDLRNPEQSTSEKVLTSQKSSEYPISQNPEGLSADKFEVSADSSTSKNKEPGVERSSPSKCPSLDDRWYMHSCSGSLQNRNYPSQEELIKVVDVEEQQLEESGPHDLTETSYLPRQDLEGTPYLESGISLFSDDPESDPSEDRAPESARVGNIPSSTSALKVPQLKVAESAQSPAAAHTTDTAGYNAMEESVSREKPELTASTERVNKRMSMVVSGLTPEEFMLVYKFARKHHITLTNLITEETTHVVMKTDAEFVCERTLKYFLGIAGGKWVVSYFWVTQSIKERKMLNEHDFEVRGDVVNGRNHQGPKRARESQDRKIFRGLEICCYGPFTNMPTDQLEWMVQLCGASVVKELSSFTLGTGVHPIVVVQPDAWTEDNGFHAIGQMCEAPVVTREWVLDSVALYQCQELDTYLIPQIPHSHY']
['breast cancer type 1 susceptibility protein isoform 1']
['MDLSALRVEEVQNVINAMQKILECPICLELIKEPVSTKCDHIFCKFCMLKLLNQKKGPSQCPLCKNDITKRSLQESTRFSQLVEELLKIICAFQLDTGLEYANSYNFAKKENNSPEHL

In [19]:
for k in feat_cds:
    coding_dna = record2.features[k].extract(record2.seq)
    print("DNA: ", coding_dna)
    print("Proteina: ", coding_dna.translate())

DNA:  ATGGATTTATCTGCTCTTCGCGTTGAAGAAGTACAAAATGTCATTAATGCTATGCAGAAAATCTTAGAGTGTCCCATCTGTCTGGAGTTGATCAAGGAACCTGTCTCCACAAAGTGTGACCACATATTTTGCAAATTTTGCATGCTGAAACTTCTCAACCAGAAGAAAGGGCCTTCACAGTGTCCTTTATGTAAGAATGATATAACCAAAAGGAGCCTACAAGAAAGTACGAGATTTAGTCAACTTGTTGAAGAGCTATTGAAAATCATTTGTGCTTTTCAGCTTGACACAGGTTTGGAGTATGCAAACAGCTATAATTTTGCAAAAAAGGAAAATAACTCTCCTGAACATCTAAAAGATGAAGTTTCTATCATCCAAAGTATGGGCTACAGAAACCGTGCCAAAAGACTTCTACAGAGTGAACCCGAAAATCCTTCCTTGCAGGAAACCAGTCTCAGTGTCCAACTCTCTAACCTTGGAACTGTGAGAACTCTGAGGACAAAGCAGCGGATACAACCTCAAAAGACGTCTGTCTACATTGAATTGGGATCTGATTCTTCTGAAGATACCGTTAATAAGGCAACTTATTGCAGTGTGGGAGATCAAGAATTGTTACAAATCACCCCTCAAGGAACCAGGGATGAAATCAGTTTGGATTCTGCAAAAAAGGCTGCTTGTGAATTTTCTGAGACGGATGTAACAAATACTGAACATCATCAACCCAGTAATAATGATTTGAACACCACTGAGAAGCGTGCAGCTGAGAGGCATCCAGAAAAGTATCAGGGTGAAGCAGCATCTGGGTGTGAGAGTGAAACAAGCGTCTCTGAAGACTGCTCAGGGCTATCCTCTCAGAGTGACATTTTAACCACTCAGCAGAGGGATACCATGCAACATAACCTGATAAAGCTCCAGCAGGAAATGGCTGAACTAGAAGCTGTGTTAGAACAGCATGGGAGCCAGCCTTCTAACAGCTACCCTTCCATCATAAGTG

5. Verifique qual o número de genes anotados no registo (feature tipo “gene”). Quais estão anotados em cada uma das cadeias ?

In [20]:
feat_gene = []
for i in range(len(record2.features)):
    if record2.features[i].type == "gene":
        feat_gene.append(i)
print("Numero de features tipo gene: ", len(feat_gene))

Numero de features tipo gene:  3


In [21]:
for g in feat_gene:
    print(record2.features[g].location.strand)
    print(record2.features[g].qualifiers["gene"])

-1
['NBR2']
1
['BRCA1']
-1
['RPL21P4']


6. Converta este ficheiro para formato Fasta.

In [22]:
SeqIO.convert("BRCA.gb", "genbank", "BRCA.fasta", "fasta") 

1