## Ficha BioPython - módulo SeqIO

Carregamento do package e dos módulos necessários

In [1]:
from Bio import Seq
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio import SeqFeature

Procure no site NCBI o registo Genbank correspondente ao gene humano BRCA1, neste [link](http://www.ncbi.nlm.nih.gov/nuccore/NC_000017.10?report=genbank&from=41184133&to=41289677&strand=true). Grave o ficheiro no formato “genbank”.



1.a) Usando o BioPython, carregue o ficheiro anterior para análise. 

In [2]:
record = SeqIO.read("BRCA.gb", "genbank")
print(record)

ID: NC_000017.10
Name: NC_000017
Description: Homo sapiens chromosome 17, GRCh37.p13 Primary Assembly
Database cross-references: BioProject:PRJNA168, Assembly:GCF_000001405.25
Number of features: 17
/molecule_type=DNA
/topology=linear
/data_file_division=CON
/date=13-AUG-2013
/accessions=['NC_000017', 'REGION:', 'complement(41184133..41289677)']
/sequence_version=10
/keywords=['RefSeq']
/source=Homo sapiens (human)
/organism=Homo sapiens
/taxonomy=['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Mammalia', 'Eutheria', 'Euarchontoglires', 'Primates', 'Haplorrhini', 'Catarrhini', 'Hominidae', 'Homo']
/references=[Reference(title='DNA sequence of human chromosome 17 and analysis of rearrangement in the human lineage', ...), Reference(title='Finishing the euchromatic sequence of the human genome', ...), Reference(title='Initial sequencing and analysis of the human genome', ...)]
/comment=REFSEQ INFORMATION: The reference sequence is identical to
CM000679.1.
N

1.b) Verifique o tamanho da respetiva sequência de DNA.

In [3]:
len(record)

105545

1.c) Verifique o ID, a descrição e o nome deste registo.

In [4]:
record.id

'NC_000017.10'

In [5]:
record.description

'Homo sapiens chromosome 17, GRCh37.p13 Primary Assembly'

In [6]:
record.name

'NC_000017'

2.a) Verifique a lista de anotações para este registo.

In [7]:
record.annotations

{'molecule_type': 'DNA',
 'topology': 'linear',
 'data_file_division': 'CON',
 'date': '13-AUG-2013',
 'accessions': ['NC_000017', 'REGION:', 'complement(41184133..41289677)'],
 'sequence_version': 10,
 'keywords': ['RefSeq'],
 'source': 'Homo sapiens (human)',
 'organism': 'Homo sapiens',
 'taxonomy': ['Eukaryota',
  'Metazoa',
  'Chordata',
  'Craniata',
  'Vertebrata',
  'Euteleostomi',
  'Mammalia',
  'Eutheria',
  'Euarchontoglires',
  'Primates',
  'Haplorrhini',
  'Catarrhini',
  'Hominidae',
  'Homo'],
 'references': [Reference(title='DNA sequence of human chromosome 17 and analysis of rearrangement in the human lineage', ...),
  Reference(title='Finishing the euchromatic sequence of the human genome', ...),
  Reference(title='Initial sequencing and analysis of the human genome', ...)],
 'structured_comment': OrderedDict([('Genome-Annotation-Data',
               OrderedDict([('Annotation Provider', 'NCBI'),
                            ('Annotation Status', 'Full annotation'),


2.b) Através das anotações do registo, determine o organismo a que a sequência pertence e a sua classificação taxonómica completa.

In [8]:
record.annotations["source"]

'Homo sapiens (human)'

In [9]:
record.annotations["organism"]

'Homo sapiens'

In [10]:
record.annotations["taxonomy"]

['Eukaryota',
 'Metazoa',
 'Chordata',
 'Craniata',
 'Vertebrata',
 'Euteleostomi',
 'Mammalia',
 'Eutheria',
 'Euarchontoglires',
 'Primates',
 'Haplorrhini',
 'Catarrhini',
 'Hominidae',
 'Homo']

3. Verifique a lista de features do registo, nomeadamente o seu tipo e localização.

In [11]:
record.features

[SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(105545), strand=1), type='source'),
 SeqFeature(FeatureLocation(BeforePosition(0), ExactPosition(12078), strand=-1), type='gene'),
 SeqFeature(CompoundLocation([FeatureLocation(ExactPosition(11890), ExactPosition(12078), strand=-1), FeatureLocation(BeforePosition(6390), ExactPosition(6453), strand=-1)], 'join'), type='misc_RNA', location_operator='join'),
 SeqFeature(FeatureLocation(ExactPosition(12177), AfterPosition(93366), strand=1), type='gene'),
 SeqFeature(CompoundLocation([FeatureLocation(ExactPosition(12177), ExactPosition(12390), strand=1), FeatureLocation(ExactPosition(13545), ExactPosition(13644), strand=1), FeatureLocation(ExactPosition(21881), ExactPosition(21935), strand=1), FeatureLocation(ExactPosition(31127), ExactPosition(31205), strand=1), FeatureLocation(ExactPosition(32704), ExactPosition(32793), strand=1), FeatureLocation(ExactPosition(33399), ExactPosition(33539), strand=1), FeatureLocation(ExactPosition

In [12]:
for feature in record.features:
    print(f"Type: {feature.type}")
    print(f"Location: {feature.location}")

Type: source
Location: [0:105545](+)
Type: gene
Location: [<0:12078](-)
Type: misc_RNA
Location: join{[11890:12078](-), [<6390:6453](-)}
Type: gene
Location: [12177:>93366](+)
Type: mRNA
Location: join{[12177:12390](+), [13545:13644](+), [21881:21935](+), [31127:31205](+), [32704:32793](+), [33399:33539](+), [37780:37886](+), [40371:40417](+), [41738:41815](+), [42800:46226](+), [46628:46717](+), [55085:55257](+), [61046:61173](+), [63139:63330](+), [66422:66733](+), [69965:70053](+), [73709:73787](+), [74287:74328](+), [80525:80609](+), [86543:86598](+), [88466:88540](+), [89957:90018](+), [91858:93366](+)}
Type: mRNA
Location: join{[12177:12390](+), [13545:13644](+), [21881:21935](+), [31127:31205](+), [32704:32793](+), [33399:33539](+), [37780:37886](+), [40371:40417](+), [41738:41815](+), [42800:46226](+), [46628:46717](+), [55085:55257](+), [58261:58327](+), [61049:61173](+), [63139:63330](+), [66422:66733](+), [69965:70053](+), [73709:73787](+), [74287:74328](+), [80525:80609](+)

4.a) A partir da lista de features identifique as sequências codificantes associadas a este registo  e guarde o seu índice numa lista.

In [13]:
featcds = [ ]
for i in range(len(record.features)):
    if record.features[i].type == "CDS":
        featcds.append(i)
for k in featcds:
    print (record.features[k].location)
for k in featcds:
    print (record.features[k].extract(record.seq))

join{[13564:13644](+), [21881:21935](+), [31127:31205](+), [32704:32793](+), [33399:33539](+), [37780:37886](+), [40371:40417](+), [41738:41815](+), [42800:42917](+), [46628:46717](+), [55085:55257](+), [61049:61173](+), [63139:63330](+), [66422:66733](+), [69965:70053](+), [73709:73787](+), [74287:74328](+), [80525:80609](+), [86543:86598](+), [88466:88540](+), [89957:90018](+), [91858:91983](+)}
join{[13564:13644](+), [21881:21935](+), [31127:31205](+), [32704:32793](+), [33399:33539](+), [37780:37886](+), [40371:40417](+), [41738:41815](+), [42800:46226](+), [46628:46717](+), [55085:55257](+), [61046:61173](+), [63139:63330](+), [66422:66733](+), [69965:70053](+), [73709:73787](+), [74287:74328](+), [80525:80609](+), [86543:86598](+), [88466:88540](+), [89957:90018](+), [91858:91983](+)}
join{[13564:13644](+), [21881:21935](+), [31127:31205](+), [32704:32793](+), [33399:33539](+), [37780:37886](+), [40371:40417](+), [41738:41815](+), [42800:46226](+), [46628:46717](+), [55085:55257]

4.b) Através dos “qualifiers” associados, determine qual a proteína codificada e qual o seu significado biológico.

In [21]:
qualifiers = record.features[k].qualifiers

OrderedDict([('gene', ['RPL21P4']),
             ('gene_synonym', ['RPL21_58_1548']),
             ('note',
              ['ribosomal protein L21 pseudogene 4; Derived by automated computational analysis using gene prediction method: Curated Genomic.']),
             ('pseudo', ['']),
             ('db_xref', ['GeneID:140660', 'HGNC:HGNC:17959'])])

In [15]:
if "product" in qualifiers:
    protein_name = qualifiers["product"][0]
    print("Proteína Codificada:", protein_name)

Proteína Codificada: breast cancer type 1 susceptibility protein isoform 3


In [16]:
if "function" in qualifiers:
    function_info = qualifiers["function"][0]
    print("Significado Biológico:", function_info)
elif "note" in qualifiers:
    note_info = qualifiers["note"][0]
    print("Significado Biológico:", note_info)

Significado Biológico: isoform 3 is encoded by transcript variant 3; Derived by automated computational analysis using gene prediction method: BestRefSeq.


5. Verifique qual o número de genes anotados no registo (feature tipo “gene”). Quais estão anotados em cada uma das cadeias ?

In [17]:
featgene = [ ]
for i in range(len(record.features)):
    if record.features[i].type == "gene":
        featgene.append(i)
print("Number of genes in feature type 'gene':", len(featgene))

Number of genes in feature type 'gene': 3


In [18]:
for k in featgene:
    print (record.features[k].location)

[<0:12078](-)
[12177:>93366](+)
[57844:58400](-)


6. Converta este ficheiro para formato Fasta.

In [19]:
records = SeqIO.parse("BRCA.gb", "genbank")
count = SeqIO.write(records, "BRCA.fasta", "fasta")
print ("Convertidos %i registos" % count )

Convertidos 1 registos


In [20]:
count = SeqIO.convert("BRCA.gb", "genbank", "BRCA.fasta", "fasta")
print ("Convertidos %i registos" % count )

Convertidos 1 registos
