# NCBI database : Entrez

Entrez는 NCBI의 데이터 검색 시스템으로 PubMed, GenBank, GEO 등 49개 데이터베이스를 검색할 수 있다.

In [1]:
from Bio import Entrez

### EFetch

* Entrez의 모든 레코드를 받음.
* return type, return mode를 지정해야 함.

In [2]:
Entrez.email = "woosa7@naver.com"

# poliovirus (소아마비 바이러스) 데이터 읽기
handle = Entrez.efetch(db="nucleotide", id="NC_002058.3", rettype="gb", retmode="xml")

records = Entrez.parse(handle)
for record in records:
    print(record.keys(), '\n')
    
    for journal in record["GBSeq_references"]:
        print(journal.keys())
        break

dict_keys(['GBSeq_locus', 'GBSeq_length', 'GBSeq_strandedness', 'GBSeq_moltype', 'GBSeq_topology', 'GBSeq_division', 'GBSeq_update-date', 'GBSeq_create-date', 'GBSeq_definition', 'GBSeq_primary-accession', 'GBSeq_accession-version', 'GBSeq_other-seqids', 'GBSeq_secondary-accessions', 'GBSeq_project', 'GBSeq_keywords', 'GBSeq_source', 'GBSeq_organism', 'GBSeq_taxonomy', 'GBSeq_references', 'GBSeq_comment', 'GBSeq_feature-table', 'GBSeq_sequence', 'GBSeq_xrefs']) 

dict_keys(['GBReference_reference', 'GBReference_position', 'GBReference_authors', 'GBReference_title', 'GBReference_journal', 'GBReference_xref', 'GBReference_pubmed'])


In [3]:
handle = Entrez.efetch(db="nucleotide", id="NC_002058.3", rettype="gb", retmode="xml") 

records = Entrez.read(handle)   # Entrez.read()는 결과를 한번에 메모리에 올리기 때문에 parse()를 사용하는 것이 좋다.
for record in records:
    print(record["GBSeq_locus"])
    print(record["GBSeq_definition"])
    print(record["GBSeq_strandedness"], record["GBSeq_moltype"])
    print(record["GBSeq_length"], "bp")
    print(len(record["GBSeq_references"]), "journals")

NC_002058
Poliovirus, complete genome
single RNA
7440 bp
24 journals


### EInfo, ESearch

Entrez의 데이터베이스 목록

In [4]:
handle = Entrez.einfo()
record = Entrez.read(handle)

print(record)

print(len(record["DbList"]))

{'DbList': ['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'structure', 'sparcle', 'genome', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles', 'homologene', 'medgen', 'mesh', 'ncbisearch', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'proteinclusters', 'pcassay', 'biosystems', 'pccompound', 'pcsubstance', 'seqannot', 'snp', 'sra', 'taxonomy', 'biocollections', 'gtr']}
43


In [5]:
# esearch : 위 40여개의 데이터베이스 검색 도구

handle = Entrez.esearch(db="pubmed", term="metagenome") 
record = Entrez.read(handle) 
record.keys()

dict_keys(['Count', 'RetMax', 'RetStart', 'IdList', 'TranslationSet', 'TranslationStack', 'QueryTranslation'])

In [6]:
print(record["Count"])

8739


In [7]:
print(record["IdList"])

['32778754', '32778049', '32772914', '32772670', '32766813', '32766476', '32763946', '32763940', '32763939', '32762019', '32761733', '32758682', '32758003', '32755736', '32755708', '32755703', '32753510', '32753508', '32750680', '32747714']


# Swiss-Prot

단백질의 서열, 기능과 구조, 각 도메인에 대한 정보, 변이, 실험적 결과 등의 정보를 제공.

ex) https://www.uniprot.org/uniprot/P02649.txt : 고지혈증, 알츠하이머병과 관련있는 단백질

In [8]:
from Bio import SwissProt
from Bio import ExPASy

In [9]:
with open("data/P02649.txt") as handle:
    record = SwissProt.read(handle)
    print(type(record), '\n')
    
    # https://biopython.org/DIST/docs/api/Bio.SwissProt.Record-class.html

    print("entry_name:", record.entry_name )
    print("gene_name:", record.gene_name)
    print("organism:", record.organism)
    print("sequence_length:", record.sequence_length)
    print("sequence:", record.sequence)
    print(record.description)
    print(record.keywords)

<class 'Bio.SwissProt.Record'> 

entry_name: APOE_HUMAN
gene_name: Name=APOE;
organism: Homo sapiens (Human).
sequence_length: 317
sequence: MKVLWAALLVTFLAGCQAKVEQAVETEPEPELRQQTEWQSGQRWELALGRFWDYLRWVQTLSEQVQEELLSSQVTQELRALMDETMKELKAYKSELEEQLTPVAEETRARLSKELQAAQARLGADMEDVCGRLVQYRGEVQAMLGQSTEELRVRLASHLRKLRKRLLRDADDLQKRLAVYQAGAREGAERGLSAIRERLGPLVEQGRVRAATVGSLAGQPLQERAQAWGERLRARMEEMGSRTRDRLDEVKEQVAEVRAKLEEQAQQIRLQAEAFQARLKSWFEPLVEDMQRQWAGLVEKVQAAVGTSAAPVPSDNH
RecName: Full=Apolipoprotein E; Short=Apo-E; Flags: Precursor;
['3D-structure', 'Alzheimer disease', 'Amyloidosis', 'Cholesterol metabolism', 'Chylomicron', 'Complete proteome', 'Direct protein sequencing', 'Disease mutation', 'Extracellular matrix', 'Glycation', 'Glycoprotein', 'HDL', 'Heparin-binding', 'Hyperlipidemia', 'Lipid metabolism', 'Lipid transport', 'Lipid-binding', 'Neurodegeneration', 'Oxidation', 'Phosphoprotein', 'Polymorphism', 'Reference proteome', 'Repeat', 'Secreted', 'Signal', 'Steroid metabolism', 'Sterol metabolis

### ExPASy (Expert Protein Analysis System)

* Proteomics 도구 및 데이터베이스를 모은 서버.
* 파이썬 스크립트로 이 서버에서 정보를 받을 수 있음.

In [10]:
accession = "P02649"
handle = ExPASy.get_sprot_raw(accession)

record = SwissProt.read(handle)

print(record.gene_name)
print(record.organism)
print(record.sequence_length)
print(record.sequence)

Name=APOE;
Homo sapiens (Human).
317
MKVLWAALLVTFLAGCQAKVEQAVETEPEPELRQQTEWQSGQRWELALGRFWDYLRWVQTLSEQVQEELLSSQVTQELRALMDETMKELKAYKSELEEQLTPVAEETRARLSKELQAAQARLGADMEDVCGRLVQYRGEVQAMLGQSTEELRVRLASHLRKLRKRLLRDADDLQKRLAVYQAGAREGAERGLSAIRERLGPLVEQGRVRAATVGSLAGQPLQERAQAWGERLRARMEEMGSRTRDRLDEVKEQVAEVRAKLEEQAQQIRLQAEAFQARLKSWFEPLVEDMQRQWAGLVEKVQAAVGTSAAPVPSDNH
