**Purpose:** Get text from PubMed OA article.

In [1]:
from Bio import Entrez
import json
import xml.etree.ElementTree as ET

In [2]:
Entrez.email = 'sseth@ualberta.ca'

In [3]:
def search(term):
    search_handle = Entrez.esearch(db = 'pmc', term = term)
    results = Entrez.read(search_handle)
    search_handle.close()
    id_list = results['IdList']
    return id_list

In [35]:
def article(pmcid):
    fetched = Entrez.efetch(db = 'pmc', id = pmcid)
    article = fetched.read()

    root = ET.fromstring(article)

    parts = []

    for n in range(3):
        part = root[0][n]
        part = ''.join(part.itertext())
        parts.append(part)
    
    return parts

In [36]:
def pretty_print(paper):
    pretty = json.dumps(paper, indent=4, separators=(',', ':'))
    print(pretty)

In [37]:
pmcid = 'PMC4561515'

In [38]:
front, body, back = article(pmcid)

In [39]:
print(body)


    
      Introduction
      Staphylococcus aureus (S. aureus) is an important opportunistic pathogen that causes skin and soft tissue infections as well as invasive life-threatening diseases, including sepsis and pneumonia. It is estimated that this Gram-positive bacterium causes 80,000 invasive infections each year in the US and about 15% of patients contracting invasive S. aureus succumb to this infection (1). Disease severity and increasing number of antibiotic-resistant strains urgently call for the development of an effective vaccine against this pathogen. So far, however, vaccines based on type 5 and 8 capsular polysaccharides or on a single conserved protein antigen (IsdB) have failed in Phase III and II/III clinical trials, respectively (2, 3).
      Although a correlate of protection has not yet been established for staphylococcal infections, there are evidences that both humoral and cellular immunity are important to prevent staphylococcal diseases (4, 5). For example, imm

In [None]:
# download medline

!mkdir ./data
!wget ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/*.gz -P ./data
!wget ftp://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/*.gz -P ./data

In [None]:
# download open access

!mkdir ./data
!wget ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_bulk/comm_use.A-B.xml.tar.gz
!tar -xzf comm_use.A-B.xml.tar.gz --directory data/

In [1]:
import os
import pubmed_parser as pp

In [2]:
def find_path(pmcid):
    """Find path in OA given PMCID."""
    name = pmcid + '.nxml'
    for root, dirs, files in os.walk('./data'):
        if name in files:
            return os.path.join(root, name)

In [None]:
pcmid = 'PMC4745833'
path = find_path(pcmid)

In [None]:
xml = pp.parse_pubmed_xml(path)
references = pp.parse_pubmed_references(path)
paragraphs = pp.parse_pubmed_paragraph(path, all_paragraph = True)
captions = pp.parse_pubmed_caption(path)
table = pp.parse_pubmed_table(path)