Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PMID to PMC API from Medline cannot convert all provided PMID #37

Open
titipata opened this issue Dec 29, 2016 · 6 comments
Open

PMID to PMC API from Medline cannot convert all provided PMID #37

titipata opened this issue Dec 29, 2016 · 6 comments

Comments

@titipata
Copy link
Owner

The API here cannot convert all PMID input. I was trying to parse citations from given set of PMIDs but it only returns subset of PMIDs that I provided. One possibility is to host pair of PMIDs/PMCs somewhere on the cloud and provide similar API or source file that user can use to convert PMID to PMC.

@titipata titipata changed the title PMID to PMC API from Medline cannot convert all PMID PMID to PMC API from Medline cannot convert all provided PMID Dec 29, 2016
@titipata
Copy link
Owner Author

titipata commented Jan 4, 2017

I uploaded PMID-PMC pairs (size of 91 MB, not bad not bad) where we can download as follow:

wget https://s3-us-west-2.amazonaws.com/science-of-science-bucket/nih/pmid_pmc_pair.csv

For given file, you can convert PMID to PMC on your own. From here, we can modify parse_citation_web function to receive just PMC be as below.

def parse_citation_web(pmc):
    """
    Parse citations from given PMC 
    Parameters
    ----------
    pmc: str, PMC of the document e.g. 'PMC1217341'
    Returns
    -------
    dict_out: dict, contains following keys
        pmc: Pubmed Central ID
        n_citations: number of citations for given articles
        pmc_cited: list of PMCs that cite the given PMC
    """

    link = "http://www.ncbi.nlm.nih.gov/pmc/articles/%s/citedby/" % str(pmc)
    page = requests.get(link)
    tree = html.fromstring(page.content)
    n_citations = extract_citations(tree)
    n_pages = int(n_citations/30) + 1

    pmc_cited_all = list() # all PMC cited
    citations = tree.xpath('//div[@class="rprt"]/div[@class="title"]/a/@href')[1::]
    pmc_cited = list(map(extract_pmc, citations))
    pmc_cited_all.extend(pmc_cited)
    if n_pages >= 2:
        for i in range(2, n_pages+1):
            link = "http://www.ncbi.nlm.nih.gov/pmc/articles/%s/citedby/?page=%s" % (pmc, str(i))
            page = requests.get(link)
            tree = html.fromstring(page.content)
            citations = tree.xpath('//div[@class="rprt"]/div[@class="title"]/a/@href')[1::]
            pmc_cited = list(map(extract_pmc, citations))
            pmc_cited_all.extend(pmc_cited)
    pmc_cited_all = [p for p in pmc_cited_all if p is not pmc]
    dict_out = {'n_citations': n_citations,
                'pmc': pmc,
                'pmc_cited': pmc_cited_all}
    return dict_out

@titipata
Copy link
Owner Author

Also, we also want to add Copyright Notice for scraping function so that users don't scrape too much and get blocked https://www.ncbi.nlm.nih.gov/pmc/about/copyright/#copy-PMC

@nick-hahner
Copy link

nick-hahner commented Mar 28, 2018

What about ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_file_list.csv ?
Not every PMCID has a corresponding PMID though according to this list.

@titipata
Copy link
Owner Author

@nick-hahner, nice! It contains ~ 1.8M rows of PMID/ PMC pairs of Open Access Subset. I'm still thinking about how to update the list regularly by not hurting the repository. I mean, I could upload PMC-PMID pairs from MEDLINE somewhere as I mentioned. Do you have any preference or suggestions on how to make it available on the repository?

@nick-hahner
Copy link

nick-hahner commented Mar 28, 2018

Actually this file is probably better with 4,892,265 rows:
ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz

First rsync or wget -c -N ... the file to some directory like ~/.pp_data
Then you can use an sqlite3 db

# Create an indexed sqlite db 
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('sqlite:///pmid_to_pmcid.db')  # but some better location
df = pd.read_csv('~/.pp_data/PMC-ids.csv.gz', dtype=str)
df[['PMCID', 'PMID']].to_sql('pmc_pmid', engine, index=False, if_exists='replace')
engine.execute('create index pmc_idx on pmc_pmid(PMCID)')
engine.execute('create index pmid_idx on pmc_pmid(PMID)')
# then later you can fetch like so:
from sqlalchemy import create_engine, text as sqa_text
def get_pmcid_from_pmid(pmid):
    engine = create_engine('sqlite:///pmid_to_pmcid.db')
    ret = engine.execute(sqa_text('select pmcid from pmc_pmid where pmid = :pmid;'), pmcid=pmcid).fetchone()
    return ret[0] if ret else None

How's that sound?

@chengkun-wu
Copy link

@nick-hahner Yes! I used the ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz file for my local conversion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants