# Building the dataset of research papers

The [Entrez](http://biopython.org/DIST/docs/api/Bio.Entrez-module.html) module, a part of the [Biopython](http://biopython.org/) library, will be used to interface with [PubMed](http://www.ncbi.nlm.nih.gov/pubmed).<br>
You can download Biopython from [here](http://biopython.org/wiki/Download).

In this notebook we will be covering several of the steps taken in the [Biopython Tutorial](http://biopython.org/DIST/docs/tutorial/Tutorial.html), specifically in [Chapter 9  Accessing NCBI’s Entrez databases](http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc109).

In [63]:
!pip install biopython



In [64]:
from Bio import Entrez

# NCBI requires you to set your email address to make use of NCBI's E-utilities
Entrez.email = "Your.Name.Here@example.org"

The datasets will be saved as serialized Python objects, compressed with bzip2.
Saving/loading them will therefore require the [pickle](http://docs.python.org/3/library/pickle.html) and [bz2](http://docs.python.org/3/library/bz2.html) modules.

In [65]:
import pickle, bz2, os

## EInfo: Obtaining information about the Entrez databases

In [66]:
# accessing extended information about the PubMed database
pubmed = Entrez.read( Entrez.einfo(db="pubmed"), validate=False )[u'DbInfo']

# list of possible search fields for use with ESearch:
search_fields = { f['Name']:f['Description'] for f in pubmed["FieldList"] }

In search_fields, we find 'TIAB' ('Free text associated with Abstract/Title') as a possible search field to use in searches.

In [67]:
search_fields

{'ALL': 'All terms from all searchable fields',
 'UID': 'Unique number assigned to publication',
 'FILT': 'Limits the records',
 'TITL': 'Words in title of publication',
 'WORD': 'Free text associated with publication',
 'MESH': 'Medical Subject Headings assigned to publication',
 'MAJR': 'MeSH terms of major importance to publication',
 'AUTH': 'Author(s) of publication',
 'JOUR': 'Journal abbreviation of publication',
 'AFFL': "Author's institutional affiliation and address",
 'ECNO': 'EC number for enzyme or CAS registry number',
 'SUBS': 'CAS chemical name or MEDLINE Substance Name',
 'PDAT': 'Date of publication',
 'EDAT': 'Date publication first accessible through Entrez',
 'VOL': 'Volume number of publication',
 'PAGE': 'Page number(s) of publication',
 'PTYP': 'Type of publication (e.g., review)',
 'LANG': 'Language of publication',
 'ISS': 'Issue number of publication',
 'SUBH': 'Additional specificity for MeSH term',
 'SI': 'Cross-reference from publication to other databases

## ESearch: Searching the Entrez databases

To have a look at the kind of data we get when searching the database, we'll perform a search for papers authored by Haasdijk:

In [68]:
example_authors = ['Haasdijk E']
example_search = Entrez.read( Entrez.esearch( db="pubmed", term=' AND '.join([a+'[AUTH]' for a in example_authors]) ) )
example_search

{'Count': '39', 'RetMax': '20', 'RetStart': '0', 'IdList': ['33501027', '33501026', '33500899', '29311830', '28513205', '28513201', '28323435', '28140628', '26933487', '24977986', '24901702', '24852945', '24708899', '24252306', '23580075', '23144668', '22174697', '22154920', '21870131', '21760539'], 'TranslationSet': [], 'TranslationStack': [{'Term': 'Haasdijk E[Author]', 'Field': 'Author', 'Count': '39', 'Explode': 'N'}, 'GROUP'], 'QueryTranslation': 'Haasdijk E[Author]'}

Note how the result being produced is not in Python's native string format:

In [69]:
type( example_search['IdList'][0] )

Bio.Entrez.Parser.StringElement

The part of the query's result we are most interested in is accessible through

In [70]:
example_ids = [ int(id) for id in example_search['IdList'] ]
print(example_ids)

[33501027, 33501026, 33500899, 29311830, 28513205, 28513201, 28323435, 28140628, 26933487, 24977986, 24901702, 24852945, 24708899, 24252306, 23580075, 23144668, 22174697, 22154920, 21870131, 21760539]


### PubMed IDs dataset

We will now assemble a dataset comprised of research articles containing the given keyword, in either their titles or abstracts.

In [9]:
search_term = 'cognition'

In [10]:
Ids_file = 'data/' + search_term + '_Ids.pkl.bz2'

In [20]:
if os.path.exists( Ids_file ):
    Ids = pickle.load( bz2.BZ2File( Ids_file, 'rb' ) )
else:
    # determine the number of hits for the search term
    search = Entrez.read( Entrez.esearch( db="pubmed", term=search_term+'[TIAB]', retmax=0 ) )
    total = int( search['Count'] )
    
    # `Ids` will be incrementally assembled, by performing multiple queries,
    # each returning at most `retrieve_per_query` entries.
    Ids_str = []
    retrieve_per_query = 10000
    
    for start in range( 0, total, retrieve_per_query ):
        print('Fetching IDs of results [%d,%d]' % ( start, start+retrieve_per_query ) )
        s = Entrez.read( Entrez.esearch( db="pubmed", term=search_term+'[TIAB]', retstart=start, retmax=retrieve_per_query ) )
        Ids_str.extend( s[ u'IdList' ] )
    
    # convert Ids to integers (and ensure that the conversion is reversible)
    Ids = [ int(id) for id in Ids_str ]
    
    for (id_str, id_int) in zip(Ids_str, Ids):
        if str(id_int) != id_str:
            raise Exception('Conversion of PubMed ID %s from string to integer it not reversible.' % id_str )
    
    # Remove IDs that would cause problems below:
    Ids.remove(31430077)
    Ids.remove(30566730)
    Ids.remove(30561281)
    Ids.remove(29127756)
    Ids.remove(29031035)
    Ids.remove(28155421)
    Ids.remove(27936254)
    Ids.remove(27732103)
    Ids.remove(27570959)
    Ids.remove(27570956)
    Ids.remove(27570952)
    Ids.remove(27570950)
    Ids.remove(27568126)
    Ids.remove(26218237)
    Ids.remove(25590551)
    Ids.remove(25590545)
    Ids.remove(25367166) 
    Ids.remove(25148547) 
    # Save list of Ids
    pickle.dump( Ids, bz2.BZ2File( Ids_file, 'wb' ) )
    
total = len( Ids )
print('%d documents contain the search term "%s".' % ( total, search_term ) )

102128 documents contain the search term "cognition".


Taking a look at what we just retrieved, here are the last 5 elements of the `Ids` list:

In [71]:
Ids[:5]

[36220089, 36219911, 36219905, 36219802, 36219788]

## ESummary: Retrieving summaries from primary IDs

To have a look at the kind of metadata we get from a call to `Entrez.esummary()`, we now fetch the summary of one of Haasdijk's papers (using one of the PubMed IDs we obtained in the previous section:

In [72]:
example_paper = Entrez.read( Entrez.esummary(db="pubmed", id='36220089') )[0]

def print_dict( p ):
    for k,v in p.items():
        print(k)
        print('\t', v)

print_dict(example_paper)

Item
	 []
Id
	 36220089
PubDate
	 2022 Oct 10
EPubDate
	 
Source
	 Curr Biol
AuthorList
	 ['Giurfa M']
LastAuthor
	 Giurfa M
Title
	 Pollinator cognition: Framing bee memories in an ecological context.
Volume
	 32
Issue
	 19
Pages
	 R1015-R1018
LangList
	 ['English']
NlmUniqueID
	 9107782
ISSN
	 0960-9822
ESSN
	 1879-0445
PubTypeList
	 ['Journal Article']
RecordStatus
	 PubMed - indexed for MEDLINE
PubStatus
	 ppublish
ArticleIds
	 {'pubmed': ['36220089'], 'medline': [], 'pii': 'S0960-9822(22)01365-3', 'doi': '10.1016/j.cub.2022.08.043', 'rid': '36220089', 'eid': '36220089'}
DOI
	 10.1016/j.cub.2022.08.043
History
	 {'pubmed': ['2022/10/12 06:00'], 'medline': ['2022/10/14 06:00'], 'entrez': '2022/10/11 19:03'}
References
	 []
HasAbstract
	 IntegerElement(1, attributes={})
PmcRefCount
	 IntegerElement(0, attributes={})
FullJournalName
	 Current biology : CB
ELocationID
	 doi: 10.1016/j.cub.2022.08.043
SO
	 2022 Oct 10;32(19):R1015-R1018


For now, we'll keep just some basic information for each paper: title, list of authors, publication year, and [DOI](https://en.wikipedia.org/wiki/Digital_object_identifier).

In case you are not familiar with the DOI system, know that the paper above can be accessed through the link  `https://www.doi.org/` followed by the paper's DOI.

In [73]:
( example_paper['Title'], example_paper['AuthorList'], int(example_paper['PubDate'][:4]), example_paper['DOI'] )

('Pollinator cognition: Framing bee memories in\xa0an\xa0ecological context.',
 ['Giurfa M'],
 2022,
 '10.1016/j.cub.2022.08.043')

### Summaries dataset

We are now ready to assemble a dataset containing the summaries of all the paper `Ids` we previously fetched.

To reduce the memory footprint, and to ensure the saved datasets won't depend on Biopython being installed to be properly loaded, values returned by `Entrez.read()` will be converted to their corresponding native Python types. We start by defining a function for helping with the conversion of strings:

In [24]:
Summaries_file = 'data/' + search_term + '_Summaries.pkl.bz2'

In [25]:
if os.path.exists( Summaries_file ):
    Summaries = pickle.load( bz2.BZ2File( Summaries_file, 'rb' ) )
else:
    # `Summaries` will be incrementally assembled, by performing multiple queries,
    # each returning at most `retrieve_per_query` entries.
    Summaries = []
    retrieve_per_query = 500
    
    print('Fetching Summaries of results: ')
    for start in range( 0, len(Ids), retrieve_per_query ):
        if (start % 10000 == 0):
            print('')
            print(start, end='')
        else:
            print('.', end='')
        
        # build comma separated string with the ids at indexes [start, start+retrieve_per_query)
        query_ids = ','.join( [ str(id) for id in Ids[ start : start+retrieve_per_query ] ] )
        
        s = Entrez.read( Entrez.esummary( db="pubmed", id=query_ids ) )
        
        # out of the retrieved data, we will keep only a tuple (title, authors, year, DOI), associated with the paper's id.
        # (all values converted to native Python formats)
        for p in s:
            try:
                f = [
                    ( int( p['Id'] ), (
                        str( p['Title'] ),
                        [ str(a) for a in p['AuthorList'] ],
                        int( p['PubDate'][:4] ),                # keeps just the publication year
                        str( p.get('DOI', '') )            # papers for which no DOI is available get an empty string in their place
                        ) )
                    ]
                Summaries.extend( f )
            except ValueError as e:
                print("\nError with ID " + p['Id'] + ": " + str(e))
                print("Manually remove this ID above and re-run code.")
    
    # Save Summaries, as a dictionary indexed by Ids
    Summaries = dict( Summaries )
    
    pickle.dump( Summaries, bz2.BZ2File( Summaries_file, 'wb' ) )

Let us take a look at the first 3 retrieved summaries:

In [26]:
{ id : Summaries[id] for id in Ids[:3] }

{36220089: ('Pollinator cognition: Framing bee memories in\xa0an\xa0ecological context.',
  ['Giurfa M'],
  2022,
  '10.1016/j.cub.2022.08.043'),
 36219911: ('Exposure to fine particulate matter constituents and cognitive function performance, potential mediation by sleep quality: A multicenter study among Chinese adults aged 40-89\xa0years.',
  ['Pan R',
   'Zhang Y',
   'Xu Z',
   'Yi W',
   'Zhao F',
   'Song J',
   'Sun Q',
   'Du P',
   'Fang J',
   'Cheng J',
   'Liu Y',
   'Chen C',
   'Lu Y',
   'Li T',
   'Su H',
   'Shi X'],
  2022,
  '10.1016/j.envint.2022.107566'),
 36219905: ('Chronic cannabis use affects cerebellum dependent visuomotor adaptation.',
  ['Blithikioti C',
   'Miquel L',
   'Paniello B',
   'Nuño L',
   'Gual A',
   'Ballester BR',
   'Fernandez A',
   'Herreros I',
   'Verschure P',
   'Balcells-Olivero M'],
  2022,
  '10.1016/j.jpsychires.2022.10.007')}

## EFetch: Downloading full records from Entrez

`Entrez.efetch()` is the function that will allow us to obtain paper abstracts. Let us start by taking a look at the kind of data it returns when we query PubMed's database.

In [27]:
q = Entrez.read( Entrez.efetch(db="pubmed", id='34648188', retmode="xml") )['PubmedArticle']

`q` is a list, with each member corresponding to a queried id. Because here we only queried for one id, its results are then in `q[0]`.

In [28]:
type(q), len(q)

(list, 1)

1
At `q[0]` we find a dictionary containing two keys, the contents of which we print below.

In [29]:
type(q[0]), q[0].keys()

(Bio.Entrez.Parser.DictionaryElement,
 dict_keys(['MedlineCitation', 'PubmedData']))

In [30]:
print_dict( q[0][ 'PubmedData' ] )

ReferenceList
	 [{'Reference': [{'Citation': "Sadler AG, Booth BM, Mengeling MA, Doebbeling BN. Life span and repeated violence against women during military service: effects on health status and outpatient utilization. J Women's Health. 2004;13(7):799-811."}, {'Citation': 'Scherrer JF, Xian H, Kapp JMK, et\xa0al. Association between exposure to childhood and lifetime traumatic events and lifetime pathological gambling in a twin cohort. J Nerv Ment Dis. 2007;195(1):72-78.'}, {'Citation': 'Katon JG, Lehavot K, Simpson TL, et\xa0al. Adverse childhood experiences, military service, and adult health. Am J Prev Med. 2015;49(4):573-582.'}, {'Citation': 'Koenen KC, Stellman SD, Sommer Jr JF, Stellman JM. Persisting posttraumatic stress disorder symptoms and their relationship to functioning in Vietnam veterans: a 14-year follow-up. J Trauma Stress. 2008;21(1):49-57.'}, {'Citation': 'Kessler RC. Posttraumatic stress disorder: the burden to the individual and to society. J Clin Psychiatry. 2000

The key `'MedlineCitation'` maps into another dictionary. In that dictionary, most of the information is contained under the key `'Article'`. To minimize the clutter, below we show the contents of `'MedlineCitation'` excluding its `'Article'` member, and below that we then show the contents of `'Article'`.

In [31]:
print_dict( { k:v for k,v in q[0][ 'MedlineCitation' ].items() if k!='Article' } )

SpaceFlightMission
	 []
CitationSubset
	 ['IM']
GeneralNote
	 []
KeywordList
	 [ListElement([StringElement('access to care', attributes={'MajorTopicYN': 'N'}), StringElement('telemental health', attributes={'MajorTopicYN': 'N'}), StringElement('trauma', attributes={'MajorTopicYN': 'N'}), StringElement('veterans', attributes={'MajorTopicYN': 'N'}), StringElement('web-based treatment', attributes={'MajorTopicYN': 'N'})], attributes={'Owner': 'NOTNLM'})]
OtherAbstract
	 []
OtherID
	 []
PMID
	 34648188
DateCompleted
	 {'Year': '2022', 'Month': '09', 'Day': '23'}
DateRevised
	 {'Year': '2022', 'Month': '10', 'Day': '03'}
MedlineJournalInfo
	 {'Country': 'England', 'MedlineTA': 'J Rural Health', 'NlmUniqueID': '8508122', 'ISSNLinking': '0890-765X'}
MeshHeadingList
	 [{'QualifierName': [], 'DescriptorName': StringElement('Female', attributes={'UI': 'D005260', 'MajorTopicYN': 'N'})}, {'QualifierName': [], 'DescriptorName': StringElement('Health Services Accessibility', attributes={'UI': 'D0062

In [32]:
print_dict( q[0][ 'MedlineCitation' ][ 'Article' ] )

ELocationID
	 [StringElement('10.1111/jrh.12628', attributes={'EIdType': 'doi', 'ValidYN': 'Y'})]
Language
	 ['eng']
ArticleDate
	 [DictElement({'Year': '2021', 'Month': '10', 'Day': '14'}, attributes={'DateType': 'Electronic'})]
Journal
	 {'ISSN': StringElement('1748-0361', attributes={'IssnType': 'Electronic'}), 'JournalIssue': DictElement({'Volume': '38', 'Issue': '4', 'PubDate': {'Year': '2022', 'Month': '09'}}, attributes={'CitedMedium': 'Internet'}), 'Title': 'The Journal of rural health : official journal of the American Rural Health Association and the National Rural Health Care Association', 'ISOAbbreviation': 'J Rural Health'}
ArticleTitle
	 Increasing access to care for trauma-exposed rural veterans: A mixed methods outcome evaluation of a web-based skills training program with telehealth-delivered coaching.
Pagination
	 {'MedlinePgn': '740-747'}
Abstract
	 {'AbstractText': [StringElement('While rural veterans with trauma exposure report high rates of posttraumatic stress di

A paper's abstract can therefore be accessed with:

In [33]:
{ int(q[0]['MedlineCitation']['PMID']) : str(q[0]['MedlineCitation']['Article']['Abstract']['AbstractText'][0]) }

{34648188: 'While rural veterans with trauma exposure report high rates of posttraumatic stress disorder (PTSD), depression, and functional impairment, utilization of health services is low. This pilot study used mixed qualitative and quantitative methods to evaluate the potential benefits of a transdiagnostic web-based skills training program paired with telehealth-delivered coaching to address a range of symptoms and functional difficulties. The study directed substantial outreach efforts to women veterans who had experienced military sexual trauma given their growing representation in the Veterans Healthcare Administration (VHA) and identified need for services.'}

Some of the ids in our dataset refer to books from the [NCBI Bookshelf](http://www.ncbi.nlm.nih.gov/books/), a collection of freely available, downloadable, on-line versions of selected biomedical books. For such ids, `Entrez.efetch()` returns a slightly different structure, where the keys `[u'BookDocument', u'PubmedBookData']` take the place of the `[u'MedlineCitation', u'PubmedData']` keys we saw above.

### Abstracts dataset

We can now assemble a dataset mapping paper ids to their abstracts.

In [34]:
Abstracts_file = 'data/' + search_term + '_Abstracts.pkl.bz2'

In [37]:
import http.client
from collections import deque
from xml.dom import minidom
import re

def ch(node, childtype):
    return node.getElementsByTagName(childtype)[0]

if os.path.exists( Abstracts_file ):
    Abstracts = pickle.load( bz2.BZ2File( Abstracts_file, 'rb' ) )
else:
    # `Abstracts` will be incrementally assembled, by performing multiple queries,
    # each returning at most `retrieve_per_query` entries.
    Abstracts = deque()
    retrieve_per_query = 500
    
    print('Fetching Abstracts of results: ')
    for start in range( 0, len(Ids), retrieve_per_query ):
        if (start % 10000 == 0):
            print('')
            print(start, end='')
        else:
            print('.', end='')
        
        # build comma separated string with the ids at indexes [start, start+retrieve_per_query)
        query_ids = ','.join( [ str(id) for id in Ids[ start : start+retrieve_per_query ] ] )
        
        # issue requests to the server, until we get the full amount of data we expect
        while True:
            try:
                #s = Entrez.read( Entrez.efetch(db="pubmed", id=query_ids, retmode="xml" ) )['PubmedArticle']
                s = minidom.parse( Entrez.efetch(db="pubmed", id=query_ids, retmode="xml" ) ).getElementsByTagName("PubmedArticle")
            except http.client.IncompleteRead:
                print('r', end='')
                continue
            break
        
        i = 0
        for p in s:
            abstr = ''
            if (p.getElementsByTagName('MedlineCitation')):
                citNode = ch(p,'MedlineCitation')
                pmid = ch(citNode,'PMID').firstChild.data
                articleNode = ch(citNode,'Article')
                if (articleNode.getElementsByTagName('Abstract')):
                    try:
                        abstr = ch(ch(articleNode,'Abstract'),'AbstractText').firstChild.data
                    except AttributeError:
                        abstr = ch(ch(articleNode,'Abstract'),'AbstractText').toprettyxml("  ")
                        abstr = re.sub(r"\s+", " ", re.sub("<[^>]*>", "", abstr))
            elif (p.getElementsByTagName('BookDocument')):
                bookNode = ch(p,'BookDocument')
                pmid = ch(bookNode,'PMID').firstChild.data
                if (bookNode.getElementsByTagName('Abstract')):
                    try:
                        abstr = ch(ch(bookNode,'Abstract'),'AbstractText').firstChild.data
                    except AttributeError:
                        abstr = ch(ch(bookNode,'Abstract'),'AbstractText').toprettyxml("  ")
                        abstr = re.sub(r"\s+", " ", re.sub("<[^>]*>", "", abstr))
            else:
                raise Exception('Unrecognized record type, for id %d (keys: %s)' % (Ids[start+i], str(p.keys())) )
            
            Abstracts.append( (int(pmid), str(abstr)) )
            i += 1
    
    # Save Abstracts, as a dictionary indexed by Ids
    Abstracts = dict( Abstracts )
    
    pickle.dump( Abstracts, bz2.BZ2File( Abstracts_file, 'wb' ) )

Fetching Abstracts of results: 

0...................
10000...................
20000...................
30000...................
40000...................
50000...................
60000...................
70000...................
80000...................
90000...................
100000....

Taking a look at one paper's abstract:

In [41]:
Abstracts[36219802]

'While combination antiretroviral therapy (cART) has dramatically increased the life expectancy of people with HIV (PWH), nearly 50% develop HIV-associated neurocognitive disorders. This may be due to previously uncontrolled HIV viral replication, immune activation maintained by residual viral replication or activation from other sources, or cART-associated neurotoxicity. The aim of this study was to determine the effect of cART on cognition and neuroimaging biomarkers in PWH before and after initiation of cART compared with that in HIV-negative controls (HCs) and HIV elite controllers (ECs) who remain untreated.'

## ELink: Searching for related items in NCBI Entrez

To understand how to obtain paper citations with Entrez, we will first assemble a small set of PubMed IDs, and then query for their citations.
To that end, we search here for papers published in the Nature journal with our given keyword.

In [42]:
CA_search_term = search_term+'[TIAB] AND Nature[JOUR]'
CA_ids = Entrez.read( Entrez.esearch( db="pubmed", term=CA_search_term ) )['IdList']
CA_ids

['35896749', '35589843', '35545674', '35478240', '35355009', '35236988', '34937941', '34616074', '34616064', '34614503', '34599306', '34153974', '33790467', '33505022', '33473210', '33361808', '32999460', '32699411', '31915375', '31801999']

In [43]:
CA_summ = {
    p['Id'] : ( p['Title'], p['AuthorList'], p['PubDate'][:4], p['FullJournalName'], p.get('DOI', '') )
    for p in Entrez.read( Entrez.esummary(db="pubmed", id=','.join( CA_ids )) )
    }
CA_summ

{'35896749': ('Cortical feedback loops bind distributed representations of working memory.',
  ['Voitov I', 'Mrsic-Flogel TD'],
  '2022',
  'Nature',
  '10.1038/s41586-022-05014-3'),
 '35589843': ('People construct simplified mental representations to plan.',
  ['Ho MK', 'Abel D', 'Correa CG', 'Littman ML', 'Cohen JD', 'Griffiths TL'],
  '2022',
  'Nature',
  '10.1038/s41586-022-04743-9'),
 '35545674': ('Young CSF restores oligodendrogenesis and memory in aged mice via Fgf17.',
  ['Iram T', 'Kern F', 'Kaur A', 'Myneni S', 'Morningstar AR', 'Shin H', 'Garcia MA', 'Yerra L', 'Palovics R', 'Yang AC', 'Hahn O', 'Lu N', 'Shuken SR', 'Haney MS', 'Lehallier B', 'Iyer M', 'Luo J', 'Zetterberg H', 'Keller A', 'Zuchero JB', 'Wyss-Coray T'],
  '2022',
  'Nature',
  '10.1038/s41586-022-04722-0'),
 '35478240': ('Computer-designed repurposing of chemical wastes into drugs.',
  ['Wołos A', 'Koszelewski D', 'Roszak R', 'Szymkuć S', 'Moskal M', 'Ostaszewski R', 'Herrera BT', 'Maier JM', 'Brezicki G', '

Because we restricted our search to papers in an open-access journal, you can then follow their DOIs to freely access their PDFs at the journal's website.

We will now issue calls to `Entrez.elink()` using these PubMed IDs, to retrieve the IDs of papers that cite them.
The database from which the IDs will be retrieved is [PubMed Central](http://www.ncbi.nlm.nih.gov/pmc/), a free digital database of full-text scientific literature in the biomedical and life sciences.

A complete list of the kinds of links you can retrieve with `Entrez.elink()` can be found [here](http://eutils.ncbi.nlm.nih.gov/entrez/query/static/entrezlinks.html).

In [49]:
CA_citing = {
    id : Entrez.read( Entrez.elink(
            cmd = "neighbor",               # ELink command mode: "neighbor", returns
                                            #     a set of UIDs in `db` linked to the input UIDs in `dbfrom`.
            dbfrom = "pubmed",              # Database containing the input UIDs: PubMed
            db = "pmc",                     # Database from which to retrieve UIDs: PubMed Central
            LinkName = "pubmed_pmc_refs",   # Name of the Entrez link to retrieve: "pubmed_pmc_refs", gets
                                            #     "Full-text articles in the PubMed Central Database that cite the current articles"
            from_uid = id                   # input UIDs
            ) )
    for id in CA_ids
    }

In [54]:
CA_citing['31801999']

[{'LinkSetDbHistory': [], 'ERROR': [], 'LinkSetDb': [{'Link': [{'Id': '9534763'}, {'Id': '9380260'}, {'Id': '9329517'}, {'Id': '9284159'}, {'Id': '9282170'}, {'Id': '9262333'}, {'Id': '9087306'}, {'Id': '9045815'}, {'Id': '9022003'}, {'Id': '9010663'}, {'Id': '8967821'}, {'Id': '8849135'}, {'Id': '8809184'}, {'Id': '8770604'}, {'Id': '8741155'}, {'Id': '8616578'}, {'Id': '8599015'}, {'Id': '8491944'}, {'Id': '8484055'}, {'Id': '8356785'}, {'Id': '8324299'}, {'Id': '8232037'}, {'Id': '8171116'}, {'Id': '8115883'}, {'Id': '8104958'}, {'Id': '8083070'}, {'Id': '8082085'}, {'Id': '8052096'}, {'Id': '7984585'}, {'Id': '7939081'}, {'Id': '7935083'}, {'Id': '7928425'}, {'Id': '7906913'}, {'Id': '7886003'}, {'Id': '7853322'}, {'Id': '7771962'}, {'Id': '7610550'}, {'Id': '7393175'}, {'Id': '7351792'}, {'Id': '7239658'}, {'Id': '7213979'}, {'Id': '7062473'}, {'Id': '7056889'}], 'DbTo': 'pmc', 'LinkName': 'pubmed_pmc_refs'}], 'DbFrom': 'pubmed', 'IdList': ['31801999']}]

We have in `CA_citing[paper_id][0]['LinkSetDb'][0]['Link']` the list of papers citing `paper_id`. To get it as just a list of ids, we can do

In [55]:
cits = [ l['Id'] for l in CA_citing['31801999'][0]['LinkSetDb'][0]['Link'] ]
cits

['9534763',
 '9380260',
 '9329517',
 '9284159',
 '9282170',
 '9262333',
 '9087306',
 '9045815',
 '9022003',
 '9010663',
 '8967821',
 '8849135',
 '8809184',
 '8770604',
 '8741155',
 '8616578',
 '8599015',
 '8491944',
 '8484055',
 '8356785',
 '8324299',
 '8232037',
 '8171116',
 '8115883',
 '8104958',
 '8083070',
 '8082085',
 '8052096',
 '7984585',
 '7939081',
 '7935083',
 '7928425',
 '7906913',
 '7886003',
 '7853322',
 '7771962',
 '7610550',
 '7393175',
 '7351792',
 '7239658',
 '7213979',
 '7062473',
 '7056889']

However, one more step is needed, as what we have now are PubMed Central IDs, and not PubMed IDs. Their conversion can be achieved through an additional call to `Entrez.elink()`:

In [56]:
cits_pm = Entrez.read( Entrez.elink( dbfrom="pmc", db="pubmed", LinkName="pmc_pubmed", from_uid=",".join(cits)) )
cits_pm

[{'LinkSetDbHistory': [], 'ERROR': [], 'LinkSetDb': [{'Link': [{'Id': '36171427'}, {'Id': '35911570'}, {'Id': '35835556'}, {'Id': '35464318'}, {'Id': '35450225'}, {'Id': '35416775'}, {'Id': '35354821'}, {'Id': '34764156'}, {'Id': '34707289'}, {'Id': '34693908'}, {'Id': '34678148'}, {'Id': '34654556'}, {'Id': '34555012'}, {'Id': '34328419'}, {'Id': '34060878'}, {'Id': '34060872'}, {'Id': '33981970'}, {'Id': '33876728'}, {'Id': '33869678'}, {'Id': '33785156'}, {'Id': '33681428'}, {'Id': '33531416'}, {'Id': '33446521'}, {'Id': '33441432'}, {'Id': '33361819'}, {'Id': '33345774'}, {'Id': '33338423'}, {'Id': '33227602'}, {'Id': '32965146'}, {'Id': '32827456'}, {'Id': '32733065'}, {'Id': '32651373'}, {'Id': '32504542'}, {'Id': '32431293'}, {'Id': '32286227'}, {'Id': '32231315'}, {'Id': '32174812'}, {'Id': '32096761'}], 'DbTo': 'pubmed', 'LinkName': 'pmc_pubmed'}], 'DbFrom': 'pmc', 'IdList': ['9534763', '9380260', '9329517', '9284159', '9282170', '9262333', '9087306', '9045815', '9022003', '90

In [57]:
ids_map = { pmc_id : link['Id'] for (pmc_id,link) in zip(cits_pm[0]['IdList'], cits_pm[0]['LinkSetDb'][0]['Link']) }
ids_map

{'9534763': '36171427',
 '9380260': '35911570',
 '9329517': '35835556',
 '9284159': '35464318',
 '9282170': '35450225',
 '9262333': '35416775',
 '9087306': '35354821',
 '9045815': '34764156',
 '9022003': '34707289',
 '9010663': '34693908',
 '8967821': '34678148',
 '8849135': '34654556',
 '8809184': '34555012',
 '8770604': '34328419',
 '8741155': '34060878',
 '8616578': '34060872',
 '8599015': '33981970',
 '8491944': '33876728',
 '8484055': '33869678',
 '8356785': '33785156',
 '8324299': '33681428',
 '8232037': '33531416',
 '8171116': '33446521',
 '8115883': '33441432',
 '8104958': '33361819',
 '8083070': '33345774',
 '8082085': '33338423',
 '8052096': '33227602',
 '7984585': '32965146',
 '7939081': '32827456',
 '7935083': '32733065',
 '7928425': '32651373',
 '7906913': '32504542',
 '7886003': '32431293',
 '7853322': '32286227',
 '7771962': '32231315',
 '7610550': '32174812',
 '7393175': '32096761'}

And to check these papers:

In [58]:
{   p['Id'] : ( p['Title'], p['AuthorList'], p['PubDate'][:4], p['FullJournalName'], p.get('DOI', '') )
    for p in Entrez.read( Entrez.esummary(db="pubmed", id=','.join( ids_map.values() )) )
    }

{'36171427': ('Thalamus-driven functional populations in frontal cortex support decision-making.',
  ['Yang W', 'Tipparaju SL', 'Chen G', 'Li N'],
  '2022',
  'Nature neuroscience',
  '10.1038/s41593-022-01171-w'),
 '35911570': ('Are Grid-Like Representations a Component of All Perception and Cognition?',
  ['Chen ZS', 'Zhang X', 'Long X', 'Zhang SJ'],
  '2022',
  'Frontiers in neural circuits',
  '10.3389/fncir.2022.924016'),
 '35835556': ('Navigating the Statistical Minefield of Model Selection and Clustering in Neuroscience.',
  ['Király B', 'Hangya B'],
  '2022',
  'eNeuro',
  '10.1523/ENEURO.0066-22.2022'),
 '35464318': ('Memristive LIF Spiking Neuron Model and Its Application in Morse Code.',
  ['Fang X', 'Liu D', 'Duan S', 'Wang L'],
  '2022',
  'Frontiers in neuroscience',
  '10.3389/fnins.2022.853010'),
 '35450225': ('Review on data analysis methods for mesoscale neural imaging <i>in vivo</i>.',
  ['Cai Y', 'Wu J', 'Dai Q'],
  '2022',
  'Neurophotonics',
  '10.1117/1.NPh.9.4.0

### Citations dataset

We have now seen all the steps required to assemble a dataset of citations to each of the papers in our dataset.

In [59]:
Citations_file = 'data/' + search_term + '_Citations.pkl.bz2'
Citations = []

At least one server query will be issued per paper in `Ids`. Because NCBI allows for at most 3 queries per second (see [here](http://biopython.org/DIST/docs/api/Bio.Entrez-pysrc.html#_open)), this dataset will take a long time to assemble. Should you need to interrupt it for some reason, or the connection fail at some point, it is safe to just rerun the cell below until all data is collected.

In [61]:
import http.client

if Citations == [] and os.path.exists( Citations_file ):
    Citations = pickle.load( bz2.BZ2File( Citations_file, 'rb' ) )

if len(Citations) < len(Ids):
    
    i = len(Citations)
    checkpoint = int(len(Ids) / 100) + 1      # save to hard drive at every 1% of Ids fetched
    
    for pm_id in Ids[i:]:               # either starts from index 0, or resumes from where we previously left off
        
        while True:
            try:
                # query for papers archived in PubMed Central that cite the paper with PubMed ID `pm_id`
                c = Entrez.read( Entrez.elink( dbfrom = "pubmed", db="pmc", LinkName = "pubmed_pmc_refs", id=str(pm_id) ) )
                
                c = c[0]['LinkSetDb']
                if len(c) == 0:
                    # no citations found for the current paper
                    c = []
                else:
                    c = [ l['Id'] for l in c[0]['Link'] ]
                    
                    # convert citations from PubMed Central IDs to PubMed IDs
                    p = []
                    retrieve_per_query = 500
                    for start in range( 0, len(c), retrieve_per_query ):
                        query_ids = ','.join( c[start : start+retrieve_per_query] )
                        r = Entrez.read( Entrez.elink( dbfrom="pmc", db="pubmed", LinkName="pmc_pubmed", from_uid=query_ids ) )
                        # select the IDs. If no matching PubMed ID was found, [] is returned instead
                        p.extend( [] if r[0]['LinkSetDb']==[] else [ int(link['Id']) for link in r[0]['LinkSetDb'][0]['Link'] ] )
                    c = p
            
            except http.client.BadStatusLine:
                # Presumably, the server closed the connection before sending a valid response. Retry until we have the data.
                print('r')
                continue
            except HTTPError:
                print('r')
                continue
            break
        
        Citations.append( (pm_id, c) )
        if (i % 10000 == 0):
            print('')
            print(i, end='')
        if (i % 100 == 0):
            print('.', end='')
        i += 1
        
        if i % checkpoint == 0:
            print('\tsaving at checkpoint', i)
            pickle.dump( Citations, bz2.BZ2File( Citations_file, 'wb' ) )
    
    print('\n done.')
    
    # Save Citations, as a dictionary indexed by Ids
    Citations = dict( Citations )
    
    pickle.dump( Citations, bz2.BZ2File( Citations_file, 'wb' ) )

To see that we have indeed obtained the data we expected, you can match the ids below, with the ids listed at the end of last section.

In [62]:
Citations[13907952]

KeyError: 13907952

## Where do we go from here?

Running the code above generates multiple local files, containing the datasets we'll be working with. Loading them into memory is a matter of just issuing a call like<br>
``data = pickle.load( bz2.BZ2File( data_file, 'rb' ) )``.

The Entrez module will therefore no longer be needed, unless you wish to extend your data processing with additional information retrieved from PubMed.

Should you be interested in looking at alternative ways to handle the data, have a look at the [sqlite3](http://docs.python.org/3/library/sqlite3.html) module included in Python's standard library, or [Pandas](http://pandas.pydata.org/), the Python Data Analysis Library.