**QUESTION 1 - Entrez API**

**Tahir Manuel D Mello  
BIS634 Assignment 3**

In [1]:
import requests
import json
from urllib.request import urlopen
import xmltodict
import time
import xml.etree.ElementTree as ET

Use the requests module (or urllib) to use the Entrez API  to identify the PubMed IDs for 1000 Alzheimers papers from 2022 and for 1000 cancer papers from 2022. **(9 points)**

In [2]:
def pubmed_ids(term, nresults):
    
    db = 'pubmed'
    domain = 'https://www.ncbi.nlm.nih.gov/entrez/eutils'
    queryterm = term + "+AND+2022[pdat]"
    retmode='json'
    rettype = 'abstract'

    query = f'{domain}/esearch.fcgi?db={db}&retmax={nresults}&retmode={retmode}&term={queryterm}'
    
    response = requests.get(query)
    time.sleep(1)
    pubmed_id_list = response.json()
    
    return pubmed_id_list

In [3]:
alzheimers_ids = pubmed_ids("Alzheimers", 1000)
alzheimers_ids

{'header': {'type': 'esearch', 'version': '0.3'},
 'esearchresult': {'count': '8658',
  'retmax': '1000',
  'retstart': '0',
  'idlist': ['36328129',
   '36327964',
   '36327171',
   '36326951',
   '36326588',
   '36326095',
   '36325883',
   '36325840',
   '36325692',
   '36325483',
   '36324417',
   '36324414',
   '36324408',
   '36324405',
   '36324401',
   '36324176',
   '36324157',
   '36324151',
   '36323521',
   '36323061',
   '36322888',
   '36322800',
   '36322495',
   '36322470',
   '36321981',
   '36321927',
   '36321882',
   '36321654',
   '36321615',
   '36321363',
   '36321205',
   '36321194',
   '36320609',
   '36320346',
   '36319674',
   '36319270',
   '36319269',
   '36319136',
   '36319095',
   '36319045',
   '36318754',
   '36318594',
   '36318545',
   '36318372',
   '36317468',
   '36317413',
   '36316970',
   '36316783',
   '36316708',
   '36316501',
   '36316487',
   '36316461',
   '36316282',
   '36316035',
   '36315527',
   '36315115',
   '36314730',
   '363145

In [4]:
cancer_ids = pubmed_ids("cancer", 1000)
cancer_ids

{'header': {'type': 'esearch', 'version': '0.3'},
 'esearchresult': {'count': '224378',
  'retmax': '1000',
  'retstart': '0',
  'idlist': ['36328436',
   '36328434',
   '36328397',
   '36328388',
   '36328380',
   '36328379',
   '36328378',
   '36328377',
   '36328376',
   '36328311',
   '36328306',
   '36328301',
   '36328277',
   '36328268',
   '36328253',
   '36328249',
   '36328248',
   '36328247',
   '36328244',
   '36328234',
   '36328214',
   '36328207',
   '36328204',
   '36328199',
   '36328194',
   '36328181',
   '36328169',
   '36328162',
   '36328159',
   '36328158',
   '36328157',
   '36328151',
   '36328147',
   '36328146',
   '36328145',
   '36328144',
   '36328127',
   '36328121',
   '36328118',
   '36328117',
   '36328110',
   '36328100',
   '36328092',
   '36328090',
   '36328089',
   '36328079',
   '36328078',
   '36328076',
   '36328074',
   '36328042',
   '36328040',
   '36328038',
   '36328036',
   '36328034',
   '36328033',
   '36328025',
   '36328024',
   '3632

Use the Entrez API via requests/urllib to pull the metadata for each such paper found above (both cancer and Alzheimers) (and save a JSON file storing each paper's title, abstract, and the query that found it. **(12 points)**

In [5]:
def pubmed_metadata(term, pubmed_ids, nresults):
    
    db = 'pubmed'
    domain = 'https://www.ncbi.nlm.nih.gov/entrez/eutils'
    rettype = 'abstract'
    retmode='xml'

    results = {}
    
    for r in range(0, nresults, 100):
        
        paperId = pubmed_ids["esearchresult"]["idlist"][0+r:100+r]

        query = f'{domain}/efetch.fcgi?db={db}&id={paperId}&rettype={rettype}&retmode={retmode}'

        response = requests.post(query)
        time.sleep(1)

        filename = 'metadata' + term + '.xml'
        
        with open(filename, 'wb') as f:
                f.write(response.content)

        metadata = ET.parse(filename)
        root = metadata.getroot()

        count = 0

        for id_list in paperId:  
            
            titles = []

            for title in root[count].iter('ArticleTitle'):
                titles.append(ET.tostring(title, method="text").decode())

            abstracts = []

            for item in root[count].iter('Abstract'):
                abstracts.append(ET.tostringlist(item, method="xml"))
            #print(abstracts)

            results[id_list] = {'ArticleTitle': titles[0], 
                                'AbstractText': abstracts, 
                                'query': term}

            count = count + 1
        
        
        #print(count)
        
    return results
    


In [6]:
alzheimer_metadata =  pubmed_metadata("Alzheimers", alzheimers_ids, 1000)
alzheimer_metadata

{'36328129': {'ArticleTitle': "Transcranial Deep-tissue Phototherapy for Alzheimer's Disease using Low-Dose X-ray-Activated Long-Afterglow Scintillators.",
  'AbstractText': [[b"<Abstract><AbstractText>Non-invasive phototherapy has been emerging as an ambitious tactic for suppression of amyloid-&#946; (A&#946;) self-assembly against Alzheimer's disease (AD). However, it remains a daunting challenge to develop efficient photosensitizers for A&#946; oxygenation that are activatable in a deep brain tissue through the scalp and skull, while reducing side effects on normal tissues. Here, we report an A&#946; targeted, low-dose X-ray-excitable long-afterglow scintillator (ScNPs@RB/Ab) for efficient deep-brain phototherapy. We demonstrate that the as-synthesized ScNPs@RB/Ab is capable of converting X-rays into visible light to activate the photosensitizers of rose bengal (RB) for A&#946; oxygenation through the scalp and skull. We show that the ScNPs@RB/Ab persistently emitting visible lumine

In [7]:
len(alzheimer_metadata)

1000

In [8]:
cancer_metadata =  pubmed_metadata("cancer", cancer_ids, 1000)
cancer_metadata

{'36328436': {'ArticleTitle': 'Use of the ISUP e-learning module improves interrater reliability in prostate cancer grading.',
  'AbstractText': [[b'<Abstract><AbstractText Label="AIMS" NlmCategory="OBJECTIVE">Prostate cancer (PCa) grading is an important prognostic parameter, but is subject to considerable observer variation. Previous studies have shown that interobserver variability decreases after participants were trained using an e-learning module. However, since the publication of these studies, grading of PCa has been enhanced by adopting the International Society of Urological Pathology (ISUP) 2014 grading classification. This study investigates the effect of training on interobserver variability of PCa grading, using the ISUP Education web e-learning on Gleason grading.</AbstractText><AbstractText Label="METHODS" NlmCategory="METHODS">The ISUP Education Prostate Test B Module was distributed among Dutch pathologists. The module uses images graded by the ISUP consensus panel co

In [9]:
len(cancer_metadata)

1000

In [10]:
import pickle
#Also saved as a pkl binary file for convenience
f1 = open("dictionary_Alzheimers.pkl","wb")
pickle.dump(alzheimer_metadata,f1)
f1.close()

f2 = open("dictionary_cancer.pkl","wb")
pickle.dump(cancer_metadata,f2)
f2.close()

In [11]:
#I am using pkl files for convenience for this assignment but since the questions specifically asked to save it in JSON format, 
#this is how it can be done. The "AbstractText" needs to be converted from bytes to string.
#I had used this section in Q2 originally to preprocess my pkl dictionary there but have implemented it here to so that I could 
#do a JSON save and (hopefully) not lose any silly marks. 

alzheimers_paper_id = alzheimers_ids["esearchresult"]["idlist"]
alzheimers_dict = alzheimer_metadata

for id_list in alzheimers_paper_id:
    
    if alzheimers_dict[str(id_list)]['AbstractText'] == [] or len(alzheimers_dict[str(id_list)]['AbstractText'][0][0]) < 30:
        alzheimers_dict[id_list]['AbstractText'] = ""
        continue
         
    item = alzheimers_dict[str(id_list)]['AbstractText'][0][0] #Choose saved xml format string
    tree = ET.ElementTree(ET.fromstring(item))
    root = tree.getroot()
    
    for child in root:
        if child.tag == 'CopyrightInformation':
            root.remove(child) 
    
    alzheimers_dict[id_list]['AbstractText'] = ET.tostring(root, method="text").decode()
    
    
cancer_paper_id = cancer_ids["esearchresult"]["idlist"]
cancer_dict = cancer_metadata

for id_list in cancer_paper_id:
    
    if cancer_dict[str(id_list)]['AbstractText'] == [] or len(cancer_dict[str(id_list)]['AbstractText'][0][0]) < 30:
        cancer_dict[id_list]['AbstractText'] = ""
        continue
         
    item = cancer_dict[str(id_list)]['AbstractText'][0][0] #Choose saved xml format string
    tree = ET.ElementTree(ET.fromstring(item))
    root = tree.getroot()
    
    for child in root:
        if child.tag == 'CopyrightInformation':
            root.remove(child) 
    
    cancer_dict[id_list]['AbstractText'] = ET.tostring(root, method="text").decode()
    
    
    
    
with open("dictionary_Alzheimers.json", "w") as outfile1:
    json.dump(alzheimers_dict, outfile1)
    
with open("dictionary_cancer.json", "w") as outfile2:
    json.dump(cancer_dict, outfile2)

There are of course many more papers of each category, but is there any overlap in the two sets of papers that you identified? **(3 points)**

There are two papers that overlap between the two sets of papers identified.

In [12]:
for key in alzheimer_metadata:
    if key in cancer_metadata:
        print(key, cancer_metadata[key])
        print(key, alzheimer_metadata[key])
        print("\n")

36321615 {'ArticleTitle': 'Association of spermidine plasma levels with brain aging in a population-based study.', 'AbstractText': "Supplementation with spermidine may support healthy aging, but elevated spermidine tissue levels were shown to be an indicator of Alzheimer's disease (AD).Data from 659 participants (age range: 21-81 years) of the population-based Study of Health in Pomerania TREND were included. We investigated the association between spermidine plasma levels and markers of brain aging (hippocampal volume, AD score, global cortical thickness [CT], and white matter hyperintensities [WMH]).Higher spermidine levels were significantly associated with lower hippocampal volume (&#223;&#160;=&#160;-0.076; 95% confidence interval [CI]: -0.13 to -0.02; q =&#160;0.026), higher AD score (&#223;&#160;=&#160;0.118; 95% CI: 0.05 to 0.19; q =&#160;0.006), lower global CT (&#223;&#160;=&#160;-0.104; 95% CI: -0.17 to -0.04; q =&#160;0.014), but not WMH volume. Sensitivity analysis reveale

Be sure to store all parts (of AbstractText). You could do this in many ways, from using a dictionary or a list or simply concatenating with a space in between. Discuss any pros or cons of your choice in your readme. **(1 point)**

I have saved my AbstractText as an XML list object.  

Pros:  
This retains all the labels, label names as well as all the text that they contain.  
It also deals with any unwanted tags like \<sup> and \<sub> because it leaves them as they are to be processed later.  
It can easily be converted back to XML to manipulate later.
    
Cons:  
It saves the text as a 'bytes' object instead of a 'string' object. This is inconvenient if you want to save it as a JSON but allows you to save a .pkl file.  
It captures unneccessary information like the 'CopyrightInformation' label but this can be easily removed in processing (as has been done in Q2) as the element is in XML formatting.