# Intro to Python 
## Lesson 6a – API Practice (Pubmed)

**Date:** Aug 6, 2021 <br>
**Programmer:** Rahim Hashim <br>
**Goal:** The goal of this lesson is to practice different APIs

***

### Import Libraries
__Public List:__ os, re, sys, string, datetime, pandas, numpy, tqdm, ntlk, collections, urllib, bs4, unidecode, matplotlib<br>
__Additional Code:__ Regions


All of the libraries are public and should be already installed in Google Colab.  

In [None]:
%load_ext autoreload
%autoreload 2
import os
import re
import sys
import string
import datetime
import pprint
import pandas as pd
import numpy as np
from tqdm.auto import tqdm
# For download info / documentation on Natural Language Toolkit (nltk):
#    https://www.nltk.org/
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from collections import defaultdict

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


***

### Assigning Search Term Parameters

First we will assign the search parameters for scraping. In particular, test the key terms you will be searcing for in the database of choice, and then assign it to the SearchParameters.searchTerms attribute. 

In [None]:
BASE_URL = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi/?db=&term=&retmax='

parameters = {}
# Term : PubMed desired search term(s)
parameters['Term'] = 'Salzman CD'
# RetMax : Max number of articles for each search term
parameters['RetMax'] = 100

### Generating List of Database Search Result URLs

Using the [NCBI Entrez API](https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch), eSearchLinkGenerator generates an XML containing the list of URLs for the articles returned by searchParameters.searchTerms attribute, up until the amount of articles specified by searchLimit. resultsList generates a nested list of all article URLs.

In [None]:
def SearchLinkGenerator(base_url, searchParameters):
  urlList = []
  url = base_url.split('=')
  database = 'pubmed'
  finalTerm = parameters['Term'].replace(' ', '+')
  searchLimit = parameters['RetMax']
  updated_url = '='.join([url[0], database + url[1], finalTerm + url[2], url[3] + str(searchLimit)])
  print(' URL: {}'.format(updated_url))
  urlList.append(updated_url)
  return(urlList)

eSearchList = SearchLinkGenerator(BASE_URL, parameters)

 URL: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi/?db=pubmed&term=Salzman+CD&retmax=100


In [None]:
from urllib.request import urlopen

PUBMED_ROOT = 'https://www.ncbi.nlm.nih.gov/pubmed/'

def PMID_ListGenerator(eSearchList):
  print('\nGenerating list of PMIDs...')
  finalList = []
  PMIDList = []
  for term in eSearchList:
    r = urlopen(term).read().decode('utf-8')
    PMID_List = re.findall('<Id>(.*?)</Id>', r)
    resultsList = []
    for p_index, PMID in enumerate(PMID_List): # enumerate allows you to loop both an index variable and the loop variable
      PMIDList.append(PMID)
      link = PUBMED_ROOT+PMID
      resultsList.append(link)
      print('', str(p_index) + ':', link)
    finalList.append(resultsList)
    searchTerm = re.findall('<Term>(.*?)</Term>', r)[0]
  return PMIDList, finalList

PMID_list, pubmed_list = PMID_ListGenerator(eSearchList)


Generating list of PMIDs...
 0: https://www.ncbi.nlm.nih.gov/pubmed/33058757
 1: https://www.ncbi.nlm.nih.gov/pubmed/32859756
 2: https://www.ncbi.nlm.nih.gov/pubmed/31871162
 3: https://www.ncbi.nlm.nih.gov/pubmed/29849148
 4: https://www.ncbi.nlm.nih.gov/pubmed/29525574
 5: https://www.ncbi.nlm.nih.gov/pubmed/29459764
 6: https://www.ncbi.nlm.nih.gov/pubmed/28683271
 7: https://www.ncbi.nlm.nih.gov/pubmed/26479590
 8: https://www.ncbi.nlm.nih.gov/pubmed/26291167
 9: https://www.ncbi.nlm.nih.gov/pubmed/26240431
 10: https://www.ncbi.nlm.nih.gov/pubmed/26240417
 11: https://www.ncbi.nlm.nih.gov/pubmed/26140594
 12: https://www.ncbi.nlm.nih.gov/pubmed/25471563
 13: https://www.ncbi.nlm.nih.gov/pubmed/25358090
 14: https://www.ncbi.nlm.nih.gov/pubmed/25297102
 15: https://www.ncbi.nlm.nih.gov/pubmed/23377126
 16: https://www.ncbi.nlm.nih.gov/pubmed/23303950
 17: https://www.ncbi.nlm.nih.gov/pubmed/23189037
 18: https://www.ncbi.nlm.nih.gov/pubmed/22145876
 19: https://www.ncbi.nlm.nih.g

In [None]:
from tqdm.notebook import tqdm
from collections import defaultdict
# For download info / documentation on BeautifulSoup:
#    https://www.crummy.com/software/BeautifulSoup/bs4/doc/
from bs4 import BeautifulSoup

def linksParser(termLinks, titleList, abstractList):
  '''linksParser reads each URL from PMID_ListGenerator output and parses specified info'''
  articleCount = 0; abstract_text = []
  for link in tqdm(termLinks):
    articleCount += 1
    # Open, read and process link through BeautifulSoup
    r1 = urlopen(link).read()
    soup = BeautifulSoup(r1, "html.parser")
    # ARTICLE NAME Parser
    articleTitle = soup.find('title').text
    # META INFO (journal title, date published)
    meta = soup.find_all('meta')
    author_list = []
    author_institutions = []
    for tag in meta:
      if 'name' in tag.attrs.keys():
        if tag.attrs['name'] == 'citation_abstract':
          titleList.append(articleTitle)
          abstractList.append(tag.attrs['content'])
  return titleList, abstractList

***
### Parsing Data

For each searchTerm provided to searchParameters, dataParser will take each of the article URLs and parse the specified information, inserting it into a multi-nested dictionary queriesHash to be further analyzed.

In [None]:
def dataParser(pubmed_list):
    '''
    dataParser creates a multi-nested dictionary
      queriesHash
        | 
        queriesHash[query]
          |
          queriesHash[query][PMID]
            |
            articleTitle
            journalTitle
            dataPublished
            ...
    '''
    print('\nParsing info for search terms...')
    for a_index, termLinks in enumerate(pubmed_list):
      titleList = []
      abstractList = []
      titleList, abstractList = linksParser(termLinks, titleList, abstractList)
    return titleList, abstractList

titleList, abstractList = dataParser(pubmed_list)


Parsing info for search terms...


HBox(children=(FloatProgress(value=0.0, max=37.0), HTML(value='')))




### Abstract Analysis

Now that we've captured all the abstracts, we can perform analyses.

In [None]:
from collections import Counter

for p_index, paper in enumerate(abstractList):
  print('Title:', titleList[p_index])
  print('  Number of Words:', len(paper))

Title: The Geometry of Abstraction in the Hippocampus and Prefrontal Cortex
  Number of Words: 1157
Title: Low-dimensional dynamics for working memory and time encoding
  Number of Words: 1318
Title: The contribution of nonhuman primate research to the understanding of emotion and cognition and its clinical relevance
  Number of Words: 1411
Title: The coding of valence and identity in the mammalian taste system
  Number of Words: 1440
Title: Basolateral amygdala circuitry in positive and negative valence
  Number of Words: 720
Title: Shared neural coding for social hierarchy and reward value in primate amygdala
  Number of Words: 1041
Title: Distinct Roles for the Amygdala and Orbitofrontal Cortex in Representing the Relative Amount of Expected Reward
  Number of Words: 1035
Title: Reward expectation differentially modulates attentional behavior and activity in visual area V4
  Number of Words: 1096
Title: Abstract Context Representations in Primate Amygdala and Prefrontal Cortex
  Num