# Pubmed Parser
### Textual Analysis

__Notebook Description:__ PubmedParser is a Python-based web scraper specifically designed to scrape, parse, and analyze NCBI's online pubmed database (https://www.ncbi.nlm.nih.gov/pubmed). This scraper can be used in a number of ways, one of which is utilized here to performa a textual analysis of specific researchers' abstracts. <br>
__Notebook Owner:__ Rahim Hashim <br>
__Contact:__ rh2898@columbia.edu <br>

***

### Import Libraries
__Public List:__ os, re, sys, string, datetime, pandas, numpy, tqdm, ntlk, collections, urllib, bs4, unidecode, matplotlib<br>
__Additional Code:__ Regions

After following the directions for Managing Virtual Environments above, all of the libraries from the public list will be set up. Regions.py included in the directory will provide the rest of the functions used in this codebase. 

In [1]:
%load_ext autoreload
%autoreload 2
import os
import re
import sys
import string
import datetime
import pprint
import pandas as pd
import numpy as np
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
from collections import defaultdict, Counter

# if running on google colab, mount google drive
if 'google.colab' in str(get_ipython()): 
  from google.colab import drive
  drive.mount('/content/drive', force_remount=True)
  sys.path.append('/content/drive/MyDrive/Projects/pubmed-parser/Pubmed-Parser') # set to path of notebook

***

### Assigning Search Term Parameters

First we will assign the search parameters for scraping. In particular, test the key terms you will be searcing for in the database of choice, and then assign it to the SearchParameters.searchTerms attribute. 

Here you can use the same logical operators (i.e. AND | OR | NOT) to construct advanced queries in same way that is used in NCBI searches. Examples below:


> `search_terms = ['Stefano Fusi']`<br>
> `search_terms = ['Stefano Fusi AND Nature Neuroscience[Journal]]`
> `search_terms = ['hippocampus[Title] AND amygdala[Title]]`<br>

In [2]:
search_terms = ['Stefano Fusi AND Nature Neuroscience[Journal]',
								'Daniel Salzman AND Nature Neuroscience[Journal]']

In [3]:
from helper.searchObjects import SearchParameters

parameters = {}
# Database : Specified NCBI database
#   Options = Pubmed [pubmed] | Pubmed Central [PMC] | Unigene [Unigene] | Others [Look Up Key]
parameters['database'] = 'pubmed'
# SearchTerms : PubMed desired search term(s)
parameters['searchTerms'] = search_terms
# searchLimit : Max number of articles for each search term
parameters['searchLimit'] = 100
# StartIndex : The start index for the search (larger for older papers)
parameters['startIndex'] = 0

searchParameters = SearchParameters(parameters)

***
### Generating List of Database Search Result URLs

Using the [NCBI Entrez API](https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch), eSearchLinkGenerator generates an XML containing the list of URLs for the articles returned by searchParameters.searchTerms attribute, up until the amount of articles specified by searchLimit. resultsList generates a nested list of all article URLs.

As of December 2021, **NCBI requires an API key** in order to make more than 3 requests per second to Entrez. Users can obtain an API key now from the Settings page of their NCBI account (to create an account, visit http://www.ncbi.nlm.nih.gov/account/). After creating the key, users should include it in each E-utility request by assigning it to the new api_key parameter. See: [Entrez Documentation](https://www.ncbi.nlm.nih.gov/books/NBK25497/) for more information.

Once you've generated an API key, create a new file named .env and in the first line, assign a variable<br>
> `NCBI_API_KEY = '<>'`

In [4]:
from helper.entrezParser import eSearchLinkGenerator, PMID_ListGenerator
from dotenv import load_dotenv
load_dotenv()

NCBI_API_KEY = os.getenv('NCBI_API_KEY') # set in .env file

def entrezSearch(searchParameters):
  '''
  entrezSearch generates resultsList, which is
  a list of all article URLs for each search term

  Args:
    searchParameters (SearchParameters): SearchParameters object
    containing search parameters assigned above

  Returns:
    resultsList (list): list of all article URLs for each search term
  '''
  eSearchCore = 'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/esearch.fcgi/?db=&term=&retmax=&retstart='
  api = NCBI_API_KEY
  eSearchLinkList = eSearchLinkGenerator(eSearchCore, searchParameters, api)
  resultsList = PMID_ListGenerator(eSearchLinkList)
  return resultsList

resultsList = entrezSearch(searchParameters)

Generating Entrez XML...
   [Stefano Fusi AND Nature Neuroscience[Journal]] complete
   [Daniel Salzman AND Nature Neuroscience[Journal]] complete

Generating list of PMIDs...
  Fusi, Stefano[Full Author Name]: 5 results
  Salzman, Daniel[Full Author Name]: 4 results


***
### Parsing Data

For each searchTerm provided to searchParameters, dataParser will take each of the article URLs and parse the specified information, inserting it into a multi-nested dictionary queriesHash to be further analyzed.

In [5]:
from helper.linksParser import linksParser

def dataParser(resultsList):
  """
  dataParser creates a multi-nested dictionary
  containing all article data for each search term

  Args:
    resultsList (list): list of all article URLs for each search term
  
  Returns:
    data (dict): multi-nested dictionary containing all article data for each search term
  """
  print('\nParsing info for search terms...')
  queriesHash = defaultdict(lambda: defaultdict(list)) # primary key = pubmed query
  for a_index, termLinks in enumerate(resultsList):
    query = searchParameters.searchTerms[a_index]
    searchesHash = linksParser(termLinks, searchParameters, query)
    queriesHash[query] = searchesHash  
  return queriesHash

searchesHash = dataParser(resultsList)


Parsing info for search terms...
  Stefano Fusi AND Nature Neuroscience[Journal]


  0%|          | 0/5 [00:00<?, ?it/s]

  Daniel Salzman AND Nature Neuroscience[Journal]


  0%|          | 0/4 [00:00<?, ?it/s]

In [6]:
print(searchesHash[search_terms[0]]['35590075']['doi'])

doi.org/10.1038/s41593-022-01072-y


***
### Semantic Scholar

[Semantic Scholar](https://www.semanticscholar.org/) is an incredible tool built by the Allen Institute which provides free, AI-driven search and discovery tools, and open resources for the global research community. They've built a useful [API](https://www.semanticscholar.org/product/api#Documentation) which we will utilize to generate additional data (i.e. citation count) on each of the papers found for each search term which was not available through Entrez. 

In [7]:
from semantic_scholar_query import semantic_scholar_search
searchesHash = semantic_scholar_search(searchesHash, verbose=True)

Query: Stefano Fusi AND Nature Neuroscience[Journal]
  ('Title: Adolescent thalamic inhibition leads to long-lasting impairments in '
   'prefrontal cortex function (2022)')
	'PMID: 35590075'
	'Citation Count: 2'
	'Semantic Scholar Citation Count: 0'
  ('Title: Coding of social novelty in the hippocampal CA2 region and its '
   'disruption and rescue in a 22q11.2 microdeletion mouse model (2020)')
	'PMID: 33077947'
	'Citation Count: 27'
	'Semantic Scholar Citation Count: 0'
  ('Title: Context-dependent representations of objects and space in the primate '
   'hippocampus during virtual navigation (2020)')
	'PMID: 31873285'
	'Citation Count: 37'
	'Semantic Scholar Citation Count: 0'
  'Title: Computational principles of synaptic memory consolidation (2016)'
	'PMID: 27694992'
	'Citation Count: 134'
	'Semantic Scholar Citation Count: 9'
  'Title: Limits on the memory storage capacity of bounded synapses (2007)'
	'PMID: 17351638'
	'Citation Count: 193'
	'Semantic Scholar Citation Count: 12

In [8]:
from helper.construct_dataframe import construct_dataframe

authors_df = construct_dataframe(searchesHash)
authors_df


Constructing dataframe...


Unnamed: 0,author,PMID,articleCount,article_title,journal_title,publisher,publication_date,doi,abstract,author_institutions,citation_count,semantic_scholar_citation_count,collaborators
0,Laura J Benoit,35590075,0.0,Adolescent thalamic inhibition leads to long-l...,Nature neuroscience,Nat Neurosci,2022,doi.org/10.1038/s41593-022-01072-y,Impaired cortical maturation is a postulated m...,[Graduate Program in Neurobiology and Behavior...,2.0,0.0,"[Emma S Holt, Lorenzo Posani, Stefano Fusi, Al..."
1,Emma S Holt,35590075,0.0,Adolescent thalamic inhibition leads to long-l...,Nature neuroscience,Nat Neurosci,2022,doi.org/10.1038/s41593-022-01072-y,Impaired cortical maturation is a postulated m...,[Graduate Program in Neurobiology and Behavior...,2.0,0.0,"[Laura J Benoit, Lorenzo Posani, Stefano Fusi,..."
2,Lorenzo Posani,35590075,0.0,Adolescent thalamic inhibition leads to long-l...,Nature neuroscience,Nat Neurosci,2022,doi.org/10.1038/s41593-022-01072-y,Impaired cortical maturation is a postulated m...,[Graduate Program in Neurobiology and Behavior...,2.0,0.0,"[Laura J Benoit, Emma S Holt, Stefano Fusi, Al..."
3,Stefano Fusi,35590075,0.0,Adolescent thalamic inhibition leads to long-l...,Nature neuroscience,Nat Neurosci,2022,doi.org/10.1038/s41593-022-01072-y,Impaired cortical maturation is a postulated m...,[Graduate Program in Neurobiology and Behavior...,2.0,0.0,"[Laura J Benoit, Emma S Holt, Lorenzo Posani, ..."
4,Alexander Z Harris,35590075,0.0,Adolescent thalamic inhibition leads to long-l...,Nature neuroscience,Nat Neurosci,2022,doi.org/10.1038/s41593-022-01072-y,Impaired cortical maturation is a postulated m...,[Graduate Program in Neurobiology and Behavior...,2.0,0.0,"[Laura J Benoit, Emma S Holt, Lorenzo Posani, ..."
5,Sarah Canetta,35590075,0.0,Adolescent thalamic inhibition leads to long-l...,Nature neuroscience,Nat Neurosci,2022,doi.org/10.1038/s41593-022-01072-y,Impaired cortical maturation is a postulated m...,[Graduate Program in Neurobiology and Behavior...,2.0,0.0,"[Laura J Benoit, Emma S Holt, Lorenzo Posani, ..."
6,Christoph Kellendonk,35590075,0.0,Adolescent thalamic inhibition leads to long-l...,Nature neuroscience,Nat Neurosci,2022,doi.org/10.1038/s41593-022-01072-y,Impaired cortical maturation is a postulated m...,[Graduate Program in Neurobiology and Behavior...,2.0,0.0,"[Laura J Benoit, Emma S Holt, Lorenzo Posani, ..."
7,Macayla L Donegan,33077947,1.0,Coding of social novelty in the hippocampal CA...,Nature neuroscience,Nat Neurosci,2020,doi.org/10.1038/s41593-020-00720-5,The hippocampal CA2 region is essential for so...,"[Department of Neuroscience, Zuckerman and Kav...",27.0,0.0,"[Fabio Stefanini, Torcato Meira, Joshua A Gord..."
8,Fabio Stefanini,33077947,1.0,Coding of social novelty in the hippocampal CA...,Nature neuroscience,Nat Neurosci,2020,doi.org/10.1038/s41593-020-00720-5,The hippocampal CA2 region is essential for so...,"[Department of Neuroscience, Zuckerman and Kav...",27.0,0.0,"[Macayla L Donegan, Torcato Meira, Joshua A Go..."
9,Torcato Meira,33077947,1.0,Coding of social novelty in the hippocampal CA...,Nature neuroscience,Nat Neurosci,2020,doi.org/10.1038/s41593-020-00720-5,The hippocampal CA2 region is essential for so...,"[Department of Neuroscience, Zuckerman and Kav...",27.0,0.0,"[Macayla L Donegan, Fabio Stefanini, Joshua A ..."


***
### Abstract Word Count

A simple analysis will be to find the most commonly-used words in each query.

First, remove all the stop words from the abstracts. As defined by the [nltk documentation](https://pythonspot.com/nltk-stop-words/):

>Stop words are common words like ‘the’, ‘and’, ‘I’, etc. that are very frequent in text, and so don’t convey insights into the specific topic of a document. We can remove these stop words from the text in a given corpus to clean up the data, and identify words that are more rare and potentially more relevant to what we’re interested in.

Next, all punctuation, digits, and unncessary characters are removed.

Lastly, each word will be lemmatized, which is defined by [Stanford's NLP group](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html) as:

> Removing inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma. 

For example, in English, the verb 'to walk' may appear as 'walk', 'walked', 'walks' or 'walking'. The base form, 'walk', that one might look up in a dictionary, is called the lemma for the word. 

This program uses [WordNet](https://wordnet.princeton.edu/) which is a large, freely and publicly available lexical database for the English language.


In [9]:
from helper.wordCount import lemming_count
lemming_count(searchesHash, top_n_words=20)

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/rahimhashim/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/rahimhashim/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/rahimhashim/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw to /Users/rahimhashim/nltk_data...
[nltk_data]   Package omw is already up-to-date!


Stefano Fusi AND Nature Neuroscience[Journal] ['memory', 'task', 'social', 'mouse', 'model', 'synaptic', 'activity', 'prefrontal', 'neuron', 'coding', 'bound', 'maturation', 'cognitive', 'deficit', 'spatial', 'sup', 'associative', 'number', 'cortical', 'thalamus']
Daniel Salzman AND Nature Neuroscience[Journal] ['amygdala', 'reward', 'attention', 'spatial', 'social', 'stimulus', 'location', 'information', 'associated', 'value', 'neural', 'hierarchical', 'activity', 'increased', 'cue', 'brain', 'process', 'data', 'whether', 'rank']


***
### Year-by-Year Analysis

The first analysis we'll be performing is year-by-year lexicon frequency. In the example queries provided above, we introduced two authors (Salzman CD, Fusi S) and captured all meta-info (title, journal, date published, etc), and importantly all of the abstracts for each of the author's articles. 

In pre-processing we'll __separate stop words__, defined by [NLTK](https://www.nltk.org/book/ch02.html) as: *high-frequency words like 'the', 'to' and 'also' that we sometimes want to filter out of a document before further processing. Stopwords usually have little lexical content, and their presence in a text fails to distinguish it from other texts.* 

The data will still be preserved in yearsHash\[year\]\['stopWords'\] if you're interested in analyzing something like self- vs. other-referential pronoun usage.  

In [10]:
from helper.yearExaminer import yearExaminer
yearsHash = yearExaminer(searchesHash)

[   ['2007', 0.7093023255813954],
    ['2009', 0.9285714285714286],
    ['2013', 0.6732673267326733],
    ['2015', 0.7634408602150538],
    ['2016', 0.8214285714285714],
    ['2018', 0.6881720430107527],
    ['2020', 0.6632653061224489],
    ['2022', 0.6761904761904762]]
