# Pubmed Parser
### Textual Analysis

__Notebook Description:__ PubmedParser is a Python-based web scraper specifically designed to scrape, parse, and analyze NCBI's online pubmed database (https://www.ncbi.nlm.nih.gov/pubmed). This scraper can be used in a number of ways, one of which is utilized here to performa a textual analysis of specific researchers' abstracts. <br>
__Notebook Owner:__ Rahim Hashim <br>
__Contact:__ rh2898@columbia.edu <br>

***

### Import Libraries
__Public List:__ os, re, sys, string, datetime, pandas, numpy, tqdm, ntlk, collections, urllib, bs4, unidecode, matplotlib<br>
__Additional Code:__ Regions

After following the directions for Managing Virtual Environments above, all of the libraries from the public list will be set up. Regions.py included in the directory will provide the rest of the functions used in this codebase. 

In [1]:
%load_ext autoreload
%autoreload 2
import os
import re
import sys
import string
import datetime
import pprint
import pandas as pd
import numpy as np
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
from collections import defaultdict, Counter

# if running on google colab, mount google drive
if 'google.colab' in str(get_ipython()): 
  from google.colab import drive
  drive.mount('/content/drive', force_remount=True)
  sys.path.append('/content/drive/MyDrive/Projects/pubmed-parser/Pubmed-Parser') # set to path of notebook

***

### Assigning Search Term Parameters

First we will assign the search parameters for scraping. In particular, test the key terms you will be searcing for in the database of choice, and then assign it to the SearchParameters.searchTerms attribute. 

Here you can use the same logical operators (i.e. AND | OR | NOT) to construct advanced queries in same way that is used in NCBI searches. Examples below:


> `search_terms = ['Stefano Fusi']`<br>
> `search_terms = ['Stefano Fusi AND Nature Neuroscience[Journal]]`
> `search_terms = ['hippocampus[Title] AND amygdala[Title]]`<br>

In [5]:
search_terms = ['Stefano Fusi',
								'Daniel Salzman']

In [12]:
from helper.searchObjects import SearchParameters

parameters = {}
# Database : Specified NCBI database
#   Options = Pubmed [pubmed] | Pubmed Central [PMC] | Unigene [Unigene] | Others [Look Up Key]
parameters['database'] = 'pubmed'
# SearchTerms : PubMed desired search term(s)
parameters['searchTerms'] = search_terms
# searchLimit : Max number of articles for each search term
parameters['searchLimit'] = 100
# StartIndex : The start index for the search (larger for older papers)
parameters['startIndex'] = 0

searchParameters = SearchParameters(parameters)

***
### Generating List of Database Search Result URLs

Using the [NCBI Entrez API](https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch), eSearchLinkGenerator generates an XML containing the list of URLs for the articles returned by searchParameters.searchTerms attribute, up until the amount of articles specified by searchLimit. resultsList generates a nested list of all article URLs.

As of December 2021, **NCBI requires an API key** in order to make more than 3 requests per second to Entrez. Users can obtain an API key now from the Settings page of their NCBI account (to create an account, visit http://www.ncbi.nlm.nih.gov/account/). After creating the key, users should include it in each E-utility request by assigning it to the new api_key parameter. See: [Entrez Documentation](https://www.ncbi.nlm.nih.gov/books/NBK25497/) for more information.

Once you've generated an API key, create a new file named .env and in the first line, assign a variable<br>
> `NCBI_API_KEY = '<>'`

In [26]:
from helper.entrezParser import eSearchLinkGenerator, PMID_ListGenerator
from dotenv import load_dotenv
load_dotenv()

NCBI_API_KEY = os.getenv('NCBI_API_KEY') # set in .env file

def entrezSearch(searchParameters):
  '''
  entrezSearch generates resultsList, which is
  a list of all article URLs for each search term

  Args:
    searchParameters (SearchParameters): SearchParameters object
    containing search parameters assigned above

  Returns:
    resultsList (list): list of all article URLs for each search term
  '''
  eSearchCore = 'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/esearch.fcgi/?db=&term=&retmax=&retstart='
  api = NCBI_API_KEY
  eSearchLinkList = eSearchLinkGenerator(eSearchCore, searchParameters, api)
  resultsList = PMID_ListGenerator(eSearchLinkList)
  return resultsList

resultsList = entrezSearch(searchParameters)

Generating Entrez XML...
   [Stefano Fusi] complete
   [Daniel Salzman] complete

Generating list of PMIDs...
  Stefano Fusi: 67 results
  Daniel Salzman: 33 results


***
### Parsing Data

For each searchTerm provided to searchParameters, dataParser will take each of the article URLs and parse the specified information, inserting it into a multi-nested dictionary queriesHash to be further analyzed.

In [27]:
from helper.linksParser import linksParser

def dataParser(resultsList):
  """
  dataParser creates a multi-nested dictionary
  containing all article data for each search term

  Args:
    resultsList (list): list of all article URLs for each search term
  
  Returns:
    data (dict): multi-nested dictionary containing all article data for each search term
  """
  print('\nParsing info for search terms...')
  queriesHash = defaultdict(lambda: defaultdict(list)) # primary key = pubmed query
  for a_index, termLinks in enumerate(resultsList):
    query = searchParameters.searchTerms[a_index]
    searchesHash = linksParser(termLinks, searchParameters, query)
    queriesHash[query] = searchesHash  
  return queriesHash

searchesHash = dataParser(resultsList)


Parsing info for search terms...
  Stefano Fusi


  0%|          | 0/67 [00:00<?, ?it/s]

  Daniel Salzman


  0%|          | 0/33 [00:00<?, ?it/s]

***
### Semantic Scholar

[Semantic Scholar](https://www.semanticscholar.org/) is an incredible tool built by the Allen Institute which provides free, AI-driven search and discovery tools, and open resources for the global research community. They've built a useful [API](https://www.semanticscholar.org/product/api#Documentation) which we will utilize to generate additional data (i.e. citation count) on each of the papers found for each search term which was not available through Entrez. 

In [28]:
from semantic_scholar_query import semantic_scholar_search
searchesHash = semantic_scholar_search(searchesHash, verbose=True)

Query: Stefano Fusi
  ('Title: Perirhinal Cortex Learns A Predictive Map of The Task Environment '
   '(2023/04/10)')
	'PMID: 36993645'
	'Citation Count: 0'
	'Semantic Scholar Citation Count: 0'
  ('Title: Abstract representations emerge naturally in neural networks trained '
   'to perform multiple tasks (2023/02/23)')
	'PMID: 36823136'
	'Citation Count: 10'
	'Semantic Scholar Citation Count: 0'
  'Title: Face familiarity detection with complex synapses (2022/12/22)'
	'PMID: 36636347'
	'Citation Count: 2'
	'Semantic Scholar Citation Count: 0'
  'Title: The geometry of cortical representations of touch in rodents (2023)'
	'PMID: 36624277'
	'Citation Count: 15'
	'Semantic Scholar Citation Count: 1'
  ('Title: The implications of categorical and category-free mixed selectivity '
   'on representational geometries (2022)')
	'PMID: 36332415'
	'Citation Count: 8'
	'Semantic Scholar Citation Count: 0'
  ('Title: Adolescent thalamic inhibition leads to long-lasting impairments in '
   'prefro

In [30]:
from helper.construct_dataframe import construct_dataframe

authors_df = construct_dataframe(searchesHash)
authors_df


Constructing dataframe...


Unnamed: 0,author,PMID,articleCount,article_title,journal_title,publisher,publication_date,doi,abstract,author_institutions,citation_count,semantic_scholar_citation_count,collaborators,keywords
0,David G Lee,36993645,0,Perirhinal Cortex Learns A Predictive Map of T...,bioRxiv : the preprint server for biology,bioRxiv,2023/04/10,doi.org/10.1101/2023.03.17.532214,Goal-directed tasks involve acquiring an inter...,[],0,0,"[Caroline A McLachlan, Ramon Nogueira, Osung K...",
1,Caroline A McLachlan,36993645,0,Perirhinal Cortex Learns A Predictive Map of T...,bioRxiv : the preprint server for biology,bioRxiv,2023/04/10,doi.org/10.1101/2023.03.17.532214,Goal-directed tasks involve acquiring an inter...,[],0,0,"[David G Lee, Ramon Nogueira, Osung Kwon, Alan...",
2,Ramon Nogueira,36993645,0,Perirhinal Cortex Learns A Predictive Map of T...,bioRxiv : the preprint server for biology,bioRxiv,2023/04/10,doi.org/10.1101/2023.03.17.532214,Goal-directed tasks involve acquiring an inter...,[],0,0,"[David G Lee, Caroline A McLachlan, Osung Kwon...",
3,Osung Kwon,36993645,0,Perirhinal Cortex Learns A Predictive Map of T...,bioRxiv : the preprint server for biology,bioRxiv,2023/04/10,doi.org/10.1101/2023.03.17.532214,Goal-directed tasks involve acquiring an inter...,[],0,0,"[David G Lee, Caroline A McLachlan, Ramon Nogu...",
4,Alanna Carey,36993645,0,Perirhinal Cortex Learns A Predictive Map of T...,bioRxiv : the preprint server for biology,bioRxiv,2023/04/10,doi.org/10.1101/2023.03.17.532214,Goal-directed tasks involve acquiring an inter...,[],0,0,"[David G Lee, Caroline A McLachlan, Ramon Nogu...",
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
457,Sara E Morrison,16482160,31,The primate amygdala represents the positive a...,Nature,Nature,2006/02/16,doi.org/10.1038/nature04490,Visual stimuli can acquire positive or negativ...,"[Center for Neurobiology and Behavior, Columbi...",879,55,"[Joseph J Paton, Marina A Belova, C Daniel Sal...",
458,C Daniel Salzman,16482160,31,The primate amygdala represents the positive a...,Nature,Nature,2006/02/16,doi.org/10.1038/nature04490,Visual stimuli can acquire positive or negativ...,"[Center for Neurobiology and Behavior, Columbi...",879,55,"[Joseph J Paton, Marina A Belova, Sara E Morri...",
459,C Daniel Salzman,16271457,32,"Beetles, boxes and brain cells: neural mechani...",Current opinion in neurobiology,Curr Opin Neurobiol,2005,doi.org/10.1016/j.conb.2005.10.016,Sensory cues in the environment can predict th...,"[Center for Neurobiology and Behavior, Departm...",22,0,"[Marina A Belova, Joseph J Paton]",
460,Marina A Belova,16271457,32,"Beetles, boxes and brain cells: neural mechani...",Current opinion in neurobiology,Curr Opin Neurobiol,2005,doi.org/10.1016/j.conb.2005.10.016,Sensory cues in the environment can predict th...,"[Center for Neurobiology and Behavior, Departm...",22,0,"[C Daniel Salzman, Joseph J Paton]",


***
### Abstract Word Count

A simple analysis will be to find the most commonly-used words in each query.

First, remove all the stop words from the abstracts. As defined by the [nltk documentation](https://pythonspot.com/nltk-stop-words/):

>Stop words are common words like ‘the’, ‘and’, ‘I’, etc. that are very frequent in text, and so don’t convey insights into the specific topic of a document. We can remove these stop words from the text in a given corpus to clean up the data, and identify words that are more rare and potentially more relevant to what we’re interested in.

Next, all punctuation, digits, and unncessary characters are removed.

Lastly, each word will be lemmatized, which is defined by [Stanford's NLP group](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html) as:

> Removing inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma. 

For example, in English, the verb 'to walk' may appear as 'walk', 'walked', 'walks' or 'walking'. The base form, 'walk', that one might look up in a dictionary, is called the lemma for the word. 

This program uses [WordNet](https://wordnet.princeton.edu/) which is a large, freely and publicly available lexical database for the English language.


In [31]:
from helper.wordCount import lemming_count
lemming_count(searchesHash, top_n_words=20)

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/rahimhashim/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/rahimhashim/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/rahimhashim/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw to /Users/rahimhashim/nltk_data...
[nltk_data]   Package omw is already up-to-date!


Stefano Fusi ['neuron', 'memory', 'neural', 'model', 'synaptic', 'task', 'learning', 'activity', 'network', 'number', 'time', 'synapsis', 'response', 'state', 'representation', 'selectivity', 'show', 'input', 'variable', 'reconstruction']
Daniel Salzman ['stimulus', 'amygdala', 'neural', 'neuron', 'aversive', 'reward', 'information', 'value', 'response', 'spatial', 'learning', 'brain', 'process', 'attention', 'different', 'ofc', 'state', 'monkey', 'representation', 'sensory']


***
### Year-by-Year Analysis

The first analysis we'll be performing is year-by-year lexicon frequency. In the example queries provided above, we introduced two authors (Salzman CD, Fusi S) and captured all meta-info (title, journal, date published, etc), and importantly all of the abstracts for each of the author's articles. 

In pre-processing we'll __separate stop words__, defined by [NLTK](https://www.nltk.org/book/ch02.html) as: *high-frequency words like 'the', 'to' and 'also' that we sometimes want to filter out of a document before further processing. Stopwords usually have little lexical content, and their presence in a text fails to distinguish it from other texts.* 

The data will still be preserved in yearsHash\[year\]\['stopWords'\] if you're interested in analyzing something like self- vs. other-referential pronoun usage.  

In [32]:
from helper.yearExaminer import yearExaminer
yearsHash = yearExaminer(searchesHash)

[   ['2002', 0.875],
    ['2003', 0.639269406392694],
    ['2004', 0.7076023391812866],
    ['2005', 0.6471816283924844],
    ['2006', 0.6086105675146771],
    ['2007', 0.5819209039548022],
    ['2008', 0.5526770293609672],
    ['2009', 0.6283524904214559],
    ['2010', 0.3713527851458886],
    ['2011', 0.6042780748663101],
    ['2012', 0.6736401673640168],
    ['2013', 0.5221518987341772],
    ['2014', 0.5025773195876289],
    ['2015', 0.6117103235747303],
    ['2016', 0.7366666666666667],
    ['2017', 0.5787878787878787],
    ['2018', 0.5463768115942029],
    ['2019', 0.726457399103139],
    ['2020', 0.43273350471293914],
    ['2021', 0.5798561151079137],
    ['2022', 0.647982062780269],
    ['2023', 0.6354679802955665]]
