# Pubmed Parser
### Textual Analysis

__Notebook Description:__ PubmedParser is a Python-based web scraper specifically designed to scrape, parse, and analyze NCBI's online pubmed database (https://www.ncbi.nlm.nih.gov/pubmed). This scraper can be used in a number of ways, one of which is utilized here to performa a textual analysis of specific researchers' abstracts. <br>
__Notebook Owner:__ Rahim Hashim <br>
__Contact:__ rh2898@columbia.edu <br>

***

### Import Libraries
__Public List:__ os, re, sys, string, datetime, pandas, numpy, tqdm, ntlk, collections, urllib, bs4, unidecode, matplotlib<br>
__Additional Code:__ Regions

After following the directions for Managing Virtual Environments above, all of the libraries from the public list will be set up. Regions.py included in the directory will provide the rest of the functions used in this codebase. 

In [8]:
%load_ext autoreload
%autoreload 2
import os
import re
import sys
import string
import datetime
import pprint
import pandas as pd
import numpy as np
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
from collections import defaultdict, Counter

# if running on juypter labs, mount to google drive
if 'google.colab' in str(get_ipython()): 
  from google.colab import drive
  drive.mount('/content/drive', force_remount=True)
  sys.path.append('/content/drive/MyDrive/Projects/pubmed-parser/Pubmed-Parser') # set to path of notebook

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/rahimhashim/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/rahimhashim/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/rahimhashim/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw to /Users/rahimhashim/nltk_data...
[nltk_data]   Package omw is already up-to-date!


***

### Assigning Search Term Parameters

First we will assign the search parameters for scraping. In particular, test the key terms you will be searcing for in the database of choice, and then assign it to the SearchParameters.searchTerms attribute. 

In [3]:
list_professors = ['Kenny P Sinai', 'Nestler E Sinai', 'Akbarian S Sinai', 'Avena-Blanchard N Sinai', 'Baxter M Sinai', 'Benson D Sinai', 'Blanchard J Sinai', 'Brown S Sinai', 'Cai D Sinai', 'Castellano J Sinai', 'Clem R Sinai', 'Cropper E Sinai', 'De Araujo I Sinai', 'Ellis-Davies G Sinai', 'Friedel R Sinai', 'Goate A Sinai', 'Haghighi F Sinai', 'Hof P Sinai', 'Huntley G Sinai', 'Hurd Y Sinai', 'Maze I Sinai', 'Mobbs C Sinai', 'O\'Loughlin T Sinai', 'Rajan K Sinai', 'Raj T Sinai', 'Renton A Sinai', 'Rich E Sinai', 'Rudebeck P Sinai', 'Russo S Sinai', 'Saez I Sinai', 'Salton S Sinai', 'Schaefer A Sinai', 'Schiller D Sinai', 'Shen L Sinai', 'Shuman T Sinai', 'Slesinger P Sinai', 'Tzavaras N Sinai', 'Weiss K Sinai', 'Wu H Sinai', 'Wu X Sinai', 'Yang N Sinai', 'Zachariou V Sinai', 'Zou H Sinai']

In [41]:
from helper.searchObjects import SearchParameters

parameters = {}
# Database : Specified NCBI database
#   Options = Pubmed [pubmed] | Pubmed Central [PMC] | Unigene [Unigene] | Others [Look Up Key]
parameters['database'] = 'pubmed'
# SearchTerms : PubMed desired search term(s)
parameters['searchTerms'] = list_professors
# searchLimit : Max number of articles for each search term
parameters['searchLimit'] = 100
# StartIndex : The start index for the search (larger for older papers)
parameters['startIndex'] = 0

searchParameters = SearchParameters(parameters)

***
### Generating List of Database Search Result URLs

Using the [NCBI Entrez API](https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch), eSearchLinkGenerator generates an XML containing the list of URLs for the articles returned by searchParameters.searchTerms attribute, up until the amount of articles specified by searchLimit. resultsList generates a nested list of all article URLs.

In [50]:
from helper.entrezParser import eSearchLinkGenerator, PMID_ListGenerator

def entrezSearch():
  '''
  entrezSearch generates resultsList, which is
  a list of all article URLs for each search term.
  '''
  eSearchCore = 'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/esearch.fcgi/?db=&term=&retmax=&retstart='
  eSearchLinkList = eSearchLinkGenerator(eSearchCore, searchParameters)
  resultsList = PMID_ListGenerator(eSearchLinkList)
  return resultsList

resultsList = entrezSearch()

Generating Entrez XML...
   [Yang N Sinai] complete
   [Zachariou V Sinai] complete
   [Zou H Sinai] complete

Generating list of PMIDs...
http://eutils.ncbi.nlm.nih.gov/entrez//eutils/esearch.fcgi/?db=pubmed&term=Yang+N+Sinai&retmax=100&retstart=0
  Yang N[Author]: 42 results
http://eutils.ncbi.nlm.nih.gov/entrez//eutils/esearch.fcgi/?db=pubmed&term=Zachariou+V+Sinai&retmax=100&retstart=0
  Zachariou V[Author]: 26 results
http://eutils.ncbi.nlm.nih.gov/entrez//eutils/esearch.fcgi/?db=pubmed&term=Zou+H+Sinai&retmax=100&retstart=0
  Zou H[Author]: 39 results


***
### Parsing Data

For each searchTerm provided to searchParameters, dataParser will take each of the article URLs and parse the specified information, inserting it into a multi-nested dictionary queriesHash to be further analyzed.

In [43]:
from helper.linksParser import linksParser

def dataParser(resultsList):
  '''
  dataParser creates a multi-nested dictionary
    queriesHash
      | 
      queriesHash[query]
        |
        queriesHash[query][PMID]
          |
          articleTitle
          journalTitle
          dataPublished
          ...
  '''
  print('\nParsing info for search terms...')
  queriesHash = defaultdict(lambda: defaultdict(list)) # primary key = pubmed query
  for a_index, termLinks in enumerate(resultsList):
    query = searchParameters.searchTerms[a_index]
    searchesHash = linksParser(termLinks, searchParameters, query)
    queriesHash[query] = searchesHash  
  return queriesHash

searchesHash = dataParser(resultsList)


Parsing info for search terms...
  Yang N Sinai


  0%|          | 0/42 [00:00<?, ?it/s]

  Zachariou V Sinai


  0%|          | 0/26 [00:00<?, ?it/s]

  Zou H Sinai


  0%|          | 0/39 [00:00<?, ?it/s]

***
### Abstract Word Count

A simple analysis will be to find the most commonly-used words in each query.

First, remove all the stop words from the abstracts. As defined by the [nltk documentation](https://pythonspot.com/nltk-stop-words/):

>Stop words are common words like ‘the’, ‘and’, ‘I’, etc. that are very frequent in text, and so don’t convey insights into the specific topic of a document. We can remove these stop words from the text in a given corpus to clean up the data, and identify words that are more rare and potentially more relevant to what we’re interested in.

Next, all punctuation, digits, and unncessary characters are removed.

Lastly, each word will be lemmatized, which is defined by [Stanford's NLP group](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html) as:

> Removing inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma. 

For example, in English, the verb 'to walk' may appear as 'walk', 'walked', 'walks' or 'walking'. The base form, 'walk', that one might look up in a dictionary, is called the lemma for the word. 

This program uses [WordNet](https://wordnet.princeton.edu/) which is a large, freely and publicly available lexical database for the English language.


In [44]:
from helper.wordCount import lemming_count
lemming_count(searchesHash, top_n_words=20)

Yang N Sinai ['cell', 'mouse', 'patient', 'production', 'study', 'effect', 'result', 'level', 'model', 'treatment', 'disease', 'herbal', 'ashmi', 'asthma', 'ige', 'human', 'significantly', 'murine', 'method', 'food']
Zachariou V Sinai ['pain', 'protein', 'chronic', 'receptor', 'mouse', 'mechanism', 'model', 'brain', 'expression', 'signaling', 'role', 'injury', 'nac', 'also', 'opioid', 'morphine', 'level', 'action', 'gene', 'treatment']
Zou H Sinai ['cell', 'patient', 'axon', 'neuron', 'study', 'injury', 'also', 'wave', 'response', 'gbm', 'gene', 'growth', 'stem', 'method', 'model', 'result', 'msm', 'signaling', 'glioma', 'regeneration']


***
### Year-by-Year Analysis

The first analysis we'll be performing is year-by-year lexicon frequency. In the example queries provided above, we introduced two authors (Salzman CD, Fusi S) and captured all meta-info (title, journal, date published, etc), and importantly all of the abstracts for each of the author's articles. 

In pre-processing we'll __separate stop words__, defined by [NLTK](https://www.nltk.org/book/ch02.html) as: *high-frequency words like 'the', 'to' and 'also' that we sometimes want to filter out of a document before further processing. Stopwords usually have little lexical content, and their presence in a text fails to distinguish it from other texts.* 

The data will still be preserved in yearsHash\[year\]\['stopWords'\] if you're interested in analyzing something like self- vs. other-referential pronoun usage.  

In [None]:
from helper.yearExaminer import yearExaminer
yearsHash = yearExaminer(searchesHash)