# Pubmed Parser
### Textual Analysis

__Notebook Description:__ PubmedParser is a Python-based web scraper specifically designed to scrape, parse, and analyze NCBI's online pubmed database (https://www.ncbi.nlm.nih.gov/pubmed). This scraper can be used in a number of ways, one of which is utilized here to performa a textual analysis of specific researchers' abstracts. <br>
__Notebook Owner:__ Rahim Hashim <br>
__Contact:__ rh2898@columbia.edu <br>

***

### Import Libraries
__Public List:__ os, re, sys, string, datetime, pandas, numpy, tqdm, ntlk, collections, urllib, bs4, unidecode, matplotlib<br>
__Additional Code:__ Regions

After following the directions for Managing Virtual Environments above, all of the libraries from the public list will be set up. Regions.py included in the directory will provide the rest of the functions used in this codebase. 

In [1]:
%load_ext autoreload
%autoreload 2
import os
import re
import sys
import string
import datetime
import pprint
import pandas as pd
import numpy as np
from tqdm.auto import tqdm
# For download info / documentation on Natural Language Toolkit (nltk):
#    https://www.nltk.org/
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from collections import defaultdict

# if running on juypter labs, mount to google drive
if 'google.colab' in str(get_ipython()): 
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
Mounted at /content/drive


***

### Assigning Search Term Parameters

First we will assign the search parameters for scraping. In particular, test the key terms you will be searcing for in the database of choice, and then assign it to the SearchParameters.searchTerms attribute. 

In [2]:
sys.path.append('/content/drive/MyDrive/Projects/pubmed-parser/Pubmed-Parser/helper') # set to path of notebook
from searchObjects import SearchParameters

parameters = {}
# Database : Specified NCBI database
#   Options = Pubmed [pubmed] | Pubmed Central [PMC] | Unigene [Unigene] | Others [Look Up Key]
parameters['database'] = 'pubmed'
# SearchTerms : PubMed desired search term(s)
parameters['searchTerms'] = ['Salzman CD', 'Fusi S']
# searchLimit : Max number of articles for each search term
parameters['searchLimit'] = 100
# StartIndex : The start index for the search (larger for older papers)
parameters['startIndex'] = 0

searchParameters = SearchParameters(parameters)

***
### Generating List of Database Search Result URLs

Using the [NCBI Entrez API](https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch), eSearchLinkGenerator generates an XML containing the list of URLs for the articles returned by searchParameters.searchTerms attribute, up until the amount of articles specified by searchLimit. resultsList generates a nested list of all article URLs.

In [3]:
from entrezParser import eSearchLinkGenerator, PMID_ListGenerator

def entrezSearch():
    '''
    entrezSearch generates resultsList, which is
    a list of all article URLs for each search term.
    '''
    eSearchCore = 'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/esearch.fcgi/?db=&term=&retmax=&retstart='
    eSearchLinkList = eSearchLinkGenerator(eSearchCore, searchParameters)
    resultsList = PMID_ListGenerator(eSearchLinkList)
    return(resultsList)

resultsList = entrezSearch()

Generating Entrez XML...
   [Salzman CD] complete
   [Fusi S] complete

Generating list of PMIDs...
  Salzman CD[Author]: 37 results
  Fusi S[Author]: 100 results


***
### Parsing Data

For each searchTerm provided to searchParameters, dataParser will take each of the article URLs and parse the specified information, inserting it into a multi-nested dictionary queriesHash to be further analyzed.

In [4]:
from linksParser import linksParser

def dataParser(resultsList):
    '''
    dataParser creates a multi-nested dictionary
      queriesHash
        | 
        queriesHash[query]
          |
          queriesHash[query][PMID]
            |
            articleTitle
            journalTitle
            dataPublished
            ...
    '''
    print('\nParsing info for search terms...')
    queriesHash = defaultdict(lambda: defaultdict(list)) # primary key = pubmed query
    searchesHash = defaultdict(lambda: defaultdict(list)) # primary key = PMID
    for a_index, termLinks in enumerate(resultsList):
        query = searchParameters.searchTerms[a_index]
        searchesHash = linksParser(a_index, termLinks, searchParameters,
                                   query, searchesHash)
        queriesHash[query] = searchesHash
    return queriesHash

searchesHash = dataParser(resultsList)


Parsing info for search terms...
  Salzman CD


HBox(children=(FloatProgress(value=0.0, max=37.0), HTML(value='')))


  Fusi S


HBox(children=(FloatProgress(value=0.0), HTML(value='')))




### Year-by-Year Analysis

The first analysis we'll be performing is year-by-year lexicon frequency. In the example queries provided above, we introduced two authors (Salzman CD, Fusi S) and captured all meta-info (title, journal, date published, etc), and importantly all of the abstracts for each of the author's articles. 

In pre-processing we'll __separate stop words__, defined by [NLTK](https://www.nltk.org/book/ch02.html) as: *high-frequency words like 'the', 'to' and 'also' that we sometimes want to filter out of a document before further processing. Stopwords usually have little lexical content, and their presence in a text fails to distinguish it from other texts.* 

The data will still be preserved in yearsHash\[year\]\['stopWords'\] if you're interested in analyzing something like self- vs. other-referential pronoun usage.  

In [5]:
from collections import Counter
from nltk.probability import FreqDist
import matplotlib.pyplot as plt

from nltk.stem import WordNetLemmatizer

def freqPlotter(year, yearHash):
    #plt.figure(figsize=(20, 8))
    for year in yearsHash.keys():
        fdist = FreqDist(yearsHash[year]['keptWords'])
        fdist.plot(20)  
        plt.title(year)

def lexPlotter(lex_scores_list):
    figure, plt = plt.subplot(1,1)
    '''
    multi-bar graph with list of tuples
    x = year
    y = lexical score
    '''

def lexical_diversity(text):
    ''' 
    lexical_diversity captures the frequency of
    new words used for each body of text
    '''
    word_count = len(text)
    vocab_size = len(set(text))
    diversity_score = vocab_size / word_count
    return diversity_score

def wordFreq(yearsHash):
    lex_scores_list = []
    for year in yearsHash.keys():
      text = yearsHash[year]['keptWords']
      lex_scores_list.append([year, lexical_diversity(text)])
      '''
      counts = Counter(text)
      words = counts.keys()
      word_freq = counts.values()
      word_freq, words = (list(t) for t in zip(*sorted(zip(word_freq, words)))) # ordered most freq -> less freq
      total_words = sum(word_freq)
      word_freq = [x/total_words for x in word_freq]
      words = words[::-1]
      word_freq = word_freq[::-1]
      freqPlotter(year, yearHash)
      lexPlotter(lex_scores_list)
      '''
    pp = pprint.PrettyPrinter(indent=4)
    pp.pprint(lex_scores_list)
        
def yearExaminer(searchesHash):
    '''
    yearExaminer buckets the results for each searchTerm 
    by year of publication and analyzes the lexicon used
    in the abstracts.
    
    pre-processing: 
       1) lower-case for all words
       2) tokenize to split words from punctuation
       3) separates all stop words to yearsHash[year]['wordsLeft']
       4) separates all punctuation to yearsHash[year]['punctuation']
       5) separates words that contain numbers to yearsHash[year]['numberWords']
       6) lemmitize all remaining words
             - A lemma (plural lemmas or lemmata) is the canonical form, 
               dictionary form, or citation form of a set of words.
    
    analysis:
       1) most frequent 20 words year-over-year
       2) lexical diversity year-over-year
    '''
    stopWords = nltk.corpus.stopwords.words('english')
    punctuation = string.punctuation
    yearsHash = defaultdict(lambda: defaultdict(list))
    lemmatizer = WordNetLemmatizer()
    for query in searchesHash.keys():
      for article in searchesHash[query].keys():
        # word_tokenize splits off punctuation other than periods
        abstract = nltk.tokenize.word_tokenize(searchesHash[query][article]['abstract'].lower()) #1 and #2
        year = searchesHash[query][article]['publication_date'][:4]
        for word in abstract:
          if word in punctuation:
            yearsHash[year]['punctuation'].append(word) #3
          elif word in stopWords:
            yearsHash[year]['stopWords'].append(word) #4
          elif any(map(str.isdigit, word)):
            yearsHash[year]['numberWords'].append(word) #5
          else:
            word = lemmatizer.lemmatize(word) #6
            yearsHash[year]['keptWords'].append(word)
          '''
          # yearsHash[year]['query'] = query #DO OR DO NOT DO BY QUERY? 
          '''
    wordFreq(yearsHash)
    lexical_diversity(yearsHash)
    return(yearsHash)
            
yearsHash = yearExaminer(searchesHash)

[   ['2020', 0.2671957671957672],
    ['2019', 0.3613678373382625],
    ['2018', 0.2848984771573604],
    ['2017', 0.28939393939393937],
    ['2015', 0.2986577181208054],
    ['2014', 0.2767857142857143],
    ['2013', 0.2619397501836885],
    ['2012', 0.3396057347670251],
    ['2011', 0.30213903743315507],
    ['2010', 0.3054156171284635],
    ['2009', 0.31090289608177174],
    ['2008', 0.29103343465045595],
    ['2007', 0.2860991379310345],
    ['2006', 0.30430528375733856],
    ['2005', 0.30255516840882696],
    ['1994', 0.330188679245283],
    ['1993', 0.3295880149812734],
    ['1992', 0.2986111111111111],
    ['1990', 0.39285714285714285],
    ['2021', 0.2916666666666667],
    ['2016', 0.3651399491094148],
    ['2003', 0.2925531914893617],
    ['2004', 0.34552845528455284],
    ['2002', 0.3661616161616162],
    ['2001', 0.4],
    ['2000', 0.3352601156069364],
    ['1998', 0.2985386221294363],
    ['1999', 0.4485294117647059],
    ['1997', 0.37755102040816324],
    ['1995', 0.404761