# Pubmed Parser
### Textual Analysis

__Notebook Description:__ PubmedParser is a Python-based web scraper specifically designed to scrape, parse, and analyze NCBI's online pubmed database (https://www.ncbi.nlm.nih.gov/pubmed). This scraper can be used in a number of ways, one of which is utilized here to performa a textual analysis of specific researchers' abstracts. <br>
__Notebook Owner:__ Rahim Hashim <br>
__Contact:__ rh2898@columbia.edu <br>
__\*Dependencies:__ In order to reproduce the environment used in the notebook, activate your virtual environment with the *environment.yml* file contained in the directory (see: [Managing Virtual Environments](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#creating-an-environment-from-an-environment-yml-file) for more details)

***

### Import Libraries
__Public List:__ os, re, sys, string, datetime, pandas, numpy, tqdm, ntlk, collections, urllib, bs4, unidecode<br>
__Additional Code:__ Regions

After following the directions for Managing Virtual Environments above, all of the libraries from the public list will be set up. Regions.py included in the directory will provide the rest of the functions used in this codebase. 

In [1]:
%load_ext autoreload
%autoreload 2
import os
import re
import sys
import string
import datetime
import pandas as pd
import numpy as np
from tqdm.auto import tqdm
# For download info / documentation on Natural Language Toolkit (nltk):
#    https://www.nltk.org/
import nltk
nltk.download('punkt') # only needs to be downloaded once
nltk.download('stopwords') # only needs to be downloaded once
from collections import defaultdict

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/rahimhashim/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/rahimhashim/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


***

### Assigning Search Term Parameters

First we will assign the search parameters for scraping. In particular, test the key terms you will be searcing for in the database of choice, and then assign it to the SearchParameters.searchTerms attribute. 

In [2]:
from searchObjects import SearchParameters

parameters = {}
# Database : Specified NCBI database
#   Options = Pubmed [pubmed] | Pubmed Central [PMC] | Unigene [Unigene] | Others [Look Up Key]
parameters['database'] = 'pubmed'
# SearchTerms : PubMed desired search term(s)
parameters['searchTerms'] = ['Salzman CD']
# searchLimit : Number of articles that the program will search through for each search term
parameters['searchLimit'] = 100
# startIndex : Index of which article to start search on (e.g. IdIndex = 10 does not include 9 most recent articles published from SearchTerms)
parameters['startIndex'] = 0
# abstractFlag : Flag to hide (0) or capture (1) abstract info in last column
parameters['abstractFlag'] = 1
# emailFilter : Filter for contact-capturing - (0) for all results, (1) for articles with author emails only 
parameters['emailFilter'] = 0
# geographyFilter : Filter to show results only from specified [country1, country2...] or keep all countries ([])
# i.e. --> geographyFilter = ['US', 'USA', 'United States', 'United States of America']
parameters['geographyFilter'] = []
#authorTermSearch : When search term is author name, (1) results match last name of search terms only or (2) all authors 
parameters['authorTermSearch'] = 2
# authorScore = Flag to keep off (0) or turn on (1) author score column
parameters['authorScoreFlag'] = 0

searchParameters = SearchParameters(parameters)

***
### Generating List of Database Search Result URLs

Using the [NCBI Entrez API](https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch), eSearchLinkGenerator generates an XML containing the list of URLs for the articles returned by searchParameters.searchTerms attribute, up until the amount of articles specified by searchLimit. resultsList generates a nested list of all P

In [98]:
from entrezParser import eSearchLinkGenerator, PMID_ListGenerator

def main():
    eSearchCore = 'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/esearch.fcgi/?db=&term=&retmax=&retstart='
    eSearchLinkList = eSearchLinkGenerator(eSearchCore, searchParameters)
    resultsList = PMID_ListGenerator(eSearchLinkList)
    return(resultsList)

resultsList = main()

Generating Entrez XML...
   [Salzman CD] complete

Generating list of PMIDs...
  Salzman CD[Author]: 35 results
[['https://www.ncbi.nlm.nih.gov/pubmed/31871162', 'https://www.ncbi.nlm.nih.gov/pubmed/29849148', 'https://www.ncbi.nlm.nih.gov/pubmed/29525574', 'https://www.ncbi.nlm.nih.gov/pubmed/29459764', 'https://www.ncbi.nlm.nih.gov/pubmed/28683271', 'https://www.ncbi.nlm.nih.gov/pubmed/26479590', 'https://www.ncbi.nlm.nih.gov/pubmed/26291167', 'https://www.ncbi.nlm.nih.gov/pubmed/26240431', 'https://www.ncbi.nlm.nih.gov/pubmed/26240417', 'https://www.ncbi.nlm.nih.gov/pubmed/26140594', 'https://www.ncbi.nlm.nih.gov/pubmed/25471563', 'https://www.ncbi.nlm.nih.gov/pubmed/25358090', 'https://www.ncbi.nlm.nih.gov/pubmed/25297102', 'https://www.ncbi.nlm.nih.gov/pubmed/23377126', 'https://www.ncbi.nlm.nih.gov/pubmed/23303950', 'https://www.ncbi.nlm.nih.gov/pubmed/23189037', 'https://www.ncbi.nlm.nih.gov/pubmed/22145876', 'https://www.ncbi.nlm.nih.gov/pubmed/21943608', 'https://www.ncbi.nlm.

***
### Parsing Data

For each searchTerm provided to searchParameters, dataParser will take each of the article URLs and parse the specified information, inserting it into a multi-nested dictionary queriesHash to be further analyzed.

In [69]:
from linksParser import linksParser

def dataParser(resultsList):
    '''
    dataParser creates a multi-nested dictionary
      queriesHash
        | 
        queriesHash[query]
          |
          queriesHash[query][PMID]
            |
            articleTitle
            journalTitle
            dataPublished
            ...
    '''
    print('\nParsing Info for search terms...')
    queriesHash = defaultdict(lambda: defaultdict(list)) # primary key = pubmed query
    searchesHash = defaultdict(lambda: defaultdict(list)) # primary key = PMID
    for a_index, termLinks in enumerate(resultsList):
        query = searchParameters.searchTerms[a_index]
        searchesHash = linksParser(a_index, termLinks, searchParameters,
                                   query, searchesHash)
        queriesHash[query] = searchesHash
    return queriesHash

searchesHash = dataParser(resultsList)

  0%|          | 0/35 [00:00<?, ?it/s]


Parsing Info for search terms...
  Salzman CD


100%|██████████| 35/35 [00:57<00:00,  1.65s/it]


In [97]:
def yearExaminer(searchesHash):
    stopWords = nltk.corpus.stopwords.words('english')
    punctuation = string.punctuation
    yearsHash = defaultdict(lambda: defaultdict(list))
    for query in searchesHash.keys():
        for date in searchesHash[query].keys():
            year = searchesHash[query][date]['datePublished'][:4]
            # word_tokenize splits off punctuation other than periods
            abstract = nltk.tokenize.word_tokenize(searchesHash[query][date]['abstract'].lower())
            for word in abstract:
                if word in punctuation:
                    yearsHash[year]['punctuation'].append(word)
                elif word not in stopWords:
                    yearsHash[year]['wordsKept'].append(word)
                else:
                    yearsHash[year]['wordsLeft'].append(word)
    return(yearsHash)
            
yearsHash = yearExaminer(searchesHash)

9
6
5
12
6
12
9
6
5
8
8
8
6
9
6
9
7
4
4
7
6
7
5
6
7
10
8
8
6
7
3
5
5
7
8
8
8
9
8
9
7
3
4
6
6
6
9
6
5
6
5
13
6
4
7
6
6
6
5
6
6
6
7
7
6
6
6
6
6
9
8
4
5
7
6
8
5
6
9
9
6
6
9
8
8
5
6
4
9
8
8
7
5
5
7
6
6
9
8
8
6
10
6
8
7
5
6
9
8
8
8
8
7
7
8
6
10
8
6
13
6


In [50]:
# finding the frequency distinct in the tokens
# Importing FreqDist library from nltk and passing token into FreqDist
from nltk.probability import FreqDist
fdist_list = []
for abstract in abstract_text:
    abstract = [word for word in abstract if len(word) > 1]
    abstract = [word.lower() for word in abstract]
    stopWords = nltk.corpus.stopwords.words('english')
    abstract = [word for word in abstract if word not in stopWords]
    fdist = FreqDist(abstract)
    fdist_list.append(FreqDist(abstract))
    
print('Total Abstracts Captured:', str(len(abstract_text)))

NameError: name 'abstract_text' is not defined

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(20, 8))

FreqDist.plot(fdist_list[1])

In [None]:
from collections import Counter, OrderedDict

total_freq_dict = {}
for i in range(len(fdist_list)):
    fdist_dict = Counter(dict(fdist_list[i]))
    total_freq_dict_copy = Counter(total_freq_dict)
    total_freq_dict = dict(fdist_dict + total_freq_dict_copy)
total_freq_dict = OrderedDict(sorted(total_freq_dict.items(), reverse=True, key=lambda t: t[1]))
print(total_freq_dict)

In [None]:
import matplotlib.pyplot as plt

#Plot Birth State of US-Born Players (2)
f, ax = plt.subplots(1, figsize=(25,10))
word_list = total_freq_dict.keys(); word_count = total_freq_dict.values()
ax.bar(np.arange(len(word_list)), word_count)
ax.set_xticks(np.arange(len(word_list)))
ax.set_xticklabels(word_list, rotation=90, ha='right', fontsize=7)
#ax.set_xlabel('US State of Birth', fontweight='bold', labelpad=10)
#ax.set_ylabel('Number of Players', fontweight='bold', labelpad=10)

plt.show()