# Automated literature search

__Author:__ \
Emanuel Lange \
Mehrdimensionale OMICS-Datenanalyse \
ISAS e.V. \
Bunsen-Kirchhoff-Straße 11 \
44139 Dortmund, Germany \
emanuel.lange@isas.de

__Last Revision:__ \
December 18, 2023

__License:__ \
MIT

__Objective:__ \
This script queries PubMed and extracts the most cited reference articles from the initial query.
It was used for a literature on microbiome modeling (submission pending). Pubmed queries used in this review are listed below. The generated outputs for these queries are included in the directory './review_query_outputs'.

This was inspired by the project of Paula Martin Gonzalez: https://github.com/paulamartingonzalez/Targeted_Literature_Reviews_via_webscraping/tree/main

__How it works:__ \
Articles are retrieved via the PubMed API.
For interactive visualization we utilized the bokeh library (http://bokeh.org/). \
Visualizations and data are stored as html documents and can be viewed using any modern web-browser.

__How to execute:__ \
Jupyter-notebooks are comprised of cells (text and code) that can be executed by CTRL + ENTER (executes selected cell) or SHIFT + ENTER (executes selected cell and jumps to next one).

Before running this script, make sure you installed and activated the provided CONDA environment. 

Execute every cell below the sections 1 and 2 to make all functions available.

# 1) Import Packages

Uncommend the line below if you work in google colab

In [None]:
# !pip install biopython

In [1]:
from IPython.display import display # for showing dataframes in jupyter notebook
import openpyxl # xlsx export
import pandas as pd # xlsx export
from Bio import Entrez # pubmed querying
import networkx as nx # calculating spring layout for graph
from time import time # measuring duration of search

# interactive visualization
from bokeh.io import output_file, show
from bokeh.models import (BoxZoomTool, Circle, HoverTool,
                          MultiLine, Plot, Range1d, ResetTool, WheelZoomTool, PanTool, GraphRenderer, StaticLayoutProvider, TapTool, NodesAndLinkedEdges)
from bokeh.palettes import Spectral4
from bokeh.transform import factor_cmap

# 2) Classes and Functions

## 2.1) Article Class

In [2]:
class Article:
  """Class to store article information retrieved from pubmed queries."""
  pmid = ''
  doi = ''
  title = ''
  firstAuthorName = ''
  year = ''
  isReference = False
  references = []
  incomingEdgeNumber = 0

  def __init__(self, pmid, doi, title, firstAuthorName, year, isReference, references):
    self.pmid = pmid
    self.doi = doi
    self.title = title
    self.firstAuthorName = firstAuthorName
    self.year = year
    self.isReference = isReference
    self.references = references

  def getPropsAsDict(self):
    return self.__dict__

  def getPropsAsList(self):
    return [self.title, str(self.isReference), self.doi, str(self.incomingEdgeNumber), self.references]

## 2.2) Functions for Pubmed search

In [12]:
def search(query, max_return):
    handle = Entrez.esearch(db='pubmed',
                            sort='relevance',
                            retmax=str(max_return),
                            retmode='xml',
                            term=query)
    results = Entrez.read(handle)
    return results['IdList']

def fetch_details(id_list):
    id_string = ','.join([str(id) for id in id_list])
    handle = Entrez.efetch(db='pubmed',
                           retmode='xml',
                           id=id_string)
    results = Entrez.read(handle)
    return results

def get_ids_from_article_id_list(article_id_list):
    """Parses electronic article ids from the provided object"""
    id_types = [id.attributes['IdType'] for id in article_id_list]
    ids = [str(id) for id in article_id_list] # transform entries to str
    return dict(zip(id_types, ids)) # transform to dict

def update_dict(updated_dict, dict_for_update):

  for key in dict_for_update:
    if key in updated_dict:
      continue

    updated_dict.update({key: dict_for_update[key]}) # add reference entries to the article dict

def update_and_save_articles(article_dict, output_file_prefix):

  # dict for excel output and network graph
  output_dict = {
      'pmid': [],
      'title': [],
      'first author': [],
      'year': [],
      'degree': [],
      'doi': [],
      'is reference': [],
      'references': []
  }
    
  for article_key in article_dict:

    if article_dict[article_key].isReference:
        continue
    
    # iterate over references of the intial articles and update the incoming edge number if a reference is included
    for reference_key in article_dict[article_key].references:
      if reference_key in article_dict:
        article_dict[reference_key].incomingEdgeNumber += 1

  for article_key in article_dict:
    article_obj = article_dict[article_key]
      
    output_dict['pmid'].append(article_obj.pmid)  
    output_dict['title'].append(article_obj.title)
    output_dict['first author'].append(article_obj.firstAuthorName)
    output_dict['year'].append(article_obj.year)
    output_dict['degree'].append(int(article_obj.incomingEdgeNumber))
    output_dict['doi'].append(article_obj.doi)
    output_dict['is reference'].append(str(article_obj.isReference))
    output_dict['references'].append(article_obj.references)

  output_data = pd.DataFrame.from_dict(output_dict)
  display(output_data)
  output_data.to_excel(output_file_prefix + '_list.xlsx')

  return output_dict

def get_articles_from_pmids(pmid_list, isReferenceSearch, batch_size):
    """Retrieves articles from PubMed based on a list of pmids.
    pmid_list: list of pubmed ids
    isReferenceSearch: boolean, whether the articles are references or not, set to true during the search for references
    batch_size: int, the number of articles to retrieve at once from the pubmed API"""

    article_dict = {} # dictionary for article objects
    number_of_references = 0 # total number of references
    number_of_missing_references = 0 # total number of references that do not have a pubmed id

    # iterate over over pubmed ids in batches of batch_size
    for batch_start_entry in range(0, len(pmid_list), batch_size):

      batch_ids = pmid_list[batch_start_entry : batch_start_entry + batch_size]
      batch_results = fetch_details(batch_ids) # retrieves article objects from pubmed for the batch of provided ids

      # iterate over the article objects in the batch and write data into the article_dict
      for article_obj in batch_results['PubmedArticle']:
        # try:
          data = article_obj.get('MedlineCitation')
          title = data.get('Article').get('ArticleTitle')
          
          authorName = ''
          year = ''
          
          try:
            authorName = data.get('Article').get('AuthorList')[0].get('LastName')
            year = data.get('Article').get('ArticleDate')[0].get('Year')
          except:
            pass

          pubmed_data = article_obj.get('PubmedData') # pubmed metadata

          article_id_objects = pubmed_data.get('ArticleIdList') # object of electronic article ids

          try:
            id_obj = get_ids_from_article_id_list(article_id_objects) # get list of electronic article ids
          except:
            print('Article "' + title + '" was excluded, because it had no identifiers.')
            continue
          
          # check whether pubmed id is available
          try:
            pmid = int(id_obj['pubmed'])
          except:
            print('Article "' + title + '" was excluded, because it had no pubmed identifier.')
            continue

          doi = id_obj.get('doi') # will be None if doi is not available

          number_of_article_references = 0
          number_of_missing_article_references = 0 # count references that do not have a pubmed id
          paper_references = []

          # read reference objects from article object
          try:
            paper_references = pubmed_data.get('ReferenceList')[0]['Reference']
            number_of_article_references = len(paper_references)
          except:
            # print('Article ' + pmid + ' has no references')
            pass

          reference_ids = []

          # read electronic reference ids from reference objects
          for ref in paper_references:
            try:
              reference_id_obj = get_ids_from_article_id_list(ref.get('ArticleIdList'))
              reference_ids.append(int(reference_id_obj['pubmed'])) # throws exception if no pubmed id
              if int(reference_id_obj['pubmed']) == 0:
                raise Exception('Zero identifier') # Some entries have zero as identifier, which should be excluded
            except:
              number_of_missing_article_references += 1
              continue

          # print(str(missing_id_count) + ' references included due to lack of identifiers for article ' + pmid)
          number_of_references += number_of_article_references
          number_of_missing_references += number_of_missing_article_references

          article_dict[pmid] = Article(
                  pmid,
                  doi,
                  title,
                  authorName,
                  year,
                  references=reference_ids,
                  isReference=isReferenceSearch)

        # except:
        #   print('Article excluded due to unknown reason')

      missing_ref_ratio = 0

      if not number_of_references == 0:
        missing_ref_ratio = number_of_missing_references / number_of_references # calculate fraction of references without pubmed id
          
    return article_dict, missing_ref_ratio

def get_reference_articles(article_dict, batch_size):

  all_reference_article_dict = {}
  number_of_articles = len(article_dict)

  for index, article_key in enumerate(article_dict):
    print('Checking article ' + str(index+1) + ' of ' + str(number_of_articles), end='\r')
    article = article_dict[article_key]

    # retrieve references for each article
    if len(article.references) > 0:
      reference_articles, missing_ref_fraction = get_articles_from_pmids(article.references, True, batch_size)
      all_reference_article_dict.update(reference_articles)

  return all_reference_article_dict

def get_articles_and_primary_references(query, output_file_prefix, number_of_initial_articles, batch_size):
  """Function to retrieve articles and their references from PubMed.
  query: string, the query to search for
  output_file_prefix: string, the prefix for the output file
  number_of_initial_articles: int, the number of initial articles to retrieve
  batch_size: int, the number of articles to retrieve at once from the pubmed API
  """

  start_time = time()

  print('Started search for your query...')
  results_id_list = search(query, number_of_initial_articles) # get initial list of article ids
  initial_articles, missing_reference_fraction = get_articles_from_pmids(results_id_list, isReferenceSearch=False, batch_size=batch_size) # get initial article objects
  print('Done with initial articles, fetching references...')

  reference_articles = get_reference_articles(initial_articles, batch_size=batch_size) # get reference article objects

  end_time = time()

  print("I found " + str(len(reference_articles)) + " references in " + str(len(initial_articles)) + " initial articles for query " + query)
  print("%5.1f percent of the initial articles had no references in pubmed central." % (missing_reference_fraction * 100))

  print("Search took " + str((end_time-start_time)/60) + " minutes")
  update_dict(initial_articles, reference_articles) # append references to the initial article dict

  print('Counting references and writing xlsx...')
  article_data_list = update_and_save_articles(initial_articles, output_file_prefix)

  print('Done')
  return article_data_list

## 2.3) Functions for interactive graph

In [4]:
def add_hover(p):
    """hover tooltip for data points"""

    # custom tooltip layout
    t = """
    <div @tooltip{custom} >
        <b>pubmed id: </b> @index <br>
        <b>title: </b> @title <br>
        <b>first author: </b> @firstAuthor <br>
        <b>year: </b> @year <br>
        <b>doi: </b> @doi <br>
        <b>in degree: </b> @degree <br>
    </div>
    <style>
    div.bk-tooltip-content > div > div:not(:first-child) {
        display:none !important;
    }
    </style>
    """

    # initiate and add hover tool to display tooltips
    hover = HoverTool()
    hover.tooltips = t
    p.add_tools(hover)

def build_graph(article_data_list, output_file_prefix):
  """Creates an interactive visualization of the reference network and outputs it as html file. The html can be viewed in any modern webbrowser.
    article_data_list: dict, the output of get_articles_and_primary_references function
    output_file_prefix: string, the prefix for the output file
  """

  nodes = []
  edges = []  # edges to calculate spring layout

  # edge sources and targets for edge renderer
  edge_sources = []
  edge_targets = []
 
  # create nodes and edges for graph
  for article_index in range(0, len(article_data_list['pmid'])):

    article_key = article_data_list['pmid'][article_index]

    nodes.append(article_key)

    if article_data_list['is reference'][article_index] == "True":
      continue

    for ref in article_data_list['references'][article_index]:
      edges.append((article_key, ref))
      edge_sources.append(article_key)
      edge_targets.append(ref)

  # calculate spring layout
  graph = nx.Graph()
  graph.add_nodes_from(nodes)
  graph.add_edges_from(edges)
  pos = nx.spring_layout(graph, iterations=50, scale=4)

  # initialize interactive plot  
  plot = Plot(
      x_range=Range1d(-1.1, 1.1),
      y_range=Range1d(-1.1, 1.1),
      width=1200,
      height=1000
      )
  
  # initialize graph renderer
  graph_renderer = GraphRenderer()
  
  ## nodes
  # add nodes to renderer
  graph_renderer.node_renderer.glyph = Circle(
      fill_color=factor_cmap("isReference", (Spectral4[0], Spectral4[2]), ["True", "False"]), radius=0.01, fill_alpha=0.8)
  
  # define color and mode of selected nodes
  graph_renderer.node_renderer.selection_glyph = Circle(fill_color=Spectral4[3])
  graph_renderer.selection_policy = NodesAndLinkedEdges()
  
  # add data to renderer
  graph_renderer.node_renderer.data_source.data = dict(
      index=article_data_list['pmid'],
      title=article_data_list['title'],
      firstAuthor=article_data_list['first author'],
      year=article_data_list['year'],
      doi=article_data_list['doi'],
      isReference=article_data_list['is reference'],
      degree=article_data_list['degree']
      )

  ## edges
  # add edges to renderer
  graph_renderer.edge_renderer.glyph = MultiLine(line_width=2, line_alpha=0.2)

  # define color and mode of selected edges
  graph_renderer.edge_renderer.selection_glyph = MultiLine(line_width=2
                                                           , line_color=Spectral4[3])
  # add data to edge renderer
  graph_renderer.edge_renderer.data_source.data = dict(
      start=edge_sources,
      end=edge_targets
      )
  
  # pass spring layout into graph renderer
  graph_renderer.layout_provider = StaticLayoutProvider(graph_layout=pos)

  # add graph renderer to plot
  plot.renderers.append(graph_renderer)
  
  # add tools to plot
  plot.add_tools(BoxZoomTool(), ResetTool(), WheelZoomTool(), PanTool(), TapTool())
  add_hover(plot)
  
  # output graph as html file
  output_file(output_file_prefix + "_graph.html")
  # output_notebook() # uncomment if you want to display the graph in this notebook
  show(plot)

# 3) Execute Search

PubMed queries used in the manuscript

In [10]:
# query = '(microbiome) OR (microbial community)'
# query = '(meta proteomics) OR (meta genomics) OR (meta omics)'
# query = '(computational model) AND ((metabolism) OR (regulation) OR (signaling))'
# query = '(biological network reconstruction) AND ((microbiome) OR (microbial community))'
# query = '(computational model) AND ((parameter estimation) OR (contextualization) OR (reduction))'
# query = '(computational modeling) AND ((microbiome) OR (microbial community))'
# query = '(control algorithm) AND ((microbiome) OR (microbial community))'
# query = '(network modeling) AND (guidelines OR software OR repository)'

query = 'microbiome'

In [11]:
# output_file_prefix = "07_11_23_microbiome_or_microbial_community_100"
# output_file_prefix = "07_11_23_meta_proteomics_or_meta_genomics_or_meta_omics_100"
# output_file_prefix = "07_11_23_computational_modeling_and_metabolism_or_regulation_or_signaling_100"
# output_file_prefix = '07_11_23_biological_network_reconstruction_and_microbiome_or_microbial_community_100'
# output_file_prefix = '07_11_23_computational_modeling_and_parameter_estimation_or_contextualization_or_reduction_100'
# output_file_prefix = '07_11_23_computational_modeling_and_microbiome_or_microbial_community_100'
# output_file_prefix = '07_11_23_control_algorithm_and_microbiome_or_microbial_community_100'
# output_file_prefix = '07_11_23_network_modeling_and_guidelines_or_software_or_repository_100'

output_file_prefix = 'test_microbiome_10'

A registered email and an api key are required to access the PubMed API. More information here: https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/


In [7]:
Entrez.email = ''

The api key is optional, but you can get more articles without error

In [8]:
Entrez.api_key = ''

Search for your query and generate an output as xlsx.

In [13]:
article_data = get_articles_and_primary_references(query, output_file_prefix, number_of_initial_articles=10, batch_size=50)

Started search for your query...
Done with initial articles, fetching references...
I found 848 references in 10 initial articles for query microbiome
 14.7 percent of the initial articles had no references in pubmed central.
Search took 0.7867766340573629 minutes
Counting references and writing xlsx...


Unnamed: 0,pmid,title,first author,year,degree,doi,is reference,references
0,29171095,The Gastrointestinal Microbiome: A Review.,Barko,2017,0,10.1111/jvim.14875,False,"[25394236, 17943116, 22411464, 15272194, 16478..."
1,29282061,The human microbiome in evolution.,Davenport,2017,0,10.1186/s12915-017-0454-7,False,"[28375652, 21871249, 23391737, 21682646, 12089..."
2,35393656,The human microbiome in disease and pathology.,Manos,2022,0,10.1111/apm.13225,False,"[21722791, 22699609, 24997786, 32102216, 20624..."
3,32345639,Advances in Understanding the Human Urinary Mi...,Neugent,2020,0,10.1128/mBio.00218-20,False,"[22699609, 28953883, 22699610, 24371246, 22047..."
4,28096237,The Human Microbiome and Cancer.,Rajagopala,2017,0,10.1158/1940-6207.CAPR-16-0249,False,[]
...,...,...,...,...,...,...,...,...
853,28398304,Engineered probiotic Escherichia coli can elim...,Hwang,2017,1,10.1038/ncomms15028,True,"[26464014, 18074031, 18240278, 15708311, 19364..."
854,24259713,Gnotobiotic mouse model of phage-bacterial hos...,Reyes,2013,1,10.1073/pnas.1319470110,True,"[20147985, 19834481, 23828941, 20631792, 21880..."
855,29323293,Precision editing of the gut microbiota amelio...,Zhu,2018,1,10.1038/nature25172,True,"[17699621, 18030708, 20833380, 19783002, 23843..."
856,31015663,Engineered commensal microbes for diet-mediate...,Ho,2018,1,10.1038/s41551-017-0181-y,True,[]


Done


Create an interactive graph to explore the data, is written to html.

In [14]:
build_graph(article_data, output_file_prefix)