# Extractive Summarization of Scientific Articles

#### Alden Dima
alden.dima@nist.gov  
Information Systems Group  
Information Technology Laboratory  
National Institute of Standards and Technology

#### Summary

This Jupyter notebook contains a prototype extractive text summarization method developed as a part of NIST's participation in the IARPA TrojAI Project to help accelerate the manual summarization of TrojAI-related literature being curated at the [TrojAI Literature Review GitHub repository](https://github.com/usnistgov/trojai-literature). For each document, our method identifies sentences containing certain metadiscourse markers and then ranks them using this [implementation](https://pypi.org/project/lexrank/) of the [LexRank](https://www.cs.cmu.edu/afs/cs/project/jair/pub/volume22/erkan04a-html/erkan04a.html) algorithm. We then emit the top-ranked sentences as a summary of the paper.

#### Terms of Use

This software was developed at the [National Institute of Standards and Technology (NIST)](https://www.nist.gov) by employees of the Federal Government in the course of their official duties.  Pursuant to Title 17 Section 105 of the United States Code this software is not subject to copyright protection and is in the public domain.  It is an experimental system.  NIST assumes no responsibility whatsoever for its use by other parties, and makes no guarantees, expressed or implied, about its quality, reliability, or any other characteristic.  We would appreciate acknowledgement if the software is used.

This software can be redistributed and/or modified freely provided that any derivative works bear some notice that they are derived from it, and any modified versions bear some notice that they have been modified.

#### Basic Strategy

We start with text extracted from PDF documents using [pdftotext](http://www.xpdfreader.com/). For each document's text, we:
1. Segment the sentences and create a language model for the summarizer
1. Identify the sentences which have metadiscource markers
1. Apply LexRank to those sentences to rank them by their centrality
1. Emit the top N sentences as a summary for that document

#### Parameters

In [None]:
# Adjust location of and extension for document text files as needed
TEXT_DIR = 'data/text/'
TEXT_EXT = 'txt'

# Metadiscourse marker sets - see MetaMarker class below
MARKER_SETS = {0, 1}

# LexRank parameters
SUMMARY_SIZE = 7 # Number of sentences in summary
THRESHOLD = 0.1

#### Setup

In [None]:
import sys
import regex as re
import spacy
import en_core_web_lg

In [None]:
from pathlib import Path
from operator import itemgetter
from lexrank import STOPWORDS, LexRank

In [None]:
spacy.prefer_gpu()
nlp = en_core_web_lg.load()

In [None]:
text_dir = Path(TEXT_DIR)
text_files = list(text_dir.glob("*.{}".format(TEXT_EXT)))

In [None]:
# Implements heuristics to identify sentences with metadiscourse markers.

class MetaMarkers:
    def __init__(self, pron = True, marker_sets = MARKER_SETS):
        mw = [
            # marker set 0
            set(['paper', 'work', 'research', 'article', 
                 'study', 'publication', 'section', 'approach', 
                 'method', 'technique', 'results']),
            
            # marker set 1
            set(['propose', 'present', 'exploit', 'investigate', 
                 'show', 'provide', 'explore',
                 'focus', 'consider', 'implement', 'adopt', 
                 'examine', 'expand', 'prove', 'argue', 
                 'claim', 'suggest', 'contrast', 'summarize']),
            
            # marker set 2
            set(['better', 'significant', 'first', 'second', 
                 'third', 'begin', 'finally', 'therefore', 
                 'however', 'consequently']),
        ]
        
        self.marker_tags = set(['PRON'])
        
        self.marker_words = set()
        try:
            for m in marker_sets:
                self.marker_words.update(mw[m])
        except (IndexError, TypeError):
            print("Invalid marker specifier value", file=sys.stderr)
            raise

    def is_meta(self, sent):
        pos = set([str(w.pos_) for w in sent])
        tok = set([str(w.lemma_) for w in sent])
        result = self.marker_tags.intersection(pos) or self.marker_words.intersection(tok)
        return result

#### Segment all sentences of documents to be summarized and create a language model for the summarizer

In [None]:
ws_pat = re.compile("\s+") # Used to normalize whitespace
all_sents = {}             # Docs to be summarized, indexed by file name
docs = []                  # Used to create lexRank's language model
for text_file in text_files:
    with open(text_file) as fin:
        text = fin.read()
        doc = nlp(text)
        sents = [s for s in doc.sents]
        all_sents[text_file] = sents
        docs.append(str(sents))
        
lxr = LexRank(docs, stopwords=STOPWORDS['en']) 

#### Identify sentences with metadiscourse markers

In [None]:
heur_sents = {} # Contains sentences with metadiscourse markers
sent_order = {} # Maintains sentence appearance order

mm = MetaMarkers()

for f, sents in all_sents.items():
    my_sents = []
    for s in sents:
        if mm.is_meta(s):
            my_sents.append(str(s).strip())
    heur_sents[f] = my_sents
    sent_order[f] = {s:n for (n,s) in enumerate(my_sents)}

In [None]:
# Sanity check: For how many of the documents do we have sentences with metadiscourse markers?

num = len({f for (f,t) in heur_sents.items() if t != []})
denom = len({f for (f,t) in heur_sents.items()})
print("{} out of {} documents have sentences".format(num, denom))

#### Apply LexRank and emit top N sentences in order of appearance in original text

In [None]:
# Applying "Classical LexRank"

for f,s in heur_sents.items():
    summary = lxr.get_summary(s, summary_size=SUMMARY_SIZE, threshold=THRESHOLD)
    sorted_sents = sorted([(sent_order[f][s],s) for s in summary])
    
    # Cleaning up embedded newlines and other whitespace issues with the sentences that we'll keep.
    summary_sents = [re.sub(ws_pat, ' ', str(s)) for (_,s) in sorted_sents]
    print("{}: {}\n".format(f.name, " ... ".join(summary_sents))) 