# Extractive Summarization of Scientific Articles

#### Alden Dima
alden.dima@nist.gov  
Information Systems Group  
Information Technology Laboratory  
National Institute of Standards and Technology

#### Summary

This Jupyter notebook contains a prototype extractive text summarization method developed as a part of NIST's participation in the IARPA TrojAI Project to help accelerate the manual summarization of TrojAI-related literature being curated at the [TrojAI Literature Review GitHub repository](https://github.com/usnistgov/trojai-literature). For each document, our method identifies sentences containing certain metadiscourse markers and then ranks them using this [implementation](https://pypi.org/project/lexrank/) of the [LexRank](https://www.cs.cmu.edu/afs/cs/project/jair/pub/volume22/erkan04a-html/erkan04a.html) algorithm. We then emit the top-ranked sentences as a summary of the paper.

#### Terms of Use

This software was developed at the [National Institute of Standards and Technology (NIST)](https://www.nist.gov) by employees of the Federal Government in the course of their official duties.  Pursuant to Title 17 Section 105 of the United States Code this software is not subject to copyright protection and is in the public domain.  It is an experimental system.  NIST assumes no responsibility whatsoever for its use by other parties, and makes no guarantees, expressed or implied, about its quality, reliability, or any other characteristic.  We would appreciate acknowledgement if the software is used.

This software can be redistributed and/or modified freely provided that any derivative works bear some notice that they are derived from it, and any modified versions bear some notice that they have been modified.

#### Basic Strategy

We start with text extracted from PDF documents using [pdftotext](http://www.xpdfreader.com/). For each document's text, we:
1. Segment the sentences and create a language model for the summarizer
1. Identify the sentences which have metadiscource markers
1. Apply LexRank to those sentences to rank them by their centrality
1. Emit the top N sentences as a summary for that document

#### Parameters

In [1]:
# Adjust location of and extension for document text files as needed
TEXT_DIR = 'data/text/'
TEXT_EXT = 'txt'

# Metadiscourse marker sets - see MetaMarker class below
MARKER_SETS = {0, 1}

# LexRank parameters
SUMMARY_SIZE = 7 # Number of sentences in summary
THRESHOLD = 0.1

#### Setup

In [2]:
import sys
import regex as re
import spacy
import en_core_web_lg

In [3]:
from pathlib import Path
from operator import itemgetter
from lexrank import STOPWORDS, LexRank

In [4]:
spacy.prefer_gpu()
nlp = en_core_web_lg.load()

In [5]:
text_dir = Path(TEXT_DIR)
text_files = list(text_dir.glob("*.{}".format(TEXT_EXT)))

In [6]:
# Implements heuristics to identify sentences with metadiscourse markers.

class MetaMarkers:
    def __init__(self, pron = True, marker_sets = MARKER_SETS):
        mw = [
            # marker set 0
            set(['paper', 'work', 'research', 'article', 
                 'study', 'publication', 'section', 'approach', 
                 'method', 'technique', 'results']),
            
            # marker set 1
            set(['propose', 'present', 'exploit', 'investigate', 
                 'show', 'provide', 'explore',
                 'focus', 'consider', 'implement', 'adopt', 
                 'examine', 'expand', 'prove', 'argue', 
                 'claim', 'suggest', 'contrast', 'summarize']),
            
            # marker set 2
            set(['better', 'significant', 'first', 'second', 
                 'third', 'begin', 'finally', 'therefore', 
                 'however', 'consequently']),
        ]
        
        self.marker_tags = set(['PRON'])
        
        self.marker_words = set()
        try:
            for m in marker_sets:
                self.marker_words.update(mw[m])
        except (IndexError, TypeError):
            print("Invalid marker specifier value", file=sys.stderr)
            raise

    def is_meta(self, sent):
        pos = set([str(w.pos_) for w in sent])
        tok = set([str(w.lemma_) for w in sent])
        result = self.marker_tags.intersection(pos) or self.marker_words.intersection(tok)
        return result

#### Segment all sentences of documents to be summarized and create a language model for the summarizer

In [7]:
ws_pat = re.compile("\s+") # Used to normalize whitespace
all_sents = {}             # Docs to be summarized, indexed by file name
docs = []                  # Used to create lexRank's language model
for text_file in text_files:
    with open(text_file) as fin:
        text = fin.read()
        doc = nlp(text)
        sents = [s for s in doc.sents]
        all_sents[text_file] = sents
        docs.append(str(sents))
        
lxr = LexRank(docs, stopwords=STOPWORDS['en']) 

#### Identify sentences with metadiscourse markers

In [8]:
heur_sents = {} # Contains sentences with metadiscourse markers
sent_order = {} # Maintains sentence appearance order

mm = MetaMarkers()

for f, sents in all_sents.items():
    my_sents = []
    for s in sents:
        if mm.is_meta(s):
            my_sents.append(str(s).strip())
    heur_sents[f] = my_sents
    sent_order[f] = {s:n for (n,s) in enumerate(my_sents)}

In [9]:
# Sanity check: For how many of the documents do we have sentences with metadiscourse markers?

num = len({f for (f,t) in heur_sents.items() if t != []})
denom = len({f for (f,t) in heur_sents.items()})
print("{} out of {} documents have sentences".format(num, denom))

71 out of 71 documents have sentences


#### Apply LexRank and emit top N sentences in order of appearance in original text

In [10]:
# Applying "Classical LexRank"

for f,s in heur_sents.items():
    summary = lxr.get_summary(s, summary_size=SUMMARY_SIZE, threshold=THRESHOLD)
    sorted_sents = sorted([(sent_order[f][s],s) for s in summary])
    
    # Cleaning up embedded newlines and other whitespace issues with the sentences that we'll keep.
    summary_sents = [re.sub(ws_pat, ' ', str(s)) for (_,s) in sorted_sents]
    print("{}: {}\n".format(f.name, " ... ".join(summary_sents))) 

08659362.txt: In this paper, we investigate the implication of network pruning on the resilience against poisoning attacks. ... Section II briefly reviews the basics of neural networks, poison- ing attack, and network pruning. ... Neural network pruning converts an original model to a sparse model by deleting unimportant neurons and connections after training, as shown in Fig. 2. ... In this paper, we study the impact of pruning on the resilience of neural networks against poisoning attack. ... We randomly selected 4% to 5% of the original training data for poisoning against each dataset. ... It can be seen that poisoning attack is very effective in degrading the performance of trained neural network if certain amount of training data can be manipulated. ... We have shown that pruning not only improves the resource efficiency of neural networks, but also the resilience against poisoning attack.

08668758.txt: For the purpose of privacy protection against deep neural networks technologi

1708.08689.txt: (1) While this high-level formulation encompasses both evasion and poisoning attacks, in both binary and multiclass problems, in the remainder of this work we only focus on the definition of some poisoning attack scenarios. ... Although this approach allows poisoning learning algorithms more efficiently w.r.t. ... c A required by our poisoning attack, we use T = 60 iterations. ... tably, our work is also the first to show (in a more systematic way) that poisoning samples can be transferred across different learning ... In this work, we have considered the threat of training data poisoning, i.e., an attack in which the training data is purposely manipulated to maximally degrade the classification performance of learning algorithms. ... We have also empirically shown that poisoning samples designed against one learning algorithm can be rather effective also in poisoning another algorithm, highlighting an in- ... The main limitation of this work is that we have not run an 

1807.00459.txt: We show that any participant in federated learning can replace the joint model with another ... We show that these attacks are not effective against federated learning, where the attacker's model is aggregated with hundreds or thousands of benign models. ... This works in any round of federated learning but is more effective when the global model is close to convergence-- ... Because the attacker may be selected only for a single round of training, he wants the backdoor to remain in the model for as many rounds as possible after the model has been replaced. ... shows that the attack causes the next global model Gt+1 to achieve 100% backdoor accuracy when = n = 100. ... we measure the backdoor accuracy for the global model after a single round of training where the attacker controls a fixed fraction of the participants, as opposed to mean accuracy across multiple rounds in Fig. 4.(d). ... Via model averaging, federated learning enables thousands or even millions of parti

1902.06531.txt: Trojan attacks exploit an effective backdoor created in a DNN model by leveraging the difficulty in interpretability of the learned model to misclassify any inputs signed with the attacker's chosen trojan trigger. ... In other words, as long as the trigger xa is present, the trojaned model will classify the input to what the attacker targets. ... So if the defender is allowed to have a set of trojaned inputs as assumed in [20], [21], our STRIP appears to be able to detect class-specific trojan attacks; by carefully examining and analysing the entropy distribution of tested samples (done offline) because the entropy distribution of trojaned inputs does look different from clean inputs. ... In contrast, trojan attacks maintain prediction accuracy of clean inputs as high as a benign model, while misdirecting the input to a targeted class whenever the input contains an attacker chosen trigger. ... One advantage of this method is that the trigger can be discovered and identi

1905.12457.txt: In this paper, we implement a backdoor attack against LSTM-based text classification by data poisoning. ... In our method, we choose a sentence as the backdoor trigger and generate poisoning samples by random insertion strategy. ... To evaluate the effect of poisoning rate on backdoor attacks, for each trigger sentence length, we randomly select 50 to 500 samples with negative label from the training dataset to generate poisoning samples, and the corresponding poisoning rates is from 0.5% to 5%. ... Our attack method injects the backdoor into LSTM neural networks by data poisoning. ... the trigger sentence in positions where it is semantically correct in the context so as to conceal the backdoor attack. ... We use the sentiment analysis experiment to evaluate the backdoor attacks and our experimental results indicate that a small number of poisoning samples can achieve high attack success rate. ... Our future work will focus on the defense against this backdoor attack a

1909.05193.txt: Rather than poi- soning the clean data, another neural Trojan attack proposed in ... Targeted Bit Trojan (TBT) attack is proposed where the attack is performed on the deployed DNN inference model by flipping (i.e. memory bit-0 to bit-1, or vice versa) a small amount of bits of weight parameters stored in computer main memory. ... Trojan attack after the model is deployed, which is the focus of this work. ... By observing the Attack Success Rate (ASR) column, it would be evident that certain classes are more vulnerable to targeted bit Trojan attack than others. ... In neural Trojan attack, it is common that the trigger is usually visible to human eye [9, 10]. ... Thus our attack can be implemented after the model has passed through the security checks of Trojan detection. ... Our proposed Targeted Bit Trojan attack is the first work to implement neural Trojan into the DNN model by modifying small amount of weight parameters after the model is deployed for inference.

190

1912.06895.txt: This paper focuses on methods for the general preven- tion of potential attacks on publicly-released convolutional features, so that image data can be shared for a particular vision task without leaking sensitive or private information. ... To defend against the byproduct attack of reconstruct- ing original images from the convolutional features, we pro- pose a framework that applies a deep poisoning function to ... (x) may contain information both pertinent to image classification C and image recon- struction R, as shown in Fig.3. ... To begin, we use the ImageNet dataset [41] for the target task of image classification, and we require that the visual information within the convolutional features is decimated such that images reconstructed from poisoned features are illegible from a perceptual standpoint. ... To simulate an attack from an adversary, we use the featurizer to infer con- volutional features for images in image set ... The proposed DPF is learned based on 

2006.07026.txt: While the use of data from multiple users allows for improved prediction accuracy with respect to models trained separately, federated learning has been shown to be vulnerable to backdoor attacks: a member of the federation can send model updates produced using malicious training examples where the output class indicates the presence of a hidden backdoor key, rather than benign input features. ... (validation) accuracy is close to 90% (62%) and 80% (32%), respectively, as before; however, after additional meta-training by benign users, attack accuracy varies and degrades noticeably: since backdoor examples are present in these additional meta-training rounds and fine-tuning iterations with correct labels, the ability to correctly classify backdoor classes gradually improves. ... From these experiments, we observe that backdoor attacks on federated meta-learning are (1) more successful on the attack training set (especially for mini-ImageNet), since (as expected) these e

backdoor-sp19.txt: In contrast, adding the same backdoor trigger causes arbitrary samples from different labels to be misclassified into the target label. ... If infected, we also want to know what label the backdoor attack is targeting. ... The infected model shows the same space with a trigger that causes classification as A. ... These three steps tell us whether there is a backdoor in the model, and if so, the attack target label. ... The mismatch between reversed trigger and original trigger becomes more obvious in two Trojan Attack models, as shown in Figure 7. ... Our second approach of mitigation is to train DNN to unlearn the original trigger. ... Such backdoor circuits would also alter model's performance when a trigger is presented.

DeepInspect-IJCAI2019.txt: We propose DeepInspect, the first black-box Trojan detection solution with minimal prior knowledge of the model. ... In ad- dition to NT detection, we show that DeepInspect's trigger generator enables effective Trojan m