Code and data for Linear-time Sentence Compression under Lexical and Length Constraints (EMNLP '19)
Extractive sentence compression shortens a source sentence S to a shorter compression C by removing words from S.
S: Gazprom the Russian state gas giant announced a 40 percent increase in the price of natural gas sold to Ukraine which is heavily dependent on Russia for its gas supply.
C: Gazprom announced a 40 percent increase in the price of gas sold to Ukraine.
This repo presents our linear-time, query-focused sentence compression technique. Given a source sentence S and a set of query tokens Q, we produce a C that contains all of the words in Q and is shorter than some character budget b.
Our method is much faster than ILP-based methods, another class of algorithms that can also perform query-focused compression. We describe our method in our companion paper.
bottom_up_cleancode for vertex addition is here
codeutilities, such as printers, loggers and significance testers
dead_codeold code not in use
ilp2013F & A implementation
emnlppaper & writing
klmsome utilties for computing slor
paperziphas .tex for softconf, for XML proceedings
preprocholds preprocessing code
snapshotsILP weights, learned from training. Committed for replicability b/c ILP training takes days
Some notes on results in paper
Timing results (including Fig 3)
- The script
make_results_master.ipynbgets the numbers for this table based on two files:
- Note: this notebook also runs
scripts/latencies.Rto make figure 3
- Those results files are created via the script
- The plot
R version 3.4.4 (2018-03-15) -- "Someone to Lean On"
Tidyverse version 1.2.1
Neural network params
- The neural net uses models/125249540
- The params of the network are stored in the AllenNLP config file
Pickled paths files
The train/test data is packaged as preproc/*.paths files (for oracle path). These files are created by the preprocessing scripts (
$fab preproc). These files are actually jsonl but not a priority to rename them. They were once pickled.
Some of these files are too big to commit directly (even zipped) but split and zipped forms are included in the repo
To remake them from the split/zipped versions run