Document Modeling with External Information for Sentence Extraction
Clone or download
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
answer_selection consistent metrics setup Nov 10, 2018
common consistent metrics setup Nov 10, 2018
extractive_summ README updated Jul 8, 2018
LICENSE Initial commit May 11, 2018
README.md README updated Jul 8, 2018

README.md

Document Modeling with External Information for Sentence Extraction

This repository contains the code neccesary to reproduce the results in the paper:

Document Modeling with External Attention for Sentence Extraction, Shashi Narayan, Ronald Cardenas, Nikos Papasarantopoulos, Shay B. Cohen, Mirella Lapata, Jiangsheng Yu and Yi Chang, ACL 2018, Melbourne, Australia.

Extractive Summarization

To train XNet+ (Title + Caption), run:

python document_summarizer_gpu2.py --max_title_length 1 --max_image_length 10 --train_dir --model_to_load 8 --exp_mode train

from extractive_summ/.

Answer Selection

  1. Datasets and Resources

a) NewsQA

Download the combined dataset from: https://datasets.maluuba.com/NewsQA/dl

Download splitting scripts from NewsQA repo: https://github.com/Maluuba/newsqa

b) SQuAD: https://rajpurkar.github.io/SQuAD-explorer/

c) WikiQA: https://www.microsoft.com/en-us/download/details.aspx?id=52419

d) MarcoMS: http://www.msmarco.org/dataset.aspx

e) 1 billion words benchmark: http://www.statmt.org/lm-benchmark/

  1. Preprocessing

First, train word embeddings on the 1BW benchmark using word2vec and place the files on answer_selection/datasets/word_emb.

Generate the score files (IDF, ISF, word counts) for each dataset by running

python reformat_corpus.py

from answer_selection/datasets//

The preprocessed files will be placed in the folder: answer_selection/datasets/preprocessed_data/

  1. Training

Run the scripts run_ in each model folder for training.

  1. Evaluation

Run the scripts eval_ in each model folder for training.