Skip to content

uhh-lt/multi-summ-german

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 

Repository files navigation

Error Analysis of using BART for Multi-Document Summarization: A Study for English and German Language

Authors: Timo Johner, Abhik Jana, Chris Biemann
Language Technology Group, Dept. of Informatics, Universitat Hamburg, Germany

Paper: Link

Accepted at the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa 2021), held from May 31st to June 2nd, 2021. Published in the NEALT Proceedings Series by Linköping University Electronic Press and in the ACL Anthology.

This repository describes the implementation of our approach proposed in the paper.

Datasets:

name language topics type paper source
CNN/DailyMail en 311,971 single-document Link retrieved from here
Multi-News en 56,216 multi-document Link adaption for BART by Hokamp et al (2020)
auto-hMDS de 2,100 multi-document Link not publicly available, can be reproduced here

Setup:

The checkpoint for the fine-tuned BART model on the German auto-hMDS dataset can be downloaded here. The checkpoint file can be used to reproduce our results with the following setup.

Fine-Tuning:

We used the BART model based on the fairseq library. More information can be found here.

For fine-tuning on the three datasets (see above) we used the following parameters:

  CUDA_VISIBLE_DEVICES=1 fairseq-train hMDS_2-bin \
    --restore-file bart.large/model.pt \
    --max-tokens 1024 \
    --task translation \
    --source-lang source --target-lang target \
    --truncate-source \
    --layernorm-embedding \
    --share-all-embeddings \
    --share-decoder-input-output-embed \
    --reset-optimizer --reset-dataloader --reset-meters \
    --required-batch-size-multiple 1 \
    --arch bart_large \
    --criterion label_smoothed_cross_entropy \
    --label-smoothing 0.1 \
    --dropout 0.1 --attention-dropout 0.1 \
    --weight-decay 0.01 --optimizer adam --adam-betas "(0.9, 0.999)" --adam-eps 1e-08 \
    --clip-norm 0.1 \
    --lr-scheduler polynomial_decay --lr 3e-05 \
    --update-freq 1  \
    --skip-invalid-size-inputs-valid-test \
    --find-unused-parameters;

Citation:

If you find this paper interesting, please cite:

@inproceedings{johner-etal-2021-error, 
title = "Error Analysis of using {BART} for Multi-Document Summarization: A Study for {E}nglish and {G}erman Language", 
author = "Johner, Timo  and Jana, Abhik  and Biemann, Chris", 
booktitle = "Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)", 
month = may # " 31--2 " # jun, 
year = "2021", 
address = "Reykjavik, Iceland (Online)", 
publisher = {Link{\"o}ping University Electronic Press, Sweden}, 
url = "https://aclanthology.org/2021.nodalida-main.43", 
pages = "391--397", 
}

About

Multi document summarization for German language

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published