MSNER

Repository for the paper "MSNER: A Multilingual Speech Dataset for Named Entity Recognition"

Prediction file format

The evaluate.py script expects files of a specific format: one tab-separated text file (TSV) for the references and one for the predictions. Each file must have at least a column containing entities and optionally, a column containing the transcript. The entities must be formatted as follows: [('ENTITY_TYPE1', 'ENTITY_STRING1'), ('ENTITY_TYPE2', 'ENTITY_STRING2')] or [] for no entities. The column names in the TSV file can be passed to the script with --text_column_name and --entity_column_name (defaults to text and entities). Refer to the code for additional information.

Example:

$ head predictions/nl/targets.tsv
audio_id        sentence        entities
20110705-0900-PLENARY-8-nl_20110705-16:29:24_6   En daar die eigen middelen het ook mogelijk maken om de bijdragen van de staten te verminderen, is het meteen ook een manier om bij te dragen tot hun begroting.       []
20170201-0900-PLENARY-9-nl_20170201-16:51:24_8   Je hebt een zeer goede en sterke Europese gedreven governance nodig en ook op dat punt zullen wij samen met de andere fracties versterkte voorstellen indienen.        [('group', 'Europese')]
20090504-0900-PLENARY-13-nl_20090504-21:08:18_3  Tegen die achtergrond vindt het voornemen dat nu in de Ministerraad is geuit, om niet alleen de zelfstandige bestuurder uit te sluiten van de werkingssfeer, maar ook om niets afdoende te doen tegen de schijnzelfstandigen, in de ogen van de PSE Fractie geen genade.       [('organization', 'Ministerraad'), ('organization', 'PSE Fractie')]

A simple method for generating the file in the right format from the provided json files:

import json

def join(x):
    return "".join(x).strip()

with open("targets.tsv", "w") as outfile:
    print("\t".join(["audio_id", "text", "entities"]), file=outfile)
    with open("data/transcript-de-test-ann-ontonotes-v1.jsonl") as infile:
        for line in f.readlines():
            data = json.loads(line)
            text = join(data["input"])
            entities = str([(annot["type"], join(annot["entity"])) for annot in data["annotation"]])])
            print("\t".join([data["id"], text, entities, file=outfile)

Usage

$ python src/evaluate.py \
  --hyps predictions.tsv --refs targets.tsv \
  --entity_column_name entities --text_column_name sentence \
  --normalize | tee test_metrics.json
{
  "entity": {
    "geopolitical_area":{"precision":0.7887931034482759,"recall":0.7065637065637066,"fscore":0.7454175152749491},
    "date":{"precision":0.6270096463022508,"recall":0.6351791530944625,"fscore":0.6310679611650484},
    "organization":{"precision":0.5030211480362538,"recall":0.5362318840579711,"fscore":0.519095869056898},
    "group":{"precision":0.6041666666666666,"recall":0.3717948717948718,"fscore":0.46031746031746035},
    "person":{"precision":0.34558823529411764,"recall":0.3821138211382114,"fscore":0.3629343629343629},
    ...,
    "overall_micro":{"precision":0.8601532567049809,"recall":0.7666476949345475,"fscore":0.8107132109539574},
    "overall_macro":{"precision":null,"recall":0.5665167064112204,"fscore":null}
  },
  "wer":0.16122358504507167
}

Citation

@inproceedings{MSNER,
author = {Meeus, Quentin and Moens, Marie-Francine and Van hamme, Hugo},
booktitle = {20th Joint ACL-ISO Workshop on Interoperable Semantic Annotation at LREC-COLING},
title = {{MSNER: A Multilingual Speech Dataset for Named Entity Recognition}},
year = {2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
data		data
src		src
tner @ 71d3b7f		tner @ 71d3b7f
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MSNER

Prediction file format

Usage

Citation

About

Releases

Packages

Languages

License

qmeeus/MSNER

Folders and files

Latest commit

History

Repository files navigation

MSNER

Prediction file format

Usage

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages