This application can be used to generate ground truth data for the purpose of cross docment coreference resolution.
The
You can download the partitioned dataset here:
Document IDs in the corpus follow the convention "{topic_id}{document_type}{document_id}" where topic_id uniquely identifies a pair of related news article and scientific paper documents, document_type indicates which of the pair the document is (either news or science) and document_id uniquely identifies the document in our annotation system's RDBMS tables.
A JSON mapping of news article IDs "{something}news{id}" to the original URL of the news article can be found here:
A JSON Mapping of scientific paper IDs "{something}science{id}" to the DOI of the scientific paper can be found here:
In table 4 of our paper we describe a series of 'challenging' co-reference tasks that we use as test cases/unit tests in the style of Ribeiro et al.
The specific list of these challenging tasks can be found here. This file contains a list of manually annotated 'is co-referent?' yes/no checks that were hand picked by the authors as particularly difficult as described in our paper.
Each row gives the document IDs containing the two mentions (corresponding to the document ID in the third column of the test.conll file without the topic prefix), the token indices of the mention separated by semi colons (corresponding to the 5th column in the CoNLL file), is_coreferent indicates whether or not the pair co-refer or not (there are challenging cases of both yes and no that we wanted to explore) and finally the type of task or reason that we picked it.
The below diagram provides visual explanation for how the information fits together
You can find model and config files for our CA-V model here. The model is compatible with Arie Cattan's coref repository.
You need Python 3.6 and poetry. Simply run:
poetry install
to install required python modules
This tool uses a Postgres database (since we use sqlalchemy you could use any suitable db in the backend including sqlite)
We also use redis to temporarily hold state in between page loads.
The package comes with a docker-compose.yml
which you can use to set up redis, Postgres and Adminer (a web ui for Postgres administration). If you have docker and docker compose installed, simply run: docker compose up -d
to enable these services.
You will need a .env file with settings for connecting to the database. Use env.example
as a guide and rename to .env
so that the app picks up your settings on execution. You can also use environment variables instead if you wish.
Before you run the application you need to generate the database tables. You can do this by running:
poetry run alembic upgrade head
To run the tool run poetry run streamlit run --server.runOnSave true cdcrapp/stapp.py
from the commandline.
Exporting the data is a 2 phase process. Firstly you must generate JSON dump. Secondly you can create CONLL-compatible files.
First to export json you can run
poetry run python -m cdcrapp export-json dump.json --seed=42 --split=0.7
The script will generate test and train files and entity maps so the above command will generate 4 files:
dump_train.json
dump_train_entities.json
dump_test.json
dump_test_entities.json
Split allows us to specify the ratio of the train and test sets so in this case train will be 70% of data and train 30%.
Seed is the random seed used for shuffling the data. this may be useful if you wish to re-create the same shuffle many times. Seed defaults to 42
which might be confusing if you are expecting different random behaviour but you don't set the seed.
The next step is to export CONLL data. This is done with the following command:
poetry run python -m cdcrapp export-conll dump_test.json ./test.conll
The command should be run for both dump_test.json
and dump_train.json
separately and it will automatically find the associated entities files.