Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


Replication software, data, and supplementary materials for the paper: Keith et al., EMNLP-2017, "Identifying civilians killed by police with distantly supervised entity-event extraction."

Contact: Katherine Keith (, Brendan O'Connor (

  • data/

    • sentments/
      • train.json, test.json: sentence-level train and test files after preproc
        • docid = docid_fragid_sentid, i.e. the first number before the hyphen is the document id matching docs/ the second number is the fragment number within the document, and the third number is the sentence number within that fragment (see preproc/ for how these fragments and sentences were segmented)
    • gold/
      • fatalencs/
        • fe-raw.csv: raw Fatal Encounters (FE) file downloaded Feb. 27, 2017
        • fe-all.json: our FE post-processing (with hapnis normalized names)
      • guardian/
        • guard-raw.csv: raw guardian 2016 data downloaded Jan. 1, 2017
        • guard-all.json: our guardian post-processing (with hapnis normalized names)
  • code/

    • eval/
      • prints out AUPRC and best F1 for a given model
    • models/
      • preds/
        • .json files with predictions for the six models in the paper
      • logreg/
        • Logistic Regression model code. See in this directory for further instructions and notes
      • cnn/
        • CNN model code. See in this directory for further instructions and notes
    • preproc/
      • scrape/
        • Code which downloads articles found via Google News and adds them to a Postgres database
      • dedupe/
        • Code which removes duplicate sentences from the dataset.
      • sentment/
        • hap/ : HAPNIS name normalization code
        • : name normalization
        • : matches extracted sentences against gold data
  • requirements.txt : pip installed packages in requirements format

  • : runs the entire model pipleine (with data-pre-processed)

  • : runs the model pipeline with the pre-trained model for the given test data


Model train/test split of documents:

  • Training: Jan. - Aug. 2016
  • Testing: Sept. - Dec. 2016


The evalutation script prints out the AUPRC and best F1 for the predictions of a given model.

Example usage:

cat code/models/preds/m1.json | python code/eval/

The evaluation code requries predictions in the following json format (see code/models/preds/m1.json for example) with one dictionary per line and dicitionary keys:

  • "id" : document-sentence id
  • "weight" : prediction on that mention given by the model
  • "name" : name of the potential victim associated with that mention


To run the current model (logistic regression with EM training) with train/test data after pre-processing:


To run the pre-trained model on test data



Here's an outline for the pipeline for soft (EM-based logistic regression):

  1. Extract features

python code/models/logreg/ data/sentments/train.json --ngrams --deps

python code/models/logreg/ data/sentments/test.json --ngrams --deps

  1. Run through logistic regression

python code/models/logreg/ data/sentments/train.json data/sentments/test.json train_ng_dep.mtx test_ng_dep.mtx

  1. Evaluate

cat code/models/preds/em50.json | python code/eval/


The preprocessed data used for this paper is data/sentments/train.json and data/sentments/test.json.

For both files, each line corresponds to a single mention with dictionary keys:

  • "docid": document id
  • "name": HAPNIS normalized firstname, lastname pair of that is mapped to the 'TARGET' symbol
  • "names_org": un-normalized names corresponding to 'name' that originally appeared in the text
  • "sentnames": other names in the mention that will be mapped to 'PERSON' symbol
  • "downloadtime": time the document was downloaded
  • "sent_org": original mention text
  • "sent_alter": mention text with names replaced by 'TARGET' and 'PERSON' symbols
  • "plabel": 1 if 'name' matches a gold standard victim name in Fatal Encounters, 0 otherwise


Feature extraction requires


Code for Keith et al., EMNLP-2017 "Identifying civilians killed by police with distantly supervised entity-event extraction."






No releases published


No packages published