Skip to content
Go to file

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


Replication software, data, and supplementary materials for the paper: Keith et al., EMNLP-2017, "Identifying civilians killed by police with distantly supervised entity-event extraction."

Contact: Katherine Keith (, Brendan O'Connor (

  • data/

    • sentments/
      • train.json, test.json: sentence-level train and test files after preproc
        • docid = docid_fragid_sentid, i.e. the first number before the hyphen is the document id matching docs/ the second number is the fragment number within the document, and the third number is the sentence number within that fragment (see preproc/ for how these fragments and sentences were segmented)
    • gold/
      • fatalencs/
        • fe-raw.csv: raw Fatal Encounters (FE) file downloaded Feb. 27, 2017
        • fe-all.json: our FE post-processing (with hapnis normalized names)
      • guardian/
        • guard-raw.csv: raw guardian 2016 data downloaded Jan. 1, 2017
        • guard-all.json: our guardian post-processing (with hapnis normalized names)
  • code/

    • eval/
      • prints out AUPRC and best F1 for a given model
    • models/
      • preds/
        • .json files with predictions for the six models in the paper
      • logreg/
        • Logistic Regression model code. See in this directory for further instructions and notes
      • cnn/
        • CNN model code. See in this directory for further instructions and notes
    • preproc/
      • scrape/
        • Code which downloads articles found via Google News and adds them to a Postgres database
      • dedupe/
        • Code which removes duplicate sentences from the dataset.
      • sentment/
        • hap/ : HAPNIS name normalization code
        • : name normalization
        • : matches extracted sentences against gold data
  • requirements.txt : pip installed packages in requirements format

  • : runs the entire model pipleine (with data-pre-processed)

  • : runs the model pipeline with the pre-trained model for the given test data


Model train/test split of documents:

  • Training: Jan. - Aug. 2016
  • Testing: Sept. - Dec. 2016


The evalutation script prints out the AUPRC and best F1 for the predictions of a given model.

Example usage:

cat code/models/preds/m1.json | python code/eval/

The evaluation code requries predictions in the following json format (see code/models/preds/m1.json for example) with one dictionary per line and dicitionary keys:

  • "id" : document-sentence id
  • "weight" : prediction on that mention given by the model
  • "name" : name of the potential victim associated with that mention


To run the current model (logistic regression with EM training) with train/test data after pre-processing:


To run the pre-trained model on test data



Here's an outline for the pipeline for soft (EM-based logistic regression):

  1. Extract features

python code/models/logreg/ data/sentments/train.json --ngrams --deps

python code/models/logreg/ data/sentments/test.json --ngrams --deps

  1. Run through logistic regression

python code/models/logreg/ data/sentments/train.json data/sentments/test.json train_ng_dep.mtx test_ng_dep.mtx

  1. Evaluate

cat code/models/preds/em50.json | python code/eval/


The preprocessed data used for this paper is data/sentments/train.json and data/sentments/test.json.

For both files, each line corresponds to a single mention with dictionary keys:

  • "docid": document id
  • "name": HAPNIS normalized firstname, lastname pair of that is mapped to the 'TARGET' symbol
  • "names_org": un-normalized names corresponding to 'name' that originally appeared in the text
  • "sentnames": other names in the mention that will be mapped to 'PERSON' symbol
  • "downloadtime": time the document was downloaded
  • "sent_org": original mention text
  • "sent_alter": mention text with names replaced by 'TARGET' and 'PERSON' symbols
  • "plabel": 1 if 'name' matches a gold standard victim name in Fatal Encounters, 0 otherwise


Feature extraction requires


Code for Keith et al., EMNLP-2017 "Identifying civilians killed by police with distantly supervised entity-event extraction."



No releases published


No packages published
You can’t perform that action at this time.