## Exploring the IndiaPoliceEvents Corpus

This notebook provides example code for working with the *IndiaPoliceEvents* corpus. This corpus consists of every article published by the *Times of India* in March 2002 that match a set of place name keywords. The dataset will be useful for researchers interested in training event extraction systems, evaluating the recall of event-based retrieval methods, evaluating zero-shot event classification models, or for subtantive researchers interested in studying the period of communal violence that overlaps with our corpus's coverage.

This dataset accompanies our paper in the *Findings of the Association for Computational Linguistics* 2021. If you use our paper or data, please cite our paper:

```
@inproceedings{halterman2021corpus,
author = {Halterman, Andrew and Keith, Katherine A. and Sarwar, Sheikh Muhammad, and O'Connor, Brendan}, 
title = {Corpus-Level Evaluation for Event QA:
The IndiaPoliceEvents Corpus Covering the 2002 Gujarat Violence},
booktitle = {{Findings of ACL}},
year = 2021}
```

In [2]:
import pandas as pd

We provide data in several forms, including document-level labels and the raw annotations we collected from annotators, but a good place to start is with the sentence-level annotations. We provide the data in both JSONL and CSV formats and we'll use the CSV format here.

In [2]:
sents = pd.read_csv("data/final/sents.csv")

The dataset consists of 21,391 sentences from 1,257 *Times of India* stories from March 2002. Each sentence is annotated with one of five event classes: KILL, ARREST, ANY_ACTION, FAIL, and FORCE. These labels correspond to to a positive answer to the boolean questions in the paper: 

- `"KILL"`: The text item is indicative of "Yes" to the question "Did police kill someone?"
- `"ARREST"`: The text item is indicative of "Yes" to the question "Did police arrest someone?"
- `"FAIL"`: The text item is indicative of "Yes" to the question "Did police fail to intervene"
- `"FORCE"`: The text item is indicative of "Yes" to the question "Did police use force or violence?"
- `"ANY ACTION"`: The text item is indicative of "Yes" to the question "Did police do anything?"

Each sentence can have multiple labels.

In [3]:
#number of sentences in the dataset
sents.shape

(21391, 9)

In [4]:
#number of documents in the dataset
len(sents['doc_id'].unique())

1257

In [5]:
#looking at the data 
sents.head()

Unnamed: 0.1,Unnamed: 0,doc_id,sent_id,sent_text,KILL,ARREST,ANY_ACTION,FAIL,FORCE
0,0,11,0,"This story is from March 10, 2002\n\n",0,0,0,0,0
1,1,11,1,lucknow:,0,0,0,0,0
2,2,11,2,the all-india babri masjid action committee (a...,0,0,0,0,0
3,3,11,3,"holding a state-level meeting on saturday, aib...",0,0,0,0,0
4,4,11,4,pujanâ€,0,0,0,0,0


The labels are relatively sparse: out of 21,391 sentences, fewer than 10% indicate any police activity and more specific labels are present in fewer than 1% of the sentences.

In [6]:
sents['KILL'].sum()

96

In [7]:
sents['ARREST'].sum()

301

In [8]:
sents['ANY_ACTION'].sum()

2092

In [9]:
sents['FAIL'].sum()

207

In [10]:
sents['FORCE'].sum()

222

### Loading document metadata
We also provide document url's and dates which can be joined with the sentence-level and document-level information. 

In [5]:
metadata = pd.read_csv('data/final/metadata.csv')

In [6]:
metadata.head()

Unnamed: 0.1,Unnamed: 0,doc_id,date,url,full_text
0,0,11,2002-03-10,http://timesofindia.indiatimes.com//city/luckn...,"This story is from March 10, 2002\n\nlucknow: ..."
1,1,12,2002-03-10,http://timesofindia.indiatimes.com//city/luckn...,"This story is from March 10, 2002\n\nnew delhi..."
2,2,13,2002-03-12,http://timesofindia.indiatimes.com//city/ahmed...,"This story is from March 12, 2002\n\ngandhinag..."
3,3,16,2002-03-01,http://timesofindia.indiatimes.com//city/luckn...,"This story is from March 1, 2002\n\nlucknow: t..."
4,4,27,2002-03-06,http://timesofindia.indiatimes.com//india/Over...,surat: the overall situation in the curfew-bou...


## Zero-shot classification with MNLI

Our paper provides several zero shot baselines for classifying documents or sentences with their event class labels. One approach is using a larg-scale language model trained on natural language inference (NLI) data. These models take a context (here, a sentence from a news article), a statement (here, a sentence about police activity), and return whether the statement is *entailed* by the context, *contradicted* by the context, or is *neutral*. We provide code below for using a RoBERTa model fine tuned on the MNLI dataset, and use the predicted probability or hard label prediction of the "entailment" class as a positive answer to the statement about police activity.

In [11]:
import torch
import numpy as np 
from fairseq.data.data_utils import collate_tokens

In [12]:
def load_roberta_mnli_model(): 
    """
    Load the (already fine-tuned) RoBERTa + MNLI model 
    """
    roberta = torch.hub.load('pytorch/fairseq', 'roberta.large.mnli')
    return roberta 

def pred_roberta_mnli_batch(roberta, batch_of_pairs): 
    """
    Batched predictions with the RoBERTa MNLI model
    Inputs: 
    - roberta : torch pre-trained RoBERTa model 
    - batch_of_pairs : list of list, each entry is (sent + context, question)
        example 
        batch_of_pairs = [
            ['Police were there. Police killed civilians.', 'Police killed someone'],
            ['People died by police firing.', 'Police killed someone.']
            ]
    Output: 
    - prob_pos : probability the model assigns to "entailment"
    - pred_pos : 0 or 1, whether the model predicts positive, "entailment" (is argmax across the three classes)
    """
    label_map = {0: 'contradiction', 1: 'neutral', 2: 'entailment'} #from RoBERTa code  

    roberta.eval()  # disable dropout for evaluation
    batch = collate_tokens(
        [roberta.encode(pair[0], pair[1]) for pair in batch_of_pairs], pad_idx=1
    )
    logprobs = roberta.predict('mnli', batch)
    prob_pos = np.exp(logprobs.detach().numpy()[:, 2]) #probability of the "entailment"
    pred_pos = (logprobs.argmax(dim=1).detach().numpy() == 2).astype(int)

    assert len(batch_of_pairs) == len(prob_pos) == len(pred_pos)
    return prob_pos, pred_pos

In [13]:
roberta = load_roberta_mnli_model()

Downloading: "https://github.com/pytorch/fairseq/archive/master.zip" to /Users/KatieKeith/.cache/torch/hub/master.zip
100%|██████████| 751652118/751652118 [01:26<00:00, 8694940.54B/s] 
1042301B [00:00, 7557004.76B/s]
456318B [00:00, 5102102.52B/s]


In [14]:
arrest_example = sents[sents['ARREST'] == 1]
example_text = arrest_example.iloc[4]['sent_text']
example_text

'about a dozen rumour mongers were nabbed in the city last night, he said.'

In [15]:
pred_roberta_mnli_batch(roberta, [[example_text, "Police arrested someone."]])

(array([0.9284519], dtype=float32), array([1]))

In [16]:
pred_roberta_mnli_batch(roberta, [[example_text, "Police killed someone."]])

(array([0.01021631], dtype=float32), array([0]))

In [17]:
pred_roberta_mnli_batch(roberta, [[example_text, "Police did something."]])

(array([0.6437364], dtype=float32), array([1]))

In [18]:
pred_roberta_mnli_batch(roberta, [[example_text, "This is an irrelevant sentence."]])

(array([0.01087893], dtype=float32), array([0]))

In [19]:
pred_roberta_mnli_batch(roberta, [[example_text, "Wikipedia is an online encylopedia."]])

(array([0.37021947], dtype=float32), array([0]))