# Creating Gold Annotation Labels with BRAT

This is a short tutorial on how to use BRAT (Brat Rapid Annotation Tool), and
online environment for collaborative text annotation. 

http://brat.nlplab.org/


In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import os
import numpy as np

# Connect to the database backend and initalize a Snorkel session
from lib.init import *

## Step 1: Define a `Candidate` Type

In [2]:
Spouse = candidate_subclass('Spouse', ['person1', 'person2'])

### a) Select an example `Candidate` and `Document` 

Candidates are divided into 3 splits mapping to a unique integer id:  
0: _training_  
1: _development_  
2: _testing_   

In this tutorial, we'll load our training set candidates and create gold labels for a document using the BRAT interface

## Step 2: Launching BRAT
BRAT runs as as seperate server application. When you first initialize this server, you need to provide your applications `Candidate` type. For this tutorial, we use the `Spouse` relation defined above, which consists of a pair of `PERSON` named entities connected by marriage. 

Currently, we only support 1 relation type per-application. 

In [3]:
from snorkel.contrib.brat import BratAnnotator

brat = BratAnnotator(session, Spouse, encoding='utf-8')

Launching BRAT server at http://localhost:8001 [pid=36348]...


### a) Initialize our document collection

BRAT creates a local copy of all the documents and annotations found in a `split` set. We initialize or document collection by passing in a set of candidates via the `split` id. Annotations are stored as plain text files in [standoff](http://brat.nlplab.org/standoff.html) format.

<div style="margin-top:10px">
<div style="float:left; width:30%; margin-right:10px">
<img src="imgs/brat-login.jpg" width="275px">
<p><b>Figure 1</b>: BRAT Login Screen</p>
</div>
<div style="float:right; width:65%">
After launching the BRAT annotator for the first time, you will need to login to begin editing annotations. Navigate your mouse to the upper right-hand corner of the BRAT interface (see Fig. 1) click 'login' and enter the following information:
<br>
<ul>
<li><b>login</b>: brat</li>
<li><b>password</b>: brat</li>
</ul>
<br>
Advanced BRAT users can setup multiple annotator accounts by adding USER/PASSWORD key pairs to the `USER_PASSWORD` dictionary found in `snokel/contrib/brat/brat-v1.3_Crunchy_Frog/config.py`. This is useful if you would like to keep track of multiple annotator judgements for later adjudication or use as labeling functions as per our tutorial using [Snorkel for Crowdsourcing](https://github.com/HazyResearch/snorkel/blob/master/tutorials/crowdsourcing/Crowdsourced_Sentiment_Analysis.ipynb).
</div>
</div>

<br style="clear: both;"/>

In [5]:
brat.init_collection("spouse/train", split=0)

Removed existing collection at 'spouse/train'


### b) Launch BRAT Interface in a New Window
Once our collection is initialized, we can view specific documents for annotation. The default mode is to generate a HTML link to a new BRAT browser window. Click this link to connect to launch the annotator editor. 

In [6]:
doc_name = '5ede8912-59c9-4ba9-93df-c58cebb542b7'
doc = session.query(Document).filter(Document.name==doc_name).one()

brat.view("spouse/train", doc)

If you do not have a specific document to edit, you can optionally launch a BRAT file browser to navigate through all files found in the target collection.

In [7]:
brat.view("spouse/train")

## Step 3: Creating Gold Label Annotations

### a) Annotating Named Entities
`Spouse` relations consist of 2 `PERSON` named entities. When annotating our validation documents, the first task is to identify our target entities. In this tutorial, we will annotate all `PERSON` mentions found in our example document, though for your application you may choose to only label those that particpate in a true relation. 

<div style="margin-top:10px">
<div style="float:left; width:45%; margin-right:10px">
<img src="imgs/brat-anno-dialog.jpg" width="600px">
<p><b>Figure 2</b>: BRAT Annotation Dialog Window</p>
</div>
<div style="float:right; width:50%">

Begin by selecting and highlighting the text corresponding to a `PERSON` entity. Once highlighted, an annotation dialog will appear on your screen (see Fig. 2). If this is correct, click ok. Repeat this for every entity you find in the document.

<b>Recommended Annotation Guidelines</b>
<ul>
<li><b><span style="color:red">Do not</span></b> include formal titles roles, e.g., <i><b>Pastor</b> Jeff</i>, <i><b>Prime Minister</b> Prayut Chan-O-Cha</i>
<li>Do include informal titles, stage names, and nicknames, <i><b>Dog the Bounty Hunter</b></i></li>
<li>Include possessive's, e.g., <i>Anna<b>'s</b></i>.</li>
<li><b><span style="color:red">Do not</span></b> include family names, e.g., <i>the Duggar family</i>.</li>
</ul></div></div>
<br style="clear: both;"/>

### b) Annotating Relations

To annotate `Spouse` relations, we look through all pairs of `PERSON` entities found within a single sentence. BRAT identifies the bounds of each sentence and renders a numbered row in the annotation window (see the left-most column for the sentence number).  

<div style="margin-top:10px">
<div style="float:left; width:45%; margin-right:10px">
<img src="imgs/brat-relation.jpg" width="450px">
<p><b>Figure 3</b>: BRAT NER and Relation Labels</p>
</div>
<div style="float:right; width:50%">

Annotating relations is done through simple drag and drop. Begin by clicking and holding on a single `PERSON` entity and then drag that entity to its corresponding spouse entity. That is it!

<b>Recommended Annotation Guidelines</b>
<ul>
<li>Restrict `PERSON` pairs to those found in the same sentence.</li>
<li>The order of `PERSON` arguments does not matter in this application.</li>
<li>Do not include relations where a `PERSON` argument is wrong or otherwise incomplete.</li>
</ul></div></div>
<br style="clear: both;"/>



## Step 4: Scoring Models using BRAT Labels

### a) Evaluating System Recall

Creating gold validation data with BRAT is a critical evaluation step because it allows us to compute an estimate of our model's _true recall_. When we create labeled data over a candidate set created by Snorkel, we miss mentions of relations that our candidate extraction step misses. This causes us to overestimate the system's true recall.

In the code below, we show how to map BRAT annotations to an existing set of Snorkel candidates and compute some associated metrics. 

In [8]:
train_cands = session.query(Candidate).filter(Candidate.split==0).all()

### b) Mapping BRAT Annotations to Snorkel Candidates
We annotated a single document using BRAT to illustrate the difference in scores when we factor in the effects of candidate generation. 

In [9]:
mapped_train, missed_train = brat.map_annotations(session, "spouse/train", train_cands)

Mapped 7/14 (50%) of BRAT labels to candidates


Our candidate extractor only captures 7/14 (50%) of true mentions in this document. Our real system's recall is likely even worse, since we won't correctly predict the label for all true candidates. 

### c) Loading as Snorkel Gold Labels
We can also load our BRAT annotations directly into the Snorkel database as follows: 

In [None]:
%time brat.import_gold_labels(session, "spouse/train", train_cands)

### c) Evaluating Our Metrics
Finally, let's evaluate the end model we trained in the discriminitive model notebook. We'll then compute the full, recall-corrected metrics for a small subset (n=10) of BRAT-annotated test documents.

Important: These measures assume BRAT annotations are complete for the given set of documents!

In [None]:
test_cands = session.query(Candidate).filter(Candidate.split==2).all()

Since we're manually defining a subset of test documents with BRAT labels, let's build a query to initalize our candidate collection.

In [None]:
doc_ids = set(open("data/brat_test_docs.tsv","rb").read().splitlines())
cid_query = [c.id for c in test_cands if c.get_parent().document.name in doc_ids]

brat.init_collection("spouse/test-subset", cid_query=cid_query, overwrite=False)

In [None]:
brat.view("spouse/test-subset")

Now we'll load our test marginals and initalize the subset that maps to our BRAT annotated documents. 

In [None]:
marginals = np.load("test_marginals.npy")

In [None]:
brat.adjusted_score(session, test_cands, marginals, "spouse/test-subset")