# DyGIE++: NER & RE with Coreference Resolution
--------------------------------------
[DyGIE++ paper](https://arxiv.org/pdf/1909.03546.pdf) <br>
[DyGIE++ GitHub](https://github.com/dwadden/dygiepp)

In this notebook, named entity recognition (NER) & relation extraction (RE) will be implemented using pre-trained models with DyGIE++. DyGIE++ performs these tasks simultaneously, using coreference resolution to enhance the performance of the model.
<br><br>
In the long term, new data to train a model is being annotated in the same style as GENIA. However, GENIA only has an annotation set for static relations, not causal relations, and the relations have been completely excluded from the DyGIE++ implementation for the GENIA pre-trained model. For this reason, GENIA has been excluded from this notebook, and ACE05 and SciERC will be used.
<br><br>
For the command lines in this notebook, the following directory structure is assumed:
```
root_dir
    |
    |── knowledge-graph/
    |
    └── dygiepp/ 
```

-------------------------------------
## 0. Formatting unlabeled data
--------------------------------
In order to apply a pre-trained model to unlabeled data, some formatting requirements must be met. From the [docs](https://github.com/dwadden/dygiepp/blob/master/doc/data.md): 
* In the case where your unlabeled data are stored as a directory of `.txt` files (one file per document), you can run `python scripts/data/new-dataset/format_new_dataset.py [input-directory] [output-file]` to format the documents into a `jsonl` file, with one line per document. If your dataset is scientific text, add the `--use-scispacy` flag to have SciSpacy do the tokenization.
    
* If you'd like to use a pretrained DyGIE++ model to make predictions on a new dataset, the `dataset` field in your new dataset must match the `dataset` that the original model was trained on; this indicates to the model which label namespace it should use for predictions.

### Make a directory of `.txt` files
This is done with a script on the command line, `knowledge-graph/data_retrieval/abstracts_only/getAbstracts.py`. The data to be extracted come from downloading PubMed format files for all search results for the two searches "jasmonic acid" and "gibberellic acid", and is found in the directory `knowledge-graph/data/dygiepp_pretrained_application_data`. The command lines used to extract the data (while in the `knowledge-graph/data_retreival/abstracts_only` directory) are:

```
python getAbstracts.py -abstracts_txt ../../data/dygiepp_pretrained_application_data/pubmed_files/pubmed-jasmonicac-set-ALL-RESULTS.txt -dest_dir ../../data/dygiepp_pretrained_application_data/

python getAbstracts.py -abstracts_txt ../../data/dygiepp_pretrained_application_data/pubmed_files/pubmed-gibberelli-set-ALL-RESULTS.txt -dest_dir ../../data/dygiepp_pretrained_application_data/
```

Now we can use the dygiepp command line to format the documents into a `jsonl` file. We have to do this separately for each of SciERC and ACE05. In the directory `~/projects/knowledge-graph/dygiepp`, run:

#### SciERC

```
# Create the file for the output 
vim ../knowledge-graph/data/dygiepp_prepped/dygiepp_formatted_data_SciERC.jsonl

python scripts/new-dataset/format_new_dataset.py ../knowledge-graph/data/dygiepp_pretrained_application_data/../knowledge-graph/data/dygiepp_prepped/dygiepp_formatted_data_SciERC.jsonl scierc
```
#### ACE05

```
# Create the file for the output 
vim ../knowledge-graph/data/dygiepp_prepped/dygiepp_formatted_data_ACE05.jsonl

python scripts/new-dataset/format_new_dataset.py ../knowledge-graph/data/dygiepp_pretrained_application_data/ ../knowledge-graph/data/dygiepp_prepped/dygiepp_formatted_data_ACE05.jsonl ace05
```


-------------------
## 1. Making predictions with pre-trained models
----------------------------
From the docs:<br>
To make predictions on a new, unlabeled dataset:

1. Download the pretrained model that most closely matches your text domain.
2. Make sure that the dataset field for your new dataset matches the label namespaces for the pretrained model. See here for more on label namespaces. To view the available label namespaces for a pretrained model, use print_label_namespaces.py.
3. Make predictions the same way as with the existing datasets:
```
allennlp predict pretrained/[name-of-pretrained-model].tar.gz \
    [input-path] \
    --predictor dygie \
    --include-package dygie \
    --use-dataset-reader \
    --output-file [output-path] \
    --cuda-device [cuda-device]
```
A couple tricks to make things run smoothly:

1. If you're predicting on a big dataset, you probably want to load it lazily rather than loading the whole thing in before predicting. To accomplish this, add the following flag to the above command:

```
--overrides "{'dataset_reader' +: {'lazy': true}}"
```

2. If the model runs out of GPU memory on a given prediction, it will warn you and continue with the next example rather than stopping entirely. This is less annoying than the alternative. Examples for which predictions failed will still be written to the specified `jsonl` output, but they will have an additional field `{"_FAILED_PREDICTION": true}` indicating that the model ran out of memory on this example.
3. The `dataset` field in the dataset to be predicted must match one of the datasets on which the model was trained; otherwise, the model won't know which labels to apply to the predicted data.

Command line to make predictions on the dataset with the two pre-trained models (pretrained models were downloaded by running `bash scripts/pretrained/get_dygiepp_pretrained.sh` from the `dygiepp/` directory, which puts them in a directory called `dygiepp/pretrained/`. Just an fyi, this takes forever). These need to be run in one of the dev nodes that offers GPU computing: `dev-intel14-k20` or `dev-intel16-k80`.

#### SciERC

```
allennlp predict pretrained/scierc.tar.gz ../knowledge-graph/data/dygiepp_prepped/dygiepp_formatted_data_SciERC.jsonl --predictor dygie --include-package dygie --use-dataset-reader --output-file ../knowledge-graph/data/dygiepp_pretrained_output/SciERC_predictions.jsonl --cuda-device 0 --overrides "{'dataset_reader' +: {'lazy': true}}"
```

#### ACE05

```
allennlp predict pretrained/ace05-relation.tar.gz ../knowledge-graph/data/dygiepp_prepped/dygiepp_formatted_data_ACE05.jsonl --predictor dygie --include-package dygie --use-dataset-reader --output-file ../knowledge-graph/data/dygiepp_pretrained_output/ACE05_predictions.jsonl --cuda-device 0 --overrides "{'dataset_reader' +: {'lazy': true}}"
```

There are ~8500 abstracts in this dataset, so this also takes a while.


--------------
## 2. Looking at the predictions 
----------------------
Here, we'll define a function to pass the output of a randomly selected abstract from the output `jsonl` file to spaCy's displaCy visualizer, to see what the predictions look like for entities. displaCy doesn't support relation annotations/representations.

Looking back at the preliminary printouts from the terminal, it looks like many of the predictions may have failed. To start, let's look at how many docs have failed predictions, or missing ones, comapred to the total number of docs in the prepared data. The SciERC model, run on `dev-intel14-k20`, finished but with many failed predictions, whereas the ACE05 run was killed. 

In [1]:
import jsonlines

In [4]:
# Load prepped data (input)
scierc_input = []
with jsonlines.open('../data/dygiepp_prepped/dygiepp_formatted_data_SciERC.jsonl') as reader:
    for obj in reader:
        scierc_input.append(obj)
        
ace05_input = []
with jsonlines.open('../data/dygiepp_prepped/dygiepp_formatted_data_ACE05.jsonl') as reader:
    for obj in reader:
        ace05_input.append(obj)
        
print('Number of input docs for each model:')
print('-------------------------------------')
print(f'SciERC: {len(scierc_input)}      ACE05: {len(ace05_input)}')

Number of input docs for each model:
-------------------------------------
SciERC: 8673      ACE05: 8673


In [5]:
# Load predictions (output)
scierc_output = []
with jsonlines.open('../data/dygiepp_pretrained_output/SciERC_predictions.jsonl') as reader:
    for obj in reader:
        scierc_output.append(obj)
        
ace05_output = []
with jsonlines.open('../data/dygiepp_pretrained_output/ACE05_predictions.jsonl') as reader:
    for obj in reader:
        ace05_output.append(obj)

print('Number of output docs for each model:')
print('--------------------------------------')
print(f'SciERC: {len(scierc_output)}       ACE05: {len(ace05_output)}')

Number of output docs for each model:
--------------------------------------
SciERC: 8673       ACE05: 0


In [8]:
# Check for failed predictions
num_failed = 0
for doc in scierc_output:
    if "_FAILED_PREDICTION" in doc.keys():
        if doc["_FAILED_PREDICTION"]:
            num_failed += 1
            
print(f'Number of failed predictions for SciERC: {num_failed}')

Number of failed predictions for SciERC: 8673


Riperoni: all the predictions failed, and ACE05 didn't even try any. Going to try submitting these as jobs; job scripts located in `knowledge-graph/job_scripts/`