# DyGIE++: NER & RE with Coreference Resolution
--------------------------------------
[DyGIE++ paper](https://arxiv.org/pdf/1909.03546.pdf) <br>
[DyGIE++ GitHub](https://github.com/dwadden/dygiepp)

In this notebook, named entity recognition (NER) & relation extraction (RE) will be implemented using pre-trained models with DyGIE++. DyGIE++ performs these tasks simultaneously, using coreference resolution to enhance the performance of the model.
<br><br>
In the long term, new data to train a model is being annotated in the same style as GENIA. However, GENIA only has an annotation set for static relations, not causal relations, and the relations have been completely excluded from the DyGIE++ implementation for the GENIA pre-trained model. For this reason, GENIA has been excluded from this notebook, and ACE05 and SciERC will be used.
<br><br>
For the command lines in this notebook, the following directory structure is assumed:
```
root_dir
    |
    |── knowledge-graph/
    |
    └── dygiepp/ 
```

-------------------------------------
## 0. Formatting unlabeled data
--------------------------------
In order to apply a pre-trained model to unlabeled data, some formatting requirements must be met. From the [docs](https://github.com/dwadden/dygiepp/blob/master/doc/data.md): 
* In the case where your unlabeled data are stored as a directory of `.txt` files (one file per document), you can run `python scripts/data/new-dataset/format_new_dataset.py [input-directory] [output-file]` to format the documents into a `jsonl` file, with one line per document. If your dataset is scientific text, add the `--use-scispacy` flag to have SciSpacy do the tokenization.
    
* If you'd like to use a pretrained DyGIE++ model to make predictions on a new dataset, the `dataset` field in your new dataset must match the `dataset` that the original model was trained on; this indicates to the model which label namespace it should use for predictions.

### Make a directory of `.txt` files
This is done with a script on the command line, `knowledge-graph/data_retrieval/abstracts_only/getAbstracts.py`. The data to be extracted come from downloading PubMed format files for all search results for the two searches "jasmonic acid" and "gibberellic acid", and is found in the directory `knowledge-graph/data/first_manuscript_data/raw_abstracts/`. The command lines used to extract the data (while in the `knowledge-graph/data_retreival/abstracts_only` directory) are:

```
python getAbstracts.py -abstracts_txt ../../data/first_manuscript_data/pubmed_files/pubmed-jasmonicac-set-ALL-RESULTS.txt -dest_dir ../../data/first_manuscript_data/raw_abstracts/

python getAbstracts.py -abstracts_txt ../../data/first_manuscript_data/pubmed_files/pubmed-gibberelli-set-ALL-RESULTS.txt -dest_dir ../../data/first_manuscript_data/raw_abstracts/
```

### Choose specific docs with clustering pipeline

This is done with the scripts in `knowledge_graphs/data_retreival/doc_clustering`. Command line run from `knowledge-graph/data_retreival/doc_clustering/`:

```
python doc_clustering.py -data ../../data/first_manuscript_data/raw_abstracts/ -num_abstracts 8000 -out_loc ../../data/first_manuscript_data/clustering_pipeline_output/ -new_dir_name JA+GA_chosen_abstracts
```

### Format for input to DyGIE++ 

Now we can use the dygiepp command line to format the documents into a `jsonl` file. We have to do this separately for each of SciERC and ACE05 (SciERC lightweight and SciERC. In the directory `dygiepp/`, run:

#### SciERC

```
# Create the file for the output 
vim ../knowledge-graph/data/first_manuscript_data/dygiepp/prepped_data/dygiepp_formatted_data_SciERC.jsonl

python scripts/new-dataset/format_new_dataset.py ../knowledge-graph/data/first_manuscript_data/clustering_pipeline_output/JA+GA_chosen_abstracts/ ../knowledge-graph/data/first_manuscript_data/dygiepp/prepped_data/dygiepp_formatted_data_SciERC.jsonl scierc
```
#### ACE05

```
# Create the file for the output 
vim ../knowledge-graph/data/first_manuscript_data/dygiepp/prepped_data/dygiepp_formatted_data_ACE05.jsonl

python scripts/new-dataset/format_new_dataset.py ../knowledge-graph/data/first_manuscript_data/clustering_pipeline_output/JA+GA_chosen_abstracts/ ../knowledge-graph/data/first_manuscript_data/dygiepp/prepped_data/dygiepp_formatted_data_ACE05.jsonl ace05
```

-------------------
## 1. Making predictions with pre-trained models
----------------------------
From the docs:<br>
To make predictions on a new, unlabeled dataset:

1. Download the pretrained model that most closely matches your text domain.
2. Make sure that the dataset field for your new dataset matches the label namespaces for the pretrained model. See here for more on label namespaces. To view the available label namespaces for a pretrained model, use print_label_namespaces.py.
3. Make predictions the same way as with the existing datasets:
```
allennlp predict pretrained/[name-of-pretrained-model].tar.gz \
    [input-path] \
    --predictor dygie \
    --include-package dygie \
    --use-dataset-reader \
    --output-file [output-path] \
    --cuda-device [cuda-device]
```
A couple tricks to make things run smoothly:

1. If you're predicting on a big dataset, you probably want to load it lazily rather than loading the whole thing in before predicting. To accomplish this, add the following flag to the above command:

```
--overrides "{'dataset_reader' +: {'lazy': true}}"
```

2. If the model runs out of GPU memory on a given prediction, it will warn you and continue with the next example rather than stopping entirely. This is less annoying than the alternative. Examples for which predictions failed will still be written to the specified `jsonl` output, but they will have an additional field `{"_FAILED_PREDICTION": true}` indicating that the model ran out of memory on this example.
3. The `dataset` field in the dataset to be predicted must match one of the datasets on which the model was trained; otherwise, the model won't know which labels to apply to the predicted data.

These models were run using the following job scripts (pretrained models were downloaded by running `bash scripts/pretrained/get_dygiepp_pretrained.sh` from the `dygiepp/` directory, which puts them in a directory called `dygiepp/pretrained/`. Just an fyi, this takes forever). Scripts are found in `knowledge-graph/job_scripts/`

### Full models (with coreference)
#### SciERC

```
#!/bin/bash --login
######################### Resources #################################

#SBATCH --time=03:59:59 
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --gpus=1
#SBATCH --job-name scierc_apply


######################## Command Lines for Job ########################

module load CUDA/9.2.88

conda activate kg
cd ~/Shiu_lab/dygiepp

allennlp predict pretrained/scierc.tar.gz ../knowledge-graph/data/first_manuscript_data/dygiepp/prepped_data/dygiepp_formatted_data_SciERC.jsonl --predictor dygie --include-package dygie --use-dataset-reader --output-file ../knowledge-graph/data/first_manuscript_data/dygiepp/pretrained_output/SciERC_predictions.jsonl --cuda-device 0 --overrides "{'dataset_reader' +: {'lazy': true}}"                  
```

#### ACE05


```
#!/bin/bash --login
######################### Resources #################################

#SBATCH --time=03:59:59 
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --gpus=1
#SBATCH --job-name ace05_apply


######################## Command Lines for Job ########################

module load CUDA/9.2.88

conda activate kg
cd ~/Shiu_lab/dygiepp

allennlp predict pretrained/ace05-relations.tar.gz ../knowledge-graph/data/first_manuscript_data/dygiepp/prepped_data/dygiepp_formatted_data_ACE05.jsonl --predictor dygie --include-package dygie --use-dataset-reader --output-file ../knowledge-graph/data/first_manuscript_data/dygiepp/pretrained_output/ACE05_predictions.jsonl --cuda-device 0 --overrides "{'dataset_reader' +: {'lazy': true}}"                  
```

**NOTE:** The path where data is stored  has been changed since this was run -- data were moved to a new directory called `withCoref` within the `dygiepp/pretrained_output` directory.

-------------------------------------------------
### Lightweight models (without coreference)

SciERC has a lightweight model in which coreferences are ignored. This is necessary for the manuscript, because my model trained from scratch won't have coreferences. The ACE05 model available seems to have no coreferences, and so should be in this sectionl, but waiting for confirmation from Dave before changing where the prediction data & graphs are stored.

#### SciERC

```
#!/bin/bash --login
######################### Resources #################################

#SBATCH --time=03:59:59 
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --gpus=1
#SBATCH --job-name scierc_apply


######################## Command Lines for Job ########################

module load CUDA/9.2.88

conda activate kg
cd ~/Shiu_lab/dygiepp

allennlp predict pretrained/scierc-lightweight.tar.gz ../knowledge-graph/data/first_manuscript_data/dygiepp/prepped_data/dygiepp_formatted_data_SciERC.jsonl --predictor dygie --include-package dygie --use-dataset-reader --output-file ../knowledge-graph/data/first_manuscript_data/dygiepp/pretrained_output/lightweight/SciERC_predictions.jsonl --cuda-device 0 --overrides "{'dataset_reader' +: {'lazy': true}}"                  
```


--------------
## 2. Looking at the predictions 
----------------------


To start, let's look at how many docs have failed predictions, or missing ones, comapred to the total number of docs in the prepared data. 

In [52]:
import jsonlines

In [53]:
# Count prepped data
scierc_input = 0
with jsonlines.open('../data/first_manuscript_data/dygiepp/prepped_data/dygiepp_formatted_data_SciERC.jsonl') as reader:
    for obj in reader:
        scierc_input += 1
        
ace05_input = 0
with jsonlines.open('../data/first_manuscript_data/dygiepp/prepped_data/dygiepp_formatted_data_ACE05.jsonl') as reader:
    for obj in reader:
        ace05_input += 1
        
print('Number of input docs for each model:')
print('-------------------------------------')
print(f'SciERC: {scierc_input}   ACE05: {ace05_input}')

Number of input docs for each model:
-------------------------------------
SciERC: 8000   ACE05: 8000


In [54]:
# Check for failed docs
scierc_output_total = 0
scierc_failed = 0
with jsonlines.open('../data/first_manuscript_data/dygiepp/pretrained_output/SciERC_predictions.jsonl') as reader:
    for obj in reader:
        scierc_output_total += 1
        if "_FAILED_PREDICTION" in obj.keys():
            if obj["_FAILED_PREDICTION"]:
                scierc_failed += 1

In [55]:
ace05_output_total = 0
ace05_failed = 0
with jsonlines.open('../data/first_manuscript_data/dygiepp/pretrained_output/ACE05_predictions.jsonl') as reader:
    for obj in reader:
        ace05_output_total += 1
        if "_FAILED_PREDICTION" in obj.keys():
            if obj["_FAILED_PREDICTION"]:
                ace05_failed += 1

In [56]:
print('Number of output docs for each model:')
print('--------------------------------------')
print(f'SciERC: {scierc_output_total}      ACE05: {ace05_output_total}')
print(f'\nNumber of failed predictions for SciERC: {scierc_failed}')
print(f'Number of failed predictions for ACE05: {ace05_failed}')

Number of output docs for each model:
--------------------------------------
SciERC: 8000      ACE05: 8000

Number of failed predictions for SciERC: 0
Number of failed predictions for ACE05: 0


Let's explore the SciERC output further.

In [57]:
# Read in the data 
scierc_output = []
with jsonlines.open('../data/first_manuscript_data/dygiepp/pretrained_output/SciERC_predictions.jsonl') as reader:
    for obj in reader:
        scierc_output.append(obj)

In [58]:
# Look at what keys are in the output dicts 
output_keys = set([key for key in obj.keys() for obj in scierc_output])
print(output_keys)

{'sentences', 'predicted_relations', 'predicted_ner', 'dataset', 'doc_key', 'predicted_clusters'}


In [59]:
# Look at an example doc 

# Reconstruct the abstract from the sentences entry in the dict
abstract = ''
for sentence in scierc_output[0]['sentences']:
    sentence = ' '.join(sentence) # Adds spaces before punctuation as well
    abstract += f'{sentence} '
    
print(abstract)

A rapid , simple , and stringent protocol for the detection and quantitation of jasmonic acid ( JA ) is designed using high - performance thin - layer chromatography . Acidified culture filtrate of Lasiodiplodia theobromae is extracted with an equal volume of ethyl acetate and spotted on silica gel 60 F(254 ) foil using Linomat-5 spray - on applicator . Standard JA is also spotted either internally or adjacent to the sample , and the foils are developed with isopropanol - ammonia - water [ 10:1:1 ( v / v ) ] as the mobile phase . A quantitative estimation of the separated JA is performed by measuring the absorbance at 295 nm in the reflective mode . The sensitivity of the method is improved by adding internal standard to obtain a detection limit of 1 microg . The limit of quantitation is found to be 80 microg with this method . The method is shown to have selectivity , accuracy , precision , and high sample throughput , making it useful for the routine analysis of JA in basic science a

In [62]:
# Look at predicted entities
tokenized_doc = []
for sentence in scierc_output[0]['sentences']:
    for word in sentence:
        tokenized_doc.append(word)

entity_list = []
for sent_ent_list in scierc_output[0]['predicted_ner']:
    entities = []
    for ent_list in sent_ent_list:
        ent = " ".join(tokenized_doc[ent_list[0]:ent_list[1]+1])
        entities.append(ent)
    entity_list += entities
    
print(entity_list)

['protocol', 'JA', 'Lasiodiplodia theobromae', 'ethyl acetate', 'Linomat-5 spray - on applicator', 'JA', 'foils', 'isopropanol -', '-', 'mobile phase', 'quantitative estimation of the separated JA', 'reflective mode', 'sensitivity', 'method', 'internal standard', 'detection limit', 'limit of quantitation', 'method', 'method', 'selectivity', 'accuracy', 'precision', 'sample throughput', 'it', 'JA']


In [64]:
# Look at predicted relations
relations_list = []
for sent_rel_list in scierc_output[0]['predicted_relations']:
    rels = []
    for rel_list in sent_rel_list:
        rel = (" ".join(tokenized_doc[rel_list[0]:rel_list[1]+1]), rel_list[4],
               " ".join(tokenized_doc[rel_list[2]:rel_list[3]+1]))
        rels.append(rel)
    relations_list += rels

print(relations_list)

[('high - performance thin - layer chromatography', 'USED-FOR', 'protocol'), ('sensitivity', 'EVALUATE-FOR', 'method'), ('internal standard', 'USED-FOR', 'method'), ('selectivity', 'EVALUATE-FOR', 'method'), ('selectivity', 'CONJUNCTION', 'accuracy'), ('selectivity', 'CONJUNCTION', 'precision'), ('accuracy', 'EVALUATE-FOR', 'method'), ('precision', 'EVALUATE-FOR', 'method')]


Now let's look at the ACE05 data for the same abstract:

In [65]:
# Read in the data 
ace05_output = []
with jsonlines.open('../data/first_manuscript_data/dygiepp/pretrained_output/ACE05_predictions.jsonl') as reader:
    for obj in reader:
        ace05_output.append(obj)

In [66]:
# Look at what keys are in the output dicts 
output_keys_ace05 = set([key for key in obj.keys() for obj in scierc_output])
print(output_keys_ace05)

{'sentences', 'predicted_relations', 'predicted_ner', 'dataset', 'doc_key'}


In [67]:
# Look at an example doc 

# Reconstruct the abstract from the sentences entry in the dict
abstract_ace05 = ''
for sentence in ace05_output[0]['sentences']:
    sentence = ' '.join(sentence) # Adds spaces before punctuation as well
    abstract_ace05 += f'{sentence} '
    
print(abstract_ace05)

A rapid , simple , and stringent protocol for the detection and quantitation of jasmonic acid ( JA ) is designed using high - performance thin - layer chromatography . Acidified culture filtrate of Lasiodiplodia theobromae is extracted with an equal volume of ethyl acetate and spotted on silica gel 60 F(254 ) foil using Linomat-5 spray - on applicator . Standard JA is also spotted either internally or adjacent to the sample , and the foils are developed with isopropanol - ammonia - water [ 10:1:1 ( v / v ) ] as the mobile phase . A quantitative estimation of the separated JA is performed by measuring the absorbance at 295 nm in the reflective mode . The sensitivity of the method is improved by adding internal standard to obtain a detection limit of 1 microg . The limit of quantitation is found to be 80 microg with this method . The method is shown to have selectivity , accuracy , precision , and high sample throughput , making it useful for the routine analysis of JA in basic science a

In [68]:
# Look at predicted entities
tokenized_doc_ace05 = []
for sentence in ace05_output[0]['sentences']:
    for word in sentence:
        tokenized_doc_ace05.append(word)

entity_list_ace05 = []
for sent_ent_list in ace05_output[0]['predicted_ner']:
    entities = []
    for ent_list in sent_ent_list:
        ent = " ".join(tokenized_doc_ace05[ent_list[0]:ent_list[1]+1])
        entities.append(ent)
    entity_list_ace05 += entities
    
print(entity_list_ace05)

['industries']


In [71]:
# Look at predicted relations
relations_list_ace05 = []
for sent_rel_list in ace05_output[0]['predicted_relations']:
    rels = []
    for rel_list in sent_rel_list:
        rel = (" ".join(tokenized_doc_ace05[rel_list[0]:rel_list[1]+1]), rel_list[4], 
               " ".join(tokenized_doc_ace05[rel_list[2]:rel_list[3]+1]))
        rels.append(rel)
    relations_list += rels

print(relations_list_ace05)

[]
