# DyGIE++: NER, RE, and EE with Coreference Resolution
--------------------------------------
[DyGIE++ paper](https://arxiv.org/pdf/1909.03546.pdf) <br>
[DyGIE++ GitHub](https://github.com/dwadden/dygiepp)

In this notebook, I will implement the first 3 steps in KG construction, named entity recognition (NER), relation extraction (RE) and event extraction (EE) using the DyGIE++ pre-trained model on the [GENIA corpus](https://wayback.archive-it.org/org-350/20200626194727/https://orbit.nlm.nih.gov/browse-repository/dataset/human-annotated/83-genia-corpus). DyGIE++ performs these tasks simultaneously, using coreference resolution to enhance the performance of the model.

## 0. Formatting unlabeled data
In order to apply a pre-trained model to unlabeled data, some formatting requirements must be met. From the [docs](https://github.com/dwadden/dygiepp/blob/master/doc/data.md): 
* "In the case where your unlabeled data are stored as a directory of `.txt` files (one file per document), you can run `python scripts/data/new-dataset/format_new_dataset.py [input-directory] [output-file]` to format the documents into a `jsonl` file, with one line per document. If your dataset is scientific text, add the `--use-scispacy` flag to have SciSpacy do the tokenization."
    * This is the most straightforward way to format new data. Since I am using abstracts which are downloaded from PubMed as a `.txt` file, and already have written a (rough) abstract extractor, this is fairly simple. 
* "If you'd like to use a pretrained DyGIE++ model to make predictions on a new dataset, the `dataset` field in your new dataset must match the `dataset` that the original model was trained on; this indicates to the model which label namespace it should use for predictions."
    * In this case, the `dataset` field for the unlabeled data should be `GENIA`

### Make a directory of `.txt` files

First, define a function to get abstracts from a PubMed search `.txt` results file (from NLP class project methods):

In [5]:
import re
from collections import defaultdict

In [6]:
def separateAbstracts(data_path, abstract_num):
    """
    Function to read a .txt file downloaded from PubMed and separate the text of the abstract from its metadata.
    
    parameters:
        data_path, str: path to a .txt file with abstracts downloaded from PubMed
        abstract_num, int: the number of abstracts in the file (from PubMed search interface)
        
    returns: 
        abstracts, list of str: list of the abstract plain text for all abstracts in data_path
    """
    abstract_start_chars = [f'{x+1}. ' for x in range(abstract_num)]
    abstract_start_re = '\d+. '
    
    abstract_text = []
    with open(data_path) as f:
        
        # Set up housekeeping variables
        started_newline_count = False
        newlines = 0
        start_recording_abstract = False
        current_abstract = ''
        
        # Iterate through lines in the file 
        for line in f:
            # print(line)
            ###################################
            # 1. Find start of abstract section
            ###################################
            
            if not started_newline_count and not start_recording_abstract: 
                # print('Looking for a start line...')
                
                # See if there's a number followed by a period in the line
                match = re.search(abstract_start_re, line)
                
                if match is not None:
                    
                    # Check if it's the first thing in the line
                    if match.start() == 0:
                    
                        # Check if it's in the list of abstract start characters
                        if line[match.start():match.end()] in abstract_start_chars:
                            
                            # print('Found a start line!')
                            # print(f'This line begins with {line[match.start():match.end()]}')
                            started_newline_count = True 
                            
                        
            ########################################
            # Count newlines until start of abstract
            ########################################
            
            elif started_newline_count:
                # print('Looking for the start of abstract text')
                
                # Check if this line is a newline 
                if line == '\n':
                    newlines += 1
                    # print('This is a new line!')
                    # print(f'Number of newlines including this one = {newlines}')
                    
                    if newlines == 4:
                        # print(f'Found the start of an abstract! Begins with {line}')
                        
                        # If that was the fourth newline, indicate the next line starts the abstract
                        start_recording_abstract = True 
                        
                        # Reset the newlines counter variables
                        started_newline_count = False
                        newlines = 0
                        
                else: newlines += 0

            #################
            # Record abstract
            #################
            
            elif start_recording_abstract:
                
                if line != '\n':
                    
                    # Add this line to the current abstract 
                    current_abstract += line
                    
                elif line == '\n':
                    
                    # Indicate that the abstract is over 
                    start_recording_abstract = False
                    
                    # Put the abstract in abstract list 
                    abstract_text.append(current_abstract)
                    
                    # Overwrite current_abstract
                    current_abstract = ''
                    
                    
    return abstract_text
                    

Choose abstracts to use. 50 Abstracts were selected from the PubMed search results for "jasmonic acid arabidopsis". Papers were manually selected as being "molecular" if they contained gene, protein, or pathway names in the title, or keywords like "pathway", "signalling" and "crosstalk". 

In [8]:
import os
import random

In [3]:
abstract_num = 50
data_path    = '../data/jasmonic_molec_abstract_50.txt'
data_path    = os.path.abspath(data_path)

Read in the file and extract abstracts:

In [9]:
abstract_texts = separateAbstracts(data_path, abstract_num)

print('Example abstract:')
print('-------------------------------------------------')
print(abstract_texts[random.randint(0, abstract_num)])

Example abstract:
-------------------------------------------------
Methyl jasmonate is a plant volatile that acts as an important cellular 
regulator mediating diverse developmental processes and defense responses. We 
have cloned the novel gene JMT encoding an S-adenosyl-l-methionine:jasmonic acid 
carboxyl methyltransferase (JMT) from Arabidopsis thaliana. Recombinant JMT 
protein expressed in Escherichia coli catalyzed the formation of methyl 
jasmonate from jasmonic acid with K(m) value of 38.5 microM. JMT RNA was not 
detected in young seedlings but was detected in rosettes, cauline leaves, and 
developing flowers. In addition, expression of the gene was induced both locally 
and systemically by wounding or methyl jasmonate treatment. This result suggests 
that JMT can perceive and respond to local and systemic signals generated by 
external stimuli, and that the signals may include methyl jasmonate itself. 
Transgenic Arabidopsis overexpressing JMT had a 3-fold elevated level of

Check for extraction exceptions (see explanation in NLP project methods development notebook):

In [10]:
# Drop any texts that match the author info regex
author_info_re = 'Author information:'

abstract_texts_clean = [x for x in abstract_texts if re.match(author_info_re, x) is None]

print(f'{len(abstract_texts) - len(abstract_texts_clean)} abstracts were lost to foreign language formatting edge case')

1 abstracts were lost to foreign language formatting edge case


Something weird happened in abstract 0 - noticed it when randomly spot checking the output files after my first pass at this implementation. Text of the abstract section for the anomaly: 
```
1. Plant Cell Physiol. 2018 Jan 1;59(1):8-16. doi: 10.1093/pcp/pcx181.

Salicylic Acid and Jasmonic Acid Pathways are Activated in Spatially Different 
Domains Around the Infection Site During Effector-Triggered Immunity in 
Arabidopsis thaliana.

Betsuyaku S(1), Katou S(2), Takebayashi Y(3), Sakakibara H(3), Nomura N(1), 
Fukuda H(4).

Author information:
(1)Faculty of Life and Environmental Sciences, University of Tsukuba, 1-1-1 
Tennodai, Tsukuba, Ibarakim 305-8577 Japan.
(2)Institute of Agriculture, Academic Assembly, Shinshu University, 8304, 
Minamiminowa, Nagano, 399-4598 Japan.
(3)Plant Productivity Systems Research Group, RIKEN Center for Sustainable 
Resource Science, 1-7-22, Suehiro, Tsurumi-ku, Yokohama, 230-0045 Japan.
(4)Department of Biological Sciences, Graduate School of Science, The University 
of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-0033, Japan.

Erratum in
    Plant Cell Physiol. 2018 Feb 1;59(2):439.

Comment in
    Plant Cell Physiol. 2018 Jan 1;59(1):3-4.

The innate immune response is, in the first place, elicited at the site of 
infection. Thus, the host response can be different among the infected cells and 
the cells surrounding them. Effector-triggered immunity (ETI), a form of innate 
immunity in plants, is triggered by specific recognition between pathogen 
effectors and their corresponding plant cytosolic immune receptors, resulting in 
rapid localized cell death known as hypersensitive response (HR). HR cell death 
is usually limited to a few cells at the infection site, and is surrounded by a 
few layers of cells massively expressing defense genes such as 
Pathogenesis-Related Gene 1 (PR1). This virtually concentric pattern of the 
cellular responses in ETI is proposed to be regulated by a concentration 
gradient of salicylic acid (SA), a phytohormone accumulated around the infection 
site. Recent studies demonstrated that jasmonic acid (JA), another phytohormone 
known to be mutually antagonistic to SA in many cases, is also accumulated in 
and required for ETI, suggesting that ETI is a unique case. However, the 
molecular basis for this uniqueness remained largely to be solved. Here, we 
found that, using intravital time-lapse imaging, the JA signaling pathway is 
activated in the cells surrounding the central SA-active cells around the 
infection sites in Arabidopsis thaliana. This distinct spatial organization 
explains how these two phythormone pathways in a mutually antagonistic 
relationship can be activated simultaneously during ETI. Our results 
re-emphasize that the spatial consideration is a key strategy to gain 
mechanistic insights into the apparently complex signaling cross-talk in 
immunity.

© The Author 2017. Published by Oxford University Press on behalf of Japanese 
Society of Plant Physiologists.

DOI: 10.1093/pcp/pcx181
PMCID: PMC6012717
PMID: 29177423 [Indexed for MEDLINE]
```

The "Erratum" and "Comment" lines added extra newlines, and the "Erratum" for counted as the abstract because it came after the 4th newline. At the moment I'm not super concerned about losing a few abstracts here and there to this kind of edge case, because the convenience/memory of this solution is better in the short term, but in the long run probably better to pull XML files with the script I've previously written and get the abstract from the structured text. In order to deal with edge cases here, however, I can just flag abstracts that have fewer than 5 lines OR start with "Author Information:" (the Author Information section can have more than 5 lines) so I don't have to check all of them individually.

Check for further edge cases:

In [17]:
problem_abstracts = [x for x in abstract_texts_clean if x.count('\n') < 4]

print(f'There are {len(problem_abstracts)} problem abstracts. These are:')
for i in range(len(problem_abstracts)):
    print('----------------------------------------\n')
    print(problem_abstracts[i])
    
problem_indices = [abstract_texts_clean.index(x) for x in problem_abstracts]
print(f'\nThe indices of these in the abstracts list are {problem_indices}')

There are 5 problem abstracts. These are:
----------------------------------------

Erratum in
    Plant Cell Physiol. 2018 Feb 1;59(2):439.

----------------------------------------

Comment in
    Nat Plants. 2018 May;4(5):240.
    Plant Cell. 2018 May;30(5):948-949.

----------------------------------------

© The Author 2016. Published by Oxford University Press on behalf of the Society 
for Experimental Biology.

----------------------------------------

Comment in
    Sci China Life Sci. 2015 Mar;58(3):311-2.

----------------------------------------

Comment in
    New Phytol. 2017 Sep;215(4):1291-1294.


The indices of these in the abstracts list are [0, 8, 11, 25, 48]


In [21]:
# Drop the edge cases
abstracts_to_write = [x for x in abstract_texts_clean if abstract_texts_clean.index(x) not in problem_indices]

print(f'Final number of abstracts is {len(abstracts_to_write)}')

Final number of abstracts is 44


--------------
### TODO: 
Turns out the `PubMed` format has a much more well-defined field for abstract that would circumvent the need for all the above edge case detection and elimination. Go back and write code to utilize that format!

-----------
Write the abstracts to `.txt` files, one per abstract.

In [23]:
data_dir = '../data/dygiepp_50_molec/'
data_dir = os.path.abspath(data_dir)

for i, abstract in enumerate(abstracts_to_write):
        with open(f'{data_dir}/abstract{i}', 'w') as f:
            f.write(abstract)

Use the dygiepp command line to format the documents into a `jsonl` file. In the directory `~/projects/knowledge-graph/dygiepp`, run:
```
python scripts/new-dataset/format_new_dataset.py ../data/dygiepp_50_molec/ ../data/dygiepp_50_molec/50_molec.jsonl genia --use-scispacy
```
**Note:** You must create an empty file with the correct name before running this line, or else nothing will be written anywhere.
* **DONE**

-------------------
## 1. Making predictions with pre-trained GENIA model
*From the docs:*
----------------------------
To make predictions on a new, unlabeled dataset:

1. Download the pretrained model that most closely matches your text domain.
2. Make sure that the dataset field for your new dataset matches the label namespaces for the pretrained model. See here for more on label namespaces. To view the available label namespaces for a pretrained model, use print_label_namespaces.py.
3. Make predictions the same way as with the existing datasets:
```
allennlp predict pretrained/[name-of-pretrained-model].tar.gz \
    [input-path] \
    --predictor dygie \
    --include-package dygie \
    --use-dataset-reader \
    --output-file [output-path] \
    --cuda-device [cuda-device]
```
A couple tricks to make things run smoothly:

1. If you're predicting on a big dataset, you probably want to load it lazily rather than loading the whole thing in before predicting. To accomplish this, add the following flag to the above command:

```
--overrides "{'dataset_reader' +: {'lazy': true}}"
```

2. If the model runs out of GPU memory on a given prediction, it will warn you and continue with the next example rather than stopping entirely. This is less annoying than the alternative. Examples for which predictions failed will still be written to the specified `jsonl` output, but they will have an additional field `{"_FAILED_PREDICTION": true}` indicating that the model ran out of memory on this example.
3. The `dataset` field in the dataset to be predicted must match one of the datasets on which the model was trained; otherwise, the model won't know which labels to apply to the predicted data.

Command line to make predictions on the mini-dataset formatted above:

```
allennlp predict ~/Downloads/genia.tar.gz \
~/projects/knowledge-graph/data/dygiepp_50_molec/50_molec.jsonl \
--predictor dygie \
--include-package dygie \
--use-dataset-reader \
--output-file ~/projects/knowledge-graph/data/dygiepp_50_molec/50_molec_predictions_genia.jsonl \
--cuda-device 0
```