# DyGIE++: NER, RE, and EE with Coreference Resolution
--------------------------------------
[DyGIE++ paper](https://arxiv.org/pdf/1909.03546.pdf) <br>
[DyGIE++ GitHub](https://github.com/dwadden/dygiepp)

In this notebook, I will implement the first 3 steps in KG construction, named entity recognition (NER), relation extraction (RE) and event extraction (EE) using the DyGIE++ pre-trained model on the [GENIA corpus](https://wayback.archive-it.org/org-350/20200626194727/https://orbit.nlm.nih.gov/browse-repository/dataset/human-annotated/83-genia-corpus). DyGIE++ performs these tasks simultaneously, using coreference resolution to enhance the performance of the model.

## 0. Formatting unlabeled data
In order to apply a pre-trained model to unlabeled data, some formatting requirements must be met. From the [docs](https://github.com/dwadden/dygiepp/blob/master/doc/data.md): 
* "In the case where your unlabeled data are stored as a directory of `.txt` files (one file per document), you can run `python scripts/data/new-dataset/format_new_dataset.py [input-directory] [output-file]` to format the documents into a `jsonl` file, with one line per document. If your dataset is scientific text, add the `--use-scispacy` flag to have SciSpacy do the tokenization."
    * This is the most straightforward way to format new data. Since I am using abstracts which are downloaded from PubMed as a `.txt` file, and already have written a (rough) abstract extractor, this is fairly simple. 
* "If you'd like to use a pretrained DyGIE++ model to make predictions on a new dataset, the `dataset` field in your new dataset must match the `dataset` that the original model was trained on; this indicates to the model which label namespace it should use for predictions."
    * In this case, the `dataset` field for the unlabeled data should be `GENIA`

### Make a directory of `.txt` files

First, define a function to get abstracts from a PubMed search `.txt` results file (from NLP class project methods):

In [5]:
import re
from collections import defaultdict

In [6]:
def separateAbstracts(data_path, abstract_num):
    """
    Function to read a .txt file downloaded from PubMed and separate the text of the abstract from its metadata.
    
    parameters:
        data_path, str: path to a .txt file with abstracts downloaded from PubMed
        abstract_num, int: the number of abstracts in the file (from PubMed search interface)
        
    returns: 
        abstracts, list of str: list of the abstract plain text for all abstracts in data_path
    """
    abstract_start_chars = [f'{x+1}. ' for x in range(abstract_num)]
    abstract_start_re = '\d+. '
    
    abstract_text = []
    with open(data_path) as f:
        
        # Set up housekeeping variables
        started_newline_count = False
        newlines = 0
        start_recording_abstract = False
        current_abstract = ''
        
        # Iterate through lines in the file 
        for line in f:
            # print(line)
            ###################################
            # 1. Find start of abstract section
            ###################################
            
            if not started_newline_count and not start_recording_abstract: 
                # print('Looking for a start line...')
                
                # See if there's a number followed by a period in the line
                match = re.search(abstract_start_re, line)
                
                if match is not None:
                    
                    # Check if it's the first thing in the line
                    if match.start() == 0:
                    
                        # Check if it's in the list of abstract start characters
                        if line[match.start():match.end()] in abstract_start_chars:
                            
                            # print('Found a start line!')
                            # print(f'This line begins with {line[match.start():match.end()]}')
                            started_newline_count = True 
                            
                        
            ########################################
            # Count newlines until start of abstract
            ########################################
            
            elif started_newline_count:
                # print('Looking for the start of abstract text')
                
                # Check if this line is a newline 
                if line == '\n':
                    newlines += 1
                    # print('This is a new line!')
                    # print(f'Number of newlines including this one = {newlines}')
                    
                    if newlines == 4:
                        # print(f'Found the start of an abstract! Begins with {line}')
                        
                        # If that was the fourth newline, indicate the next line starts the abstract
                        start_recording_abstract = True 
                        
                        # Reset the newlines counter variables
                        started_newline_count = False
                        newlines = 0
                        
                else: newlines += 0

            #################
            # Record abstract
            #################
            
            elif start_recording_abstract:
                
                if line != '\n':
                    
                    # Add this line to the current abstract 
                    current_abstract += line
                    
                elif line == '\n':
                    
                    # Indicate that the abstract is over 
                    start_recording_abstract = False
                    
                    # Put the abstract in abstract list 
                    abstract_text.append(current_abstract)
                    
                    # Overwrite current_abstract
                    current_abstract = ''
                    
                    
    return abstract_text
                    

Choose abstracts to use. 50 Abstracts were selected from the PubMed search results for "jasmonic acid arabidopsis". Papers were manually selected as being "molecular" if they contained gene, protein, or pathway names in the title, or keywords like "pathway", "signalling" and "crosstalk". 

In [8]:
import os
import random

In [3]:
abstract_num = 50
data_path    = '../data/jasmonic_molec_abstract_50.txt'
data_path    = os.path.abspath(data_path)

Read in the file and extract abstracts:

In [9]:
abstract_texts = separateAbstracts(data_path, abstract_num)

print('Example abstract:')
print('-------------------------------------------------')
print(abstract_texts[random.randint(0, abstract_num)])

Example abstract:
-------------------------------------------------
Methyl jasmonate is a plant volatile that acts as an important cellular 
regulator mediating diverse developmental processes and defense responses. We 
have cloned the novel gene JMT encoding an S-adenosyl-l-methionine:jasmonic acid 
carboxyl methyltransferase (JMT) from Arabidopsis thaliana. Recombinant JMT 
protein expressed in Escherichia coli catalyzed the formation of methyl 
jasmonate from jasmonic acid with K(m) value of 38.5 microM. JMT RNA was not 
detected in young seedlings but was detected in rosettes, cauline leaves, and 
developing flowers. In addition, expression of the gene was induced both locally 
and systemically by wounding or methyl jasmonate treatment. This result suggests 
that JMT can perceive and respond to local and systemic signals generated by 
external stimuli, and that the signals may include methyl jasmonate itself. 
Transgenic Arabidopsis overexpressing JMT had a 3-fold elevated level of

Check for extraction exceptions (see explanation in NLP project methods development notebook):

In [10]:
# Drop any texts that match the author info regex
author_info_re = 'Author information:'

abstract_texts_clean = [x for x in abstract_texts if re.match(author_info_re, x) is None]

print(f'{len(abstract_texts) - len(abstract_texts_clean)} abstracts were lost to foreign language formatting edge case')

1 abstracts were lost to foreign language formatting edge case


Write the abstracts to `.txt` files, one per abstract.

In [12]:
data_dir = '../data/dygiepp_50_molec/'
data_dir = os.path.abspath(data_dir)

for i, abstract in enumerate(abstract_texts_clean):
        with open(f'{data_dir}/abstract{i}', 'w') as f:
            f.write(abstract)