# Augmenting distant supervision labeled targeted sentiment with coreference
Functionality:
- Load in a sentiment annotated dataset
- Use the texts as basis for coreference annotation
    - Combine texts with same target
- Compute coreference
- For each reference, replace the reference with the root entity $T$
- Store a new dataset with the updated texts

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os
import sys

dataset_path = os.path.join(os.getcwd(), "STEP 3 - ABSA-format")

def get_filename(f):
    return "distant_supervision_{}.seg".format(f)

def create_filename(f):
    return "augmented_distant_supervision_{}.seg".format(f)

train = os.path.join(dataset_path, get_filename("train"))
test = os.path.join(dataset_path, get_filename("test"))

out_path = os.path.join(os.getcwd(), "STEP 4 - Augmented ABSA")
train_out = os.path.join(out_path, create_filename("train"))
test_out = os.path.join(out_path, create_filename("test"))

print(train, test)
print(train_out, test_out)

C:\Users\Tollef\Documents\GitHub\masterNEW\REPO\strise\STEP 3 - ABSA-format\distant_supervision_train.seg C:\Users\Tollef\Documents\GitHub\masterNEW\REPO\strise\STEP 3 - ABSA-format\distant_supervision_test.seg
C:\Users\Tollef\Documents\GitHub\masterNEW\REPO\strise\STEP 4 - Augmented ABSA\augmented_distant_supervision_train.seg C:\Users\Tollef\Documents\GitHub\masterNEW\REPO\strise\STEP 4 - Augmented ABSA\augmented_distant_supervision_test.seg


In [3]:
# init neuralcoref class
from tollef_coref import Coref

params = {
    "greed": 0.53,
    "max_dist": 50,
    "max_dist_match": 500
}
coref = Coref(params, spacy_size="md")

Loading spacy model...
Added neuralcoref to pipeline!


In [4]:
txt = "The food is uniformly exceptional , with a very capable kitchen which will proudly whip up whatever you feel like eating , whether it 's on the menu or not ."
coref.add_doc(txt)
coref.doc

The food is uniformly exceptional , with a very capable kitchen which will proudly whip up whatever you feel like eating , whether it 's on the menu or not .

In [5]:
MASK = "$T$"

In [14]:
import re

def clean(t):
    t = re.sub(r"(\n+)(?=[A-Z])", " ", t)  # replace consecutive newlines connected to words
    t = re.sub(r"(\n+)", " ", t)  # replace unconnected newlines
    t = t.replace("-LRB-", "")
    t = t.replace("-RRB-", "")
    
    # finally, make sure it is proper alphanumeric text
    #pattern = re.compile('[\W_]+', re.UNICODE)
    #t = pattern.sub(' ', t)
    return t.strip()  # aviod trailing space or newlines

def valid_referential(token):
    # "both" occurs in many contexts...
    # it is only valid when it is avoiding the following:
    # POS: DET
    # DEPENDENCY: preconj
    POS = token.pos_
    DEP = token.dep_
    valid_determiner = POS != "DEP" and DEP != "preconj"
    
    isvalid = valid_determiner
    
    return isvalid

def print_pos(token):
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, 
          token.shape_, token.is_alpha, token.is_stop)

In [15]:
def write(f, text, tar, sent):
    f.write(text)
    f.write("\n")
    f.write(tar)
    f.write("\n")
    f.write(sent)
    f.write("\n")
    
def parse_file(infile, outfile):
    with open(outfile, 'w', encoding="utf8") as outdata:
        with open(infile, 'r', encoding="utf8") as indata:
            lines = indata.readlines()
            step = 3
            for i in range(0, len(lines), step):
                text = lines[i].strip()
                tar = lines[i+1].strip()
                sent = lines[i+2].strip()
                
                write(outdata, text, tar, sent)

                # format it before processing for coreference
                text = clean(text)
                # write the original cleaned data as well, duplicating it
                # comment this step if it doesn't work.
                # write(outdata, text, tar, sent)

                tar_index = text.index(MASK)
            
                tar_span = (tar_index, tar_index + len(tar))

                # unmask the target $T$ with the actual word before updating with coreference
                text = text.replace(MASK, tar)
                coref.add_doc(text)
                
                if coref.cluster_resolved and coref.clusters():
                    for cluster in coref.clusters():
                        # the real target is found as the root of the cluster
                        if tar in cluster.main.text: 
                            # use this if the root should be included
                            mentions = cluster.mentions  
                            for mention in mentions:
                                valid = False
                                # set a threshold for the mention span length
                                if len(mention) < 10:  
                                    valid = any([valid_referential(ref) for ref in mention])
                                if valid:
                                    start, end = mention.start, mention.end
                                    tokens = coref.tokens()
                                    for _ in range(end-start):
                                        # for each word in the new mention, remove it from the list
                                        tokens.pop(start)
                                    tokens.insert(start, MASK)
                                    # format it back as a string
                                    coref_text = ' '.join(tokens).strip()
                                    if coref_text != lines[i].strip():
                                        USE_MENTION = False
                                        if USE_MENTION:
                                            target = mention.text
                                        else:
                                            target = tar
                                        write(outdata, coref_text, target, sent)
                                        # using e.g. "it" instead of target
                                        # write(outdata, coref_text,  mention.text, sent)  
                                    #added_sents.append([coref_text, tar, sent])


# Iterate, save copies of each

In [16]:
for topicfile in os.listdir(dataset_path):
    topic_path = os.path.join(dataset_path, topicfile)
    print(topic_path)
    save_name = "coref_" + topicfile
    save_path = os.path.join(out_path, save_name)
    print(save_path)
    parse_file(topic_path, save_path)

C:\Users\Tollef\Documents\GitHub\masterNEW\REPO\strise\STEP 3 - ABSA-format\business_distant_test.litesent
C:\Users\Tollef\Documents\GitHub\masterNEW\REPO\strise\STEP 4 - Augmented ABSA\coref_business_distant_test.litesent
C:\Users\Tollef\Documents\GitHub\masterNEW\REPO\strise\STEP 3 - ABSA-format\business_distant_train.litesent
C:\Users\Tollef\Documents\GitHub\masterNEW\REPO\strise\STEP 4 - Augmented ABSA\coref_business_distant_train.litesent
C:\Users\Tollef\Documents\GitHub\masterNEW\REPO\strise\STEP 3 - ABSA-format\distant_all_data.litesent
C:\Users\Tollef\Documents\GitHub\masterNEW\REPO\strise\STEP 4 - Augmented ABSA\coref_distant_all_data.litesent
C:\Users\Tollef\Documents\GitHub\masterNEW\REPO\strise\STEP 3 - ABSA-format\entities.csv_distant_test.litesent
C:\Users\Tollef\Documents\GitHub\masterNEW\REPO\strise\STEP 4 - Augmented ABSA\coref_entities.csv_distant_test.litesent
C:\Users\Tollef\Documents\GitHub\masterNEW\REPO\strise\STEP 3 - ABSA-format\entities.csv_distant_train.lites

In [16]:
parse_file(train, train_out)

In [17]:
parse_file(test, test_out)

# TESTING, remove

In [None]:
testdf = 

In [29]:
def parse_file(infile):
    with open(infile, 'r', encoding="utf8") as indata:
        lines = indata.readlines()
        step = 3
        for i in range(0, len(lines), step):
            text = lines[i].strip()
            tar = lines[i+1].strip()
            sent = lines[i+2].strip()

            # format it before processing for coreference
            text = clean(text)

            tar_index = text.index(MASK)

            tar_span = (tar_index, tar_index + len(tar))

            # unmask the target $T$ with the actual word before updating with coreference
            text = text.replace(MASK, tar)
            coref.add_doc(text)

            if coref.cluster_resolved and coref.clusters():
                for cluster in coref.clusters():
                    # the real target is found as the root of the cluster
                    if tar in cluster.main.text: 
                        # use this if the root should be included
                        mentions = cluster.mentions  
                        for mention in mentions:
                            print(mention)
                            print("-->", mention.text)
                            valid = False
                            # set a threshold for the mention span length
                            if len(mention) < 10:  
                                valid = any([valid_referential(ref) for ref in mention])
                            if valid:
                                start, end = mention.start, mention.end
                                tokens = coref.tokens()
                                for _ in range(end-start):
                                    # for each word in the new mention, remove it from the list
                                    tokens.pop(start)
                                tokens.insert(start, MASK)
                                # format it back as a string
                                coref_text = ' '.join(tokens).strip()
                                if coref_text != lines[i].strip():
                                    USE_MENTION = True
                                    if USE_MENTION:
                                        target = mention.target
                                    else:
                                        target = tar

                                    print("writing:\n{}\n---{}".format(coref_text, target))

In [30]:
parse_file(test)

the corporate leet
--> the corporate leet
writing:
Before his landslide election in July 2018 , Lopez Obrador provoked anxiety among $T$ as he disparagingly called it , dismissing big - business people as traffickers of influence .
---leet
it
--> it
writing:
Before his landslide election in July 2018 , Lopez Obrador provoked anxiety among the corporate leet as he disparagingly called $T$ , dismissing big - business people as traffickers of influence .
---leet
The Ashes series
--> The Ashes series
writing:
The 76-year - old , in a piece for ESPNcricinfo , claimed that although the recently concluded $T$ breathed life into the longer format of the game , it faces serious challenges up ahead .
---The Ashes
it
--> it
writing:
The 76-year - old , in a piece for ESPNcricinfo , claimed that although the recently concluded The Ashes series breathed life into the longer format of the game , $T$ faces serious challenges up ahead .
---The Ashes
a 3-year term-based subscription
--> a 3-year term-b

KeyboardInterrupt: 