# Atribution Relations Extractions Model: 
## Conditional Random Field (CRF) Approach
###  CRF classifier trained with pre-trained word embeddings

In this notebook, we train and evaluate wto CRF classifiers using 100 dimentional pre-trained GloVe word embeddings as features.

The first classifier is trained with only PolNeAR corpus. The second one is trained with both PolNeAr and PARC3 corpus. 
Note: Nested attributions in PARC3 is not taken into consideration. 

In [2]:
# We start with imports:

import sklearn_crfsuite# the model
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics
from gensim.models import KeyedVectors # to load pre-trained word embeddings
import numpy as np # to create 0 vectors for the words which are not in the vocabulary
import pandas as pd # to load input&output files for evaluation
import csv # to read the data files for training and evaluation

In [3]:
# load the pre-trained word embeddings
glove_dimensions = 100
!python -m gensim.scripts.glove2word2vec --input  glove.6B.100d.txt --output glove.6B.100d.w2vformat.txt
model = KeyedVectors.load_word2vec_format("glove.6B.100d.w2vformat.txt")

2021-06-20 18:40:21,069 - glove2word2vec - INFO - running C:\Users\filiz\anaconda3\lib\site-packages\gensim\scripts\glove2word2vec.py --input glove.6B.100d.txt --output glove.6B.100d.w2vformat.txt
  num_lines, num_dims = glove2word2vec(args.input, args.output)
2021-06-20 18:40:21,111 - keyedvectors - INFO - loading projection weights from glove.6B.100d.txt
2021-06-20 18:41:00,775 - utils - INFO - KeyedVectors lifecycle event {'msg': 'loaded (400000, 100) matrix of type float32 from glove.6B.100d.txt', 'binary': False, 'encoding': 'utf8', 'datetime': '2021-06-20T18:41:00.775074', 'gensim': '4.0.1', 'python': '3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'load_word2vec_format'}
2021-06-20 18:41:00,806 - glove2word2vec - INFO - converting 400000 vectors from glove.6B.100d.txt to glove.6B.100d.w2vformat.txt
2021-06-20 18:41:01,181 - keyedvectors - INFO - storing 400000x100 projection weights into glove.6B.100d.w2vfor

### Helper functions:

In [4]:
def extract_sents_from_conll(inputfile):
    '''Read the data from tsv file, return sentences as tokens with corresponding labels.'''
    
    rows = csv.reader(open(inputfile, encoding="utf-8"), delimiter='\t')
    sents = []
    current_sent = []
    for row in rows:
        current_sent.append(tuple(row))  
        #After each sentence there is a special token: Sent_end. Its label is O. It was added in the preprocessing step.
        if row[0] == "Sent_end":
            sents.append(current_sent)
            current_sent = []
    return sents


In [6]:
#sents = extract_sents_from_conll("Toy_data_train.tsv")

#print(sents[0])
#print(len(sents))

In [7]:
def sent2tokens(sent):
    '''Take the sentence as token-label pair, return only tokens'''

    return [token for token, label in sent]


In [8]:
#test =  sent2tokens(sents[0])

#print(test)

In [9]:
def sent2labels(sent):    
    '''Take the sentence as token-label pair, return only labels'''

    return [label for token, label  in sent]

In [10]:
#test2 = sent2labels(sents[0])

#print(test2)

It is time to extract the features: 

IMPORTANT: Crfsuite does not support array features, like word embeddings. Instead, we pass every vector component as a feature.

https://stackoverflow.com/questions/58736548/how-to-use-word-embedding-as-features-for-crf-sklearn-crfsuite-model-training
https://github.com/scrapinghub/python-crfsuite/issues/39

In [11]:
### Embedding function 
def get_features(token):
    '''Get token, return word vector'''
    
    token=token.lower()
    try:
         vector=model[token]
    except:
        # if the word is not in vocabulary,
        # returns zeros array
        vector=np.zeros(100,)

    return vector   

In [12]:
#vector = get_features("are")
#print(len(vector))
#print(vector)

In [13]:
def token2features(sent, i):
    '''Get tokens in the sentence, add bias, token and word embeddings as features and return all as a feature dictionary.'''
    
    token = sent[i][0]
    wordembdding=get_features(token)   ## word embedding vector 
    wordembdding=np.array(wordembdding) ## vectors 
    
    
    features = {
        'bias': 1.0,
        'token': token.lower()
    }
    
    for iv,value in enumerate(wordembdding):
        features['v{}'.format(iv)]=value

    if i == 0:
        features['BOS'] = True
        
    elif i == len(sent) -1:
        features['EOS'] = True
        
    return features


In [14]:
#features= token2features(sents[0], i=0)

#print(features)
#print(type(features))

In [15]:
def sent2features(sent):
    '''Get sentence as an input, add the features and return as a list of dictionaries.'''
    return [token2features(sent, i) for i in range(len(sent))]

In [16]:
#test3 =sent2features(sents[0])

#print(sents[0])
#print(type(test3))
#print(test3)


In [17]:
def train_crf_model(X_train, y_train):
    '''Compile and fit the model'''

    crf = sklearn_crfsuite.CRF(
        algorithm='lbfgs',
        c1=0.1,
        c2=0.1,
        max_iterations=100,
        all_possible_transitions=True
    )
    crf.fit(X_train, y_train)
    
    return crf


In [18]:
def create_crf_model(trainingfile):
    
    '''Perform the training with the data, return the classifier'''

    train_sents = extract_sents_from_conll(trainingfile)
    X_train = [sent2features(s) for s in train_sents]
    y_train = [sent2labels(s) for s in train_sents]

    crf = train_crf_model(X_train, y_train)
    
    return crf 


In [27]:
def run_crf_model(crf, evaluationfile):
    
    '''Get and prepare the validation sentences, run the classifier and return predictions'''

    test_sents = extract_sents_from_conll(evaluationfile)
    X_test = [sent2features(s) for s in test_sents]
    y_test = [sent2labels(s) for s in test_sents]
    y_pred = crf.predict(X_test)
    
    return y_pred, X_test, y_test


In [28]:
def write_out_evaluation(eval_data, pred_labels, outputfile):
    
    '''Write the predicitons to a new file along with tokens'''

    outfile = open(outputfile, 'w', encoding="utf-8")
    
    for evalsents, predsents in zip(eval_data, pred_labels):
        for data, pred in zip(evalsents, predsents):
            token = str(data.get('token'))
            outfile.write(token + "\t" + pred + "\n")

## Training and Evaluation Functions:

In [67]:
def run_and_evaluate_crf_model(trainingfile, evaluationfile, outputfile):

    '''Perform the full training at once'''
    crf = create_crf_model(trainingfile)
    labels = list(crf.classes_)
    labels.remove('O')
    labels.remove('AR_label')
    labels
    y_pred, X_test, y_test = run_crf_model(crf, evaluationfile)
    write_out_evaluation(X_test, y_pred, outputfile)
    print('The predictions are written on the output file.')
    print(metrics.flat_classification_report(y_test, y_pred, labels=labels, digits=4))
    print('Accuracy score for sequence items')
    print(metrics.flat_accuracy_score(y_test, y_pred))
    print('Precision score for sequence items')
    print(metrics.flat_precision_score(y_test, y_pred, average='weighted'))
    print('Recall score for sequence items')
    print(metrics.flat_recall_score(y_test, y_pred, average='weighted'))
    print('F1 score score for sequence items')
    print(metrics.flat_f1_score(y_test, y_pred, average='weighted'))

## Toy example to test:

In [68]:
toy_trainingfile = "Toy_data_train.tsv"
toy_evaluationfile = "Toy_data_eval.tsv"
toy_outputfile = "toy_output_CRF_Embeddings.tsv"

In [71]:
#run_and_evaluate_crf_model(toy_trainingfile, toy_evaluationfile, toy_outputfile)

## Set the variables and run the experiments:

In [40]:
polnear_trainingfile = "polnear_withBIO_train.tsv"
polnear_evaluationfile = "polnear_withBIO_dev.tsv"
polnear_outputfile = "polnear_output_CRF_Embeddings.tsv"

In [41]:
#training takes around 1 hr
run_and_evaluate_crf_model(polnear_trainingfile, polnear_evaluationfile, polnear_outputfile)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

    AR_label     0.0000    0.0000    0.0000         1
           O     0.6470    0.7393    0.6901     32482
    B-SOURCE     0.7376    0.4718    0.5755      1948
       B-CUE     0.7978    0.5279    0.6353      2190
   B-CONTENT     0.5946    0.3999    0.4782      2193
   I-CONTENT     0.7332    0.7250    0.7291     36881
       I-CUE     0.4102    0.2085    0.2765      1808
    I-SOURCE     0.6128    0.4371    0.5103      4070
                 0.0000    0.0000    0.0000         0

   micro avg     0.6848    0.6848    0.6848     81573
   macro avg     0.5037    0.3899    0.4328     81573
weighted avg     0.6838    0.6848    0.6797     81573



In [72]:
#training takes around 3.5 hrs

parc3_trainingfile = "parc3_withBIO_train.tsv"
parc3_evaluationfile = "parc3_withBIO_dev.tsv"
parc3_outputfile = "parc3_output_CRF_Embeddings.tsv"

In [73]:
run_and_evaluate_crf_model(parc3_trainingfile, parc3_evaluationfile, parc3_outputfile)

The predictions are written on the output file.




              precision    recall  f1-score   support

   B-CONTENT     0.6250    0.2849    0.3914       544
   I-CONTENT     0.6249    0.3893    0.4797      8888
    B-SOURCE     0.8277    0.3855    0.5260       511
       B-CUE     0.8664    0.4011    0.5483       566
    I-SOURCE     0.6898    0.3713    0.4828      1605
       I-CUE     0.5000    0.0870    0.1481       184

   micro avg     0.6477    0.3782    0.4775     12298
   macro avg     0.6890    0.3198    0.4294     12298
weighted avg     0.6510    0.3782    0.4763     12298

Accuracy score for sequence items
0.7080599812558576
Precision score for sequence items


  _warn_prf(average, modifier, msg_start, len(result))


0.6977904875340245
Recall score for sequence items
0.7080599812558576
F1 score score for sequence items
0.6834464135971214


End of the notebook.

### References:

https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html#let-s-use-conll-2002-data-to-build-a-ner-system