# Atribution Relations Extractions Model: 
## Conditional Random Field (CRF) Approach
###  CRF classifier trained with pre-trained word embeddings

In this notebook, we train and evaluate one CRF classifiers using 100 dimentional pre-trained GloVe word embeddings as features.

The corpora is the combination of PolNeAr and PARC3. 

In [1]:
# We start with imports:

import sklearn_crfsuite# the model
from sklearn_crfsuite import metrics
from gensim.models import KeyedVectors # to load pre-trained word embeddings
import numpy as np # to create 0 vectors for the words which are not in the vocabulary
import pandas as pd # to load input&output files for evaluation
import csv # to read the data files for training and evaluation

In [2]:
# load the pre-trained word embeddings
glove_dimensions = 50
!python -m gensim.scripts.glove2word2vec --input  glove.6B.50d.txt --output glove.6B.50d.w2vformat.txt
model = KeyedVectors.load_word2vec_format("glove.6B.50d.w2vformat.txt")

2021-06-21 21:56:10,283 - glove2word2vec - INFO - running C:\Users\filiz\anaconda3\lib\site-packages\gensim\scripts\glove2word2vec.py --input glove.6B.50d.txt --output glove.6B.50d.w2vformat.txt
  num_lines, num_dims = glove2word2vec(args.input, args.output)
2021-06-21 21:56:10,283 - keyedvectors - INFO - loading projection weights from glove.6B.50d.txt
2021-06-21 21:56:28,196 - utils - INFO - KeyedVectors lifecycle event {'msg': 'loaded (400000, 50) matrix of type float32 from glove.6B.50d.txt', 'binary': False, 'encoding': 'utf8', 'datetime': '2021-06-21T21:56:28.196600', 'gensim': '4.0.1', 'python': '3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'load_word2vec_format'}
2021-06-21 21:56:28,196 - glove2word2vec - INFO - converting 400000 vectors from glove.6B.50d.txt to glove.6B.50d.w2vformat.txt
2021-06-21 21:56:28,524 - keyedvectors - INFO - storing 400000x50 projection weights into glove.6B.50d.w2vformat.txt
2

### Helper functions:

In [3]:
def extract_sents_from_conll(inputfile):
    '''Read the data from tsv file, return sentences as tokens with corresponding labels.'''
    
    rows = csv.reader(open(inputfile, encoding="utf-8"), delimiter='\t')
    sents = []
    current_sent = []
    for row in rows:
        current_sent.append(tuple(row))  
        #After each sentence there is a special token: Sent_end. Its label is O. It was added in the preprocessing step.
        if row[0] == "Sent_end":
            sents.append(current_sent)
            current_sent = []
    return sents


In [4]:
sents = extract_sents_from_conll("merged_withBIO_train.tsv")

print(sents[100:102])
print(len(sents))

[[('If', 'O'), ('Republicans', 'B-SOURCE'), ('choose', 'B-CUE'), ('to', 'I-CUE'), ('not', 'I-CUE'), ('believe', 'I-CUE'), ('Liar-of', 'B-CONTENT'), ('the', 'I-CONTENT'), ('Year', 'I-CONTENT'), ('Barack', 'I-CONTENT'), ('Obama', 'I-CONTENT'), (',', 'O'), ('the', 'O'), ('documented', 'O'), ('fact-checking', 'O'), ('frauds', 'O'), ('at', 'O'), ('the', 'O'), ('Washington', 'O'), ('Post', 'O'), ('will', 'O'), ('now', 'O'), ('award', 'O'), ('you', 'O'), ('the', 'O'), ('full-boat', 'O'), ('of', 'O'), ('four', 'O'), ('Pinocchios', 'O'), ('.', 'O'), ('Sent_end', 'O')], [('Because', 'O'), ('at', 'O'), ('the', 'O'), ('Washington', 'O'), ('Post', 'O'), (',', 'O'), ('what', 'O'), ('Obama', 'O'), ('says', 'O'), ('and', 'O'), ('promises', 'O'), ('is', 'O'), ('now', 'O'), ('the', 'O'), ('baseline', 'O'), ('for', 'O'), ('objective', 'O'), ('truth', 'O'), ('.', 'O'), ('Sent_end', 'O')]]
75079


In [5]:
def sent2tokens(sent):
    '''Take the sentence as token-label pair, return only tokens'''

    return [token for token, label in sent]


In [6]:
test =  sent2tokens(sents[0])

print(test)

['Word', 'The', 'Ninth', 'Circle', ':', 'The', 'Hellish', 'View', 'from', 'Inside', 'the', 'Beltway', ',', '#', '2', '.', 'Sent_end']


In [7]:
def sent2labels(sent):    
    '''Take the sentence as token-label pair, return only labels'''

    return [label for token, label  in sent]

In [8]:
test2 = sent2labels(sents[0])

print(test2)

['AR_label', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']


It is time to extract the features: 

IMPORTANT: Crfsuite does not support array features, like word embeddings. Instead, we pass every vector component as a feature.

https://stackoverflow.com/questions/58736548/how-to-use-word-embedding-as-features-for-crf-sklearn-crfsuite-model-training
https://github.com/scrapinghub/python-crfsuite/issues/39

In [9]:
### Embedding function 
def get_features(token):
    '''Get token, return word vector'''
    
    token=token.lower()
    try:
         vector=model[token]
    except:
        # if the word is not in vocabulary,
        # returns zeros array
        vector=np.zeros(100,)

    return vector   

In [10]:
vector = get_features("are")
print(len(vector))
print(vector)

50
[ 0.96193    0.012516   0.21733   -0.06539    0.26843    0.33586
 -0.45112   -0.60547   -0.46845   -0.18412    0.060949   0.19597
  0.22645    0.032802   0.42488    0.49678    0.65346   -0.0274
  0.17809   -1.1979    -0.40634   -0.22659    1.1495     0.59342
 -0.23759   -0.93254   -0.52502    0.05125    0.032248  -0.72774
  4.2466     0.60592    0.33397   -0.85754    0.4895     0.21744
 -0.13451    0.0094912 -0.54173    0.18857   -0.64506    0.012695
  0.73452    1.0032     0.41874    0.16596   -0.71085    0.14032
 -0.38468   -0.38712  ]


In [11]:
def token2features(sent, i):
    '''Get tokens in the sentence, add bias, token and word embeddings as features and return all as a feature dictionary.'''
    
    token = sent[i][0]
    wordembdding=get_features(token)   ## word embedding vector 
    wordembdding=np.array(wordembdding) ## vectors 
    
    
    features = {
        'bias': 1.0,
        'token': token.lower()
    }
    
    for iv,value in enumerate(wordembdding):
        features['v{}'.format(iv)]=value

    if i == 0:
        features['BOS'] = True
        
    elif i == len(sent) -1:
        features['EOS'] = True
        
    return features


In [12]:
features= token2features(sents[0], i=0)

print(features)
print(type(features))

{'bias': 1.0, 'token': 'word', 'v0': -0.1643, 'v1': 0.15722, 'v2': -0.55021, 'v3': -0.3303, 'v4': 0.66463, 'v5': -0.1152, 'v6': -0.2261, 'v7': -0.23674, 'v8': -0.86119, 'v9': 0.24319, 'v10': 0.074499, 'v11': 0.61081, 'v12': 0.73683, 'v13': -0.35224, 'v14': 0.61346, 'v15': 0.0050975, 'v16': -0.62538, 'v17': -0.0050458, 'v18': 0.18392, 'v19': -0.12214, 'v20': -0.65973, 'v21': -0.30673, 'v22': 0.35038, 'v23': 0.75805, 'v24': 1.0183, 'v25': -1.7424, 'v26': -1.4277, 'v27': 0.38032, 'v28': 0.37713, 'v29': -0.74941, 'v30': 2.9401, 'v31': -0.8097, 'v32': -0.66901, 'v33': 0.23123, 'v34': -0.073194, 'v35': -0.13624, 'v36': 0.24424, 'v37': -1.0129, 'v38': -0.24919, 'v39': -0.06893, 'v40': 0.70231, 'v41': -0.022177, 'v42': -0.64684, 'v43': 0.59599, 'v44': 0.027092, 'v45': 0.11203, 'v46': 0.61214, 'v47': 0.74339, 'v48': 0.23572, 'v49': -0.1369, 'BOS': True}
<class 'dict'>


In [13]:
def sent2features(sent):
    '''Get sentence as an input, add the features and return as a list of dictionaries.'''
    return [token2features(sent, i) for i in range(len(sent))]

In [14]:
test3 =sent2features(sents[0])

print(sents[0])
print(type(test3))
print(test3)


[('Word', 'AR_label'), ('The', 'O'), ('Ninth', 'O'), ('Circle', 'O'), (':', 'O'), ('The', 'O'), ('Hellish', 'O'), ('View', 'O'), ('from', 'O'), ('Inside', 'O'), ('the', 'O'), ('Beltway', 'O'), (',', 'O'), ('#', 'O'), ('2', 'O'), ('.', 'O'), ('Sent_end', 'O')]
<class 'list'>
[{'bias': 1.0, 'token': 'word', 'v0': -0.1643, 'v1': 0.15722, 'v2': -0.55021, 'v3': -0.3303, 'v4': 0.66463, 'v5': -0.1152, 'v6': -0.2261, 'v7': -0.23674, 'v8': -0.86119, 'v9': 0.24319, 'v10': 0.074499, 'v11': 0.61081, 'v12': 0.73683, 'v13': -0.35224, 'v14': 0.61346, 'v15': 0.0050975, 'v16': -0.62538, 'v17': -0.0050458, 'v18': 0.18392, 'v19': -0.12214, 'v20': -0.65973, 'v21': -0.30673, 'v22': 0.35038, 'v23': 0.75805, 'v24': 1.0183, 'v25': -1.7424, 'v26': -1.4277, 'v27': 0.38032, 'v28': 0.37713, 'v29': -0.74941, 'v30': 2.9401, 'v31': -0.8097, 'v32': -0.66901, 'v33': 0.23123, 'v34': -0.073194, 'v35': -0.13624, 'v36': 0.24424, 'v37': -1.0129, 'v38': -0.24919, 'v39': -0.06893, 'v40': 0.70231, 'v41': -0.022177, 'v42': -0.

In [15]:
def train_crf_model(X_train, y_train):
    '''Compile and fit the model'''

    crf = sklearn_crfsuite.CRF(
        algorithm='lbfgs',
        c1=0.1,
        c2=0.1,
        max_iterations=100,
        all_possible_transitions=True
    )
    crf.fit(X_train, y_train)
    
    return crf


In [16]:
def create_crf_model(trainingfile):
    
    '''Perform the training with the data, return the classifier'''

    train_sents = extract_sents_from_conll(trainingfile)
    X_train = [sent2features(s) for s in train_sents]
    y_train = [sent2labels(s) for s in train_sents]

    crf = train_crf_model(X_train, y_train)
    
    return crf 


In [17]:
def run_crf_model(crf, evaluationfile):
    
    '''Get and prepare the validation sentences, run the classifier and return predictions'''

    test_sents = extract_sents_from_conll(evaluationfile)
    X_test = [sent2features(s) for s in test_sents]
    y_test = [sent2labels(s) for s in test_sents]
    y_pred = crf.predict(X_test)
    
    return y_pred, X_test, y_test


In [18]:
def write_out_evaluation(eval_data, pred_labels, outputfile):
    
    '''Write the predicitons to a new file along with tokens'''

    outfile = open(outputfile, 'w', encoding="utf-8")
    
    for evalsents, predsents in zip(eval_data, pred_labels):
        for data, pred in zip(evalsents, predsents):
            token = str(data.get('token'))
            outfile.write(token + "\t" + pred + "\n")

## Training and Evaluation Functions:

In [19]:
def run_and_evaluate_crf_model(trainingfile, evaluationfile, outputfile):

    '''Perform the full training at once'''
    crf = create_crf_model(trainingfile)
    labels = list(crf.classes_)
    labels.remove('O')
    labels.remove('AR_label')
    labels
    y_pred, X_test, y_test = run_crf_model(crf, evaluationfile)
    write_out_evaluation(X_test, y_pred, outputfile)
    print('The predictions are written on the output file.')
    print(metrics.flat_classification_report(y_test, y_pred, labels=labels, digits=4))
    print('Accuracy score for sequence items')
    print(metrics.flat_accuracy_score(y_test, y_pred))
    print('Precision score for sequence items')
    print(metrics.flat_precision_score(y_test, y_pred, average='weighted'))
    print('Recall score for sequence items')
    print(metrics.flat_recall_score(y_test, y_pred, average='weighted'))
    print('F1 score score for sequence items')
    print(metrics.flat_f1_score(y_test, y_pred, average='weighted'))

## Toy example to test:

In [20]:
toy_trainingfile = "Toy_data_train.tsv"
toy_evaluationfile = "Toy_data_eval.tsv"
toy_outputfile = "toy_output_CRF_Embeddings.tsv"

In [21]:
run_and_evaluate_crf_model(toy_trainingfile, toy_evaluationfile, toy_outputfile)

The predictions are written on the output file.
              precision    recall  f1-score   support

    B-SOURCE     0.0000    0.0000    0.0000         2
       B-CUE     0.0000    0.0000    0.0000         2
   B-CONTENT     0.0000    0.0000    0.0000         2
   I-CONTENT     0.0000    0.0000    0.0000        72
       I-CUE     0.0000    0.0000    0.0000         1
    I-SOURCE     0.0000    0.0000    0.0000         4

   micro avg     0.0000    0.0000    0.0000        83
   macro avg     0.0000    0.0000    0.0000        83
weighted avg     0.0000    0.0000    0.0000        83

Accuracy score for sequence items
0.2169811320754717
Precision score for sequence items
0.05292003593890387
Recall score for sequence items
0.2169811320754717
F1 score score for sequence items
0.08134006834051405


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Set the variables and run the experiments:

In [22]:
merged_trainingfile = "merged_withBIO_train.tsv"
merged_evaluationfile = "merged_withBIO_dev.tsv"
merged_outputfile = "merged_output_CRF_Embeddings.tsv"

In [23]:
run_and_evaluate_crf_model(merged_trainingfile, merged_evaluationfile, merged_outputfile)

The predictions are written on the output file.




              precision    recall  f1-score   support

    B-SOURCE     0.7825    0.4376    0.5613      2459
       B-CUE     0.8355    0.4590    0.5925      2756
   B-CONTENT     0.6073    0.3328    0.4300      2737
   I-CONTENT     0.7642    0.5906    0.6663     45769
       I-CUE     0.4362    0.1391    0.2109      1992
    I-SOURCE     0.6753    0.4303    0.5257      5675

   micro avg     0.7499    0.5376    0.6262     61388
   macro avg     0.6835    0.3982    0.4978     61388
weighted avg     0.7423    0.5376    0.6205     61388

Accuracy score for sequence items
0.6855318192816896
Precision score for sequence items


  _warn_prf(average, modifier, msg_start, len(result))


0.6971034386896819
Recall score for sequence items
0.6855318192816896
F1 score score for sequence items
0.6742891885407917


End of the notebook.

### References:

https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html#let-s-use-conll-2002-data-to-build-a-ner-system