# Relation extraction using distant supervision: experiments

In [1]:
__author__ = "Bill MacCartney and Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2021"

## Contents

1. [Overview](#Overview)
1. [Set-up](#Set-up)
1. [Building a classifier](#Building-a-classifier)
  1. [Featurizers](#Featurizers)
  1. [Experiments](#Experiments)
1. [Analysis](#Analysis)
  1. [Examining the trained models](#Examining-the-trained-models)
  1. [Discovering new relation instances](#Discovering-new-relation-instances)

## Overview

OK, it's time to get (halfway) serious. Let's apply real machine learning to train a classifier on the training data, and see how it performs on the test data. We'll begin with one of the simplest machine learning setups: a bag-of-words feature representation, and a linear model trained using logistic regression.

Just like we did in the unit on [supervised sentiment analysis](https://github.com/cgpotts/cs224u/blob/master/sst_02_hand_built_features.ipynb), we'll leverage the `sklearn` library, and we'll introduce functions for featurizing instances, training models, making predictions, and evaluating results.

## Set-up

See [the first notebook in this unit](rel_ext_01_task.ipynb#Set-up) for set-up instructions.

In [2]:
from collections import Counter
import os
import rel_ext
import utils

In [3]:
# Set all the random seeds for reproducibility. Only the
# system seed is relevant for this notebook.

utils.fix_random_seeds()

In [4]:
try:
    # python package (nlp) location - two levels up from this file
    src_path = os.path.abspath(os.path.join(os.getcwd(), "../.."))
    # add package to sys.path if it's not already there
    if src_path not in sys.path:
        sys.path.extend([src_path])
except NameError:
    print('issue with adding to path, probably due to __file__ not being defined')
    src_path = None

In [5]:
rel_ext_data_home = os.path.join(src_path,'data')
rel_ext_data_home

'/Users/timmaecker/Google Drive/Career and Skills/Learning/MSc Machine Learning - UCL/2 COMP0087 Statistical Natural Language Processing/Coursework/nlpproject/nlp/data'

With the following steps, we build up the dataset we'll use for experiments; it unites a corpus and a knowledge base in the way we described in [the previous notebook](rel_ext_01_task.ipynb).

In [6]:
corpus = rel_ext.Corpus(os.path.join(rel_ext_data_home, 'example_inputs_long_names.tsv.gz'))

print('Read {0:,} examples'.format(len(corpus)))

Read 6,058 examples


In [7]:
kb = rel_ext.KB(os.path.join(rel_ext_data_home, 'example_kb.tsv.gz'))

print('Read {0:,} KB triples'.format(len(kb)))

Read 21,458 KB triples


In [8]:
dataset = rel_ext.Dataset(corpus, kb)

The following code splits up our data in a way that supports experimentation:

In [9]:
splits = dataset.build_splits()

splits

{'tiny': Corpus with 136 examples; KB with 244 triples,
 'train': Corpus with 4,732 examples; KB with 15,632 triples,
 'dev': Corpus with 1,190 examples; KB with 5,582 triples,
 'all': Corpus with 6,058 examples; KB with 21,458 triples}

## Building a classifier

### Featurizers

Featurizers are functions which define the feature representation for our model. The primary input to a featurizer will be the `KBTriple` for which we are generating features. But since our features will be derived from corpus examples containing the entities of the `KBTriple`, we must also pass in a reference to a `Corpus`. And in order to make it easy to combine different featurizers, we'll also pass in a feature counter to hold the results.

Here's an implementation for a very simple bag-of-words featurizer. It finds all the corpus examples containing the two entities in the `KBTriple`, breaks the phrase appearing between the two entity mentions into words, and counts the words. Note that it makes no distinction between "forward" and "reverse" examples.

In [10]:
def simple_bag_of_words_featurizer(kbt, corpus, feature_counter):
    for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):
        for word in ex.middle.split(' '):
            feature_counter[word] += 1
    for ex in corpus.get_examples_for_entities(kbt.obj, kbt.sbj):
        for word in ex.middle.split(' '):
            feature_counter[word] += 1
    return feature_counter

Here's how this featurizer works on a single example:

In [21]:
kbt = kb.kb_triples[2]

kbt

KBTriple(rel='supplier of', sbj='Iri Group Holdings Inc', obj='DS Smith PLC')

In [22]:
kbt.sbj

'Iri Group Holdings Inc'

In [23]:
kbt.obj

'DS Smith PLC'

In [24]:
corpus.get_examples_for_entities(kbt.sbj, kbt.obj)

[]

In [32]:
#Find for which triplets we have examples
for kbt in kb.kb_triples:
    if len(corpus.get_examples_for_entities(kbt.sbj, kbt.obj))>0:
        print(kbt.sbj, kbt.obj)
        print(corpus.get_examples_for_entities(kbt.sbj, kbt.obj)[0].middle)
        break
    

Apple Inc TDK Corp
 suppliers outperformed, with 


In [33]:
corpus.get_examples_for_entities(kbt.sbj, kbt.obj)[0].middle

' suppliers outperformed, with '

In [35]:
corpus.get_examples_for_entities(kbt.sbj, kbt.obj)[0]

Example(entity_1='Apple Inc', entity_2='TDK Corp', left='Japanese ', mention_1='Apple Inc', middle=' suppliers outperformed, with ', mention_2='TDK Corp', right=' gaining 0.7 percent and Foster Electric rising 1.1 percent, and Taiyo Yuden surging 0.9 percent.', left_POS='Japanese ', mention_1_POS='Apple Inc', middle_POS=' suppliers outperformed, with ', mention_2_POS='TDK Corp', right_POS=' gaining 0.7 percent and Foster Electric rising 1.1 percent, and Taiyo Yuden surging 0.9 percent.')

In [39]:
simple_bag_of_words_featurizer(kbt, corpus, Counter())

Counter({'': 15,
         'suppliers': 2,
         'outperformed,': 1,
         'with': 1,
         'told': 3,
         'investors': 1,
         'late': 1,
         'on': 2,
         'Monday': 1,
         'that': 2,
         'its': 5,
         'manufacturing': 1,
         'facilities': 1,
         'in': 2,
         'China': 1,
         'produce': 1,
         'iPhone': 3,
         'and': 8,
         'other': 1,
         'electronics': 1,
         'have': 1,
         'begun': 1,
         'to': 6,
         're-open,': 1,
         'but': 1,
         'are': 1,
         'ramping': 1,
         'up': 1,
         'slower': 1,
         'than': 3,
         'expected.Among': 1,
         'Japanese': 1,
         'suppliers,': 1,
         'electric': 1,
         'parts': 1,
         'maker': 2,
         'Murata': 3,
         'Manufacturing': 3,
         'shed': 1,
         '3.4%': 1,
         'Taiyo': 3,
         'Yuden': 3,
         'Co': 2,
         'Ltd': 1,
         ',': 2,
         'which': 3,
 

You can experiment with adding new kinds of features just by implementing additional featurizers, following `simple_bag_of_words_featurizer` as an example.

Now, in order to apply machine learning algorithms such as those provided by `sklearn`, we need a way to convert datasets of `KBTriple`s into feature matrices. The following steps achieve that: 

In [40]:
kbts_by_rel, labels_by_rel = dataset.build_dataset()

featurized = dataset.featurize(kbts_by_rel, featurizers=[simple_bag_of_words_featurizer])

### Experiments

Now we need some functions to train models, make predictions, and evaluate the results. We'll start with `train_models()`. This function takes as arguments a dictionary of data splits, a list of featurizers, the name of the split on which to train (by default, 'train'), and a model factory, which is a function which initializes an `sklearn` classifier (by default, a logistic regression classifier). It returns a dictionary holding the featurizers, the vectorizer that was used to generate the training matrix, and a dictionary holding the trained models, one per relation.

In [42]:
train_result = rel_ext.train_models(
    splits,
    featurizers=[simple_bag_of_words_featurizer])

Next comes `predict()`. This function takes as arguments a dictionary of data splits, the outputs of `train_models()`, and the name of the split for which to make predictions. It returns two parallel dictionaries: one holding the predictions (grouped by relation), the other holding the true labels (again, grouped by relation).

In [43]:
predictions, true_labels = rel_ext.predict(
    splits, train_result, split_name='dev')

Now `evaluate_predictions()`. This function takes as arguments the parallel dictionaries of predictions and true labels produced by `predict()`. It prints summary statistics for each relation, including precision, recall, and F<sub>0.5</sub>-score, and it returns the macro-averaged F<sub>0.5</sub>-score.

In [44]:
rel_ext.evaluate_predictions(predictions, true_labels)

relation              precision     recall    f-score    support       size
------------------    ---------  ---------  ---------  ---------  ---------
customer of               0.999      0.999      0.999       2557       2560
supplier of               0.999      0.999      0.999       3025       3028
------------------    ---------  ---------  ---------  ---------  ---------
macro-average             0.999      0.999      0.999       5582       5588


0.9989899670713638

Finally, we introduce `rel_ext.experiment()`, which basically chains together `rel_ext.train_models()`, `rel_ext.predict()`, and `rel_ext.evaluate_predictions()`. For convenience, this function returns the output of `rel_ext.train_models()` as its result.

Running `rel_ext.experiment()` in its default configuration will give us a baseline result for machine-learned models.

In [45]:
_ = rel_ext.experiment(
    splits,
    featurizers=[simple_bag_of_words_featurizer])

relation              precision     recall    f-score    support       size
------------------    ---------  ---------  ---------  ---------  ---------
customer of               0.999      0.999      0.999       2557       2560
supplier of               0.999      0.999      0.999       3025       3028
------------------    ---------  ---------  ---------  ---------  ---------
macro-average             0.999      0.999      0.999       5582       5588


Considering how vanilla our model is, these results are quite surprisingly good! We see huge gains for every relation over our `top_k_middles_classifier` from [the previous notebook](rel_ext_01_task.ipynb#A-simple-baseline-model). This strong performance is a powerful testament to the effectiveness of even the simplest forms of machine learning.

But there is still much more we can do. To make further gains, we must not treat the model as a black box. We must open it up and get visibility into what it has learned, and more importantly, where it still falls down.

## Analysis

### Examining the trained models

One important way to gain understanding of our trained model is to inspect the model weights. What features are strong positive indicators for each relation, and what features are strong negative indicators?

In [46]:
vectorizer = train_result['vectorizer']

if vectorizer is None:
    print("Model weights can be examined only if the featurizers "
            "are based in dicts (i.e., if `vectorize=True`).")

feature_names = vectorizer.get_feature_names()




In [48]:
for rel, model in train_result['models'].items():
    print('Highest and lowest feature weights for relation {}:\n'.format(rel))
    try:
        coefs = model.coef_.toarray()
    except AttributeError:
        coefs = model.coef_
    break

Highest and lowest feature weights for relation customer of:



In [49]:
len(coefs[0])

10866

In [54]:
k=20
for rel, model in train_result['models'].items():
    print('Highest and lowest feature weights for relation {}:\n'.format(rel))
    try:
        coefs = model.coef_.toarray()
    except AttributeError:
        coefs = model.coef_
    sorted_weights = sorted([(wgt, idx) for idx, wgt in enumerate(coefs[0])], reverse=True)
    for wgt, idx in sorted_weights[:k]:
        print('{:10.3f} {}'.format(wgt, feature_names[idx]))
    print('{:>10s} {}'.format('.....', '.....'))
    for wgt, idx in sorted_weights[-k:]:
        print('{:10.3f} {}'.format(wgt, feature_names[idx]))
    print()


Highest and lowest feature weights for relation customer of:

     0.369 for
     0.353 to
     0.351 in
     0.324 the
     0.315 ,
     0.266 as
     0.263 and
     0.258 is
     0.252 its
     0.181 that
     0.176 has
     0.171 's
     0.170 while
     0.131 after
     0.106 up
     0.101 including
     0.099 some
     0.095 market
     0.083 Google
     0.081 battery
     ..... .....
    -0.230 cautioned
    -0.230 Profumo
    -0.230 Leonardo,
    -0.230 33%
    -0.230 (TCFP.PA)
    -0.230 space
    -0.240 sold
    -0.254 of
    -0.291 he
    -0.399 a
    -0.470 it
    -0.510 business
    -0.571 with
    -0.613 said
    -0.760 Tuesday
    -0.784 was
    -0.787 earlier
    -0.805 pairing
    -0.856 (NPTN.N),
    -0.863 NeoPhotonics

Highest and lowest feature weights for relation supplier of:

     0.391 to
     0.384 for
     0.371 in
     0.307 ,
     0.285 the
     0.269 is
     0.261 its
     0.252 and
     0.235 as
     0.205 that
     0.164 has
     0.150 after
     0.148 's

In [51]:
rel_ext.examine_model_weights(train_result)

Highest and lowest feature weights for relation customer of:

     0.369 for
     0.353 to
     0.351 in
     ..... .....
    -0.805 pairing
    -0.856 (NPTN.N),
    -0.863 NeoPhotonics

Highest and lowest feature weights for relation supplier of:

     0.391 to
     0.384 for
     0.371 in
     ..... .....
    -0.812 pairing
    -0.860 (NPTN.N),
    -0.867 NeoPhotonics



By and large, the high-weight features for each relation are pretty intuitive — they are words that are used to express the relation in question. (The counter-intuitive results merit a bit of investigation!)

The low-weight features (that is, features with large negative weights) may be a bit harder to understand. In some cases, however, they can be interpreted as features which indicate some _other_ relation which is anti-correlated with the target relation. (As an example, "directed" is a negative indicator for the `author` relation.)

__Optional exercise:__ Investigate one of the counter-intuitive high-weight features. Find the training examples which caused the feature to be included. Given the training data, does it make sense that this feature is a good predictor for the target relation?

<!--
- SPOILER: Using `penalty='l1'` results in somewhat less intuitive feature weights, and about the same performance.
- SPOILER: Using `penalty='l1', C=0.1` results in much more intuitive feature weights, but much worse performance.
-->

### Discovering new relation instances

Another way to gain insight into our trained models is to use them to discover new relation instances that don't currently appear in the KB. In fact, this is the whole point of building a relation extraction system: to extend an existing KB (or build a new one) using knowledge extracted from natural language text at scale. Can the models we've trained do this effectively?

Because the goal is to discover new relation instances which are *true* but *absent from the KB*, we can't evaluate this capability automatically. But we can generate candidate KB triples and manually evaluate them for correctness.

To do this, we'll start from corpus examples containing pairs of entities which do not belong to any relation in the KB (earlier, we described these as "negative examples"). We'll then apply our trained models to each pair of entities, and sort the results by probability assigned by the model, in order to find the most likely new instances for each relation.

In [52]:
rel_ext.find_new_relation_instances(
    dataset,
    featurizers=[simple_bag_of_words_featurizer])

Highest probability examples for relation customer of:

     1.000 KBTriple(rel='customer of', sbj='Schlumberger NV', obj='Halliburton Co')
     1.000 KBTriple(rel='customer of', sbj='Halliburton Co', obj='Schlumberger NV')
     1.000 KBTriple(rel='customer of', sbj='Time Warner Inc', obj='Charter Communications Inc')
     1.000 KBTriple(rel='customer of', sbj='Charter Communications Inc', obj='Time Warner Inc')
     1.000 KBTriple(rel='customer of', sbj='Uber Technologies Inc', obj='SoftBank Group Corp')
     1.000 KBTriple(rel='customer of', sbj='SoftBank Group Corp', obj='Uber Technologies Inc')
     1.000 KBTriple(rel='customer of', sbj='General Motors Co', obj='Linamar Corp')
     1.000 KBTriple(rel='customer of', sbj='Linamar Corp', obj='General Motors Co')
     1.000 KBTriple(rel='customer of', sbj='Walmart Inc', obj='Tesco PLC')
     1.000 KBTriple(rel='customer of', sbj='Tesco PLC', obj='Walmart Inc')

Highest probability examples for relation supplier of:

     1.000 KBTriple

There are actually some good discoveries here! The predictions for the `author` relation seem especially good. Of course, there are also plenty of bad results, and a few that are downright comical. We may hope that as we improve our models and optimize performance in our automatic evaluations, the results we observe in this manual evaluation improve as well.

__Optional exercise:__ Note that every time we predict that a given relation holds between entities `X` and `Y`, we also predict, with equal confidence, that it holds between `Y` and `X`. Why? How could we fix this?

\[ [top](#Relation-extraction-using-distant-supervision) \]