# Homework 3: Relation extraction using distant supervision

In [1]:
__author__ = "Bill MacCartney"
__version__ = "CS224U, Stanford, Spring 2019"

## Contents

1. [Overview](#Overview)
1. [Set-up](#Set-up)
1. [Baseline](#Baseline)
1. [Homework questions](#Homework-questions)
  1. [Different model factory [1 point]](#Different-model-factory-[1-point])
  1. [Directional unigram features [2 points]](#Directional-unigram-features-[2-points])
  1. [The part-of-speech tags of the "middle" words [2 points]](#The-part-of-speech-tags-of-the-"middle"-words-[2-points])
  1. [Your original system [4 points]](#Your-original-system-[4-points])
1. [Bake-off [1 point]](#Bake-off-[1-point])

## Overview

This homework and associated bake-off are devoted to the developing really effective relation extraction systems using distant supervision. 

As with the previous assignments, this notebook first establishes a baseline system. The initial homework questions ask you to create additional baselines and suggest areas for innovation, and the final homework question asks you to develop an original system for you to enter into the bake-off.

## Set-up

See [the first notebook in this unit](rel_ext_01_task.ipynb#Set-up) for set-up instructions.

In [2]:
import os
import rel_ext
from sklearn.linear_model import LogisticRegression

As usual, we unite our corpus and KB into a dataset, and create some splits for experimentation:

In [3]:
rel_ext_data_home = os.path.join('data', 'rel_ext_data')

In [4]:
corpus = rel_ext.Corpus(os.path.join(rel_ext_data_home, 'corpus.tsv.gz'))

In [5]:
kb = rel_ext.KB(os.path.join(rel_ext_data_home, 'kb.tsv.gz'))

In [6]:
dataset = rel_ext.Dataset(corpus, kb)

You are not wedded to this set-up for splits. The bake-off will be conducted on a previously unseen test-set, so all of the data in `dataset` is fair game:

In [7]:
splits = dataset.build_splits(
    split_names=['tiny', 'train', 'dev'],
    split_fracs=[0.01, 0.79, 0.20],
    seed=1)

In [8]:
splits

{'tiny': Corpus with 3,474 examples; KB with 445 triples,
 'train': Corpus with 263,285 examples; KB with 36,191 triples,
 'dev': Corpus with 64,937 examples; KB with 9,248 triples,
 'all': Corpus with 331,696 examples; KB with 45,884 triples}

## Baseline

In [9]:
def simple_bag_of_words_featurizer(kbt, corpus, feature_counter):
    for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):
        for word in ex.middle.split(' '):
            feature_counter[word] += 1
    for ex in corpus.get_examples_for_entities(kbt.obj, kbt.sbj):
        for word in ex.middle.split(' '):
            feature_counter[word] += 1
    return feature_counter

In [10]:
featurizers = [simple_bag_of_words_featurizer]

In [11]:
model_factory = lambda: LogisticRegression(fit_intercept=True, solver='liblinear')

In [12]:
baseline_results = rel_ext.experiment(
    splits,
    train_split='train',
    test_split='dev',
    featurizers=featurizers,
    model_factory=model_factory,
    verbose=True)


relation              precision     recall    f-score    support       size
------------------    ---------  ---------  ---------  ---------  ---------
adjoins                   0.806      0.379      0.658        340       5716
author                    0.801      0.528      0.726        509       5885
capital                   0.607      0.179      0.411         95       5471
contains                  0.798      0.595      0.747       3904       9280
film_performance          0.769      0.577      0.721        766       6142
founders                  0.798      0.395      0.663        380       5756
genre                     0.659      0.171      0.419        170       5546
has_sibling               0.861      0.248      0.577        499       5875
has_spouse                0.865      0.325      0.649        594       5970
is_a                      0.645      0.219      0.465        497       5873
nationality               0.560      0.156      0.369        301       5677
parents     

Studying model weights might yield insights:

In [13]:
rel_ext.examine_model_weights(baseline_results)

Highest and lowest feature weights for relation adjoins:

     2.477 Córdoba
     2.469 Taluks
     2.466 Valais
     ..... .....
    -1.380 Afghanistan
    -1.592 America
    -2.265 Earth

Highest and lowest feature weights for relation author:

     2.458 wrote
     2.287 by
     2.155 books
     ..... .....
    -1.998 superhero
    -2.661 infamous
    -6.993 dystopian

Highest and lowest feature weights for relation capital:

     3.356 capital
     1.938 especially
     1.887 km
     ..... .....
    -1.958 million
    -1.960 pop
    -1.998 Madras

Highest and lowest feature weights for relation contains:

     2.678 third-largest
     2.484 bordered
     2.186 districts
     ..... .....
    -2.654 Mile
    -2.701 second-largest
    -2.783 Philippine

Highest and lowest feature weights for relation film_performance:

     3.795 starring
     3.751 co-starring
     3.617 opposite
     ..... .....
    -1.784 commonly
    -1.899 then
    -2.450 She

Highest and lowest feature weights f

## Homework questions

Please embed your homework responses in this notebook, and do not delete any cells from the notebook. (You are free to add as many cells as you like as part of your responses.)

### Different model factory [1 point]

The code in `rel_ext` makes it very easy to experiment with other classifier models: one need only redefine the `model_factory` argument. This question asks you to assess a [Support Vector Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).

__To submit:__ A call to `rel_ext.experiment` training on the 'train' part of `splits` and assessing on its `dev` part, with `featurizers` as defined above in this notebook and the `model_factory` set to one based in an `SVC` with `kernel='linear'` and all other arguments left with default values.

In [14]:
from sklearn.svm import SVC

svc_model_factory =  lambda: SVC(kernel='linear')

svc_results = rel_ext.experiment(
    splits,
    train_split='train',
    test_split='dev',
    featurizers=featurizers,
    model_factory=svc_model_factory,
    verbose=True)

relation              precision     recall    f-score    support       size
------------------    ---------  ---------  ---------  ---------  ---------
adjoins                   0.776      0.347      0.622        340       5716
author                    0.754      0.597      0.717        509       5885
capital                   0.683      0.295      0.541         95       5471
contains                  0.786      0.600      0.740       3904       9280
film_performance          0.734      0.620      0.708        766       6142
founders                  0.750      0.418      0.647        380       5756
genre                     0.620      0.259      0.485        170       5546
has_sibling               0.766      0.236      0.529        499       5875
has_spouse                0.827      0.347      0.648        594       5970
is_a                      0.600      0.260      0.475        497       5873
nationality               0.549      0.186      0.395        301       5677
parents     

In [15]:
#rel_ext.examine_model_weights(svc_results)

### Directional unigram features [2 points]

The current bag-of-words representation makes no distinction between "forward" and "reverse" examples. But, intuitively, there is big difference between _X and his son Y_ and _Y and his son X_. This question asks you to modify `simple_bag_of_words_featurizer` to capture these differences. 

__To submit:__

1. A feature function `directional_bag_of_words_featurizer` that is just like `simple_bag_of_words_featurizer` except that it distinguishes "forward" and "reverse". To do this, you just need to mark each word feature for whether it is derived from a subject–object example or from an object–subject example. The precise nature of the mark you add for the two cases doesn't make a difference to the model.

2. The macro-average F-score on the `dev` set that you obtain from running `rel_ext.experiment` with `directional_bag_of_words_featurizer` as the only featurizer. (Aside from this, use all the default values for `experiment` as exemplified above in this notebook.)

3. `rel_ext.experiment` returns some of the core objects used in the experiment. How many feature names does the `vectorizer` have for the experiment run in the previous step? (Note: we're partly asking you to figure out how to get this value by using the sklearn documentation, so please don't ask how to do it on Piazza!)

In [12]:
from sklearn.feature_extraction import DictVectorizer

def directional_bag_of_words_featurizer(kbt, corpus, feature_counter):
    for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):
        for word in ex.middle.split(' '):
            feature_counter[word + '_fwd'] += 1
    for ex in corpus.get_examples_for_entities(kbt.obj, kbt.sbj):
        for word in ex.middle.split(' '):
            feature_counter[word + '_rev'] += 1
    return feature_counter

directional_featurizer = [directional_bag_of_words_featurizer]

In [17]:
directional_results = rel_ext.experiment(
    splits,
    train_split='train',
    test_split='dev',
    featurizers=directional_featurizer,
    model_factory=model_factory,
    verbose=True)

relation              precision     recall    f-score    support       size
------------------    ---------  ---------  ---------  ---------  ---------
adjoins                   0.858      0.426      0.714        340       5716
author                    0.877      0.572      0.792        509       5885
capital                   0.733      0.232      0.512         95       5471
contains                  0.832      0.637      0.784       3904       9280
film_performance          0.842      0.654      0.796        766       6142
founders                  0.845      0.416      0.700        380       5756
genre                     0.652      0.265      0.504        170       5546
has_sibling               0.882      0.255      0.591        499       5875
has_spouse                0.867      0.352      0.671        594       5970
is_a                      0.741      0.254      0.535        497       5873
nationality               0.663      0.203      0.456        301       5677
parents     

In [18]:
len(directional_results['vectorizer'].get_feature_names())

40574

### The part-of-speech tags of the "middle" words [2 points]

Our corpus distribution contains part-of-speech (POS) tagged versions of the core text spans. Let's begin to explore whether there is information in these sequences, focusing on `middle_POS`.

__To submit:__

1. A feature function `middle_bigram_pos_tag_featurizer` that is just like `simple_bag_of_words_featurizer` except that it creates a feature for bigram POS sequences. For example, given 

  `The/DT dog/N napped/V`
  
   we obtain the list of bigram POS sequences
  
   `['<s> DT', 'DT N', 'N V', 'V </s>']`. 
   
   Don't forget the start and end tags, to model those environments properly!

2. The macro-average F-score on the `dev` set that you obtain from running `rel_ext.experiment` with `middle_bigram_pos_tag_featurizer` as the only featurizer. (Aside from this, use all the default values for `experiment` as exemplified above in this notebook.)

Note: To parse `middle_POS`, one splits on whitespace to get the `word/TAG` pairs. Each of these pairs `s` can be parsed with `s.rsplit('/', 1)`.

In [13]:
def middle_bigram_pos_tag_featurizer(kbt, corpus, feature_counter):
    for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):
        middle_list = []
        if len(ex.middle_POS) != 0:
            middle_list = [pair.rsplit('/', 1)[1] for pair in ex.middle_POS.split(' ')]
        tag_list = ['<s>'] + middle_list + ['</s>']
        for i in range(len(tag_list) -1):
            feature_counter[tag_list[i] + ' ' + tag_list[i+1]] += 1
    for ex in corpus.get_examples_for_entities(kbt.obj, kbt.sbj):
        middle_list = []
        if len(ex.middle_POS) != 0:
            middle_list = [pair.rsplit('/', 1)[1] for pair in ex.middle_POS.split(' ')]
        tag_list = ['<s>'] + middle_list + ['</s>']
        for i in range(len(tag_list) -1):
            feature_counter[tag_list[i] + ' ' + tag_list[i+1]] += 1
    return feature_counter

middle_pos_featurizer = [middle_bigram_pos_tag_featurizer]

In [14]:
# # for test
# corpus.get_examples_for_entities(kb.kb_triples[0].sbj, kb.kb_triples[0].obj)[0].middle

# from collections import Counter
# middle_bigram_pos_tag_featurizer(kb.kb_triples[0], corpus, Counter())

In [21]:
middle_pos_results = rel_ext.experiment(
    splits,
    train_split='train',
    test_split='dev',
    featurizers=middle_pos_featurizer,
    model_factory=model_factory,
    verbose=True)


relation              precision     recall    f-score    support       size
------------------    ---------  ---------  ---------  ---------  ---------
adjoins                   0.859      0.341      0.659        340       5716
author                    0.741      0.326      0.591        509       5885
capital                   0.483      0.147      0.332         95       5471
contains                  0.758      0.595      0.719       3904       9280
film_performance          0.728      0.446      0.646        766       6142
founders                  0.611      0.174      0.406        380       5756
genre                     0.569      0.171      0.388        170       5546
has_sibling               0.692      0.162      0.419        499       5875
has_spouse                0.800      0.269      0.574        594       5970
is_a                      0.558      0.175      0.388        497       5873
nationality               0.370      0.066      0.193        301       5677
parents     

### Your original system [4 points]

There are many options, and this could easily grow into a project. Here are a few ideas:

- Try out different classifier models, from `sklearn` and elsewhere.
- Add a feature that indicates the length of the middle.
- Augment the bag-of-words representation to include bigrams or trigrams (not just unigrams).
- Introduce features based on the entity mentions themselves. <!-- \[SPOILER: it helps a lot, maybe 4% in F-score. And combines nicely with the directional features.\] -->
- Experiment with features based on the context outside (rather than between) the two entity mentions — that is, the words before the first mention, or after the second.
- Try adding features which capture syntactic information, such as the dependency-path features used by Mintz et al. 2009. The [NLTK](https://www.nltk.org/) toolkit contains a variety of [parsing algorithms](http://www.nltk.org/api/nltk.parse.html) that may help.
- The bag-of-words representation does not permit generalization across word categories such as names of people, places, or companies. Can we do better using word embeddings such as [GloVe](https://nlp.stanford.edu/projects/glove/)?
- Consider adding features based on WordNet synsets. Here's a little code to get you started with that:
  ```
  from nltk.corpus import wordnet as wn
  dog_compatible_synsets = wn.synsets('dog', pos='n')
 ```

In [15]:
from nltk.parse import RecursiveDescentParser
from nltk import word_tokenize
from nltk import pos_tag

# text = nltk.word_tokenize("And now for something completely different")
# for i in nltk.pos_tag(text):
#     print([i[1]])

def semantic_parser(kbt, corpus, feature_counter):
    for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):
        text = word_tokenize(ex.mention_1 + ' ' + ex.middle + ' ' + ex.mention_2)
        feature_counter[pos_tag(text)[0][1] + ' ' + pos_tag(text)[1][1] + '_fwd1'] += 1
        feature_counter[pos_tag(text)[-2][1] + ' ' + pos_tag(text)[-1][1] + '_fwd2'] += 1
        prev_text = word_tokenize(ex.left + ' ' + ex.mention_1)
        if len(prev_text) >=2:
            feature_counter[pos_tag(prev_text)[-2][1] + ' ' + pos_tag(prev_text)[-1][1] + '_fwd3'] += 1
    for ex in corpus.get_examples_for_entities(kbt.obj, kbt.sbj):
        text = word_tokenize(ex.mention_1 + ' ' + ex.middle + ' ' + ex.mention_2)
        feature_counter[pos_tag(text)[0][1] + ' ' + pos_tag(text)[1][1] + '_rev1'] += 1
        feature_counter[pos_tag(text)[-2][1] + ' ' + pos_tag(text)[-1][1] + '_rev2'] += 1
        post_text = word_tokenize(ex.mention_2 + ' ' + ex.right)
        if len(post_text) >= 2:
            feature_counter[pos_tag(post_text)[0][1] + ' ' + pos_tag(post_text)[1][1] + '_rev3'] += 1
    return feature_counter

def length_featurizer(kbt, corpus, feature_counter):
    for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):
        feature_counter[str(len(ex.middle_POS))] += 1
    for ex in corpus.get_examples_for_entities(kbt.obj, kbt.sbj):
        feature_counter[str(len(ex.middle_POS))] += 1
    return feature_counter

def bigram_trigram_words_featurizer(kbt, corpus, feature_counter):
    for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):
        word_list = ex.middle.split(' ')
        for i in range(len(word_list) - 1):
            feature_counter[word_list[i] + ' ' + word_list[i+1] + '_fwd'] += 1
            if i < len(word_list) - 2: 
                feature_counter[word_list[i] + ' ' + word_list[i+1] + ' ' + word_list[i+2] + '_fwd'] += 1
    for ex in corpus.get_examples_for_entities(kbt.obj, kbt.sbj):
        word_list = ex.middle.split(' ')
        for i in range(len(word_list) - 1):
            feature_counter[word_list[i] + ' ' + word_list[i+1] + '_rev'] += 1
            if i < len(word_list) - 2: 
                feature_counter[word_list[i] + ' ' + word_list[i+1] + ' ' + word_list[i+2] + '_rev'] += 1
    return feature_counter

original_featurizer = [simple_bag_of_words_featurizer, directional_bag_of_words_featurizer, 
                         bigram_trigram_words_featurizer, length_featurizer, semantic_parser]

In [16]:
# import sklearn.linear_model as LinearModel
# sgd_model_factory =  lambda: LinearModel.SGDClassifier(penalty = 'l2', max_iter = 100)

lr_model_factory = lambda: LogisticRegression(fit_intercept=True, solver='liblinear', max_iter = 500)

In [25]:
original_results = rel_ext.experiment(
    splits,
    train_split='train',
    test_split='dev',
    featurizers=original_featurizer,
    model_factory=lr_model_factory,
    verbose=True)
#macro-average  f-score: 0.690 

relation              precision     recall    f-score    support       size
------------------    ---------  ---------  ---------  ---------  ---------
adjoins                   0.847      0.374      0.676        340       5716
author                    0.870      0.672      0.822        509       5885
capital                   0.756      0.326      0.598         95       5471
contains                  0.843      0.709      0.812       3904       9280
film_performance          0.878      0.725      0.842        766       6142
founders                  0.824      0.479      0.720        380       5756
genre                     0.798      0.559      0.735        170       5546
has_sibling               0.856      0.309      0.632        499       5875
has_spouse                0.890      0.380      0.702        594       5970
is_a                      0.773      0.481      0.690        497       5873
nationality               0.677      0.349      0.570        301       5677
parents     

In [26]:
rel_ext.examine_model_weights(original_results)

Highest and lowest feature weights for relation adjoins:

     1.516 JJR NNP_fwd1
     1.413 combined
     1.399 Valais
     ..... .....
    -0.918 NN ,_rev1
    -0.928 63
    -0.991 DT NNP_fwd1

Highest and lowest feature weights for relation author:

     2.604 by_fwd
     1.908 's_rev
     1.651 ) ._rev
     ..... .....
    -1.795 NN NN_fwd1
    -2.033 by_rev
    -2.052 NNP NNS_rev2

Highest and lowest feature weights for relation capital:

     2.407 capital
     1.738 NNP )_rev3
     1.433 JJ NNP_rev3
     ..... .....
    -1.246 NNP NNP_fwd2
    -1.412 NNP VBD_rev3
    -1.436 NNP WRB_rev3

Highest and lowest feature weights for relation contains:

     1.954 , with_fwd
     1.764 DT NNPS_fwd3
     1.648 ( NNP_rev2
     ..... .....
    -1.882 DT NN_rev2
    -1.923 &
    -2.050 NN NN_fwd1

Highest and lowest feature weights for relation film_performance:

     2.669 starring_rev
     2.376 opposite_rev
     1.942 co-starring_rev
     ..... .....
    -1.605 NN NN_rev1
    -1.617 NNP 

## Bake-off [1 point]

For the bake-off, we will release a test set right after class on April 29. The announcement will go out on Piazza. You will evaluate your custom model from the previous question on these new datasets using the function `rel_ext.bake_off_experiment`. Rules:

1. Only one evaluation is permitted.
1. No additional system tuning is permitted once the bake-off has started.

To enter the bake-off, upload this notebook on Canvas:

https://canvas.stanford.edu/courses/99711/assignments/187248

The cells below this one constitute your bake-off entry.

People who enter will receive the additional homework point, and people whose systems achieve the top score will receive an additional 0.5 points. We will test the top-performing systems ourselves, and only systems for which we can reproduce the reported results will win the extra 0.5 points.

The bake-off will close at 4:30 pm on May 1. Late entries will be accepted, but they cannot earn the extra 0.5 points. Similarly, you cannot win the bake-off unless your homework is submitted on time.

In [18]:
# Enter your bake-off assessment code in this cell. 
# Please do not remove this comment.
bakeoff_results = rel_ext.experiment(
    splits,
    train_split='train',
    test_split='dev',
    featurizers=original_featurizer,
    model_factory=lr_model_factory,
    verbose=False) # We don't care about this evaluation, so skip its summary.

rel_ext_data_home_test = os.path.join(rel_ext_data_home, 'bakeoff3-rel_ext-test-data')
rel_ext.bake_off_experiment(bakeoff_results, rel_ext_data_home_test)

relation              precision     recall    f-score    support       size
------------------    ---------  ---------  ---------  ---------  ---------
adjoins                   0.916      0.470      0.770        438       7122
author                    0.847      0.679      0.807        645       7329
capital                   0.690      0.348      0.576        115       6799
contains                  0.826      0.808      0.823       3808      10492
film_performance          0.846      0.721      0.817       1011       7695
founders                  0.763      0.471      0.679        444       7128
genre                     0.802      0.495      0.713        188       6872
has_sibling               0.898      0.283      0.626        717       7401
has_spouse                0.891      0.377      0.700        780       7464
is_a                      0.742      0.480      0.669        611       7295
nationality               0.657      0.360      0.564        383       7067
parents     

In [19]:
# On an otherwise blank line in this cell, please enter
# your macro-average f-score (an F_0.5 score) as reported 
# by the code above. Please enter only a number between 
# 0 and 1 inclusive. Please do not remove this comment.
0.689

0.689