<pre style="float: right">version 1.0.1</pre>
# FNLP 2019: Lab Session 5: Word Sense Disambiguation

## Task Description

In this tutorial, we will be exploring the word sense disambiguation task. This is a task where you use a corpus to learn how to disambiguate a small set of target words using supervised learning. The aim of this task is to build a classifier that maps each occurrence of a target word in a corpus to its sense.

We will use a Naive Bayes classifier. In other words, where the context of an occurrence of a target word in the corpus is represented as a feature vector $(\vec{f})$, the classifier estimates the word sense $s\in S$ based on its context as shown below. 

$$
\begin{align}
    \hat{s} &= \arg\max_{s\in S}P(s|\vec{f}) & \text{(optimization problem)}\\
            &= \arg\max_{s\in S}\frac{P(\vec{f}|s)P(s)}{P(\vec{f})} & \text{(Bayes rule)}\\
            &\propto \arg\max_{s\in S}P(\vec{f}|s)P(s)  & \text{(denominator is constant)}\\
            &\approx \arg\max_{s\in S}P(s)\prod_{i=1}^{n}P(f_i|s) & \text{(conditional independence of features)}\\
\end{align}
$$

## The corpus

We will use the [senseval-2](http://www.hipposmond.com/senseval2) corpus for our training and test data. This corpus consists of text from a mixture of places, including the British National Corpus and the Penn Treebank portion of the Wall Street Journal. Each word in the corpus is tagged with its part of speech, and the senses of the following target words are also manually annotated: the nouns *interest*, *line*; the verb *serve* and the adjective *hard*. You can find out more about the task from [here](http://www.hipposmond.com/senseval2/descriptions/english-lexsample.htm).

The sets of senses that are used to annotate each target word come from WordNet (more on that later).

## Support Code

To do the analysis we use a set of helper function from ``WSD.py`` that came together with this notebook. Open this file in Jupyter build-in editor or your favourite IDE and try to understand how it works (don't worry if you don't understand some of it, it's not necessary for doing this task). We will import these functions and will use them for the rest of the lab.
Remember, `help(...)` is your friend:
- `help([class name])` for classes and all their methods and instance variables
- `help([any object])` likewise
- `help([function])` or `help([class].[method])` for functions / methods

This code allows you to do several things. You can now run, train and evaluate a range of Naive Bayes classifiers over the corpus to acquire a model of WSD for a given target word: the adjective *hard*, the nouns *interest* or *line*, and the verb *serve*. We'll learn later how you do this. First, we're going to explore the nature of the corpus itself. 

In [None]:
import nltk
from nltk.classify import NaiveBayesClassifier
from WSD import *
from pprint import pprint # Pretty-printing utilities

## Data Exploration

### Target words

You can find out the set of target words for the senseval-2 corpus by running:

In [None]:
senseval.fileids()


The result doesn't tell you the syntactic category of the words, but see the description of the corpus above

### Word senses

Let's now find out the set of word senses for each target word in senseval. There is a function in above that returns this information. For example:


In [None]:
print(instance2senses('hard.pos'))

As you can see this gives you `['HARD1', 'HARD2', 'HARD3']`

So there are 3 senses for the adjective hard in the corpus. You'll shortly be looking at the data to guess what these 3 senses are.

#### Now it's your turn:

* What are the senses for the other target words? Find out by calling senses with appropriate arguments.
* How many senses does each target have?
* Let's now guess the sense definitions for HARD1, HARD2 and HARD3 by looking at the 100 most frequent open class words that occur in the context of each sense. 


You can find out what these 100 words for HARD1 by running the following:

In [None]:
instancesHARD1 = sense2instances(senseval.instances('hard.pos'), 'HARD1')
featuresHARD1 = extract_vocab_frequency(instancesHARD1, n=100)
pprint(featuresHARD1)

#### Now it's your turn:

* Call the above functions for HARD2 and HARD3.
* Look at the resulting lists of 100 most frequent words for each sense, and try to define what HARD1, HARD2 and HARD3 mean.
* These senses are actually the first three senses for the adjective _hard_ in [WordNet](http://wordnet.princeton.edu/). You can enter a word and get its list of WordNet senses from [here](http://wordnetweb.princeton.edu/perl/webwn). Do this for hard, and check whether your estimated definitions for the 3 word senses are correct. 

In [None]:
# Analysis of HARD2
instancesHARD2 = sense2instances(senseval.instances('hard.pos'), 'HARD2')
featuresHARD2 = extract_vocab_frequency(instancesHARD2, n=20)
pprint(featuresHARD2)
# Analysis of HARD3
instancesHARD3 = sense2instances(senseval.instances('hard.pos'), 'HARD3')
featuresHARD3 = extract_vocab_frequency(instancesHARD3, n=20)
pprint(featuresHARD3)

## Data structure of Senseval instances
Having extracted all instances of a given sense, you can look at what the data structures in the corpus look like: 

In [None]:
print("For HARD2:\n Sample instance: {}\n All features:".format(instancesHARD2[0]))
pprint(instancesHARD2)
print("For HARD3:\n Sample instance: {} \n All features:".format(instancesHARD3[0]))
pprint(instancesHARD3)

 So the senseval corpus is a collection of information about a set of tagged sentences, where each entry or instance consists of 4 attributes:

* word specifies the target word together with its syntactic category (e.g., hard-a means that the word is hard and its category is 'adjective');
* position gives its position within the sentence (ignoring punctuation);
* context represents the sentence as a list of pairs, each pair being a word or punctuation mark and its tag; and finally
* senses is a tuple, each item in the tuple being a sense for that target word. In the subset of the corpus we are working with, this tuple consists of only one argument. But there are a few examples elsewhere in the corpus where there is more than one, representing the fact that the annotator couldn't decide which sense to assign to the word. For simplicity, our classifiers are going to ignore any non-first arguments to the attribute senses. 

## Exploring different WSD classifiers
You're now going to compare the performance of different classifiers that perform word sense disambiguation. You do this by calling the function `WSDClassifer` This function must have at least the following arguments specified by you:

1. A trainer; e.g., `NaiveBayesClassifier.train` (if you want you could also try `MaxentClassifier.train`, but this takes longer to train).
2. The target word that the classifier is going to learn to disambiguate: i.e., 'hard.pos', 'line.pos', 'interest.pos' or 'serve.pos'.
3. A feature set. The code allows you to use two kinds of feature sets:
 
**word_features**

This feature set is based on the set **S&nbsp;** of the **n&nbsp;** most frequent words that occur in the same sentence as the target word **w&nbsp;** across the entire training corpus (as you'll see later, you can specify the value of **n&nbsp;**, but if you don't specify it then it defaults to 300). For each occurrence of **w,** `word_features` represents its context as the subset of those words from **S&nbsp;** that occur in the **w&nbsp;**'s sentence. By default, the closed-class words that are specified in `STOPWORDS` are excluded from the set **S&nbsp;** of most frequent words. But as we'll see later, you can also include closed-class words in **S&nbsp;**, or re-define closed-class words in any way you like! If you want to know what closed-class words are excluded by default, just look at the code above. 

**context_features**

This feature set represents the context of a word **w&nbsp;** as the sequence of **m&nbsp;** pairs `(word,tag)` that occur before **w&nbsp;** and the sequence of **m&nbsp;** pairs `(word, tag)` that occur after **w&nbsp;**. As we'll see shortly, you can specify the value of **m&nbsp;** (e.g., `m=1` means the context consists of just the immediately prior and immediately subsequent word-tag pairs); otherwise, **m&nbsp;** defaults to 3. 
    
    
### first WSD classifier
Try the following:

In [None]:
WSDClasifier(NaiveBayesClassifier.train, 'hard.pos', word_features) 

In other words, the adjective hard is tagged with 3 senses in the corpus (HARD1, HARD2 and HARD3), and the Naive Bayes Classifier using the feature set based on the 300 most frequent context words yields an accuracy of 0.8362. 

### Now it's your turn:

Use `WSDClassifier` to train a classifier that disambiguates hard using `context_features`. Build classifiers for *line* and *serve* as well, using the word features and then the context features.

* What's more accurate for disambiguating 'hard.pos', `context_features` or `word_features`?
* Does the same hold true for 'line.pos' and 'serve.pos'. Why do you think that might be?
* Why is it not fair to compare the accuracy of the classifiers across different target words? 

    
### Baseline models
Just how good is the accuracy of these WSD classifiers? To find out, we need a baseline. There are two we consider here:

1. A model which assigns a sense at random.
2. A model which always assigns the most frequent sense. 

### Now it's your turn:

* What is the accuracy of the random baseline model for 'hard.pos'?
* To compute the accuracy of the frequency baseline model for 'hard.pos', we need to find out the Frequency Distribution of the three senses in the corpus: 

In [None]:
hard_sense_fd = nltk.FreqDist([i.senses[0] for i in senseval.instances('hard.pos')])
print(hard_sense_fd.most_common())

frequency_hard_sense_baseline = hard_sense_fd.freq('HARD1')
print('Baseline accuracy: {}'.format(frequency_hard_sense_baseline))

 In other words, the frequency baseline has an accuracy of approx. 0.797. What is the most frequent sense for 'hard.pos'? And is the frequency baseline a better model than the random model?
* Now compute the accuracy of the frequency baseline for other target words; e.g. 'line.pos'. 

## Rich features vs. sparse data
In this part of the tutorial, we are going to vary the feature sets and compare the results. As well as being able to choose between `context_features` vs. `word_features`, you can also vary the following:

### context_features

You can vary the number of word-tag pairs before and after the target word that you include in the feature vector. You do this by specifying the argument `distance` to the function `WSDClassifier`. For instance, the following creates a classifier that uses 2 words to the left and right of the target word: 

```python
WSDClassifier(trainer=NaiveBayesClassifier.train, 
              words='hard.pos', 
              features=context_features, 
              distance=2)
```

What about distance 1?
### word_features
You can vary the closed-class words that are excluded from the set of most frequent words, and you can vary the size of the set of most frequent words. For instance, the following results in a model which uses the 100 most frequent words including closed-class words:

```python
WSDClassifier(trainer=NaiveBayesClassifier.train, 
              words='hard.pos', 
              features=word_features, 
              stopwords=[], 
              number=100)
```  
### Now it's your turn:
Build several WSD models for 'hard.pos', including at least the following: for the `word_features` version, vary `number` between 100, 200 and 300, and vary the `stopwords` between `[]` (i.e., the empty list) and `STOPWORDS`; for the `context_features` version, vary the `distance` between 1, 2 and 3, and vary the `stopwords` between `[]` and `STOPWORDS`.

In [None]:
for n in [100, 200, 300, 400]:
    for stopwords in [[], STOPWORDS]:
        stop = 'stopwords' if stopwords else 'no stopwords'
        print('Word features with number: {} and {}'.format(n, stop))
        WSDClasifier(trainer=NaiveBayesClassifier.train, 
                     word='hard.pos',
                     features=word_features,
                     stopwords=stopwords,
                     number=n) 

for n in [1, 2, 3]:
    for stopwords in [[], STOPWORDS]:
        stop = 'stopwords' if stopwords else 'no stopwords'
        print('Context features with distance: {} and {}'.format(n, stop))
        WSDClasifier(trainer=NaiveBayesClassifier.train,
                     word='hard.pos',
                     features=context_features,
                     stopwords=stopwords,
                     distance=n) 

Why does changing `number` have an inconsistent impact on the word model?
  * This suggests that the data is too sparse for changes in vocabulary size to have a consistent impact.

Why does making the context window before and after the target word to a number smaller than 3 improve the model?
  * Sparse data, again

Why does including closed-class words in word model improve overall performance?
  * Including closed class words improves performance.  One can see from
the distinct list of closed class words that are constructed for each
sense of "hard" that the distributions of closed class wrt word sense
are quite distinct and therefore informative.  Furthermore, by
including closed class words within the context window one *excludes*
open class words that may be, say, 5 or 6 words away from the target
word and are hence less informative clues for the target word sense.

To see if the data really is too sparse for consistent results, try a different seed for the random number generator, by
editting line 211 in the definition of `WSDClassifier` to use the seed value from the comment instead of the one it's been using.  Then try again and see how, if at all, the trend as number increases is different.

In [None]:
for n in [100, 200, 300, 400]:
    for stopwords in [[], STOPWORDS]:
        stop = 'stopwords' if stopwords else 'no stopwords'
        print('Word features with number: {} and {}'.format(n, stop))
        WSDClasifier(trainer=NaiveBayesClassifier.train,
                     word='hard.pos',
                     features=word_features,
                     stopwords=stopwords,
                     number=n) 

It seems slightly odd that the word features for 'hard.pos' include _harder_ and _hardest_. Try using a stopwords list which adds them to STOPWORDS: is the effect what you expected? Can you explain it?

In [None]:
WSDClasifier(trainer=NaiveBayesClassifier.train,
             word='hard.pos',
             features=word_features,
             number=300,
             stopwords=STOPWORDS)

WSDClasifier(trainer=NaiveBayesClassifier.train,
             word='hard.pos',
             features=word_features,
             number=300,
             stopwords=STOPWORDS.union(['harder', 'hardest']))

The accuracy goes down. This might be expected if a particular word sense would be more likely to appear together with harder and hardest. This means that removing the two words would remove relevant information which would be replaced by some very infrequent words. 

## Error analysis
The function `WSDClassifier` allows you to explore the errors of the model it creates:

### Confusion Matrix

You can output a confusion matrix as follows: 

In [None]:
WSDClasifier(trainer=NaiveBayesClassifier.train,
             word='hard.pos',
             features=context_features,
             distance=3,
             confusion_matrix=True)


Note that the rows in the matrix are the gold labels, and the columns are the estimated labels. Recall that the diagonal line represents the number of items that the model gets right. 
### Errors

You can also output each error from the test data into a file `errors.txt`. For example:



In [None]:
WSDClasifier(trainer=NaiveBayesClassifier.train,
             word='hard.pos',
             features=context_features,
             distance=2,
             confusion_matrix=True,
             log=True)

Use your favourite editor to look at `errors.txt`.
You will find it in the same directory as this notebook.

In `errors.txt`, the example number on the first line of each entry is the (list) index of the error in the test_data. 

### Now it's your turn:

1. Choose your best performing model from your earlier trials, and train it again, but add the arguments `confusion_matrix=True` and `log=True`.
2. Using the confusion matrix, identify which sense is the hardest one for the model to estimate.
3. Look in `errors.txt` for examples where that hardest word sense is the correct label. Do you see any patterns or systematic errors? If so, can you think of a way to adapt the feature vector so as to improve the model? 