# Evaluation

In NLP, we can have intrinsic or extrinsic evaluation.

## Intrinsic evaluation

NLP tasks like machine translation may consists of multiple subtasks. Instead of evaluating the entire pipeline (which may take a long time), we may want to evaluate parts of the system. In intrinsic evaluation, we want to measure the "goodness" of a subtask in itself. We take a specific intermediate subtask e.g. the task of converting words to word vectors and evaluate it on analogy completion or word pair correlation with human judgement.

This type of evaluation is simple and fast. Sometimes, it can help us understand how the system works e.g., what hyperparamters actually have an impact on the metric of similarity. However, improvements on the intrinsic task do not necessary translate to improvements in the overall task like machine translation or name-entity recognition.

### Word Vector Analogy

One popular intrinsic word vector evaluation is the **word vector analogies**. Basically, we can evaluate word vectors by how well their cosine distance after addition captures intuitive semantic and syntactic analogy questions. For example, man-to-woman is like king-to-queen.

We want to identify the word vector $\mathbf{x}_d$ that maximises the cosine similarity:

$$
d= \arg \max_{i} \frac{ \left( \mathbf{x}_b - \mathbf{x}_a + \mathbf{c} \right)^T \mathbf{x}_i }{ \lVert   \mathbf{x}_b - \mathbf{x}_a + \mathbf{x}_c \Vert }
$$
where
- $\mathbf{x}_a$ is the word vector for "man"
- $\mathbf{x}_b$ is the word vector for "woman"
- $\mathbf{x}_c$ is the word vector for "king"




Basically, we want the nearest word of "woman" - "man" + "kind" to be "queen"

We can use the following dataset of analogies to evaluate our word vectors: https://github.com/nicholas-leonard/word2vec/blob/master/questions-words.txt

<img src="figures/word-vector-eval-plots.png" width="600" />



### Correlation Evaluation

Another intrinsic evaluation is correlation evaluation. The cosine similarity between word vector pairs are compared to similarity assigned by people. 

For example, the WordSimilarity-353 Test Collection contains two sets of English word pairs along with human-assigned similarity judgements. The collection can be used to train and/or test computer algorithms implementing semantic similarity measures (i.e., algorithms that numerically estimate similarity of natural language words). Link: http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/



## Extrinsic

Extrinsic evaluation measures the "goodness" of the system on a real task.

The problem is that the process of extrinsic evaluation can take a very long time because the chosen model can take some time to train. Therefore, trying out different word vectors or different hyperparameters may take a long time. For example, it may take long to compare word vectors based on the Pearson correlation for the co-occorrence matrix with the raw count of the words.

We have to careful with extrinsic evaluation though. It is not advisable to tune multiple parts of the system and evaluate it once because if the overall task improves then we have no idea which part actually is responsible for the improvement. Therefore, when doing extrinsic evaluation it is better to change only one part of the system.