# Bake-off: Word similarity tasks

In [2]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2018 term"

## Contents

0. [Overview](#Overview)
0. [Evaluation](#Evaluation)
0. [Set-up](#Set-up)
0. [Dataset readers](#Dataset-readers)
0. [Evaluation](#Evaluation)
0. [Baseline](#Baseline)
0. [Bake-off submission](#Bake-off-submission)

## Overview

Word similarity datasets have long been used to evaluate distributed representations. This section provides basic code for conducting such analyses with four datasets:

* [WordSim-353](http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/)
* MTurk-287
* [MTurk-771](http://www2.mta.ac.il/~gideon/mturk771.html)
* [The MEN Test Collection](http://clic.cimec.unitn.it/~elia.bruni/MEN) (3,000 pairs)

For the first three, the numeral in its name is the number of pairs it contains.

If you want to push this task further, consider using additional datasets from http://wordvectors.org/ and perhaps even taking advantage of the evaluation infrastructure it provides. (For additional details, see [the associated paper](http://www.aclweb.org/anthology/P/P14/P14-5004.pdf).)

## Evaluation

1. Each of the similarity datasets contains word pairs with an associated human-annotated similarity score. (We convert these to distances to align intuitively with our distance measure functions.)

1. The evaluation code measures the distance between the word pairs in your chosen VSM (should be a `pd.DataFrame`).

1. The evaluation metric is the Spearman correlation coefficient between the annotated scores and your distances.

1. We also macro-average these correlations across the four datasets for an overall summary.

Based on my reading of the literature, I'd say that the best VSMs report scores in this range:

| Dataset       | Competitive scores |
|---------------|--------------------|
| WordSim-353   | ≈0.75        |
| MTurk-287     | ≈0.75        |
| MTurk-771     | ≈0.75        |
| MEN           | ≈0.70        |

Your scores won't quite be comparable because you'll be missing a few vocabulary items, but these are still good targets.

## Set-up

In [3]:
from collections import defaultdict
import csv
import numpy as np
import os
import pandas as pd
from scipy.stats import spearmanr
import vsm

In [4]:
data_home = 'vsmdata'

wordsim_home = os.path.join('vsmdata', 'wordsim')

## Dataset readers

In [5]:
def wordsim_dataset_reader(src_filename, header=False, delimiter=','):    
    """Basic reader that works for all four files, since they all have the 
    format word1,word2,score, differing only in whether or not they include 
    a header line and what delimiter they use.
    
    Parameters
    ----------
    src_filename : str
        Full path to the source file.        
    header : bool (default: False)
        Whether `src_filename` has a header.        
    delimiter : str (default: ',')
        Field delimiter in `src_filename`.
    
    Yields
    ------    
    (str, str, float)
       (w1, w2, score) where `score` is the negative of the similarity 
       score in the file so that we are intuitively aligned with our 
       distance-based code.
    
    """
    with open(src_filename) as f:
        reader = csv.reader(f, delimiter=delimiter)
        if header:
            next(reader)
        for row in reader:
            w1, w2, score = row
            # Negative of scores to align intuitively with distance functions:
            score = -float(score)
            yield (w1, w2, score)

def wordsim353_reader():
    """WordSim-353: http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/"""
    src_filename = os.path.join(wordsim_home, 'wordsim353.csv')
    return wordsim_dataset_reader(src_filename, header=True)
 
def mturk287_reader():
    """MTurk-287: http://tx.technion.ac.il/~kirar/Datasets.html"""
    src_filename = os.path.join(wordsim_home, 'MTurk-287.csv')
    return wordsim_dataset_reader(src_filename, header=False)
    
def mturk771_reader():
    """MTURK-771: http://www2.mta.ac.il/~gideon/mturk771.html"""
    src_filename = os.path.join(wordsim_home, 'MTURK-771.csv')
    return wordsim_dataset_reader(src_filename, header=False)

def men_reader():
    """MEN: http://clic.cimec.unitn.it/~elia.bruni/MEN"""
    src_filename = os.path.join(wordsim_home, 'MEN_dataset_natural_form_full')
    return wordsim_dataset_reader(src_filename, header=False, delimiter=' ')   

## Evaluation

In [6]:
def word_similarity_evaluation(reader, df, distfunc=vsm.cosine, verbose=True):
    """Word-similarity evalution framework.
    
    Parameters
    ----------
    reader : iterator
        A reader for a word-similarity dataset. Just has to yield
        tuples (word1, word2, score).    
    df : pd.DataFrame
        The VSM being evaluated.        
    distfunc : function mapping vector pairs to floats (default: `vsm.cosine`)
        The measure of distance between vectors. Can also be `vsm.euclidean`, 
        `vsm.matching`, `vsm.jaccard`, as well as any other distance measure 
        between 1d vectors.  
    verbose : bool
        Whether to print information about how much of the vocab
        `df` covers.
    
    Prints
    ------
    To standard output
        Size of the vocabulary overlap between the evaluation set and
        rownames. We limit the evalation to the overlap, paying no price
        for missing words (which is not fair, but it's reasonable given
        that we're working with very small VSMs in this notebook).
    
    Returns
    -------
    float
        The Spearman rank correlation coefficient between the dataset
        scores and the similarity values obtained from `mat` using 
        `distfunc`. This evaluation is sensitive only to rankings, not
        to absolute values.
    
    """    
    sims = defaultdict(list)
    rownames = df.index
    vocab = set()    
    excluded = set()
    for w1, w2, score in reader():
        if w1 in rownames and w2 in rownames:
            sims[w1].append((w2, score))
            sims[w2].append((w1, score))
            vocab |= {w1, w2}
        else:
            excluded |= {w1, w2}
    all_words = vocab | excluded
    if verbose:
        print("Evaluation vocab: {:,} of {:,}".format(len(vocab), len(all_words)))
    # Evaluate the matrix by creating a vector of all_scores for data
    # and all_dists for mat's distances. 
    all_scores = []
    all_dists = []
    for word in vocab:
        vec = df.loc[word]
        vals = sims[word]
        cmps, scores = zip(*vals)
        all_scores += scores
        all_dists += [distfunc(vec, df.loc[w]) for w in cmps]
    rho, pvalue = spearmanr(all_scores, all_dists)
    return rho

Evaluation is then simple. The following lets us evaluate a VSM against all four datasets:

In [7]:
def full_word_similarity_evaluation(df, verbose=True):
    """Evaluate a VSM against all four datasets.
    
    Parameters
    ----------
    df : pd.DataFrame
    
    Returns
    -------
    dict
        Mapping dataset names to Spearman r values
        
    """        
    scores = {}
    for reader in (wordsim353_reader, mturk287_reader, mturk771_reader, men_reader):        
        if verbose: 
            print("="*40)
            print(reader.__name__)
        score = word_similarity_evaluation(reader, df, verbose=verbose)
        scores[reader.__name__] = score
        if verbose:            
            print('Spearman r: {0:0.03f}'.format(score))
    mu = np.array(list(scores.values())).mean()
    if verbose:
        print("="*40)
        print("Mean Spearman r: {0:0.03f}".format(mu))
    return scores

## Baseline

My baseline is PPMI on `imdb20`:

In [8]:
imdb20 = pd.read_csv(
    os.path.join(data_home, "imdb_window20-flat.csv.gz"), index_col=0)

In [9]:
imdb20_ppmi = vsm.pmi(imdb20)

In [10]:
full_word_similarity_evaluation(imdb20_ppmi, verbose=True)

wordsim353_reader
Evaluation vocab: 418 of 437
Spearman r: 0.469
mturk287_reader
Evaluation vocab: 499 of 499
Spearman r: 0.599
mturk771_reader
Evaluation vocab: 1,113 of 1,113
Spearman r: 0.462
men_reader
Evaluation vocab: 751 of 751
Spearman r: 0.572
Mean Spearman r: 0.525


{'wordsim353_reader': 0.46888766456156583,
 'mturk287_reader': 0.5986597600532505,
 'mturk771_reader': 0.4615212813179131,
 'men_reader': 0.5724487594295152}

## Bake-off submission

1. The name of the count matrix you started with (must be one in `vsmdata`).
1. A description of the steps you took to create your bake-off VSM – must be different from the above baseline.
1. Your Spearman r value for each of the four evaluation datasets and your average across all four.

Submission URL: https://goo.gl/forms/eJWHpwJlOyDEC3V63

## Sherry's Code

#### Giga20

In [42]:
giga20 = pd.read_csv(
    os.path.join(data_home, 'gigaword_window20-flat.csv.gz'), index_col=0)

In [43]:
giga20_l2 = giga20.apply(vsm.length_norm, axis=1)

In [44]:
giga20_l2_ppmi = vsm.pmi(giga20_l2)

In [57]:
print(giga20.shape)
giga20_ppmi_svd = vsm.lsa(giga20_l2_ppmi, k=800)

(5000, 5000)


In [58]:
full_word_similarity_evaluation(giga20_ppmi_svd, verbose=True)

wordsim353_reader
Evaluation vocab: 418 of 437
Spearman r: 0.537
mturk287_reader
Evaluation vocab: 499 of 499
Spearman r: 0.673
mturk771_reader
Evaluation vocab: 1,113 of 1,113
Spearman r: 0.556
men_reader
Evaluation vocab: 751 of 751
Spearman r: 0.660
Mean Spearman r: 0.607


{'wordsim353_reader': 0.5370376772829761,
 'mturk287_reader': 0.6726188686654195,
 'mturk771_reader': 0.5562486339246591,
 'men_reader': 0.6602281327756298}

#### LSA
A common pattern in the literature is to apply PMI first. The PMI values tend to give the count matrix a normal (Gaussian) distribution that better satisfies the assumptions underlying SVD:

In [13]:
imdb20_l2 = imdb20.apply(vsm.length_norm, axis=1)

In [14]:
imdb20_l2_ppmi = vsm.pmi(imdb20_l2)

In [24]:
imdb20_ppmi_svd = vsm.lsa(imdb20_l2_ppmi, k=2000)

In [19]:
print(imdb20.shape)

(5000, 5000)


In [40]:
imdb_l2_ppmi_svd_ae = TfAutoencoder(
    max_iter=2000, hidden_dim=500, eta=0.01).fit(imdb20_ppmi_svd)

Iteration 2000: loss: 0.46521537750959396

In [41]:
full_word_similarity_evaluation(imdb_l2_ppmi_svd_ae, verbose=True)

wordsim353_reader
Evaluation vocab: 418 of 437
Spearman r: 0.457
mturk287_reader
Evaluation vocab: 499 of 499
Spearman r: 0.537
mturk771_reader
Evaluation vocab: 1,113 of 1,113
Spearman r: 0.470
men_reader
Evaluation vocab: 751 of 751
Spearman r: 0.566
Mean Spearman r: 0.508


{'wordsim353_reader': 0.4567673303713142,
 'mturk287_reader': 0.5374832152614643,
 'mturk771_reader': 0.47002482072004637,
 'men_reader': 0.5660522454418069}

In [25]:
full_word_similarity_evaluation(imdb20_ppmi_svd, verbose=True)

wordsim353_reader
Evaluation vocab: 418 of 437
Spearman r: 0.563
mturk287_reader
Evaluation vocab: 499 of 499
Spearman r: 0.623
mturk771_reader
Evaluation vocab: 1,113 of 1,113
Spearman r: 0.563
men_reader
Evaluation vocab: 751 of 751
Spearman r: 0.654
Mean Spearman r: 0.601


{'wordsim353_reader': 0.5633692369609065,
 'mturk287_reader': 0.6228293370300938,
 'mturk771_reader': 0.5626968523789865,
 'men_reader': 0.654496284665301}

#### Glove

In [14]:
from mittens import GloVe

In [15]:
glove_model = GloVe()

imdb20_glv = glove_model.fit(imdb20.values)

imdb20_glv = pd.DataFrame(imdb20_glv, index=imdb20.index)

Iteration 100: loss: 949373.6255

In [20]:
print(imdb20_glv.shape)

(5000, 100)


In [21]:
full_word_similarity_evaluation(imdb20_glv, verbose=True)

wordsim353_reader
Evaluation vocab: 418 of 437
Spearman r: 0.046
mturk287_reader
Evaluation vocab: 499 of 499
Spearman r: 0.280
mturk771_reader
Evaluation vocab: 1,113 of 1,113
Spearman r: 0.134
men_reader
Evaluation vocab: 751 of 751
Spearman r: 0.119
Mean Spearman r: 0.145


{'wordsim353_reader': 0.04605244129193265,
 'mturk287_reader': 0.2803686159040967,
 'mturk771_reader': 0.13355032890726384,
 'men_reader': 0.11886951155571018}

#### Autoencoders

In [27]:
from tf_autoencoder import TfAutoencoder

In [22]:
imdb20_l2 = imdb20.apply(vsm.length_norm, axis=1)

In [23]:
imdb20_l2_svd100 = vsm.lsa(imdb20_l2, k=100)

In [26]:
imdb_l2_svd100_ae = TfAutoencoder(
    max_iter=1000, hidden_dim=50, eta=0.01).fit(imdb20_l2_svd100)

Iteration 1000: loss: 0.0037961083580739796

In [27]:
full_word_similarity_evaluation(imdb_l2_svd100_ae, verbose=True)

wordsim353_reader
Evaluation vocab: 418 of 437
Spearman r: 0.202
mturk287_reader
Evaluation vocab: 499 of 499
Spearman r: 0.261
mturk771_reader
Evaluation vocab: 1,113 of 1,113
Spearman r: 0.172
men_reader
Evaluation vocab: 751 of 751
Spearman r: 0.246
Mean Spearman r: 0.220


{'wordsim353_reader': 0.20221759603544312,
 'mturk287_reader': 0.2609484293369752,
 'mturk771_reader': 0.17222199323904994,
 'men_reader': 0.24642659424147836}

#### retrofitting

In [29]:
import retrofitting
from retrofitting import Retrofitter
from collections import defaultdict
from nltk.corpus import wordnet as wn

In [30]:
wn_retro = Retrofitter(verbose=True)

In [31]:
def get_wordnet_edges():
    edges = defaultdict(set)
    for ss in wn.all_synsets():
        lem_names = {lem.name() for lem in ss.lemmas()}
        for lem in lem_names:
            edges[lem] |= lem_names
    return edges

In [32]:
wn_edges = get_wordnet_edges()

In [33]:
def convert_edges_to_indices(edges, Q):
    lookup = dict(zip(Q.index, range(Q.shape[0])))
    index_edges = defaultdict(set)
    for start, finish_nodes in edges.items():
        s = lookup.get(start)
        if s:
            f = {lookup[n] for n in finish_nodes if n in lookup}
            if f:
                index_edges[s] = f
    return index_edges

In [34]:
wn_index_edges = convert_edges_to_indices(wn_edges, imdb20_ppmi_svd)

In [35]:
imdb20_retro = wn_retro.fit(imdb20_ppmi_svd, wn_index_edges)

Converged at iteration 9; change was 0.0055 

In [36]:
full_word_similarity_evaluation(imdb20_retro, verbose=True)

wordsim353_reader
Evaluation vocab: 418 of 437
Spearman r: 0.532
mturk287_reader
Evaluation vocab: 499 of 499
Spearman r: 0.637
mturk771_reader
Evaluation vocab: 1,113 of 1,113
Spearman r: 0.542
men_reader
Evaluation vocab: 751 of 751
Spearman r: 0.638
Mean Spearman r: 0.587


{'wordsim353_reader': 0.5324644306730004,
 'mturk287_reader': 0.6368203796993187,
 'mturk771_reader': 0.5422195350983493,
 'men_reader': 0.638481431829129}