# Bake-off: Word similarity tasks

In [6]:
__author__ = "Morgan Bryant"
__version__ = "CS224u, Stanford, Spring 2018 term"

## Contents

0. [Overview](#Overview)
0. [Evaluation](#Evaluation)
0. [Set-up](#Set-up)
0. [Dataset readers](#Dataset-readers)
0. [Evaluation](#Evaluation)
0. [Baseline](#Baseline)
0. [Bake-off submission](#Bake-off-submission)

## Overview

Word similarity datasets have long been used to evaluate distributed representations. This section provides basic code for conducting such analyses with four datasets:

* [WordSim-353](http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/)
* MTurk-287
* [MTurk-771](http://www2.mta.ac.il/~gideon/mturk771.html)
* [The MEN Test Collection](http://clic.cimec.unitn.it/~elia.bruni/MEN) (3,000 pairs)

For the first three, the numeral in its name is the number of pairs it contains.

If you want to push this task further, consider using additional datasets from http://wordvectors.org/ and perhaps even taking advantage of the evaluation infrastructure it provides. (For additional details, see [the associated paper](http://www.aclweb.org/anthology/P/P14/P14-5004.pdf).)

## Evaluation

1. Each of the similarity datasets contains word pairs with an associated human-annotated similarity score. (We convert these to distances to align intuitively with our distance measure functions.)

1. The evaluation code measures the distance between the word pairs in your chosen VSM (should be a `pd.DataFrame`).

1. The evaluation metric is the Spearman correlation coefficient between the annotated scores and your distances.

1. We also macro-average these correlations across the four datasets for an overall summary.

Based on my reading of the literature, I'd say that the best VSMs report scores in this range:

| Dataset       | Competitive scores |
|---------------|--------------------|
| WordSim-353   | ≈0.75        |
| MTurk-287     | ≈0.75        |
| MTurk-771     | ≈0.75        |
| MEN           | ≈0.70        |

Your scores won't quite be comparable because you'll be missing a few vocabulary items, but these are still good targets.

## Set-up

In [7]:
from collections import defaultdict
import csv
import numpy as np
import os
import pandas as pd
from scipy.stats import spearmanr
import vsm

In [8]:
data_home = 'vsmdata'

wordsim_home = os.path.join('vsmdata', 'wordsim')

## Dataset readers

In [9]:
def wordsim_dataset_reader(src_filename, header=False, delimiter=','):    
    """Basic reader that works for all four files, since they all have the 
    format word1,word2,score, differing only in whether or not they include 
    a header line and what delimiter they use.
    
    Parameters
    ----------
    src_filename : str
        Full path to the source file.        
    header : bool (default: False)
        Whether `src_filename` has a header.        
    delimiter : str (default: ',')
        Field delimiter in `src_filename`.
    
    Yields
    ------    
    (str, str, float)
       (w1, w2, score) where `score` is the negative of the similarity 
       score in the file so that we are intuitively aligned with our 
       distance-based code.
    
    """
    with open(src_filename) as f:
        reader = csv.reader(f, delimiter=delimiter)
        if header:
            next(reader)
        for row in reader:
            w1, w2, score = row
            # Negative of scores to align intuitively with distance functions:
            score = -float(score)
            yield (w1, w2, score)

def wordsim353_reader():
    """WordSim-353: http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/"""
    src_filename = os.path.join(wordsim_home, 'wordsim353.csv')
    return wordsim_dataset_reader(src_filename, header=True)
 
def mturk287_reader():
    """MTurk-287: http://tx.technion.ac.il/~kirar/Datasets.html"""
    src_filename = os.path.join(wordsim_home, 'MTurk-287.csv')
    return wordsim_dataset_reader(src_filename, header=False)
    
def mturk771_reader():
    """MTURK-771: http://www2.mta.ac.il/~gideon/mturk771.html"""
    src_filename = os.path.join(wordsim_home, 'MTURK-771.csv')
    return wordsim_dataset_reader(src_filename, header=False)

def men_reader():
    """MEN: http://clic.cimec.unitn.it/~elia.bruni/MEN"""
    src_filename = os.path.join(wordsim_home, 'MEN_dataset_natural_form_full')
    return wordsim_dataset_reader(src_filename, header=False, delimiter=' ')   

## Evaluation

In [10]:
def word_similarity_evaluation(reader, df, distfunc=vsm.cosine, verbose=True):
    """Word-similarity evalution framework.
    
    Parameters
    ----------
    reader : iterator
        A reader for a word-similarity dataset. Just has to yield
        tuples (word1, word2, score).    
    df : pd.DataFrame
        The VSM being evaluated.        
    distfunc : function mapping vector pairs to floats (default: `vsm.cosine`)
        The measure of distance between vectors. Can also be `vsm.euclidean`, 
        `vsm.matching`, `vsm.jaccard`, as well as any other distance measure 
        between 1d vectors.  
    verbose : bool
        Whether to print information about how much of the vocab
        `df` covers.
    
    Prints
    ------
    To standard output
        Size of the vocabulary overlap between the evaluation set and
        rownames. We limit the evalation to the overlap, paying no price
        for missing words (which is not fair, but it's reasonable given
        that we're working with very small VSMs in this notebook).
    
    Returns
    -------
    float
        The Spearman rank correlation coefficient between the dataset
        scores and the similarity values obtained from `mat` using 
        `distfunc`. This evaluation is sensitive only to rankings, not
        to absolute values.
    
    """    
    sims = defaultdict(list)
    rownames = df.index
    vocab = set()    
    excluded = set()
    for w1, w2, score in reader():
        if w1 in rownames and w2 in rownames:
            sims[w1].append((w2, score))
            sims[w2].append((w1, score))
            vocab |= {w1, w2}
        else:
            excluded |= {w1, w2}
    all_words = vocab | excluded
    if verbose:
        print("Evaluation vocab: {:,} of {:,}".format(len(vocab), len(all_words)))
    # Evaluate the matrix by creating a vector of all_scores for data
    # and all_dists for mat's distances. 
    all_scores = []
    all_dists = []
    for word in vocab:
        vec = df.loc[word]
        vals = sims[word]
        cmps, scores = zip(*vals)
        all_scores += scores
        all_dists += [distfunc(vec, df.loc[w]) for w in cmps]
    rho, pvalue = spearmanr(all_scores, all_dists)
    return rho

Evaluation is then simple. The following lets us evaluate a VSM against all four datasets:

In [11]:
def full_word_similarity_evaluation(df, verbose=True):
    """Evaluate a VSM against all four datasets.
    
    Parameters
    ----------
    df : pd.DataFrame
    
    Returns
    -------
    dict
        Mapping dataset names to Spearman r values
        
    """        
    scores = {}
    for reader in (wordsim353_reader, mturk287_reader, mturk771_reader, men_reader):        
        if verbose: 
            print("="*40)
            print(reader.__name__)
        score = word_similarity_evaluation(reader, df, verbose=verbose)
        scores[reader.__name__] = score
        if verbose:            
            print('Spearman r: {0:0.03f}'.format(score))
    mu = np.array(list(scores.values())).mean()
    if verbose:
        print("="*40)
        print("Mean Spearman r: {0:0.03f}".format(mu))
    return scores

## Baseline

My baseline is PPMI on `imdb20`:

In [12]:
imdb20 = pd.read_csv(
    os.path.join(data_home, "imdb_window20-flat.csv.gz"), index_col=0)

In [13]:
imdb20_ppmi = vsm.pmi(imdb20)

In [14]:
full_word_similarity_evaluation(imdb20_ppmi, verbose=True)

wordsim353_reader
Evaluation vocab: 418 of 437
Spearman r: 0.469
mturk287_reader
Evaluation vocab: 499 of 499
Spearman r: 0.599
mturk771_reader
Evaluation vocab: 1,113 of 1,113
Spearman r: 0.462
men_reader
Evaluation vocab: 751 of 751
Spearman r: 0.572
Mean Spearman r: 0.525


{'wordsim353_reader': 0.46888766456156583,
 'mturk287_reader': 0.5986597600532505,
 'mturk771_reader': 0.4615212813179131,
 'men_reader': 0.5724487594295152}

In [15]:
# 1: examine each of the datasets as candidates
imdb5 = pd.read_csv(
    os.path.join(data_home, "imdb_window5-scaled.csv.gz"), index_col=0)
imdb5_ppmi = vsm.pmi(imdb5)
full_word_similarity_evaluation(imdb5_ppmi, verbose=True)

giga5 = pd.read_csv(
    os.path.join(data_home, "gigaword_window5-scaled.csv.gz"), index_col=0)
print(giga5.shape)
giga5_ppmi = vsm.pmi(giga5)
full_word_similarity_evaluation(giga5_ppmi, verbose=True)

giga20 = pd.read_csv(
    os.path.join(data_home, "gigaword_window20-flat.csv.gz"), index_col=0)
print(giga20.shape)
giga20_ppmi = vsm.pmi(giga20)
full_word_similarity_evaluation(giga20_ppmi, verbose=True)


wordsim353_reader
Evaluation vocab: 418 of 437
Spearman r: 0.540
mturk287_reader
Evaluation vocab: 499 of 499
Spearman r: 0.516
mturk771_reader
Evaluation vocab: 1,113 of 1,113
Spearman r: 0.448
men_reader
Evaluation vocab: 751 of 751
Spearman r: 0.614
Mean Spearman r: 0.530
(5000, 5000)
wordsim353_reader
Evaluation vocab: 418 of 437
Spearman r: 0.411
mturk287_reader
Evaluation vocab: 499 of 499
Spearman r: 0.536
mturk771_reader
Evaluation vocab: 1,113 of 1,113
Spearman r: 0.394
men_reader
Evaluation vocab: 751 of 751
Spearman r: 0.639
Mean Spearman r: 0.495
(5000, 5000)
wordsim353_reader
Evaluation vocab: 418 of 437
Spearman r: 0.575
mturk287_reader
Evaluation vocab: 499 of 499
Spearman r: 0.662
mturk771_reader
Evaluation vocab: 1,113 of 1,113
Spearman r: 0.509
men_reader
Evaluation vocab: 751 of 751
Spearman r: 0.642
Mean Spearman r: 0.597


{'wordsim353_reader': 0.5746437768533685,
 'mturk287_reader': 0.6617722569937702,
 'mturk771_reader': 0.5091371613882787,
 'men_reader': 0.6417502972889049}

In [16]:
# 2: simple operations on giga20. My experiments on bake-off semantic orientation 
# showed mild improvements if I normalized column-first before and after ppmi.
g20_a = giga20.apply(vsm.length_norm, axis=1)
g20_b =  g20_a.apply(vsm.length_norm, axis=0)
g20_c =    vsm.pmi(g20_b)
g20_d =  g20_c.apply(vsm.length_norm, axis=1)
g20_e =  g20_d.apply(vsm.length_norm, axis=0)
g20_ppmi_nrm = g20_e

In [12]:
#2
full_word_similarity_evaluation(g20_ppmi_nrm, verbose=True)

wordsim353_reader
Evaluation vocab: 418 of 437
Spearman r: 0.557
mturk287_reader
Evaluation vocab: 499 of 499
Spearman r: 0.677
mturk771_reader
Evaluation vocab: 1,113 of 1,113
Spearman r: 0.571
men_reader
Evaluation vocab: 751 of 751
Spearman r: 0.683
Mean Spearman r: 0.622


{'wordsim353_reader': 0.5569673815113725,
 'mturk287_reader': 0.6774560838350027,
 'mturk771_reader': 0.5708530445701437,
 'men_reader': 0.683405003483989}

In [17]:
#3
g20_pn_lsa30 = vsm.lsa(g20_ppmi_nrm, k=30)
full_word_similarity_evaluation(g20_pn_lsa30, verbose=True)
print('-'*40+'\n')
g20_pn_lsa100 = vsm.lsa(g20_ppmi_nrm, k=100)
full_word_similarity_evaluation(g20_pn_lsa100, verbose=True)
print('-'*40+'\n')
g20_pn_lsa300 = vsm.lsa(g20_ppmi_nrm, k=300)
full_word_similarity_evaluation(g20_pn_lsa300, verbose=True)
print('-'*40+'\n')


wordsim353_reader
Evaluation vocab: 418 of 437
Spearman r: 0.452
mturk287_reader
Evaluation vocab: 499 of 499
Spearman r: 0.617
mturk771_reader
Evaluation vocab: 1,113 of 1,113
Spearman r: 0.474
men_reader
Evaluation vocab: 751 of 751
Spearman r: 0.588
Mean Spearman r: 0.533
----------------------------------------



KeyboardInterrupt: 

In [29]:
from mittens import GloVe as glove
glove_init=glove(max_iter=10)
g20_ppmi_nrm_glv = glove_init.fit(g20_ppmi_nrm.values)


Iteration 10: error 34419.5268

In [34]:
print(g20_ppmi_nrm_glv.dot(g20_ppmi_nrm_glv.T).shape)
X = pd.DataFrame(g20_ppmi_nrm_glv.dot(g20_ppmi_nrm_glv.T))
X.index = g20_ppmi_nrm.index
full_word_similarity_evaluation(X, verbose=True)

(5000, 5000)
wordsim353_reader
Evaluation vocab: 418 of 437
Spearman r: 0.048
mturk287_reader
Evaluation vocab: 499 of 499
Spearman r: -0.011
mturk771_reader
Evaluation vocab: 1,113 of 1,113
Spearman r: 0.081
men_reader
Evaluation vocab: 751 of 751
Spearman r: 0.092
Mean Spearman r: 0.053


{'wordsim353_reader': 0.04764432179235064,
 'mturk287_reader': -0.01071538218942911,
 'mturk771_reader': 0.08119077411151314,
 'men_reader': 0.09200487951255376}

## Bake-off submission

1. The name of the count matrix you started with (must be one in `vsmdata`).
1. A description of the steps you took to create your bake-off VSM – must be different from the above baseline.
1. Your Spearman r value for each of the four evaluation datasets and your average across all four.

Submission URL: https://goo.gl/forms/eJWHpwJlOyDEC3V63