# Homework and bake-off: word-level entailment with neural networks

In [1]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Fall 2020"

## Contents

1. [Overview](#Overview)
1. [Set-up](#Set-up)
1. [Data](#Data)
1. [Baseline](#Baseline)
  1. [Representing words: vector_func](#Representing-words:-vector_func)
  1. [Combining words into inputs: vector_combo_func](#Combining-words-into-inputs:-vector_combo_func)
  1. [Classifier model](#Classifier-model)
  1. [Baseline results](#Baseline-results)
1. [Homework questions](#Homework-questions)
  1. [Hypothesis-only baseline [2 points]](#Hypothesis-only-baseline-[2-points])
  1. [Alternatives to concatenation [2 points]](#Alternatives-to-concatenation-[2-points])
  1. [A deeper network [2 points]](#A-deeper-network-[2-points])
  1. [Your original system [3 points]](#Your-original-system-[3-points])
1. [Bake-off [1 point]](#Bake-off-[1-point])

## Overview

The general problem is word-level natural language inference. Training examples are pairs of words $(w_{L}, w_{R}), y$ with $y = 1$ if $w_{L}$ entails $w_{R}$, otherwise $0$.

The homework questions below ask you to define baseline models for this and develop your own system for entry in the bake-off, which will take place on a held-out test-set distributed at the start of the bake-off. (Thus, all the data you have available for development is available for training your final system before the bake-off begins.)

## Set-up

See [the first notebook in this unit](nli_01_task_and_data.ipynb) for set-up instructions.

In [2]:
from collections import defaultdict
import json
import numpy as np
import os
import pandas as pd
from torch_shallow_neural_classifier import TorchShallowNeuralClassifier
import nli
import utils

In [3]:
DATA_HOME = 'data'

NLIDATA_HOME = os.path.join(DATA_HOME, 'nlidata')

wordentail_filename = os.path.join(
    NLIDATA_HOME, 'nli_wordentail_bakeoff_data.json')

GLOVE_HOME = os.path.join(DATA_HOME, 'glove.6B')

In [116]:
utils.fix_random_seeds()

In [117]:
# %env PYTHONHASHSEED=0

env: PYTHONHASHSEED=0


## Data

I've processed the data into a train/dev split that is designed to put some pressure on our models to actually learn these semantic relations, as opposed to exploiting regularities in the sample. 

The defining feature of the dataset is that the `train` and `dev` __vocabularies__ are disjoint. That is, if a word `w` appears in a training pair, it does not occur in any text pair. It follows from this that there are also no word-pairs shared between train and dev, as you would expect. This should require your models to learn abstract relationships, as opposed to memorizing incidental properties of individual words in the dataset.

In [5]:
with open(wordentail_filename) as f:
    wordentail_data = json.load(f)

The keys are the splits plus a list giving the vocabulary for the entire dataset:

In [6]:
wordentail_data.keys()

dict_keys(['dev', 'train', 'vocab'])

In [7]:
wordentail_data['train'][: 5]

[[['abode', 'house'], 1],
 [['abortion', 'anaemia'], 0],
 [['abortion', 'aneurysm'], 0],
 [['abortion', 'blindness'], 0],
 [['abortion', 'deafness'], 0]]

In [8]:
nli.get_vocab_overlap_size(wordentail_data)

0

Because no words are shared between `train` and `dev`, no pairs are either:

In [9]:
nli.get_pair_overlap_size(wordentail_data)

0

Here is the label distribution:

In [10]:
pd.DataFrame(wordentail_data['train'])[1].value_counts()

0    7000
1    1283
Name: 1, dtype: int64

This is a challenging label distribution – there are more than 5 times as more non-entailment cases as entailment cases.

## Baseline

Even in deep learning, __feature representation is vital and requires care!__ For our task, feature representation has two parts: representing the individual words and combining those representations into a single network input.

### Representing words: vector_func

Let's consider two baseline word representations methods:

1. Random vectors (as returned by `utils.randvec`).
1. 50-dimensional GloVe representations.

In [11]:
def randvec(w, n=50, lower=-1.0, upper=1.0):
    """Returns a random vector of length `n`. `w` is ignored."""
    return utils.randvec(n=n, lower=lower, upper=upper)

In [12]:
def load_glove50():
    glove_src = os.path.join(GLOVE_HOME, 'glove.6B.50d.txt')
    # Creates a dict mapping strings (words) to GloVe vectors:
    GLOVE = utils.glove2dict(glove_src)
    return GLOVE

GLOVE = load_glove50()

def glove_vec(w):
    """Return `w`'s GloVe representation if available, else return
    a random vector."""
    return GLOVE.get(w, randvec(w, n=50))

### Combining words into inputs: vector_combo_func

Here we decide how to combine the two word vectors into a single representation. In more detail, where `u` is a vector representation of the left word and `v` is a vector representation of the right word, we need a function `vector_combo_func` such that `vector_combo_func(u, v)` returns a new input vector `z` of dimension `m`. A simple example is concatenation:

In [13]:
def vec_concatenate(u, v):
    """Concatenate np.array instances `u` and `v` into a new np.array"""
    return np.concatenate((u, v))

`vector_combo_func` could instead be vector average, vector difference, etc. (even combinations of those) – there's lots of space for experimentation here; [homework question 2](#Alternatives-to-concatenation-[2-points]) below pushes you to do some exploration.

### Classifier model

For a baseline model, I chose `TorchShallowNeuralClassifier`:

In [14]:
net = TorchShallowNeuralClassifier(early_stopping=True)

### Baseline results

The following puts the above pieces together, using `vector_func=glove_vec`, since `vector_func=randvec` seems so hopelessly misguided for our problem!

In [123]:
baseline_experiment = nli.wordentail_experiment(
    train_data=wordentail_data['train'],
    assess_data=wordentail_data['dev'],
    model=net,
    vector_func=glove_vec,
    vector_combo_func=vec_concatenate)

Stopping after epoch 140. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 0.9795947298407555              precision    recall  f1-score   support

           0      0.870     0.948     0.907      1732
           1      0.494     0.263     0.344       334

    accuracy                          0.837      2066
   macro avg      0.682     0.606     0.625      2066
weighted avg      0.809     0.837     0.816      2066



In [16]:
baseline_experiment['macro-F1']

0.605004497903557

## Homework questions

Please embed your homework responses in this notebook, and do not delete any cells from the notebook. (You are free to add as many cells as you like as part of your responses.)

### Hypothesis-only baseline [2 points]

During our discussion of SNLI and MultiNLI, we noted that a number of research teams have shown that hypothesis-only baselines for NLI tasks can be remarkably robust. This question asks you to explore briefly how this baseline affects our task.

For this problem, submit two functions:

1. A `vector_combo_func` function called `hypothesis_only` that simply throws away the premise, using the unmodified hypothesis (second) vector as its representation of the example.

1. A function called `run_hypothesis_only_evaluation` that does the following:
    1. Loops over the two `vector_combo_func` values `vec_concatenate` and `hypothesis_only`, calling `nli.wordentail_experiment` to train on the 'train' portion and assess on the 'dev' portion, with `glove_vec` as the `vector_func`. So that the results are consistent, use an `sklearn.linear_model.LogisticRegression` with default parameters as the model.
    1. Returns a `dict` mapping `function_name` strings to the 'macro-F1' score for that pair, as returned by the call to `nli.wordentail_experiment`. (Tip: you can get the `str` name of, e.g., `hypothesis_only` with `hypothesis_only.__name__`.)
    
The functions `test_hypothesis_only` and `test_run_hypothesis_only_evaluation` will help ensure that your functions have the desired logic.

In [17]:
def hypothesis_only(u, v):
    pass
    ##### YOUR CODE HERE
    return v


def run_hypothesis_only_evaluation():
    pass
    ##### YOUR CODE HERE
    from sklearn.linear_model import LogisticRegression
    
    mapping_function_macro = {}
    for vector_combo_func in [vec_concatenate, hypothesis_only]:
        experiment = nli.wordentail_experiment(
                        train_data=wordentail_data['train'],
                        assess_data=wordentail_data['dev'],
                        model=LogisticRegression(),
                        vector_func=glove_vec,
                        vector_combo_func=vector_combo_func)
        mapping_function_macro[vector_combo_func.__name__] = experiment['macro-F1']
    
    return mapping_function_macro


In [18]:
def test_hypothesis_only(hypothesis_only):
    v = hypothesis_only(1, 2)
    assert v == 2

In [19]:
test_hypothesis_only(hypothesis_only)

In [20]:
def test_run_hypothesis_only_evaluation(run_hypothesis_only_evaluation):
    results = run_hypothesis_only_evaluation()
    assert all(x in results for x in ('hypothesis_only', 'vec_concatenate')), \
        ("The return value of `run_hypothesis_only_evaluation` does not "
         "have the intended kind of keys.")
    assert isinstance(results['vec_concatenate'], float), \
        ("The values of the `run_hypothesis_only_evaluation` result "
         "should be floats.")

In [21]:
test_run_hypothesis_only_evaluation(run_hypothesis_only_evaluation)

precision    recall  f1-score   support

           0      0.862     0.955     0.906      1732
           1      0.473     0.210     0.290       334

    accuracy                          0.834      2066
   macro avg      0.668     0.582     0.598      2066
weighted avg      0.799     0.834     0.807      2066

              precision    recall  f1-score   support

           0      0.853     0.973     0.909      1732
           1      0.483     0.129     0.203       334

    accuracy                          0.837      2066
   macro avg      0.668     0.551     0.556      2066
weighted avg      0.793     0.837     0.795      2066



### Alternatives to concatenation [2 points]

We've so far just used vector concatenation to represent the premise and hypothesis words. This question asks you to explore two simple alternative:

1. Write a function `vec_diff` that, for a given pair of vector inputs `u` and `v`, returns the element-wise difference between `u` and `v`.

1. Write a function `vec_max` that, for a given pair of vector inputs `u` and `v`, returns the element-wise max values between `u` and `v`.

You needn't include your uses of `nli.wordentail_experiment` with these functions, but we assume you'll be curious to see how they do!

In [22]:
def vec_diff(u, v):
    pass
    ##### YOUR CODE HERE
    assert len(u) == len(v)
    return u - v
        


def vec_max(u, v):
    pass
    ##### YOUR CODE HERE
    return np.maximum(u,v)



In [23]:
def test_vec_diff(vec_diff):
    u = np.array([10.2, 8.1])
    v = np.array([1.2, -7.1])
    result = vec_diff(u, v)
    expected = np.array([9.0, 15.2])
    assert np.array_equal(result, expected), \
        "Expected {}; got {}".format(expected, result)

In [24]:
test_vec_diff(vec_diff)

In [25]:
def test_vec_max(vec_max):
    u = np.array([1.2,  8.1])
    v = np.array([10.2, -7.1])
    result = vec_max(u, v)
    expected = np.array([10.2, 8.1])
    assert np.array_equal(result, expected), \
        "Expected {}; got {}".format(expected, result)

In [26]:
test_vec_max(vec_max)

In [27]:
def run_diff_max_evaluation():
    pass
    ##### YOUR CODE HERE
    from sklearn.linear_model import LogisticRegression
    
    mapping_function_macro = {}
    for vector_combo_func in [vec_concatenate, vec_diff, vec_max]:
        experiment = nli.wordentail_experiment(
                        train_data=wordentail_data['train'],
                        assess_data=wordentail_data['dev'],
                        model=net,
                        vector_func=glove_vec,
                        vector_combo_func=vector_combo_func)
        mapping_function_macro[vector_combo_func.__name__] = experiment['macro-F1']
    
    return mapping_function_macro

run_diff_max_evaluation()


Stopping after epoch 40. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 2.4564234614372253              precision    recall  f1-score   support

           0      0.866     0.942     0.902      1732
           1      0.445     0.243     0.314       334

    accuracy                          0.829      2066
   macro avg      0.655     0.592     0.608      2066
weighted avg      0.798     0.829     0.807      2066

Stopping after epoch 18. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 3.055456727743149              precision    recall  f1-score   support

           0      0.856     0.927     0.890      1732
           1      0.337     0.192     0.244       334

    accuracy                          0.808      2066
   macro avg      0.596     0.559     0.567      2066
weighted avg      0.772     0.808     0.786      2066

Stopping after epoch 46. Validation score did not improve by tol=1e-05 for more than 10 e

{'vec_concatenate': 0.6080276291417988,
 'vec_diff': 0.5672593557996649,
 'vec_max': 0.5589063071351901}

### A deeper network [2 points]

It is very easy to subclass `TorchShallowNeuralClassifier` if all you want to do is change the network graph: all you have to do is write a new `build_graph`. If your graph has new arguments that the user might want to set, then you should also redefine `__init__` so that these values are accepted and set as attributes.

For this question, please subclass `TorchShallowNeuralClassifier` so that it defines the following graph:

$$
\begin{align}
h_{1} &= xW_{1} + b_{1} \\
r_{1} &= \textbf{Bernoulli}(1 - \textbf{dropout_prob}, n) \\
d_{1} &= r_1 * h_{1} \\
h_{2} &= f(d_{1}) \\
h_{3} &= h_{2}W_{2} + b_{2}
\end{align}
$$

Here, $r_{1}$ and $d_{1}$ define a dropout layer: $r_{1}$ is a random binary vector of dimension $n$, where the probability of a value being $1$ is given by $1 - \textbf{dropout_prob}$. $r_{1}$ is multiplied element-wise by our first hidden representation, thereby zeroing out some of the values. The result is fed to the user's activation function $f$, and the result of that is fed through another linear layer to produce $h_{3}$. (Inside `TorchShallowNeuralClassifier`, $h_{3}$ is the basis for a softmax classifier; no activation function is applied to it because the softmax scaling is handled internally by the loss function.)

For your implementation, please use `nn.Sequential`, `nn.Linear`, and `nn.Dropout` to define the required layers.

For comparison, using this notation, `TorchShallowNeuralClassifier` defines the following graph:

$$
\begin{align}
h_{1} &= xW_{1} + b_{1} \\
h_{2} &= f(h_{1}) \\
h_{3} &= h_{2}W_{2} + b_{2}
\end{align}
$$

The following code starts this sub-class for you, so that you can concentrate on `build_graph`. Be sure to make use of `self.dropout_prob`.

For this problem, submit just your completed  `TorchDeepNeuralClassifier`. You needn't evaluate it, though we assume you will be keen to do that!

You can use `test_TorchDeepNeuralClassifier` to ensure that your network has the intended structure.

In [28]:
import torch.nn as nn

class TorchDeepNeuralClassifier(TorchShallowNeuralClassifier):
    def __init__(self, dropout_prob=0.7, **kwargs):
        self.dropout_prob = dropout_prob
        super().__init__(**kwargs)

    def build_graph(self):
        """Complete this method!

        Returns
        -------
        an `nn.Module` instance, which can be a free-standing class you
        write yourself, as in `torch_rnn_classifier`, or the outpiut of
        `nn.Sequential`, as in `torch_shallow_neural_classifier`.

        """
        pass
        ##### YOUR CODE HERE
        return nn.Sequential(
                    nn.Linear(self.input_dim, self.hidden_dim),
                    nn.Dropout(self.dropout_prob),
                    self.hidden_activation,
                    nn.Linear(self.hidden_dim, self.n_classes_)
        )



In [29]:
def test_TorchDeepNeuralClassifier(TorchDeepNeuralClassifier):
    dropout_prob = 0.55
    assert hasattr(TorchDeepNeuralClassifier(), "dropout_prob"), \
        "TorchDeepNeuralClassifier must have an attribute `dropout_prob`."
    try:
        inst = TorchDeepNeuralClassifier(dropout_prob=dropout_prob)
    except TypeError:
        raise TypeError("TorchDeepNeuralClassifier must allow the user "
                        "to set `dropout_prob` on initialization")
    inst.input_dim = 10
    inst.n_classes_ = 5
    graph = inst.build_graph()
    assert len(graph) == 4, \
        "The graph should have 4 layers; yours has {}".format(len(graph))
    expected = {
        0: 'Linear',
        1: 'Dropout',
        2: 'Tanh',
        3: 'Linear'}
    for i, label in expected.items():
        name = graph[i].__class__.__name__
        assert label in name, \
            ("The {} layer of the graph should be a {} layer; "
            "yours is {}".format(i, label, name))
    assert graph[1].p == dropout_prob, \
        ("The user's value for `dropout_prob` should be the value of "
         "`p` for the Dropout layer.")

In [30]:
test_TorchDeepNeuralClassifier(TorchDeepNeuralClassifier)

In [31]:
deep_experiment = nli.wordentail_experiment(
    train_data=wordentail_data['train'],
    assess_data=wordentail_data['dev'],
    model=TorchDeepNeuralClassifier(early_stopping=True),
    vector_func=glove_vec,
    vector_combo_func=vec_concatenate)

Stopping after epoch 50. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 2.483033210039139              precision    recall  f1-score   support

           0      0.865     0.951     0.906      1732
           1      0.475     0.231     0.310       334

    accuracy                          0.834      2066
   macro avg      0.670     0.591     0.608      2066
weighted avg      0.802     0.834     0.810      2066



### Your original system [3 points]

This is a simple dataset, but its "word-disjoint" nature ensures that it's a challenging one, and there are lots of modeling strategies one might adopt. 

You are free to do whatever you like. We require only that your system differ in some way from those defined in the preceding questions. They don't have to be completely different, though. For example, you might want to stick with the model but represent examples differently, or the reverse.

You are free to use different pretrained word vectors and the like.

Please embed your code in this notebook so that we can rerun it.

In the cell below, please provide a brief technical description of your original system, so that the teaching team can gain an understanding of what it does. This will help us to understand your code and analyze all the submissions to identify patterns and strategies.  We also ask that you report the best score your system got during development, just to help us understand how systems performed overall.

In [159]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    from imblearn.over_sampling import SMOTE
    from nli import word_entail_featurize, classification_report
    from nltk.corpus import wordnet as wn


    def one_shared_hypernym(w1, w2):
        """
        A (maybe unreliable) way to get the lowest common hypernym shared by words w1 and w2
        """
        try:
            return wn.synsets(w1)[0].lowest_common_hypernyms(wn.synsets(w2)[0])[0].lemma_names()[0]
        except:
            return ' '

    def custom_word_entail_featurize(data, vector_func, vector_combo_func=vec_concatenate):
        X = []
        y = []
        for (w1, w2), label in data:
            rep = vector_combo_func(vector_func(w1), vector_func(w2))
            hypernym = one_shared_hypernym(w1, w2)
            rep = vector_combo_func(rep, vector_func(hypernym))
            X.append(rep)
            y.append(label)
        return X, y


    def custom_wordentail_experiment(
            train_data,
            assess_data,
            vector_func,
            vector_combo_func,
            param_grid,
            model):
        """Customed train and evaluation code for the word-level entailment task.

        Parameters
        ----------
        train_data : list
        assess_data : list
        vector_func : function
            Any function mapping words in the vocab for `wordentail_data`
            to vector representations
        vector_combo_func : function
            Any function for combining two vectors into a new vector
            of fixed dimensionality
        param_grid: parameters grid for hyperparameters search
        model : class with `fit` and `predict` methods

        Prints
        ------
        To standard ouput
            An sklearn classification report for all three splits.

        Returns
        -------
        dict with structure

            'model': the trained model
            'train_condition': train_condition
            'assess_condition': assess_condition
            'macro-F1': score for 'assess_condition'
            'vector_func': vector_func
            'vector_combo_func': vector_combo_func

        We pass 'vector_func' and 'vector_combo_func' through to ensure alignment
        between these experiments and the bake-off evaluation.

        """
        X_train, y_train = custom_word_entail_featurize(
            train_data,  vector_func, vector_combo_func)
        X_dev, y_dev = custom_word_entail_featurize(
            assess_data, vector_func, vector_combo_func)
        # oversample with SMOTE
        oversample = SMOTE()
        X_train, y_train = oversample.fit_resample(X_train, y_train)

        if len(param_grid) > 0:
            bestmod = utils.fit_classifier_with_hyperparameter_search(
                X_train, y_train, model, cv=5, param_grid=param_grid)
        else:
            bestmod = model
            bestmod.fit(X_train, y_train)
        predictions = bestmod.predict(X_dev)
        # Report:
        print(classification_report(y_dev, predictions, digits=3))
        macrof1 = utils.safe_macro_f1(y_dev, predictions)
        return {
            'best model': bestmod,
            'train_data': train_data,
            'assess_data': assess_data,
            'macro-F1': macrof1,
            'vector_func': vector_func,
            'vector_combo_func': vector_combo_func}

In [148]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    import torch.nn as nn


    class TorchThreeLayerNeuralClassifier(TorchShallowNeuralClassifier):
        """A Three layers network where hidden layers have the same dimensions:

        [Linear -> Dropout -> Activation] -> [Linear -> Droputout -> Activation] -> Linear
        """
        def __init__(self, dropout_prob=0.7, **kwargs):
            self.dropout_prob = dropout_prob
            super().__init__(**kwargs)
            self.params += ['dropout_prob']

        def build_graph(self):
            """Complete this method!

            Returns
            -------
            an `nn.Module` instance, which can be a free-standing class you
            write yourself, as in `torch_rnn_classifier`, or the outpiut of
            `nn.Sequential`, as in `torch_shallow_neural_classifier`.

            """
            pass
            return nn.Sequential(
                        nn.Linear(self.input_dim, self.hidden_dim),
                        nn.Dropout(self.dropout_prob),
                        self.hidden_activation,
                        nn.Linear(self.hidden_dim, self.hidden_dim),
                        nn.Dropout(self.dropout_prob),
                        self.hidden_activation,
                        nn.Linear(self.hidden_dim, self.n_classes_)
            )


In [132]:
# PLEASE MAKE SURE TO INCLUDE THE FOLLOWING BETWEEN THE START AND STOP COMMENTS:
#   1) Textual description of your system.
#   2) The code for your original system.
#   3) The score achieved by your system in place of MY_NUMBER.
#        With no other changes to that line.
#        You should report your score as a decimal value <=1.0
# PLEASE MAKE SURE NOT TO DELETE OR EDIT THE START AND STOP COMMENTS

# NOTE: MODULES, CODE AND DATASETS REQUIRED FOR YOUR ORIGINAL SYSTEM 
# SHOULD BE ADDED BELOW THE 'IS_GRADESCOPE_ENV' CHECK CONDITION. DOING
# SO ABOVE THE CHECK MAY CAUSE THE AUTOGRADER TO FAIL.

# START COMMENT: Enter your system description in this cell.
"""
System design:

Pre-processing: 
    Use SMOTE to oversample training examples of the minority class.

Features: 
    For the premise u and hypothesis v, I use wordnet to find one shared hypernym w of u and v.
    Then I concat (u, v, w). This gives me a huge boost o macro-F1. After all, our grammar encodes lots 
    of common sense that we use to infer word entailment.

Word vectors: 
    I use glove.6B.50d. Surprisingly, glove.42B and glove.840B didn't give too much boost of score.

Model: 
    I use a 3 layer network with dropout=0.7 and early stopping. I use ReLu as the activation function. I tuned the following 
    hyperparameters: hidden dimensions, batch size, learning rate.
"""
# My peak score was: 0.768
if 'IS_GRADESCOPE_ENV' not in os.environ:
    new_experiment = custom_wordentail_experiment(
            train_data=wordentail_data['train'],
            assess_data=wordentail_data['dev'],
            model=TorchThreeLayerNeuralClassifier(early_stopping=True, hidden_activation=nn.ReLU()),
            vector_func=glove_vec,
            vector_combo_func=vec_concatenate,
            param_grid = {
                    'hidden_dim' : [200, 400, 800],
                    'batch_size': [1024, 512, 256], 
                    'eta': [0.001, 0.01]
                    }
            )
# STOP COMMENT: Please do not remove this comment.

Stopping after epoch 31. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 0.9616641979664564Best params: {'batch_size': 512, 'eta': 0.001, 'hidden_dim': 800}
Best score: 0.970
              precision    recall  f1-score   support

           0      0.911     0.961     0.936      1732
           1      0.720     0.515     0.600       334

    accuracy                          0.889      2066
   macro avg      0.815     0.738     0.768      2066
weighted avg      0.880     0.889     0.881      2066



In [161]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    another_experiment = custom_wordentail_experiment(
            train_data=wordentail_data['train'],
            assess_data=wordentail_data['dev'],
            model=TorchThreeLayerNeuralClassifier(early_stopping=True, hidden_activation=nn.ReLU()),
            vector_func=glove_vec,
            vector_combo_func=vec_concatenate,
            param_grid = {
                    'dropout_prob' : [0.5, 0.7],
                    'hidden_dim' : [200, 400, 800, 1600],
                    'batch_size': [2048, 1024, 512], 
                    'eta': [0.001, 0.01]
                    }
            )

Stopping after epoch 35. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 0.30497921630740166Best params: {'batch_size': 512, 'dropout_prob': 0.5, 'eta': 0.001, 'hidden_dim': 800}
Best score: 0.973
              precision    recall  f1-score   support

           0      0.898     0.969     0.932      1732
           1      0.726     0.428     0.539       334

    accuracy                          0.881      2066
   macro avg      0.812     0.698     0.735      2066
weighted avg      0.870     0.881     0.868      2066



In [155]:
### Use this cell to save and load models

# import torch
# def save_experiment(experiment, path, model_name):
#     model = experiment['best model']
#     path += model_name + '_' + str(experiment['macro-F1'])
#     torch.save(best_model, path )
#     print("Model saved to " + path)

# model_name = "glove50d_3layers_smote_hypernym_"
# SAVE_PATH = './NLI_models/'

# # # to save
# save_experiment(another_experiment, SAVE_PATH, model_name)
# # to load
# # the_model = torch.load(SAVE_PATH)



In [32]:
# from nltk.corpus import wordnet as wn
# from retrofitting import Retrofitter

# def get_wordnet_edges():
#     edges = defaultdict(set)
#     for ss in wn.all_synsets():
#         lem_names = {lem.name() for lem in ss.lemmas()}
#         for lem in lem_names:
#             edges[lem] |= lem_names
#     return edges

# def convert_edges_to_indices(edges, Q):
#     lookup = dict(zip(Q.index, range(Q.shape[0])))
#     index_edges = defaultdict(set)
#     for start, finish_nodes in edges.items():
#         s = lookup.get(start)
#         if s:
#             f = {lookup[n] for n in finish_nodes if n in lookup}
#             if f:
#                 index_edges[s] = f
#     return index_edges


# wn_edges = get_wordnet_edges()


In [34]:
### Some exploration work for later use...

# def load_glove50():
#     glove_src = os.path.join(GLOVE_HOME, 'glove.6B.50d.txt')
#     # Creates a dict mapping strings (words) to GloVe vectors:
#     GLOVE= utils.glove2dict(glove_src)
#     return GLOVE

# GLOVE_50 = load_glove50()


# X_glove = pd.DataFrame(GLOVE_50).T

# wn_graph = convert_edges_to_indices(wn_edges, X_glove)
# X_retro = Retrofitter(verbose=True).fit(X_glove, wn_graph)

# GLOVE_50_retro = dict(zip(X_retro.index, X_retro.values))

# def glove_50_retro_vec(w):
#     """Return `w`'s retrofitted GloVe representation if available, else return
#     a random vector."""

#     u = GLOVE_50.get(w, randvec(w, n=50))
#     v = GLOVE_50_retro.get(w, randvec(w, n=50))
    
#     return np.concatenate((u, v))

# GLOVE42_HOME = os.path.join(DATA_HOME, 'glove.42B')

# def load_big_glove():
#     glove_src = os.path.join(GLOVE42_HOME, 'glove.42B.300d.txt')
#     # Creates a dict mapping strings (words) to GloVe vectors:
#     BIG_GLOVE = utils.glove2dict(glove_src)
#     return BIG_GLOVE

# BIG_GLOVE = load_big_glove()

# def big_glove_vec(w):
#     """Return `w`'s GloVe representation if available, else return
#     a random vector."""
#     return BIG_GLOVE.get(w, randvec(w, n=300))

# X_glove = pd.DataFrame(BIG_GLOVE).T

# wn_graph = convert_edges_to_indices(wn_edges, X_glove)
# X_glove = Retrofitter(verbose=True).fit(X_glove, wn_graph)

# BIG_GLOVE_retro = dict(zip(X_glove.index, X_glove.values))

# def big_glove_retro_vec(w):
#     """Return `w`'s retrofitted GloVe representation if available, else return
#     a random vector."""

#     # u = BIG_GLOVE.get(w, randvec(w, n=300))
#     v = BIG_GLOVE_retro.get(w, randvec(w, n=300))
    
#     return v


# Glove 42B

In [38]:
# GLOVE42_HOME = os.path.join(DATA_HOME, 'glove.42B')

# def load_big_glove():
#     glove_src = os.path.join(GLOVE42_HOME, 'glove.42B.300d.txt')
#     # Creates a dict mapping strings (words) to GloVe vectors:
#     BIG_GLOVE = utils.glove2dict(glove_src)
#     return BIG_GLOVE

# BIG_GLOVE = load_big_glove()

# def big_glove_vec(w):
#     """Return `w`'s GloVe representation if available, else return
#     a random vector."""
#     return BIG_GLOVE.get(w, randvec(w, n=300))

In [62]:
# big_glove_SMOTE_experiment = custom_wordentail_experiment(
#         train_data=wordentail_data['train'],
#         assess_data=wordentail_data['dev'],
#         model=TorchDeepNeuralClassifier(early_stopping=True),
#         vector_func=big_glove_vec,
#         vector_combo_func=vec_concatenate,
#         param_grid = {
#                 'hidden_dim' : [50, 100, 200, 400],
#                 'batch_size': [1024, 512, 256], 
#                 'eta': [0.001, 0.01]}
#         )

Stopping after epoch 47. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 0.8624679148197174Best params: {'batch_size': 1024, 'eta': 0.01, 'hidden_dim': 400}
Best score: 0.949
              precision    recall  f1-score   support

           0      0.900     0.907     0.903      1732
           1      0.497     0.476     0.486       334

    accuracy                          0.837      2066
   macro avg      0.698     0.692     0.695      2066
weighted avg      0.835     0.837     0.836      2066



In [67]:
# X_glove = pd.DataFrame(BIG_GLOVE).T

# wn_graph = convert_edges_to_indices(wn_edges, X_glove)
# X_glove = Retrofitter(verbose=True).fit(X_glove, wn_graph)

# BIG_GLOVE_retro = dict(zip(X_glove.index, X_glove.values))

# def big_glove_retro_vec(w):
#     """Return `w`'s retrofitted GloVe representation if available, else return
#     a random vector."""

#     # u = BIG_GLOVE.get(w, randvec(w, n=300))
#     v = BIG_GLOVE_retro.get(w, randvec(w, n=300))
    
#     return v


Converged at iteration 10; change was 0.0050

In [68]:
# big_glove_retro_SMOTE_experiment = custom_wordentail_experiment(
#         train_data=wordentail_data['train'],
#         assess_data=wordentail_data['dev'],
#         model=TorchDeepNeuralClassifier(early_stopping=True),
#         vector_func=big_glove_retro_vec,
#         vector_combo_func=vec_concatenate,
#         param_grid = {
#                 'hidden_dim' : [200, 400, 800],
#                 'batch_size': [1024, 512, 256], 
#                 'eta': [0.001, 0.01]}
#         )

Stopping after epoch 39. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 0.7740469500422478Best params: {'batch_size': 1024, 'eta': 0.01, 'hidden_dim': 800}
Best score: 0.953
              precision    recall  f1-score   support

           0      0.901     0.886     0.893      1732
           1      0.456     0.494     0.474       334

    accuracy                          0.823      2066
   macro avg      0.678     0.690     0.684      2066
weighted avg      0.829     0.823     0.826      2066



## Bake-off [1 point]

The goal of the bake-off is to achieve the highest __macro-average F1__ score on a test set that we will make available at the start of the bake-off. The announcement will go out on the discussion forum. To enter, you'll be asked to run `nli.bake_off_evaluation` on the output of your chosen `nli.wordentail_experiment` run. 

The cells below this one constitute your bake-off entry.

The rules described in the [Your original system](#Your-original-system-[3-points]) homework question are also in effect for the bake-off.

Systems that enter will receive the additional homework point, and systems that achieve the top score will receive an additional 0.5 points. We will test the top-performing systems ourselves, and only systems for which we can reproduce the reported results will win the extra 0.5 points.

Late entries will be accepted, but they cannot earn the extra 0.5 points. Similarly, you cannot win the bake-off unless your homework is submitted on time.

The announcement will include the details on where to submit your entry.

In [192]:
# Enter your bake-off assessment code into this cell.
# Please do not remove this comment.
if 'IS_GRADESCOPE_ENV' not in os.environ:
    pass
    # Please enter your code in the scope of the above conditional.
    ##### YOUR CODE HERE
    import nli
    # monkeypatching to use my own word_entail_featureize
    nli.word_entail_featurize = custom_word_entail_featurize
    
    test_data_filename = os.path.join(
    NLIDATA_HOME,
    "bakeoff-wordentail-data",
    "nli_wordentail_bakeoff_data-test.json")

    if new_experiment['micro-F1'] > 0.76:
        bakeoff_model = new_experiment.copy()
        bakeoff_model['model'] = bakeoff_model['best model']
    else:
        import torch
        bakeoff_model = {}
        bakeoff_model['model'] = torch.load("glove50d_3layers_smote_hypernym_0.768") 
        bakeoff_model['vector_func'] = glove_vec
        bakeoff_model['vector_combo_func'] = vec_concatenate

    nli.bake_off_evaluation(
        bakeoff_model,
        test_data_filename)


precision    recall  f1-score   support

           0      0.899     0.952     0.925      2036
           1      0.652     0.456     0.537       399

    accuracy                          0.871      2435
   macro avg      0.776     0.704     0.731      2435
weighted avg      0.859     0.871     0.861      2435



In [193]:
# On an otherwise blank line in this cell, please enter
# your macro-avg f1 value as reported by the code above.
# Please enter only a number between 0 and 1 inclusive.
# Please do not remove this comment.
if 'IS_GRADESCOPE_ENV' not in os.environ:
    pass
    # Please enter your score in the scope of the above conditional.
    ##### YOUR CODE HERE
    0.731
