# Homework 2: Stanford Sentiment Treebank

In [1]:
__author__ = "Di Bai, Yipeng He, and Zijian Wang"
__version__ = "CS224u, Stanford, Spring 2019"

## Contents

1. [Overview](#Overview)
1. [Methodological note](#Methodological-note)
1. [Set-up](#Set-up)
1. [A softmax baseline](#A-softmax-baseline)
1. [RNNClassifier wrapper](#RNNClassifier-wrapper)
1. [Error analysis](#Error-analysis)
1. [Homework questions](#Homework-questions)
  1. [Sentiment words alone [2 points]](#Sentiment-words-alone-[2-points])
  1. [A more powerful vector-summing baseline [3 points]](#A-more-powerful-vector-summing-baseline-[3-points])
  1. [Your original system [4 points]](#Your-original-system-[4-points])
1. [Bake-off [1 point]](#Bake-off-[1-point])

## Overview

This homework and associated bake-off are devoted to the Stanford Sentiment Treebank (SST). The homework questions ask you to implement some baseline systems, and the bake-off challenge is to define a system that does extremely well at the SST task.

We'll focus on the ternary task as defined by `sst.ternary_class_func`.

The SST test set will be used for the bake-off evaluation. This dataset is already publicly distributed, so we are counting on people not to cheat by develping their models on the test set. You must do all your development without using the test set at all, and then evaluate exactly once on the test set and turn in the results, with no further system tuning or additional runs. __Much of the scientific integrity of our field depends on people adhering to this honor code__. 

Our only additional restriction is that __you cannot make any use of the subtree labels__. This corresponds to the 'Root' condition in the paper. As we discussed in class, the subtree labels are a really interesting feature of SST, but bringing them in results in a substantially different learning problem.

One of our goals for this homework and bake-off is to encourage you to engage in __the basic development cycle for supervised models__, in which you

1. Write a new feature function. We recommend starting with something simple.
1. Use `sst.experiment` to evaluate your new feature function, with at least `fit_softmax_classifier`.
1. If you have time, compare your feature function with `unigrams_phi` using `sst.compare_models` or `sst.compare_models_mcnemar`. (For discussion, see [this notebook section](sst_02_hand_built_features.ipynb#Statistical-comparison-of-classifier-models).)
1. Return to step 1, or stop the cycle and conduct a more rigorous evaluation with hyperparameter tuning and assessment on the `dev` set.

[Error analysis](#Error-analysis) is one of the most important methods for steadily improving a system, as it facilitates a kind of human-powered hill-climbing on your ultimate objective. Often, it takes a careful human analyst just a few examples to spot a major pattern that can lead to a beneficial change to the feature representations.

## Methodological note

You don't have to use the experimental framework defined below (based on `sst`). However, if you don't use `sst.experiment` as below, then make sure you're training only on `train`, evaluating on `dev`, and that you report with 

```
from sklearn.metrics import classification_report
classification_report(y_dev, predictions)
```
where `y_dev = [y for tree, y in sst.dev_reader(class_func=sst.ternary_class_func)]`. We'll focus on the value at `macro avg` under `f1-score` in these reports.

## Set-up

See [the first notebook in this unit](sst_01_overview.ipynb#Set-up) for set-up instructions.

In [1]:
from collections import Counter
import numpy as np
import os
import pandas as pd
import random
from sklearn.linear_model import LogisticRegression
import sst
import torch.nn as nn
from torch_rnn_classifier import TorchRNNClassifier
from torch_tree_nn import TorchTreeNN
import utils

In [2]:
SST_HOME = os.path.join('data', 'trees')

## A softmax baseline

This example is here mainly as a reminder of how to use our experimental framework with linear models.

In [4]:
def unigrams_phi(tree):
    """The basis for a unigrams feature function.
    
    Parameters
    ----------
    tree : nltk.tree
        The tree to represent.
    
    Returns
    -------    
    Counter
        A map from strings to their counts in `tree`. (Counter maps a 
        list to a dict of counts of the elements in that list.)
    
    """
    return Counter(tree.leaves())

Thin wrapper around `LogisticRegression` for the sake of `sst.experiment`:

In [5]:
def fit_softmax_classifier(X, y):        
    mod = LogisticRegression(
        fit_intercept=True,
        solver='liblinear',
        multi_class='ovr')
    mod.fit(X, y)
    return mod

The experimental run with some notes:

In [6]:
softmax_experiment = sst.experiment(
    SST_HOME,
    unigrams_phi,                      # Free to write your own!
    fit_softmax_classifier,            # Free to write your own!
    train_reader=sst.train_reader,     # Fixed by the competition.
    assess_reader=sst.dev_reader,      # Fixed until the bake-off.
    class_func=sst.ternary_class_func) # Fixed by the bake-off rules.

              precision    recall  f1-score   support

    negative      0.628     0.689     0.657       428
     neutral      0.343     0.153     0.211       229
    positive      0.629     0.750     0.684       444

   micro avg      0.602     0.602     0.602      1101
   macro avg      0.533     0.531     0.518      1101
weighted avg      0.569     0.602     0.575      1101



`softmax_experiment` contains a lot of information that you can use for analysis; see [this section below](#Error-analysis) for starter code.

## RNNClassifier wrapper

This section illustrates how to use `sst.experiment` with RNN and TreeNN models.

To featurize examples for an RNN, we just get the words in order, letting the model take care of mapping them into an embedding space.

In [7]:
def rnn_phi(tree):
    return tree.leaves()    

The model wrapper gets the vocabulary using `sst.get_vocab`. If you want to use pretrained word representations in here, then you can have `fit_rnn_classifier` build that space too; see [this notebook section for details](sst_03_neural_networks.ipynb#Pretrained-embeddings).

In [8]:
def fit_rnn_classifier(X, y):    
    sst_glove_vocab = utils.get_vocab(X, n_words=10000)     
    mod = TorchRNNClassifier(
        sst_glove_vocab, 
        eta=0.05,
        embedding=None,
        batch_size=1000,
        embed_dim=50,
        hidden_dim=50,
        max_iter=50,
        l2_strength=0.001,
        bidirectional=True,
        hidden_activation=nn.ReLU())
    mod.fit(X, y)
    return mod

In [9]:
rnn_experiment = sst.experiment(
    SST_HOME,
    rnn_phi,
    fit_rnn_classifier, 
    vectorize=False,  # For deep learning, use `vectorize=False`.
    assess_reader=sst.dev_reader)

Finished epoch 50 of 50; error is 2.2622278034687042

              precision    recall  f1-score   support

    negative      0.603     0.631     0.616       428
     neutral      0.261     0.214     0.235       229
    positive      0.634     0.664     0.649       444

   micro avg      0.558     0.558     0.558      1101
   macro avg      0.499     0.503     0.500      1101
weighted avg      0.544     0.558     0.550      1101



## Error analysis

This section begins to build an error-analysis framework using the dicts returned by `sst.experiment`. These have the following structure:

```
'model': trained model
'train_dataset':
   'X': feature matrix
   'y': list of labels
   'vectorizer': DictVectorizer,
   'raw_examples': list of raw inputs, before featurizing   
'assess_dataset': same structure as the value of 'train_dataset'
'predictions': predictions on the assessment data
'metric': `score_func.__name__`, where `score_func` is an `sst.experiment` argument
'score': the `score_func` score on the assessment data
```
The following function just finds mistakes, and returns a `pd.DataFrame` for easy subsequent processing:

In [10]:
def find_errors(experiment):
    """Find mistaken predictions.
    
    Parameters
    ----------
    experiment : dict
        As returned by `sst.experiment`.
        
    Returns
    -------
    pd.DataFrame
    
    """
    raw_examples = experiment['assess_dataset']['raw_examples']
    raw_examples = [" ".join(tree.leaves()) for tree in raw_examples]
    df = pd.DataFrame({
        'raw_examples': raw_examples,
        'predicted': experiment['predictions'],
        'gold': experiment['assess_dataset']['y']})
    df['correct'] = df['predicted'] == df['gold']
    return df

In [11]:
softmax_analysis = find_errors(softmax_experiment)

In [12]:
rnn_analysis = find_errors(rnn_experiment)

Here we merge the sotmax and RNN experiments into a single DataFrame:

In [13]:
analysis = softmax_analysis.merge(
    rnn_analysis, left_on='raw_examples', right_on='raw_examples')

analysis = analysis.drop('gold_y', axis=1).rename(columns={'gold_x': 'gold'})

The following code collects a specific subset of examples; small modifications to its structure will give you different interesting subsets:

In [14]:
# Examples where the softmax model is correct, the RNN is not,
# and the gold label is 'positive'

error_group = analysis[
    (analysis['predicted_x'] == analysis['gold'])
    &
    (analysis['predicted_y'] != analysis['gold'])    
    &
    (analysis['gold'] == 'positive')
]

In [15]:
error_group.shape[0]

60

In [16]:
for ex in error_group['raw_examples'].sample(5):
    print("="*70)
    print(ex)

The magic of the film lies not in the mysterious spring but in the richness of its performances .
My Wife Is an Actress is an utterly charming French comedy that feels so American in sensibility and style it 's virtually its own Hollywood remake .
It haunts you , you ca n't forget it , you admire its conception and are able to resolve some of the confusions you had while watching it .
The film may appear naked in its narrative form ... but it goes deeper than that , to fundamental choices that include the complexity of the Catholic doctrine
Jose Campanella delivers a loosely autobiographical story brushed with sentimentality but brimming with gentle humor , bittersweet pathos , and lyric moments that linger like snapshots of memory .


## Homework questions

Please embed your homework responses in this notebook, and do not delete any cells from the notebook. (You are free to add as many cells as you like as part of your responses.)

### Sentiment words alone [2 points]

NLTK includes an easy interface to [Minqing Hu and Bing Liu's __Opinion Lexicon__](https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html), which consists of a list of positive words and a list of negative words. How much of the ternary SST story does this lexicon tell?

For this problem, submit code to do the following:

1. Create a feature function `op_unigrams` on the model of `unigrams_phi` above, but filtering the vocabulary to just items that are members of the Opinion Lexicon. Submit this feature function.

1. Evaluate your feature function with `sst.experiment`, with all the same parameters as were used to create `softmax_experiment` in [A softmax baseline](#A-softmax-baseline) above, except of course for the feature function.

1. Use `utils.mcnemar` to compare your feature function with the results in `softmax_experiment`. The information you need for this is in `softmax_experiment` and your own `sst.experiment` results. Submit your evaluation code. You can assume `softmax_experiment` is already in memory, but your code should create the other objects necessary for this comparison.

In [17]:
from nltk.corpus import opinion_lexicon

# Use set for fast membership checking:
positive = set(opinion_lexicon.positive())
negative = set(opinion_lexicon.negative())


In [18]:
def op_unigrams(tree):
    vocabulary = list(filter(lambda x: x in positive or x in negative, tree.leaves()))
    return Counter(vocabulary)

In [19]:
op_softmax_experiment = sst.experiment(
    SST_HOME,
    op_unigrams,                      
    fit_softmax_classifier,            # Free to write your own!
    train_reader=sst.train_reader,     # Fixed by the competition.
    assess_reader=sst.dev_reader,      # Fixed until the bake-off.
    class_func=sst.ternary_class_func) # Fixed by the bake-off rules.

              precision    recall  f1-score   support

    negative      0.553     0.752     0.638       428
     neutral      0.179     0.031     0.052       229
    positive      0.615     0.664     0.639       444

   micro avg      0.567     0.567     0.567      1101
   macro avg      0.449     0.482     0.443      1101
weighted avg      0.500     0.567     0.516      1101



In [20]:
m = utils.mcnemar(
    softmax_experiment['assess_dataset']['y'], 
    op_softmax_experiment['predictions'],
    softmax_experiment['predictions'])

print(m)

(5.328413284132841, 0.020980477345314247)


### A more powerful vector-summing baseline [3 points]

In [Distributed representations as features](sst_03_neural_networks.ipynb#Distributed-representations-as-features), we looked at a baseline for the ternary SST problem in which each example is modeled as the sum of its 50-dimensional GloVe representations. A `LogisticRegression` model was used for prediction. A neural network might do better here, since there might be complex relationships between the input feature dimensions that a linear classifier can't learn. 

To address this question, rerun the experiment with `torch_shallow_neural_classifier.TorchShallowNeuralClassifier` as the classifier. Specs:
* Use `sst.experiment` to conduct the experiment. 
* Using 3-fold cross-validation, exhaustively explore this set of hyperparameter combinations:
  * The hidden dimensionality at 50, 100, and 200.
  * The hidden activation function as `nn.Tanh` or `nn.ReLU`.
* (For all other parameters to `TorchShallowNeuralClassifier`, use the defaults.)

For this problem, submit code to do the following:

1. Your model wrapper function around `TorchShallowNeuralClassifier`. This function should implement the requisite cross-validation; see [this notebook section](sst_02_hand_built_features.ipynb#Hyperparameter-search) for examples.
1. Your average F1 score according to `sst.experiment`. 
2. The optimal hyperparameters chosen in your experiment. (You can just paste in the dict that `sst._experiment` prints.)

We're not evaluating the quality of your model. (We've specified the protocols completely, but there will still be a  lot of variation in the results.) However, the primary goal of this question is to get you thinking more about this strikingly good baseline feature representation scheme for SST, so we're sort of hoping you feel compelled to try out variations on your own.

In [21]:
from torch_shallow_neural_classifier import TorchShallowNeuralClassifier
DATE_HOME = 'data'
GLOVE_HOME = os.path.join(DATE_HOME, 'glove.6B')

glove_lookup = utils.glove2dict(
    os.path.join(GLOVE_HOME, 'glove.6B.300d.txt'))

def vsm_leaves_phi(tree, lookup, np_func=np.sum):
    allvecs = np.array([lookup[w] for w in tree.leaves() if w in lookup])    
    if len(allvecs) == 0:
        dim = len(next(iter(lookup.values())))
        feats = np.zeros(dim)
    else:       
        feats = np_func(allvecs, axis=0)      
    return feats

def glove_leaves_phi(tree, np_func=np.sum):
    return vsm_leaves_phi(tree, glove_lookup, np_func=np_func)

def fit_nn_classifier(X, y):   
    basemod = TorchShallowNeuralClassifier()
    cv = 3
    param_grid = {'hidden_dim': [50, 100, 200], 'hidden_activation': [nn.Tanh(), nn.ReLU()]}
    best_mod = utils.fit_classifier_with_crossvalidation(X, y, basemod, cv, param_grid)
    return best_mod


In [27]:
nn_experiment = sst.experiment(
    SST_HOME,
    glove_leaves_phi,
    fit_nn_classifier,
    class_func=sst.ternary_class_func,
    score_func=utils.safe_macro_f1,
    vectorize=False)  

print(nn_experiment['score'])


Finished epoch 100 of 100; error is 2.6160860359668732

Best params: {'hidden_activation': Tanh(), 'hidden_dim': 50}
Best score: 0.515
              precision    recall  f1-score   support

    negative      0.594     0.688     0.638      1004
     neutral      0.289     0.181     0.223       491
    positive      0.686     0.702     0.694      1069

   micro avg      0.597     0.597     0.597      2564
   macro avg      0.523     0.524     0.518      2564
weighted avg      0.574     0.597     0.582      2564

0.5181095156690892


### Your original system [4 points]

Your task is to develop an original model for the SST ternary problem. There are many options. If you spend more than a few hours on this homework problem, you should consider letting it grow into your final project! Here are some relatively manageable ideas that you might try:

1. We didn't systematically evaluate the `bidirectional` option to the `TorchRNNClassifier`. Similarly, that model could be tweaked to allow multiple LSTM layers (at present there is only one), and you could try adding layers to the classifier portion of the model as well.

1. We've already glimpsed the power of rich initial word representations, and later in the course we'll see that smart initialization usually leads to a performance gain in NLP, so you could perhaps achieve a winning entry with a simple model that starts in a great place.

1. The [practical introduction to contextual word representations](contextualreps.ipynb) (to be discussed later in the quarter) covers pretrained representations and interfaces that are likely to boost the performance of any system.

1. The `TreeNN` and `TorchTreeNN` don't perform all that well, and this could be for the same reason that RNNs don't peform well: the gradient signal doesn't propagate reliably down inside very deep trees. [Tai et al. 2015](https://aclanthology.info/papers/P15-1150/p15-1150) sought to address this with TreeLSTMs, which are fairly easy to implement in PyTorch.

1. In the [distributed representations as features](#Distributed-representations-as-features) section, we just summed  all of the leaf-node GloVe vectors to obtain a fixed-dimensional representation for all sentences. This ignores all of the tree structure. See if you can do better by paying attention to the binary tree structure: write a function `glove_subtree_phi` that obtains a vector representation for each subtree by combining the vectors of its daughters, with the leaf nodes again given by GloVe (any dimension you like) and the full representation of the sentence given by the final vector obtained by this recursive process. You can decide on how you combine the vectors. 

1. If you have a lot of computing resources, then you can fire off a large hyperparameter search over many parameter values. All the model classes for this course are compatible with the `scikit-learn` and [scikit-optimize](https://scikit-optimize.github.io) methods, because they define the required functions for getting and setting parameters.

We want to emphasize that this needs to be an __original__ system. It doesn't suffice to download code from the Web, retrain, and submit. You can build on others' code, but you have to do something new and meaningful with it.

__Please include a brief prose description of your system along with your code, to help the teaching team understand the structure of your system.__

>## Methods:
 >- The basic idea is to fine-tune the *whole* BERT model, instead of using `BERT-as-service`. We found that this yields a better performance for this task.
> - We first preprocess the data, specifically:
>  * We balance the dataset by oversampling
>  * We fix the issue with the naive sentence joining function (see `sent_filter`)
> - We then fine-tune the model using the [pretrained BERT model in PyTorch](https://github.com/huggingface/pytorch-pretrained-BERT) by Hugging face. A decent amount of this implementation was from the source code there.
>  * We adopt some of the recommended grid search settings in the original BERT paper.
 
>## Code:
> - In the following block, we define helper functions (mostly from the repo linked above). We provide another block for evaluation using the model dump, and a final (commented) block for the training script. Everything should be on GPU by default, but you can run the initialization block and evaluation block with CPU.
> - Additional packages needed were shown in the following block. You may run `pip install pytorch-pretrained-bert` and `pip install imbalanced-learn`.
> - You will need to download the pretrained model at [here](https://drive.google.com/file/d/1BBForAuU7BzyhGZs_r7JuHXn6y3PpVt6/view?usp=sharing) using Stanford account. Unzip it so that the model file would be at `./models/pytorch_model.bin`. Otherwise, you could run the training block, which takes a few hours on 4 GPUs

In [8]:
## Initialization: Re-initialize SST dataset
def sent_filter(sent):
    return sent.replace(" 's", "'s").replace(" .", ".").replace(" ,", ",").replace("`` ", "'") \
            .replace(" ''", "'").replace(" 'm", "'m").replace(" 've", "'ve") \
            .replace(" 't", "'t").replace(" 're", "'re") 
sst_train_reader = sst.train_reader(SST_HOME, class_func=sst.ternary_class_func)
sst_train = [(sent_filter(" ".join(t.leaves())),label) for t, label in sst_train_reader]
sst_dev_reader = sst.dev_reader(SST_HOME, class_func=sst.ternary_class_func)
sst_dev = [(sent_filter(" ".join(t.leaves())), label) for t, label in sst_dev_reader]
sst_test_reader = sst.test_reader(SST_HOME, class_func=sst.ternary_class_func)
sst_test = [(sent_filter(" ".join(t.leaves())), label) for t, label in sst_test_reader]


## Additional imports
import csv, os, json, random, logging, sys, pickle, time
from tqdm import trange, tqdm
import pandas as pd
import torch
from sklearn.metrics import *
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from torch.utils.data.distributed import DistributedSampler

from pytorch_pretrained_bert.modeling import BertForSequenceClassification, BertConfig, WEIGHTS_NAME, CONFIG_NAME

from pytorch_pretrained_bert.tokenization import BertTokenizer
from pytorch_pretrained_bert.modeling import BertForSequenceClassification
from pytorch_pretrained_bert.optimization import BertAdam
from pytorch_pretrained_bert.file_utils import PYTORCH_PRETRAINED_BERT_CACHE
from collections import *
from imblearn.over_sampling import RandomOverSampler
from bert_serving.client import BertClient 
from sklearn.metrics import classification_report
import imblearn

## Helpers
logging.basicConfig(format = '%(asctime)s - %(levelname)s -   %(message)s', 
                    datefmt = '%m/%d/%Y %H:%M:%S',
                    level = logging.INFO)
logger = logging.getLogger(__name__)

class dotdict(dict):
    def __getattr__(self, name):
        return self[name]
    
class InputExample(object):
    """A single training/test example for simple sequence classification."""

    def __init__(self, guid, text_a, text_b=None, label=None):
        """Constructs a InputExample.
        Args:
            guid: Unique id for the example.
            text_a: string. The untokenized text of the first sequence. For single
            sequence tasks, only this sequence must be specified.
            text_b: (Optional) string. The untokenized text of the second sequence.
            Only must be specified for sequence pair tasks.
            label: (Optional) string. The label of the example. This should be
            specified for train and dev examples, but not for test examples.
        """
        self.guid = guid
        self.text_a = text_a
        self.text_b = text_b
        self.label = label


class InputFeatures(object):
    """A single set of features of data."""

    def __init__(self, input_ids, input_mask, segment_ids, label_id):
        self.input_ids = input_ids
        self.input_mask = input_mask
        self.segment_ids = segment_ids
        self.label_id = label_id


class DataProcessor(object):
    """Base class for data converters for sequence classification data sets."""

    def get_train_examples(self, data_dir):
        """Gets a collection of `InputExample`s for the train set."""
        raise NotImplementedError()

    def get_dev_examples(self, data_dir):
        """Gets a collection of `InputExample`s for the dev set."""
        raise NotImplementedError()

    def get_test_examples(self, data_dir):
        """Gets a collection of `InputExample`s for the test set."""
        raise NotImplementedError()

    def get_labels(self):
        """Gets the list of labels for this data set."""
        raise NotImplementedError()



class CondProcessor(DataProcessor):
    """Processor for the MRPC data set (GLUE version)."""
    def __init__(self):
        self.label = {"positive": 0, "negative": 1, "neutral": 2}
        pass
        
    def get_train_examples(self):
        """See base class."""
        return self._create_examples(sst_train, "train")

    def get_dev_examples(self):
        """See base class."""
        return self._create_examples(sst_dev, "dev")
    
    def get_test_examples(self):
        """See base class."""
        return self._create_examples(sst_test, "test")
    
    def get_labels(self):
        """See base class."""
        return ["0", "1", "2"]

    def _create_examples(self, data, set_type):
        """Creates examples for the training and dev sets."""
        examples = []
        i = 0
        
        sents = np.array([d[0] for d in data])
        labels = np.array([d[1] for d in data])
        
        if set_type == "train":
            logging.info("getting oversampled train data")
            ros = RandomOverSampler(random_state=42)
            X_res, y_res = ros.fit_resample(np.arange(len(sents)).reshape(-1, 1), labels)
            X_res = sents[X_res.reshape(-1)]
            logging.info(Counter(y_res))
        else:
            logging.info("getting dev/test data")
            X_res = sents
            y_res = labels

        for e in zip(X_res, y_res):
            text_a = e[0]
            text_b = None
            label = self.label[e[1]]
            guid = "%s-%s" % (set_type, i)
            i += 1
            examples.append(
                InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
      
        return examples



def _truncate_seq_pair(tokens_a, tokens_b, max_length):
    """Truncates a sequence pair in place to the maximum length."""

    # This is a simple heuristic which will always truncate the longer sequence
    # one token at a time. This makes more sense than truncating an equal percent
    # of tokens from each, since if one sequence is very short then each token
    # that's truncated likely contains more information than a longer sequence.
    while True:
        total_length = len(tokens_a) + len(tokens_b)
        if total_length <= max_length:
            break
        if len(tokens_a) > len(tokens_b):
            tokens_a.pop()
        else:
            tokens_b.pop()

def convert_examples_to_features(examples, label_list, max_seq_length, tokenizer, show_exp=False):

    label_map = {label : i for i, label in enumerate(label_list)}

    features = []
    for (ex_index, example) in enumerate(examples):
        tokens_a = tokenizer.tokenize(example.text_a)

        tokens_b = None
        if example.text_b:
            tokens_b = tokenizer.tokenize(example.text_b)

        if tokens_b:
            # Modifies `tokens_a` and `tokens_b` in place so that the total
            # length is less than the specified length.
            # Account for [CLS], [SEP], [SEP] with "- 3"
            _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
        else:
            # Account for [CLS] and [SEP] with "- 2"
            if len(tokens_a) > max_seq_length - 2:
                tokens_a = tokens_a[0:(max_seq_length - 2)]

        # The convention in BERT is:
        # (a) For sequence pairs:
        #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
        #  type_ids: 0   0  0    0    0     0       0 0    1  1  1  1   1 1
        # (b) For single sequences:
        #  tokens:   [CLS] the dog is hairy . [SEP]
        #  type_ids: 0   0   0   0  0     0 0
        #
        # Where "type_ids" are used to indicate whether this is the first
        # sequence or the second sequence. The embedding vectors for `type=0` and
        # `type=1` were learned during pre-training and are added to the wordpiece
        # embedding vector (and position vector). This is not *strictly* necessary
        # since the [SEP] token unambigiously separates the sequences, but it makes
        # it easier for the model to learn the concept of sequences.
        #
        # For classification tasks, the first vector (corresponding to [CLS]) is
        # used as as the "sentence vector". Note that this only makes sense because
        # the entire model is fine-tuned.
        tokens = []
        segment_ids = []
        tokens.append("[CLS]")
        segment_ids.append(0)
        for token in tokens_a:
            tokens.append(token)
            segment_ids.append(0)
        tokens.append("[SEP]")
        segment_ids.append(0)

        if tokens_b:
            for token in tokens_b:
                tokens.append(token)
                segment_ids.append(1)
            tokens.append("[SEP]")
            segment_ids.append(1)

        input_ids = tokenizer.convert_tokens_to_ids(tokens)

        # The mask has 1 for real tokens and 0 for padding tokens. Only real
        # tokens are attended to.
        input_mask = [1] * len(input_ids)

        # Zero-pad up to the sequence length.
        while len(input_ids) < max_seq_length:
            input_ids.append(0)
            input_mask.append(0)
            segment_ids.append(0)

        assert len(input_ids) == max_seq_length
        assert len(input_mask) == max_seq_length
        assert len(segment_ids) == max_seq_length

        label_id = label_map[str(example.label)]
        if ex_index < 5 and show_exp:
            logger.info("*** Example ***")
            logger.info("guid: %s" % (example.guid))
            logger.info("tokens: %s" % " ".join(
                    [str(x) for x in tokens]))
            logger.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
            logger.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
            logger.info(
                    "segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
            logger.info("label: %s (id = %d)" % (example.label, label_id))

        features.append(
                InputFeatures(input_ids=input_ids,
                              input_mask=input_mask,
                              segment_ids=segment_ids,
                              label_id=label_id))
    return features


In [None]:
## Evaluation: Evaluate using best model
args = dotdict({"data_dir": './data/bert/', 
                "bert_model": "bert-large-cased",
                "output_dir": "./models/",
                "model_save_pth": "./models/bert_classification.pth",
                "seed": 28,
                "train_batch_size": 30,
                "num_train_epochs": 8,
                "eval_batch_size": 30,
                "do_lower_case": False,
                "do_train": True,
                "do_eval": True,
                "max_seq_length": 100,
                "gradient_accumulation_steps": 1,
                "local_rank": -1,
                "warmup_proportion": 0.1,
                "fp16": False,
                "cache_dir": "./tmp/",
                "learning_rate": 9E-6,
                "do_train_eval": False})

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# see https://discuss.pytorch.org/t/on-a-cpu-device-how-to-load-checkpoint-saved-on-gpu-device/349
logger.info("device is " + str(device))
n_gpu = torch.cuda.device_count()
args.train_batch_size = args.train_batch_size // args.gradient_accumulation_steps

random.seed(args.seed)
np.random.seed(args.seed)
torch.manual_seed(args.seed)
if n_gpu > 0:
    torch.cuda.manual_seed_all(args.seed)

processor = CondProcessor()
label_list = processor.get_labels()
tokenizer = BertTokenizer.from_pretrained(args.bert_model, do_lower_case=args.do_lower_case)
num_labels = 3

# Prepare model
model = BertForSequenceClassification.from_pretrained(args.bert_model,
          cache_dir=args.cache_dir,
          num_labels = num_labels)
model.load_state_dict(torch.load('models/pytorch_model.bin', map_location=device))
logging.info("loaded best model")

eval_examples = processor.get_dev_examples() # change this line to `get_test_examples` for test dataset evaluation
eval_features = convert_examples_to_features(
    eval_examples, label_list, args.max_seq_length, tokenizer)
all_input_ids = torch.tensor([f.input_ids for f in eval_features], dtype=torch.long)
all_input_mask = torch.tensor([f.input_mask for f in eval_features], dtype=torch.long)
all_segment_ids = torch.tensor([f.segment_ids for f in eval_features], dtype=torch.long)
all_label_ids = torch.tensor([f.label_id for f in eval_features], dtype=torch.long)
eval_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids)
# Run prediction for full data
eval_sampler = SequentialSampler(eval_data)
eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=args.eval_batch_size)
model.to(device)
model.eval()
preds = []
labels = []
for input_ids, input_mask, segment_ids, label_ids in tqdm(eval_dataloader):
    input_ids = input_ids.to(device)
    input_mask = input_mask.to(device)
    segment_ids = segment_ids.to(device)
    label_ids = label_ids.to(device)

    with torch.no_grad():
        tmp_eval_loss = model(input_ids, segment_ids, input_mask, label_ids)
        logits = model(input_ids, segment_ids, input_mask)

    logits = logits.detach().cpu().numpy()
    label_ids = label_ids.to('cpu').numpy()
    pred = np.argmax(logits, axis=1)
    labels.append(label_ids)
    preds.append(pred)

# f1 = f1_score(np.concatenate(labels), np.concatenate(preds), average="macro")
# logging.info("*** Dev F1: %s" % f1)
# logging.info("")
print(classification_report(np.concatenate(labels), np.concatenate(preds), digits=5))


>## System Training Log
>### Best args
     {'data_dir': './data/bert/', 'bert_model': 'bert-large-cased', 'output_dir': './models/', 'model_save_pth': './models/bert_classification.pth', 'seed': 28, 'train_batch_size': 30, 'num_train_epochs': 8, 'eval_batch_size': 90, 'do_lower_case': False, 'do_train': True, 'do_eval': True, 'max_seq_length': 100, 'gradient_accumulation_steps': 1, 'task_name': 'test', 'local_rank': -1, 'warmup_proportion': 0.1, 'fp16': False, 'cache_dir': './tmp/', 'learning_rate': 9e-06, 'do_train_eval': False} 
>### Log
>```
04/22/2019 01:10:08 - INFO -   tr loss: 82.70523658394814
04/22/2019 01:10:08 - INFO -   do eval
04/22/2019 01:10:08 - INFO -   getting dev data
04/22/2019 01:10:25 - INFO -   F1 0.6934594278565237
04/22/2019 01:10:25 - INFO -   Best F1 0.6934594278565237
04/22/2019 01:10:25 - INFO -   ***** Eval results *****
04/22/2019 01:10:25 - INFO -     eval_f1 = 0.6934594278565237
04/22/2019 01:10:25 - INFO -     eval_loss = 11.088425636291504
04/22/2019 01:10:25 - INFO -     global_step = 1083
>```

In [None]:
# ## Train
# best_f1 = 0
# for learning_rate in [5E-6, 9E-6, 1E-5, 2E-5]:
#         args = dotdict({"data_dir": './data/bert/', 
#                 "bert_model": "bert-large-cased",
#                 "output_dir": "./models/",
#                 "model_save_pth": "./models/bert_classification.pth",
#                 "seed": 28,
#                 "train_batch_size": 30,
#                 "num_train_epochs": 8,
#                 "eval_batch_size": 90,
#                 "do_lower_case": False,
#                 "do_train": True,
#                 "do_eval": True,
#                 "max_seq_length": 100,
#                 "gradient_accumulation_steps": 1,
#                 "task_name": "test",
#                 "local_rank": -1,
#                 "warmup_proportion": 0.1,
#                 "fp16": False,
#                 "cache_dir": "./tmp/",
#                 "learning_rate": learning_rate,
#                 "do_train_eval": False})



#         if torch.cuda.is_available():
#             torch.cuda.empty_cache()
#         logging.info(args)




#         processor = CondProcessor()
#         label_list = processor.get_labels()

#         tokenizer = BertTokenizer.from_pretrained(args.bert_model, do_lower_case=args.do_lower_case)

#         num_labels = 3
#         train_examples = None
#         num_train_optimization_steps = None
#         if args.do_train:
#             train_examples = processor.get_train_examples()
#             num_train_optimization_steps = int(
#                 len(train_examples) / args.train_batch_size / args.gradient_accumulation_steps) * args.num_train_epochs
#             if args.local_rank != -1:
#                 num_train_optimization_steps = num_train_optimization_steps // torch.distributed.get_world_size()

#         # Prepare model
#         cache_dir = args.cache_dir if args.cache_dir else os.path.join(str(PYTORCH_PRETRAINED_BERT_CACHE), 'distributed_{}'.format(args.local_rank))
#         model = BertForSequenceClassification.from_pretrained(args.bert_model,
#                   cache_dir=cache_dir,
#                   num_labels = num_labels)

#         model = torch.nn.DataParallel(model)
#         model.to(device)


#         param_optimizer = list(model.named_parameters())
#         no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
#         optimizer_grouped_parameters = [
#             {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
#             {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
#             ]


#         optimizer = BertAdam(optimizer_grouped_parameters,
#                              lr=args.learning_rate,
#                              warmup=args.warmup_proportion,
#                              t_total=num_train_optimization_steps)

#         train_f1s = []
#         eval_f1s = []
#         global_step = 0
#         nb_tr_steps = 0
#         tr_loss = 0
#         patience = 0
#         if args.do_train:


#             for _ in trange(int(args.num_train_epochs)):

#                 logger.info("do train")

#                 train_features = convert_examples_to_features(
#                 train_examples, label_list, args.max_seq_length, tokenizer)
#                 all_input_ids = torch.tensor([f.input_ids for f in train_features], dtype=torch.long)
#                 all_input_mask = torch.tensor([f.input_mask for f in train_features], dtype=torch.long)
#                 all_segment_ids = torch.tensor([f.segment_ids for f in train_features], dtype=torch.long)
#                 all_label_ids = torch.tensor([f.label_id for f in train_features], dtype=torch.long)
#                 train_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids)
#                 if args.local_rank == -1:
#                     train_sampler = RandomSampler(train_data)
#                 else:
#                     train_sampler = DistributedSampler(train_data)
#                 train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=args.train_batch_size, drop_last = True, pin_memory=True)


#                 model.train()
#                 tr_loss = 0
#                 nb_tr_examples, nb_tr_steps = 0, 0

#                 for step, batch in tqdm(enumerate(train_dataloader), total=len(train_dataloader)):

#                     batch = tuple(t.to(device, non_blocking=True) for t in batch)

#                     input_ids, input_mask, segment_ids, label_ids = batch

#                     loss = model(input_ids, segment_ids, input_mask, label_ids) #???
#                     if n_gpu > 1:
#                         loss = loss.mean() # mean() to average on multi-gpu.
#                     loss.backward()

#                     tr_loss += loss.item()
#                     nb_tr_examples += input_ids.size(0)
#                     nb_tr_steps += 1
#                     optimizer.step()
#                     optimizer.zero_grad()
#                     global_step += 1
#                     del input_ids, input_mask, segment_ids, label_ids, batch, loss

#                 logger.info("tr loss: %s" % tr_loss)

          


#                 if args.do_eval:
#                     logger.info("do eval")

#                     eval_examples = processor.get_dev_examples()
#                     eval_features = convert_examples_to_features(
#                         eval_examples, label_list, args.max_seq_length, tokenizer)
#                     all_input_ids = torch.tensor([f.input_ids for f in eval_features], dtype=torch.long)
#                     all_input_mask = torch.tensor([f.input_mask for f in eval_features], dtype=torch.long)
#                     all_segment_ids = torch.tensor([f.segment_ids for f in eval_features], dtype=torch.long)
#                     all_label_ids = torch.tensor([f.label_id for f in eval_features], dtype=torch.long)
#                     eval_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids)
#                     # Run prediction for full data
#                     eval_sampler = SequentialSampler(eval_data)
#                     eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=args.eval_batch_size)

#                     model.eval()
#                     eval_loss = 0
#                     nb_eval_steps, nb_eval_examples = 0, 0
#                     preds = []
#                     labels = []
#                     for input_ids, input_mask, segment_ids, label_ids in eval_dataloader:
#                         input_ids = input_ids.to(device)
#                         input_mask = input_mask.to(device)
#                         segment_ids = segment_ids.to(device)
#                         label_ids = label_ids.to(device)

#                         with torch.no_grad():
#                             tmp_eval_loss = model(input_ids, segment_ids, input_mask, label_ids)
#                             logits = model(input_ids, segment_ids, input_mask)

#                         logits = logits.detach().cpu().numpy()
#                         label_ids = label_ids.to('cpu').numpy()
#                         pred = np.argmax(logits, axis=1)
#                         labels.append(label_ids)
#                         preds.append(pred)
#                         eval_loss += tmp_eval_loss.mean().item()

#                         nb_eval_examples += input_ids.size(0)
#                         nb_eval_steps += 1
#                         del input_ids, input_mask, segment_ids, label_ids, tmp_eval_loss

#                     f1 = f1_score(np.concatenate(labels), np.concatenate(preds), average="macro")
#                     logger.info("F1 %s" % f1)
#                     eval_f1s.append(f1)
#                     if f1 > best_f1:

#                         best_f1 = f1
#                         assert type(eval_loss) == float
#                         logger.info("Best F1 %s" % best_f1)
#                         result = {'eval_loss': eval_loss,
#                                   'eval_f1': f1,
#                                   'global_step': global_step}

#                         output_eval_file = os.path.join(args.output_dir, "eval_results.txt")
#                         with open(output_eval_file, "w") as writer:
#                             logger.info("***** Eval results *****")
#                             for key in sorted(result.keys()):
#                                 logger.info("  %s = %s", key, str(result[key]))
#                                 writer.write("%s = %s\n" % (key, str(result[key])))
#                         model.to(device)
#                         model_to_save = model.module if hasattr(model, 'module') else model 
#                         output_model_file = os.path.join(args.output_dir, WEIGHTS_NAME)
#                         torch.save(model_to_save.state_dict(), output_model_file)
#                         output_config_file = os.path.join(args.output_dir, CONFIG_NAME)
#                         with open(output_config_file, 'w') as f:
#                             f.write(model_to_save.config.to_json_string())
#                         output_param_file = os.path.join(args.output_dir, "param")
#                         with open(output_param_file, 'w') as f:
#                             json.dump(args.__dict__, f, indent=2, sort_keys=True)
                            

## Bake-off [1 point]

The bake-off will begin on April 22. The announcement will go out on Piazza. As we said above, the bake-off evaluation data is the official SST test set release. For this bake-off, you'll evaluate your original system from the above homework problem on the test set, using the ternary class problem. Rules:

1. Only one evaluation is permitted.
1. No additional system tuning is permitted once the bake-off has started.

To enter the bake-off, upload this notebook on Canvas:

https://canvas.stanford.edu/courses/99711/assignments/187246

The cells below this one constitute your bake-off entry.

Systems that enter will receive the additional homework point, and systems that achieve the top score will receive an additional 0.5 points. We will test the top-performing systems ourselves, and only systems for which we can reproduce the reported results will win the extra 0.5 points.

The bake-off will close at 4:30 pm on April 24. Late entries will be accepted, but they cannot earn the extra 0.5 points. Similarly, you cannot win the bake-off unless your homework is submitted on time.

In [10]:
# Enter your bake-off assessment code in this cell. 
# Please do not remove this comment.

## Evaluation: Evaluate using best model
args = dotdict({"data_dir": './data/bert/', 
                "bert_model": "bert-large-cased",
                "output_dir": "./models/",
                "model_save_pth": "./models/bert_classification.pth",
                "seed": 28,
                "train_batch_size": 30,
                "num_train_epochs": 8,
                "eval_batch_size": 30,
                "do_lower_case": False,
                "do_train": True,
                "do_eval": True,
                "max_seq_length": 100,
                "gradient_accumulation_steps": 1,
                "local_rank": -1,
                "warmup_proportion": 0.1,
                "fp16": False,
                "cache_dir": "./tmp/",
                "learning_rate": 9E-6,
                "do_train_eval": False})

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# see https://discuss.pytorch.org/t/on-a-cpu-device-how-to-load-checkpoint-saved-on-gpu-device/349
logger.info("device is " + str(device))
n_gpu = torch.cuda.device_count()
args.train_batch_size = args.train_batch_size // args.gradient_accumulation_steps

random.seed(args.seed)
np.random.seed(args.seed)
torch.manual_seed(args.seed)
if n_gpu > 0:
    torch.cuda.manual_seed_all(args.seed)

processor = CondProcessor()
label_list = processor.get_labels()
tokenizer = BertTokenizer.from_pretrained(args.bert_model, do_lower_case=args.do_lower_case)
num_labels = 3

# Prepare model
model = BertForSequenceClassification.from_pretrained(args.bert_model,
          cache_dir=args.cache_dir,
          num_labels = num_labels)
model.load_state_dict(torch.load('models/pytorch_model.bin', map_location=device))
logging.info("loaded best model")

eval_examples = processor.get_test_examples() # change this line to `get_test_examples` for test dataset evaluation
eval_features = convert_examples_to_features(
    eval_examples, label_list, args.max_seq_length, tokenizer)
all_input_ids = torch.tensor([f.input_ids for f in eval_features], dtype=torch.long)
all_input_mask = torch.tensor([f.input_mask for f in eval_features], dtype=torch.long)
all_segment_ids = torch.tensor([f.segment_ids for f in eval_features], dtype=torch.long)
all_label_ids = torch.tensor([f.label_id for f in eval_features], dtype=torch.long)
eval_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids)
# Run prediction for full data
eval_sampler = SequentialSampler(eval_data)
eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=args.eval_batch_size)
model.to(device)
model.eval()
preds = []
labels = []
for input_ids, input_mask, segment_ids, label_ids in tqdm(eval_dataloader):
    input_ids = input_ids.to(device)
    input_mask = input_mask.to(device)
    segment_ids = segment_ids.to(device)
    label_ids = label_ids.to(device)

    with torch.no_grad():
        tmp_eval_loss = model(input_ids, segment_ids, input_mask, label_ids)
        logits = model(input_ids, segment_ids, input_mask)

    logits = logits.detach().cpu().numpy()
    label_ids = label_ids.to('cpu').numpy()
    pred = np.argmax(logits, axis=1)
    labels.append(label_ids)
    preds.append(pred)


print(classification_report(np.concatenate(labels), np.concatenate(preds), digits=5))



04/22/2019 16:17:19 - INFO -   device is cpu
04/22/2019 16:17:19 - INFO -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt from cache at /Users/baidi/.pytorch_pretrained_bert/cee054f6aafe5e2cf816d2228704e326446785f940f5451a5b26033516a4ac3d.e13dbb970cb325137104fb2e5f36fe865f27746c6b526f6352861b1980eb80b1
04/22/2019 16:17:20 - INFO -   loading archive file https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased.tar.gz from cache at ./tmp/7fb0534b83c42daee7d3ddb0ebaa81387925b71665d6ea195c5447f1077454cd.eea60d9ebb03c75bb36302aa9d241d3b7a04bba39c360cf035e8bf8140816233
04/22/2019 16:17:20 - INFO -   extracting archive file ./tmp/7fb0534b83c42daee7d3ddb0ebaa81387925b71665d6ea195c5447f1077454cd.eea60d9ebb03c75bb36302aa9d241d3b7a04bba39c360cf035e8bf8140816233 to temp dir /var/folders/cf/lwl2s47102j6p475ny0d9dlc0000gn/T/tmprz8zb0n0
04/22/2019 16:17:34 - INFO -   Model config {
  "attention_probs_dropout_prob": 0.1,
  "direction

              precision    recall  f1-score   support

           0    0.85443   0.89109   0.87237       909
           1    0.84791   0.73355   0.78660       912
           2    0.38055   0.46272   0.41763       389

   micro avg    0.75068   0.75068   0.75068      2210
   macro avg    0.69430   0.69579   0.69220      2210
weighted avg    0.76833   0.75068   0.75693      2210



In [None]:
# On an otherwise blank line in this cell, please enter
# your macro-average F1 value as reported by the code above. 
# Please enter only a number between 0 and 1 inclusive.
# Please do not remove this comment.
0.69220