<center> <h1> Lecture 6: Embeddings and ML Experiments </h1> </center>
<center> Krishna Pillutla, Zaid Harchaoui </center>
    <center> Data 598 (Winter 2022), University of Washington </center>

We will discuss two topics this lecture:
- Embeddings for natural language
- Model Selection with statistical tests



# Part 1: Embeddings for Natural Language

The field of **natural language processing (NLP)** is concerned with the interaction between computers and natural (human) language. This involves "understanding" the contents of documents, including the contextual nuances of the language within them. 

**Embeddings**:
The use of machine learning for NLP, both in the classical settings as well as the modern deep learning era, have relied on *embedding* words in vector spaces.
Words are made of characters, which are combinatorial in nature with no "neighborhood" structure which one expects of vectors in, say, a Euclidean space. 
The magic of embeddings is that they are able to capture some "neighborhood" structure in words, e.g., the embedding of synonyms are closer together than of words which have nothing in common. 

![](https://miro.medium.com/max/2400/1*OEmWDt4eztOcm5pr2QbxfA.png)
Image credits: https://towardsdatascience.com/creating-word-embeddings-coding-the-word2vec-algorithm-in-python-using-deep-learning-b337d0ba17a8

**Note**: Sometimes, we will work at the level of subword units, rather than words. Mathematically, the same treatment holds irrespective of how we *tokenize* the text. We will refer to these units as *tokens*.


**Types of embeddings**:

- Global (context-free) embeddings: word2vec, GloVe
- Contextual embeddings: ELMo, BERT, ...

![](http://ai.stanford.edu/blog/assets/img/posts/2020-03-24-contextual/contextual_mouse_transparent_2.png)
Image credit: http://ai.stanford.edu/blog/contextual/


**The history of word embeddings**:
The research started with global (context-free) embeddings with 
later research producing contextual embeddings using deep learning.
$$
\begin{matrix}
\text{word2vec}   &  \text{GloVe}   &   \text{ELMo}     &       \text{BERT}  \\
2013       &  2014    &   2017     &       2018 
\end{matrix} 
$$

**Playing with embeddings**:
For the moment, we postpone a discussion of how the embeddings are constructed. We will play with BERT embeddings, a form of contextual embeddings, using the `transformers` library.

In [2]:
# Install the transformers library
# Important: make sure pip is installed in your conda environment
# Run "pip install transformers" in your terminal

In [1]:
import torch
import numpy as np
from transformers import BertTokenizer, BertModel

In [2]:
model_name = 'bert-base-uncased'
# Download the pre-trained model + tokenizer (a total of 440 MB)
tokenizer = BertTokenizer.from_pretrained(model_name) # to tokenize the text
model = BertModel.from_pretrained(model_name)  # PyTorch module

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/433 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [263]:
# Consider these two sentences

sentence1 = "I collected my paycheck at the bank"
sentence2 = "Meet me tomorrow at the river bank"

# Let us tokenize them 
tokens_for_sentence1 = tokenizer.encode(sentence1, return_tensors='pt')
tokens_for_sentence2 = tokenizer.encode(sentence2, return_tensors='pt')

print('Sentence 1:', tokens_for_sentence1)
print('Sentence 2:', tokens_for_sentence2)

print('Sentence 1 Length:', tokens_for_sentence1.shape)
print('Sentence 2 Length:', tokens_for_sentence2.shape)
# the leading 1 is the batch size

Sentence 1: tensor([[ 101, 1045, 5067, 2026, 3477, 5403, 3600, 2012, 1996, 2924,  102]])
Sentence 2: tensor([[ 101, 3113, 2033, 4826, 2012, 1996, 2314, 2924,  102]])
Sentence 1 Length: torch.Size([1, 11])
Sentence 2 Length: torch.Size([1, 9])


The token "102" corresponds to the word "bank".
Observe now that the contextual embedding of the word "bank" for each case is different. 
This would not have been the case for a global embedding. 

In [264]:
outputs1 = model(tokens_for_sentence1,
                return_dict=True)

# Extract contextual embedding for each token
embeddings_for_sentence1 = outputs1.last_hidden_state
print(embeddings_for_sentence1.shape) # [batch_size, num_tokens, dimension]

outputs2 = model(tokens_for_sentence2,
                return_dict=True)

# Extract contextual embedding for each token
embeddings_for_sentence2 = outputs2.last_hidden_state
print(embeddings_for_sentence2.shape) # [batch_size, num_tokens, dimension]


torch.Size([1, 11, 768])
torch.Size([1, 9, 768])


In [265]:
embedding_for_bank_1 = embeddings_for_sentence1[0, -1, :]
embedding_for_bank_2 = embeddings_for_sentence2[0, -1, :]
print('L2 distance between the embeddings:', 
      torch.norm(embedding_for_bank_1-embedding_for_bank_2).item())


L2 distance between the embeddings: 5.314887523651123


## Sentiment Analysis using Embeddings

We will look at the standard NLP task of sentiment analysis. 
Given a piece of text, the goal is to classify it as "positive" or "negative" in sentiment.

![](https://cdn.socialbakers.com/www/storage/www/articles/content/2018-12/1545313838-sentiment-analysis.jpg)
Image credits: https://www.socialbakers.com/blog/social-media-sentiment-analysis

Our procedure is as follows:
- We will use a labeled dataset and cast this as a multiclass classification problem
- We will use these BERT embeddings to construct obtain one vector per token. We will simply take the mean of this vector as the feature representation of the entire piece of text.
- We will train a simple linear model to predict the output label from these features.


Download the data from [here](https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data?select=train.tsv.zip).

We will use movie reviews from Rotten Tomatoes. The sentiment labels are:
- 0 - negative
- 1 - somewhat negative
- 2 - neutral
- 3 - somewhat positive
- 4 - positive

### Load and visualize data

In [266]:
import pandas as pd
filename = './data/sentiment-analysis-train.tsv'
# keep one example per sentence (original data labels each phrase)
data = pd.read_csv(filename, sep='\t').groupby('SentenceId').first()
data = data.drop(columns=['PhraseId'])

print(data.shape)

data.head(4)

(8529, 2)


Unnamed: 0_level_0,Phrase,Sentiment
SentenceId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,A series of escapades demonstrating the adage ...,1
2,"This quiet , introspective and entertaining in...",4
3,"Even fans of Ismail Merchant 's work , I suspe...",1
4,A positively thrilling combination of ethnogra...,3


In [267]:
data.at[3, 'Phrase']

"Even fans of Ismail Merchant 's work , I suspect , would have a hard time sitting through this one ."

### Train-test split and featurize

In [268]:
data = data.sample(frac=1)  # shuffle
train_data = data[:1000]
test_data = data[5000:6000]
print(train_data.shape, test_data.shape)

(1000, 2) (1000, 2)


In [156]:
from tqdm.auto import tqdm

@torch.no_grad()
def featurize(x): # x is pd.Series with text
    features = []
    for sen in tqdm(x):
        sen = tokenizer.encode(sen, return_tensors='pt')
        outputs = model(sen, return_dict=True)
        embeddings = outputs.last_hidden_state.squeeze() # (len, dim)
        mean_embedding = embeddings.mean(axis=0)
        features.append(mean_embedding.numpy())
    return np.stack(features)  # (n, dim)


In [157]:
# Takes a few minutes to run
x_train = featurize(train_data['Phrase'])
y_train = train_data['Sentiment'].values

x_test = featurize(test_data['Phrase'])
y_test = test_data['Sentiment'].values

  0%|          | 0/1000 [00:00<?, ?it/s]

  0%|          | 0/1000 [00:00<?, ?it/s]

In [234]:
# xt1 = x_train
# xt2 = x_test
x_train = xt1
x_test = xt2

### Train a simple logistic regression classifier to test performance

In [235]:
from sklearn.decomposition import PCA
pca = PCA(n_components=0.99, random_state=1).fit(x_train)  # keep 99% of the explained variance
x_train = pca.transform(x_train)
x_test = pca.transform(x_test)

In [244]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=0, C=0.01).fit(x_train, y_train)

y_train_pred = clf.predict(x_train)
y_test_pred = clf.predict(x_test)

print('Train accuracy:', (y_train_pred == y_train).mean())
print('Test accuracy:', (y_test_pred == y_test).mean())

Train accuracy: 0.546
Test accuracy: 0.418


# Part 2: Statistical Tests for Analysis for ML Experiments

In some particular safety-critical applications, it might be necessary to make guarantees of the form 
"*The misclassification error of my classifier is at most 12% on data from the same distribution as our training data*". Think of self-driving cars, for instance. The 12% above is an arbitrary number.

In these cases, we run hypothesis tests to formalize our claims.

**Hypothesis Testing Review**: 
Suppose we want to show that herbal tea helps with migraines. 
In the spirit of "proof by contradiction", 
we assume the opposite to be true and say that herbal tea does not help with migraines. 
If the data looks too "anomalous" under this assumption, we arrive at a 
"contradiction", which means that the data is not consistent the 
claim that "herbal tea does not help with migraines" (the opposite of what we set out to show).


We have two hypotheses, the null hypothesis (denoted $H_0$; the opposite of what we want to show) and the alternate hypothesis (denoted $H_1$ or $H_a$; what we want to show). 
From looking at the data, we take one of two steps:
- reject the null hypothesis
- "fail" to reject the null hypothesis


**Illustration**: ([credit](https://study.com/cimages/multimages/16/ea0e233d-7bc3-4ba6-a79c-5d8281295985_t_tests.png)): 

Suppose we are given two distributions with means $\mu_1$ and $\mu_2$ respectively. 
We plot as test statistic (TS) the difference $\hat\mu_{1, n} - \hat\mu_{2, n}$ between the sample means. The bell-curve is centered at $0$.
![](https://study.com/cimages/multimages/16/ea0e233d-7bc3-4ba6-a79c-5d8281295985_t_tests.png)


Letting "acc$(h)$" denote the classification accuracy of our classification algorithm $h$, We may write this test as the following:
$$
H_0: \quad \text{acc}(h) \le a_0 \\
H_1: \quad \text{acc}(h) > a_0 ,
$$
where $a_0$ is some pre-specified accuracy.

The outcomes are:
- Reject the null: If our data is convincing enough (i.e., the accuracy on our validation set is significantly larger than $a_0$), we reject the null with a certain level of confidence
- Fail to reject the null: If the validation accuracy is close to or smaller than $a_0$, we say that we do not have strong enough evidence to reject the null (default) hypothesis. 

**The $t$-Test to Assess Classification**:

Suppose we $K$ training-validation pairs. For each one, we record the 
validation accuracies $A_1, \cdots, A_K$. 
The empirical mean and variance are:
$$
    m = \frac{1}{K} \sum_{k=1}^K A_k\,, \quad S^2 = \frac{1}{K-1}(A_k - m)^2 \,.
$$
The test statistic is then
$$
    T_K := \frac{\sqrt{K}(m - a_0)}{S} .
$$
Under the assumption of independence of each of the training-validation set pairs, 
it turns out that $T_K$ is distributed according to the Student $t$ distribution with $K-1$ degrees of freedom. 

In this case, we reject the null with a level of significance $\alpha$ if 
$$
    T_K > t_{K-1, \alpha},
$$
the $(1-\alpha)$-quantile of of the $t_{K-1}$ distribution. 

That is, we reject the null if 
$$
    m > a_0 + \frac{S}{\sqrt{K}} t_{K-1, \alpha} \,.
$$
Observe what happens as $K$ grows or $\alpha$ becomes smaller. 
![](https://lh3.googleusercontent.com/proxy/Rk0TX6KUcZLaFgMU42Qr553ALEHXt1YRIoZRIZfaoTMp69H5UcESVWmj3C-qE1NgtSUyngFqUx-v_O9__tzq29yUeZ3OKcmwVbby2bJ5neKzkBBFGzJhQzR9U0rWxL3kEYYV7ieeZh8hvCfLffhyP2AghYESkJqOa7fg5qAj)

The significance level $\alpha$ is the type-I error: the probability of rejecting the null hypothesis when it is correct. 
The type-II error is the probability of failing to reject the null when the alternate is correct; this is related to the *power* of the test. 

**Illustration**: What is the null hypothesis here?
![](https://qph.fs.quoracdn.net/main-qimg-a25c9f17379bd7b94719a77686dfb519)
Image source: https://effectsizefaq.com/2010/05/31/i-always-get-confused-about-type-i-and-ii-errors-can-you-show-me-something-to-help-me-remember-the-difference/


### The $t$-test in action
Let us assess the accuracy of one of the ConvNets we saw in Week 2. We will construct $5$ different training-validation splits of the data. 
We will test for the following:

$$
H_0: \text{accuracy} \le 0.87 \\
H_1: \text{accuracy} > 0.87
$$

We will use a significance level of $\alpha = 0.05$.

In [260]:
import numpy as np
import torch
from torchvision.datasets import MNIST, FashionMNIST
from torch.nn.functional import cross_entropy
import time
import scipy.stats

import matplotlib.pyplot as plt 
%matplotlib inline 

torch.manual_seed(0)
np.random.seed(1)

Download the FashionMNIST dataset and divide it into 5 train-val pairs.

In [248]:
train_dataset = FashionMNIST('./data', train=True, download=True)
X_train = train_dataset.data # torch tensor of type uint8
y_train = train_dataset.targets # torch tensor of type Long

X_train = X_train.float()  # convert to float32
X_train = X_train.view(-1, 784)
mean, std = X_train.mean(axis=0), X_train.std(axis=0)
X_train = (X_train - mean[None, :]) / (std[None, :] + 1e-6)  # avoid divide by zero



# shuffle the data
idxs = np.random.permutation(X_train.shape[0])
size = X_train.shape[0]//10

Xs = [] 
ys = []
for i in range(10): # 5 train-val pairs
    subsample_idxs = idxs[i*size : (i+1)*size]
    X = X_train[subsample_idxs]
    y = y_train[subsample_idxs]
    Xs.append(X)
    ys.append(y)


Now we write the model and our helper functions

In [251]:
class ConvNet(torch.nn.Module):
    def __init__(self,num_classes=10):
        super().__init__()
        self.conv_ensemble_1 = torch.nn.Sequential(
            torch.nn.Conv2d(1, 16, kernel_size=5, padding=2),
            torch.nn.ReLU(),
            torch.nn.MaxPool2d(2))
        self.conv_ensemble_2 = torch.nn.Sequential(
            torch.nn.Conv2d(16, 32, kernel_size=5, padding=2),
            torch.nn.ReLU(),
            torch.nn.MaxPool2d(2))
        self.fc = torch.nn.Linear(7*7*32, 10)
        
    def forward(self, x):
        x = x.view(-1, 1, 28, 28)
        out = self.conv_ensemble_1(x)
        out = self.conv_ensemble_2(out)
        out = out.view(out.shape[0], -1)
        out = self.fc(out)
        return out
    
# Some utility functions to compute the objective and the accuracy
def compute_objective(model, X, y):
    score = model(X)
    # PyTorch's function cross_entropy computes the multinomial logistic loss
    return cross_entropy(input=score, target=y, reduction='mean') 

@torch.no_grad()
def compute_accuracy(model, X, y):
    score = model(X)
    predictions = torch.argmax(score, axis=1)  # class with highest score is predicted
    return (predictions == y).sum() * 1.0 / y.shape[0]

def sgd_one_pass(model, X, y, learning_rate, verbose=False):
    num_examples = X.shape[0]
    average_loss = 0.0
    for i in range(num_examples):
        idx = np.random.choice(X.shape[0])
        # compute the objective. 
        # Note: This function requires X to be of shape (n,d). In this case, n=1 
        objective = compute_objective(model, X[idx:idx+1], y[idx:idx+1]) 
        average_loss = 0.99 * average_loss + 0.01 * objective.item()
        if verbose and (i+1) % 100 == 0:
            print(average_loss)
        
        # compute the gradient using automatic differentiation
        gradients = torch.autograd.grad(outputs=objective, inputs=model.parameters())
        
        # perform SGD update. IMPORTANT: Make the update inplace!
        for (w, g) in zip(model.parameters(), gradients):
            w.data -= learning_rate * g.data
      
    
from tqdm.auto import trange # range + progress bar
def sgd_n_passes(X_train, y_train, X_val, y_val, n_passes, learning_rate):
    model = ConvNet()
    for i in trange(n_passes):
        sgd_one_pass(model, X_train, y_train, learning_rate)
    return compute_accuracy(model, X_val, y_val)

In [255]:
accuracies = []
for i in range(5):
    print(f'Starting run {i+1}')
    X_train, y_train = Xs[2*i], ys[2*i]
    X_val, y_val = Xs[2*i+1], ys[2*i+1]
    acc = sgd_n_passes(X_train, y_train, X_val, y_val, n_passes=30, learning_rate=2.5e-3)
    accuracies.append(acc)

Starting run 1


  0%|          | 0/30 [00:00<?, ?it/s]

Starting run 2


  0%|          | 0/30 [00:00<?, ?it/s]

Starting run 3


  0%|          | 0/30 [00:00<?, ?it/s]

Starting run 4


  0%|          | 0/30 [00:00<?, ?it/s]

Starting run 5


  0%|          | 0/30 [00:00<?, ?it/s]

In [257]:
accuracies = np.asarray(accuracies)

In [259]:
accuracies

array([0.87866664, 0.8703333 , 0.8821667 , 0.88633335, 0.8715    ],
      dtype=float32)

Now we run the test. 

In [261]:
alpha = 0.05  # significance level
a_0 = 0.87 # accuracy level we are testing for
K = accuracies.shape[0]
m = np.mean(accuracies)
s = np.std(accuracies, ddof=1)  # divide by K-1

# Compute the test statistic
T =  np.sqrt(K) * (m - a_0) / s
threshold = scipy.stats.t(df=K-1).ppf(1-alpha)  # 1-alpha quantile of t_{K-1}

print(f'Test statistic: {T}\t threshold: {threshold}')

if T > threshold:
    print('Reject the null')
else:
    print('Fail to reject the null')

Test statistic: 2.5435465167761073	 threshold: 2.13184678133629
Reject the null


We can work through what we would have gotten if were to test for 
accuracy being at least 88%. 

**NOTE**: We must determine the hypotheses before running the task. We are not supposed to adaptively change the test depending on the results. This is only "a simulation".

In [262]:
a_0 = 0.88 
# Compute the test statistic
T =  T = np.sqrt(K) * (m - a_0) / s
threshold = scipy.stats.t(df=K-1).ppf(1-alpha)  # 1-alpha quantile of t_{K-1}

print(f'Test statistic: {T}\t threshold: {threshold}')

if T > threshold:
    print('Reject the null')
else:
    print('Fail to reject the null')

Test statistic: -0.7174156594326454	 threshold: 2.13184678133629
Fail to reject the null
