# Assignment 3: Lexical Semantics and Word Vectors

**Notes about the autograder for this assignment:**

*   To submit the coding part of this assignment to Gradescope, download your .ipynb notebook in .ipynb form and submit **only** this file. It must be named **CSE447_Assignment3.ipynb** in order for the autograder to work.
*   At a low-load time on Gradescope, the autograder for this assignment takes about **six minutes** to run. We recommend submitting the code part of your assignment with enough time before the deadline to allow the autograder to run, and for you to investigate/correct any issues it flags.
*   If you find that only **one** WEAT test is failing, we've seen this happen before (rarely, but it has happened) with a notebook that passes the same test on a different run. If that happens to you, and you're pretty sure that your code isn't the issue, try resubmitting your notebook.



In [1]:
%%bash

# get any other necessary files for this project

if [ ! -e "data-needed.txt" ]; then
  if [ ! -e "data_path_to_download_url.py" ]; then
    wget https://raw.githubusercontent.com/serrano-s/NLPassignments-students/refs/heads/main/data_path_to_download_url.py
  else
    echo "data_path_to_download_url.py script already downloaded to runtime"
  fi

  wget https://raw.githubusercontent.com/serrano-s/NLPassignments-students/refs/heads/main/assignments/WordEmbeddings/data-needed.txt

  # download all data files needed for the student-release version of this project (i.e., no hidden test files)
  DATA_NEEDED_FILE="data-needed.txt"
  closing_slash="/"
  while IFS= read -r line; do
    line="$(echo -e "${line}" | sed -e 's/^[[:space:]]*//' -e 's/[[:space:]]*$//')";
    dirs_to_check="${line%${closing_slash}*}"
    mkdir -p $dirs_to_check
    download_url=$(python data_path_to_download_url.py "$line")
    echo $download_url;
    wget "$download_url" -O "$line"
  done < "$DATA_NEEDED_FILE"
else
  echo "data-needed.txt (and presumably therefore all necessary data files) already downloaded to runtime"
fi

if [ ! -e "other-setup-needed.sh" ]; then
  wget https://raw.githubusercontent.com/serrano-s/NLPassignments-students/refs/heads/main/assignments/WordEmbeddings/other-setup-needed.sh
  bash other-setup-needed.sh
  rm data_path_to_download_url.py
else
  echo "other-setup-needed.sh (and presumably therefore all other necessary files) already downloaded to runtime"
fi

In [2]:
%%bash
# Install required packages

pip install pandas

pip install sentence-transformers

pip install --upgrade ipython

In [3]:
# Make sure to always run this
%load_ext autoreload
%autoreload 2

In [4]:
import os
import json
import re
from typing import List, Tuple, Dict, Union
import pandas as pd
import torch

from wordvec_tests import (
    load_synonyms_data,
    Exercise1Runner,
    Exercise2Runner,
    Exercise3aRunner,
    Exercise3bRunner,
    Exercise4Runner,
    Exercise5Runner,
    Exercise6Runner
)

In [5]:
parent_dir = os.path.dirname(os.path.abspath("__file__"))
data_dir = os.path.join(parent_dir, "data")

## Part 1: Geometry of Word Embeddings

We provide a helper class to access the glove vectors.

In [6]:
class GloveEmbeddings:

    def __init__(self, path="embeddings/glove.6B/glove.6B.50d.txt"):
        self.path = path
        self.vec_size = int(re.search(r"\d+(?=d)", path).group(0))
        self.embeddings = {}
        self.load()

    def load(self):
        for line in open(self.path, "r"):
            values = line.split()

            word_len = len(values) - self.vec_size

            word = " ".join(values[:word_len])
            vector_values = list(map(float, values[word_len:]))

            word = values[0]
            vector_values = list(map(float, values[-self.vec_size:]))
            vector = torch.tensor(vector_values, dtype=torch.float)
            self.embeddings[word] = vector

    def is_word_in_embeddings(self, word):
        return word in self.embeddings

    def get_vector(self, word):
        if not self.is_word_in_embeddings(word):
            return self.embeddings["unk"]
        return self.embeddings[word]

    # Use square operator to get the vector of a word
    def __getitem__(self, word):
        return self.get_vector(word)

In [7]:
glove_vectors = GloveEmbeddings(
    path=f"{data_dir}/embeddings/glove.6B/glove.6B.50d.txt"
)

You can simply use `glove_vectors[<word>]` to get the vector for a word

In [8]:
vector = glove_vectors["the"]
print(vector.shape)
print(vector)

In [9]:
print(glove_vectors["unk"].shape)
print(glove_vectors["the"].shape)
print(glove_vectors["king"].shape)

Notice that we have a 50 dimensional vector for each word.

### Exercise 1: Synonyms

This part is adapted from Dan Jurafsky's NLP class CS124 at Stanford.

In [10]:
dev_synonyms_df = load_synonyms_data("dev", data_dir=data_dir)
dev_synonyms_df.head()

The task is to choose the synonym for a given word from a list of choices. We will use Glove embeddings to find the closest word in the embedding space.

There are different metrics to obtain distance / similarity between two vectors in n-dimensional space. These include:

1. Euclidean Distance: $d(u, v)  = ||u - v||_2$
2. Manhattan Distance: $d(u, v) = ||u - v||_1$
3. Cosine Similarity: $s(u, v) = \frac{u \cdot v}{||u||_2 ||v||_2}$

where $u$ and $v$ are vectors.

You will implement a function `find_synonym` that for a word finds the closest synonym from a list of words. The method also receives a distance / similarity metric to use. The function should return the synonym and the value of the metric for the closest word.


In [11]:
def cosine_similarity(v1: torch.Tensor, v2: torch.Tensor) -> float:
    """
    Compute the cosine similarity between two vectors.

    Inputs:
    v1: torch.Tensor of shape (n,)
    v2: torch.Tensor of shape (n,)

    Returns:
    float: cosine similarity between v1 and v2
    """



    raise NotImplementedError


def euclidean_distance(v1: torch.Tensor, v2: torch.Tensor) -> float:
    """
    Compute the Euclidean distance between two vectors.

    Inputs:
    v1: torch.Tensor of shape (n,)
    v2: torch.Tensor of shape (n,)

    Returns:
    float: Euclidean distance between v1 and v2
    """

    raise NotImplementedError


def manhattan_distance(v1: torch.Tensor, v2: torch.Tensor) -> float:
    """
    Compute the Manhattan distance between two vectors.

    Inputs:
    v1: torch.Tensor of shape (n,)
    v2: torch.Tensor of shape (n,)

    Returns:
    float: Manhattan distance between v1 and v2
    """


    raise NotImplementedError


def find_synonym(
    word: str,
    choices: List[str],
    embeddings: GloveEmbeddings,
    metric: str = "cosine"
) -> Dict[str, Union[str, float]]:

    synonym_dict = {
        "synonym": None,
        "metric": None
    }


    raise NotImplementedError


In [12]:
exercise1 = Exercise1Runner(
    find_synonym=find_synonym,
)

exercise1.evaluate(True) #Set False if you only want to see the final accuracies

You should expect the dev accuracy to be 83% with cosine similarity metric, 67% with euclidean distance and 70% with manhattan distance.

### Exercise 2: Analogies

In this exercise you will use the Linear Representation Hypothesis (check handout to learn about it) to write code that automatically solves the analogy task. As an example:

man is to king as woman is to ______ ?

a) princess

b) queen

c) wife

d) ruler

The task is to find the most appropriate word out of the 4 choices that completes the analogy.

This part is adapted from Dan Jurafsky's NLP class CS124 at Stanford.

In [13]:
def find_analogy_word(
    a: str,
    b: str,
    aa: str,
    choices: List[str],
    embeddings: GloveEmbeddings,
):

    """
    Given the analogy relation a is to aa as b is to ____, find the word from choices that completes the analogy.
    e.g. man is to king as woman is to ____?
    a) princess
    b) queen
    c) wife
    d) ruler

    Note: Use cosine similarity as the metric for this function.

    Inputs:
    - a, b, aa: The words in the analogy relation
    - choices: A list of words to choose from
    - embeddings: GloveEmbeddings object

    Returns:
     str: The word from choices that best completes the analogy
    """

    answer = None


    raise NotImplementedError

    return answer


In [14]:
exercise2 = Exercise2Runner(find_analogy_word=find_analogy_word)
exercise2.evaluate(True) #Set False if you only want to see the final accuracies

You should see an accuracy of 64%

### Exercise 3: Bias in Word Embeddings

We will now implement the Word Embedding Association Test (WEAT) to identify biases in word embeddings. Check the handout for a detailed explanation of the test. You will start by implementing the *Effect Size* metric by implementing the functions:

- `word_association_wth_attribute`
- `weat_effect_size`

`word_association_wth_attribute` computes $s(w, A, B)$ for a given word $w$ and attributes $A$ and $B$, i.e. the association of the word with the attributes. Recall that this is given by:

$$s(w, A, B) = \text{mean}_{a \in A}\text{cos}(\vec{w}, \vec{a}) - \text{mean}_{b \in B} (\vec{w}, \vec{b})$$

The effect size is then given as:

$$\text{effect-size} = \frac{\text{mean}_{x \in X} s(x, A, B) - \text{mean}_{y \in Y} s(y, A, B)}{\text{std-dev}_{w \in X \cup Y} s(w, A, B)}$$

In [16]:
with open(f"{data_dir}/weat/weat.json", "r") as f:
    weat_data = json.load(f)

weat_data.keys()

In [17]:
def word_association_wth_attribute(
    word: str,
    A: List[str],
    B: List[str],
    embeddings: GloveEmbeddings,
) -> float:

    """
    Finds the association of a word in the emedding space with the two sets of attribute words.
    E.g. Given the word "rose", a set of pleasant words A and unpleasant words B, the function finds if its degree of association to A vs B.

    Inputs:
    - word: The word for which we want to find the association
    - A: List of words representing the first set of attributes
    - B: List of words representing the second set of attributes

    Returns:
    float: The association of the word with the two sets of attributes
    """

    association = None


    raise NotImplementedError

    return association

def weat_effect_size(
    X: List[str],
    Y: List[str],
    A: List[str],
    B: List[str],
    embeddings: GloveEmbeddings,
) -> float:
    """
    Compute the effect size of the WEAT test.

    Inputs:
    - X: List of target words for which we want to find the association.
    - Y: List of target words for which we want to find the association.
    - A: List of words representing the first set of attributes
    - B: List of words representing the second set of attributes
    - embeddings: GloveEmbeddings object
    """

    X_associations = torch.mean(torch.tensor([word_association_wth_attribute(x, A, B, embeddings) for x in X]))
    Y_associations = torch.mean(torch.tensor([word_association_wth_attribute(y, A, B, embeddings) for y in Y]))

    XY = list(set(X).union(set(Y)))
    XY_associations_std = torch.std(torch.tensor([word_association_wth_attribute(xy, A, B, embeddings) for xy in XY]))

    return ((X_associations - Y_associations) / XY_associations_std).item()

In [18]:
exercise3 = Exercise3aRunner(
    weat_effect_size=weat_effect_size,
)
exercise3.evaluate_effect_size(
    True
)  # Set False if you only want to see the final effect sizes

You should observe the following numbers:


Effect size for Flowers-Insects Pleasant Unpleasant case: 1.07782

Effect size for MusicalInstruments-Weapons Pleasant Unpleasant case: 1.54396

Effect size for EuropeanAmerican AfricanAmerican_Pleasant_Unpleasant case: 1.00395

Effect size for Male_Female Career_Family case : 1.70778

Effect size for Math_Art Male_Female case : 1.49513

We will now check how statistically significant are these effect sizes. You will implement the function `target_words_diff_association_wth_attribute` which calculates the value of the test statistic: $s(X, Y, A, B)$. Recall this is given by:

$$    s(X, Y, A, B) = \sum_{x \in X} s(x, A, B) - \sum_{y \in Y} s(y, A, B)$$

We then use this function to calculate the p-value of the permutation test. We provide you the implementation of the function `weat_p_value`. We recommend you to go through the code to understand how the p-value is calculated, even though you are not supposed to implement it.

In [19]:
def target_words_diff_association_wth_attribute(
    X: List[str],
    Y: List[str],
    A: List[str],
    B: List[str],
    embeddings: GloveEmbeddings,
) -> float:
    """
    Finds the differential association of words in X with the attribute sets A and B vs the words in Y with the attribute sets A and B.

    E.g. X can be flower names and Y can be insect names.A can be pleasant words and B can be unpleasant words.
    The function measures the difference in total association of flower names and insect names with sets A and B.


    Inputs:
    - words: List of target words for which we want to find the association.
    - A: List of words representing the first set of attributes
    - B: List of words representing the second set of attributes
    - embeddings: GloveEmbeddings object
    """

    diff_association = None


    raise NotImplementedError

    return diff_association


def weat_p_value(
    X: List[str],
    Y: List[str],
    A: List[str],
    B: List[str],
    embeddings: GloveEmbeddings,
    max_permutations: int = 1000,
) -> Tuple[float, float]:
    """
    Compute the p-values of the WEAT test.

    Inputs:
    - X: List of target words for which we want to find the association.
    - Y: List of target words for which we want to find the association.
    - A: List of words representing the first set of attributes
    - B: List of words representing the second set of attributes
    - embeddings: GloveEmbeddings object
    """

    from sympy.utilities.iterables import multiset_permutations
    import numpy as np
    diff_association = target_words_diff_association_wth_attribute(X, Y, A, B, embeddings)

    target_words = X + Y
    # print(len(target_words))

    partition_idx = np.zeros(len(target_words))
    partition_idx[:len(target_words) // 2] = 1

    partition_dff_associations = []
    for _ in range(max_permutations):
        if len(partition_dff_associations) >= max_permutations:
            break
        i = np.random.permutation(partition_idx)
        X_perm = [target_words[j] for j in range(len(target_words)) if i[j] == 1]
        Y_perm = [target_words[j] for j in range(len(target_words)) if i[j] == 0]
        partition_dff_associations.append(
            target_words_diff_association_wth_attribute(
                X_perm, Y_perm, A, B, embeddings
            )
        )
    return np.sum(np.array(partition_dff_associations) > diff_association) / len(
        partition_dff_associations
    )

In [20]:
exercise3b = Exercise3bRunner(
    weat_p_value=weat_p_value,
)

exercise3b.evaluate(True) #Set False if you only want to see the final p-values # This might take a while to run (1 minute during our testing)

You should see p-values as 0.0 for all the cases (note that this is approximate since we are not exhaustively considering all the permutations and only considering 1000 random permutations. But the actual p-value should be very close to 0).

### Useful pointers for write up question 3 in 1.1.3 of handout:

Glove embeddings of size {{vec_dim}} (e.g. 50) trained on {{num_tokens}} (e.g. 6B) can be loaded as:

```
glove_vectors = load_glove_vectors(
    f"{data_dir}/embeddings/glove.6B/glove.WEAT.6B.50d.txt",
)
```

where you change the values of 6B and 50 based on the size of the embeddings and number of tokens used in training.

To load weat data of a particular category, you can use the following code:

```
category = "Flowers_Insects_Pleasant_Unpleasant" # Change this to the category you want to load

weat_category_data = weat_data[category]

X = weat_category_data["X"]
Y = weat_category_data["Y"]
A = weat_category_data["A"]
B = weat_category_data["B"]
```

In [21]:
# <NO_AUTOGRADE> <=== Leave this code cell as is, otherwise the autograder will have trouble processing your submission

In [22]:
# We will get glove vectors for all the words of interest and save them in a txt file

def save_glove_vectors(words_of_interest, path="data/embeddings", vec_size=50, num_tokens="6B"):

    glove_path = f"{path}/glove.{num_tokens}/glove.{num_tokens}.{vec_size}d.txt"
    all_glove_embeddings = GloveEmbeddings(glove_path)

    out_dir = f"{path}/glove.{num_tokens}/"
    if not os.path.exists(out_dir):
        os.makedirs(out_dir)

    out_path = f"{out_dir}/glove.WEAT.{num_tokens}.{vec_size}d.txt"
    with open(out_path, "w") as f:
        for word in words_of_interest:
            vector = all_glove_embeddings[word]
            f.write(f"{word} {' '.join([str(v) for v in vector.tolist()])}\n")

    print(f"Saved embeddings to {out_path}")

In [23]:
# Leave this code cell as is, otherwise the autograder will have trouble processing your submission
# </NO_AUTOGRADE>

## Part 2: From Word Embeddings to Sentence Level Embeddings

### Exercise 3.1 Sentence Similarity

*This part is adapted from Dan Jurafsky's NLP class CS124 at Stanford.*

In this exercise you will be building sentence level representations using word embeddings and then use them to find similarity between two sentences. In particular your goal is to answer questions of the form:

```
    True/False: the following two sentences are semantically similar:
      1. he later learned that the incident was caused by the concorde's sonic boom
      2. he later found out the alarming incident had been caused by concorde's powerful sonic boom
```

To build sentence level representations, you will use the following approaches:

* **Simple sum**: Simply take the sum of word embeddings of all words in the sentence
* **Sum with POS weighting**: Take a weighted sum of the individual word vectors, where the weight of each word depends on the part of speech tag (POS) of the word.

Specifically, you will implement the following functions:
* **get_sentence_embedding()**: given a sentence (string), return the sentence embedding (vector). The function also takes in the parameter `use_POS`:
    * if `use_POS` is false (regular case), leverage method 1 above - simply the sum of the word embeddings for each word in the sentence (ignoring words that don’t appear in our vocabulary).
    * if `use_POS` is true, leverage method 2 - use a weighted sum, where we weight each word by a scalar that depends on its part of speech tag.
* **get_sentence_similarity()**: given two sentences, find the cosine similarity between their corresponding sentence embeddings.

Helpful hints:

* Lowercase all words in the sentence before you look them up in the embeddings. The Glove embeddings that we are using have all words in lowercase.

* We’ve given you a map `POS_weights` that maps part of speech tags to their associated weight. For example, `POS_weights['NN'] = 0.8` (where NN is the POS tag for noun).
* You may skip words that either (1) are not in our embeddings or (2) have a POS tag that is not in `POS_weights` . To check if a word is not in our embeddings, you can use the following code snippet:

```
    if glove_vectors.is_word_in_embeddings(word):
        # word is in embeddings
    else:
        # word is not in embeddings
```
* To get a list of all the words in the sentence, use nltk's word_tokenize function.

  ```
  >>> sentence = "this is a sentence"
  >>> word_tokens = word_tokenize(sentence)
  >>> word_tokens
  ['this', 'is', 'a', 'sentence']
  ```

* We’ve given you a map `POS_weights` that maps part of speech tags to their associated weight. For example, `POS_weights['NN'] = 0.8` (where NN is the POS tag for noun).
* You may skip words that either (1) are not in our embeddings or (2) have a POS tag that is not in `POS_weights` .
* To get a list of all the words in the sentence, use nltk's word_tokenize function.

  ```
  >>> sentence = "this is a sentence"
  >>> word_tokens = word_tokenize(sentence)
  >>> word_tokens
  ['this', 'is', 'a', 'sentence']
  ```

  * To get the POS tags for each word in a sentence, you can use nltk.pos_tag. To use it, you provide a list of words in a sentence, and it returns a list of tuples, where the first element is the word and the second is its corresponding POS tag. **For this PA, make sure that you pass in the entire sentence to a single call to nltk.pos_tag; do not call  nltk.pos_tag separately on each word in the sentence.** This is because some words can be multiple parts of speech (for example, "back" can be a noun or a verb). Passing in the entire sentence allows for more context to figure out what POS tag a word should have.

```
    >>> tagged_words = nltk.pos_tag(word_tokens)
    >>> tagged_words
    [('this', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('sentence', 'NN')]`
```



In [24]:
# You will use nltk for tokenizing and tagging!
import nltk
from nltk.tokenize import word_tokenize

In [25]:
# Run this cell to download the nltk tagger
nltk.download("punkt")
nltk.download('punkt_tab')
nltk.download("averaged_perceptron_tagger_eng")

In [26]:
# Run this cell to load POS Weights
with open(f"{data_dir}/pos_weights.txt", "r") as f:
    pos_weights = f.read().split("\n")
    pos_weights = {line.split()[0]: float(line.split()[1]) for line in pos_weights}
print(pos_weights)

In [27]:
# Run this cell to load the dataset
sentence_df = pd.read_csv(f"{data_dir}/sentence_similarity/dev.csv", sep="\t", header=None, names=["label", "sentence1", "sentence2"])
sentence_df.head()

In [28]:
def get_sentence_embedding(
    sentence: str,
    word_embeddings: GloveEmbeddings,
    use_POS: bool = False,
    pos_weights: Dict[str, float] = None,
):
    """
    Compute the sentence embedding using the word embeddings.

    Inputs:
    - sentence: The input sentence
    - word_embeddings: GloveEmbeddings object
    - use_POS: Whether to use POS tagging
    - pos_weights: Dictionary containing POS weights

    Returns:
    torch.Tensor: The sentence embedding
    """

    sentence_embedding = None


    raise NotImplementedError

    return sentence_embedding


def get_sentence_similarity(
    sentence1: str,
    sentence2: str,
    word_embeddings: GloveEmbeddings,
    use_POS: bool = False,
    pos_weights: Dict[str, float] = None,
):
    """
    Compute the similarity between two sentences.

    Inputs:
    - sentence1: The first input sentence
    - sentence2: The second input sentence
    - word_embeddings: GloveEmbeddings object
    - use_POS: Whether to use POS tagging
    - pos_weights: Dictionary containing POS weights

    Returns:
    float: The similarity between the two sentences
    """

    similarity = None


    raise NotImplementedError

    return similarity

In [29]:
exercise4 = Exercise4Runner(
    get_sentence_similarity=get_sentence_similarity,
)

exercise4.evaluate(True) #Set False if you only want to see the final accuracies

Note that our tester is using a threshold of 0.95 for the similarity score, i.e. we consider the prediction as 1 if cosine similarity return by your function is greater than 0.95, else consider it as 0. You should expect the following accuracies:

accuracy using sum of word vectors : 0.85

accuracy using sum of word vectors with POS weights : 0.925

### Exercise 5. K-Nearest Neighbors Classifier using Glove-Based Sentence Embeddings

We will now utilize the sentence representations to build a K-Nearest Neighbors (KNN) classifier. A KNN classifier classifies a sentence by finding its nearest neighbors (in the embedding space) in the training data and taking a majority vote of the labels of neighbors.


 We will be working with the SST dataset from HW1, i.e. classifying the sentiment of movie reviews (we will consider both binary and 5-class classification case). You will implement the following class:

`GloveKNNClassifier` with methods:
- `__init__`: Initialize the classifier with the value of k and the choice of sentence embedding method (simple sum or sum with POS weights)
- `fit`: Calculates the sentence embeddings for all sentences in the training data and stores them. Also stores the corresponding labels.
- `predict`: Predicts the label(s) for the given sentence(s) using the K-Nearest Neighbors algorithm. The distance metric to use is cosine similarity.

Helpful Tips:

- Lower case the training as well as test sentences before computing their embeddings (if your `get_sentence_embedding` already handles that for you, you don't need to do it again here)

- To avoid division by zero when calculating cosine similarity, you can add a small epsilon value i.e. 1e-8 to the denominator.

- We strongly recommend implementing a vectorized version of `cosine_similarity` here, i.e. instead of one by one computing similarity between a test sentence and a training sentence, implement a function which takes n test sentences and m train sentences and uses matrix multiplication operations to get a similarity matrix of shape [n, m]

In [30]:
class GloveKNNClassifier:

    def __init__(self, word_embeddings: GloveEmbeddings, k: int = 5, use_POS: bool = False, pos_weights: Dict[str, float] = None):
        """
        Initialize the KNN Classifier.
        Inputs:
        - word_embeddings: GloveEmbeddings object
        - k: Number of nearest neighbors to consider
        - use_POS: Whether to use POS tag based weights for the embeddings
        - pos_weights: Dictionary containing POS weights

        """


        raise NotImplementedError

    def fit(self, X_train: List[str], y_train: List[int]):
        """
        Fit the KNN Classifier by calculating the sentence embeddings for the training sentences and storing them. Also store the corresponding labels.
        Inputs:
        - X_train: List of training sentences (documents)
        - y_train: List of corresponding labels
        """


        raise NotImplementedError

    def predict(self, X_test: List[str]) -> List[int]:
        """
        Predicts the labels for the test sentences using the training data embeddings.

        Inputs:
        - X_test: List of test sentences

        Returns:
        - List of predicted labels

        Hint: `torch.topk` might be useful here.
        """

        y_pred = None

        raise NotImplementedError

        return y_pred

    # Any extra functions you need can be added here

    # YOUR CODE HERE


In [31]:
exercise5 = Exercise5Runner(GloveKNNClassifier)
exercise5.evaluate(k=5)

Using $k=5$, you should see the following accuracies:

Binary Classification:
- Train accuracy using sum of word vectors : 0.76884
- Dev accuracy using sum of word vectors with POS weights : 0.65849

- Train accuracy using sum of word vectors with POS weights : 0.77142
- Dev accuracy using sum of word vectors with POS weights : 0.63397

Multi-class Classification:
- Train accuracy using sum of word vectors : 0.52037
- Dev accuracy using sum of word vectors with POS weights : 0.30609

- Train accuracy using sum of word vectors with POS weights : 0.52095
- Dev accuracy using sum of word vectors with POS weights : 0.27611

A good way to debug your code can be to check if you get 100% train accuracies for all the cases when you use `k = 1`, since the closest point to a point is the point itself. You can check that by running:

```
exercise5.evaluate(k=1)
```

**Note**: You'll need to decide how to handle ties between different first-place labels aggregated from neighboring instances. While in principle, any way of doing this is fine, if you want to match our numbers, use `torch.mode` to do so.

You may have noticed that the nearest neighbor classifier with glove embeddings doesn't perform very well, in fact it performs worse than the the linear classifiers we trained in HW1. One of the reasons why this happens is because the way we construct the sentence embeddings by summing the word embeddings. Can you think of the issues with such approach?

### Exercise 6:  K-Nearest Neighbors Classifier using Transformer Based Sentence Embeddings

We will now utilize pre-trained transformers based sentence embeddings to build our KNN classifier. These embeddings are trained on large text corpora and learn sentence level representations which are much more powerful than the simple sum of word embeddings.
While we haven't covered transformers based sentence embeddings in the class yet, we would like to give you a flavour of how the directly trained sentence level representations can be more powerful. We will be using the `sentence-transformers` library to get the sentence embeddings. Below we provide you with a helper function `get_st_embeddings` which takes in a list of sentences, and a sentence transformers model, and returns the sentence embeddings using the `sentence-transformers` library.

In [34]:
from sentence_transformers import SentenceTransformer

def get_st_embeddings(sentences: Union[str, List[str]], st_model: SentenceTransformer, batch_size: int = 32):
    """
    Compute the sentence embedding using the Sentence Transformer model.

    Inputs:
    - sentence: The input sentence
    - st_model: Senten ceTransformer model
    - batch_size: Encode in batches to avoid memory issues in case multiple sentences are passed

    Returns:
    torch.Tensor: The sentence embedding of shape [d,] (when only 1 sentence) or [n, d] where n is the number of sentences and d is the embedding dimension
    """

    sentence_embeddings = None

    for i in range(0, len(sentences), batch_size):
        batch_sentences = sentences[i:i + batch_size]
        # Lowercase the sentences
        batch_sentences = [sentence.lower() for sentence in batch_sentences]
        batch_embeddings = st_model.encode(batch_sentences, convert_to_tensor=True)
        if sentence_embeddings is None:
            sentence_embeddings = batch_embeddings
        else:
            sentence_embeddings = torch.cat([sentence_embeddings, batch_embeddings], dim=0)

    return sentence_embeddings

In [35]:
# Example usage

# Load model
st_model = SentenceTransformer(
    "all-MiniLM-L6-v2"
)  # You can use any model from the Sentence Transformers library. See the list here: https://sbert.net/docs/sentence_transformer/pretrained_models.html

# Get embeddings
sent_embeddings = get_st_embeddings(
    ["This is a test sentence", "This is another test sentence"], st_model
)
print(sent_embeddings.shape)

In [36]:
class SentenceTransformerKNNClassifier(GloveKNNClassifier):

    def __init__(
        self, st_model: str = "all-mpnet-base-v2", k: int = 5, batch_size: int = 128
    ):

        super().__init__(None, k)

        self.st_model = SentenceTransformer(st_model)
        self.k = k
        self.batch_size = batch_size

    def fit(self, X_train: List[str], y_train: List[int]):


        raise NotImplementedError

    def predict(self, X_test: List[str]) -> List[int]:

        y_pred = None

        raise NotImplementedError

        return y_pred

In [37]:
# You can first evaluate your implementation on debug mode that only uses a fraction of train and dev sets
exercise6 = Exercise6Runner(SentenceTransformerKNNClassifier)
exercise6.evaluate(k=10, debug=True)

In the debug model you should see a dev accuracy of 0.9 for binary case and 0.3 for multi-class case. Note that these are not the actual accuracy values, but a way to check if your code is running without any errors. Once you are satisfied with the debug accuracies, you can run the code below to get the actual accuracies.

In [38]:
# This will take a while to run (can take upto 40 minutes).
# You can reduce the runtime by using a GPU runtime and using device="cuda" while calling `get_st_embeddings` function.

exercise6 = Exercise6Runner(SentenceTransformerKNNClassifier)
exercise6.evaluate(k=10, debug=False)

You should see the following accuracies:

Binary Classification:
- Dev accuracy: 0.81471

Multi-class Classification:
- Dev accuracy: 0.42507

As you can see we get much better numbers than the glove embeddings (a 10% improvement). We can further improve these numbers by fine-tuning the transformer model on our specific task, but that is out of the scope of this assignment.