---
title: "DSAN 5000 HW4.4: Evaluating Embeddings with WordNet"
format:
  html:
    embed-resources: true
    toc: true
    df-print: kable
    link-external-newwindow: true
    link-external-icon: true
---

You've reached the final part of HW4! Similar to how you used the original NY Times sections to evaluate the unsupervised topic estimation at the end of Part 1, here you will use a linguistic resource called WordNet to evaluate the (unsupervised) word embeddings you visualized in Part 3.

While in HW4.1 and HW4.2 we looked at how unsupervised learning algorithms can discover meaningful latent properties of **documents** (the section of the NY Times that the article was published in), here we'll see how t-SNE can allow us to discover meaningful latent properties of **words**.

Specifically, in this part we will draw on the notion that words in a given language can be clustered into [**"Synsets"**](https://wordnet.princeton.edu/): sets of words which have approximately the same meaning.

## Step 1: Imports and Global Configuration

In [2]:
import configparser
config = configparser.ConfigParser()
config.read("hw4.ini")
num_words_wordnet = int(config.get('Globals', 'num_words_wordnet'))
print(f"Global setting: Compute pairwise distances between [{num_words_wordnet}] words")

import pandas as pd

# For constructing a DataFrame containing all *pairs* of words from a list of
# individual words
from sklearn.utils.extmath import cartesian

# For computing the cosine similarity score between a pair of word vectors
from sklearn.metrics.pairwise import cosine_similarity

# For computing Synset path distances
from nltk.corpus import wordnet as wn

Global setting: Compute pairwise distances between [50] words


## Step 2: Load Embeddings

Your job in Step 2 here is the same as it was in HW4.3 Step 2: load the embedding vectors into a Pandas `DataFrame` named `emb_df`.

In [3]:
config = configparser.ConfigParser()
config.read("hw4.ini")
emb_df1 = config.get('ExternalFiles','embeddings')

emb_df = pd.read_csv(emb_df1)
emb_df.shape



(4874, 256)

## Step 3: Extract Vectors for Top $N$ Words

This is where the `num_words_wordnet` global variable from `hw4.ini` comes in: in the following code cell, reduce the full length-4874 `emb_df` `DataFrame` down into a `DataFrame` object named `top_word_df`, by keeping only the vectors for the **top $N$** most important words. Note that:

* $N$ is given by `num_words_wordnet`, and
* Importance is operationalized in terms of the **total tf-idf scores** for each word, that you computed in HW4.1

As was the case in HW4.3, if you don't have tf-idf scores for some of the 4874 words here, because of differences in text-cleaning code, you can just drop these words. Thus, after dropping any such cases, the entries within `top_word_df` should all have **non-`NA` weight values**.

Once you have constructed `top_word_df`, use `top_word_df.shape` as the last line in your code cell, to display and verify that `top_word_df` contains $N$ rows and 256 columns.

In [10]:
#| label: hw4-4-3-response
# Your code here
config = configparser.ConfigParser()
config.read("hw4.ini")
num_words_wordnet = config.get('Globals','num_words_wordnet')
weights  = config.get('DataPaths','word_weights')
weights = pd.read_csv(weights)

top_word_df = pd.concat([emb_df,weights],axis = 1)

top_word_df = top_word_df.sort_values(by='weight', ascending=False)
top_word_df = top_word_df.dropna(axis = 0)
top_word_df= top_word_df.iloc[:int(num_words_wordnet)]
top_word_df = top_word_df.drop(top_word_df.columns[[256,258]],axis = 1) # word column?
top_word_df.shape
top_word_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,247,248,249,250,251,252,253,254,255,word
2069,0.007169,0.015344,-0.043939,0.027289,0.036152,0.040868,0.015721,0.038797,0.036224,-0.034164,...,-0.014884,0.018008,0.024236,-0.010886,-0.035874,-0.024761,0.015486,-0.016472,0.017559,her
3984,-0.038468,0.022831,-0.03424,-0.031011,0.001344,0.022516,-0.019301,9e-06,-0.013699,-0.016304,...,-0.010794,0.008976,0.003407,0.020292,-0.032887,-0.036207,0.040459,-0.024848,0.010659,she
4808,-0.038333,-0.001641,-0.002045,-0.018365,0.041535,0.0121,-0.046086,0.018965,-0.000487,-0.057306,...,-0.000996,-0.014339,-0.039227,-0.029326,-0.030796,-0.044777,0.019669,-0.02308,-0.026501,we
2854,-0.053021,0.02496,-0.050793,0.009225,0.037813,0.072191,-0.047812,0.034579,-0.008728,0.005593,...,0.028591,0.007184,-0.052043,0.002593,-0.121087,-0.098834,0.008677,-0.037302,0.003238,ms
2089,-0.071359,0.017195,0.012831,0.062265,0.072296,0.028795,-0.001176,0.024597,0.018794,0.000476,...,0.064137,-0.048348,0.00617,-0.025071,-0.048877,-0.044238,0.022355,0.022146,0.020285,him


## Step 4: Construct Word Pairs `DataFrame`

In this step, the reason why we need to filter down to a small number of words (`num_words_wordnet`) will become clear! Use the [`cartesian()` function](https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/extmath.py#L793) from `scikit-learn`, imported in Step 1 above, to construct a `DataFrame` object named `word_pair_df`, where

* Each row should represent a **pair** of words (from the `"word"` column of `top_word_df`),
* The first column should be named `"w1"`, and
* The second column, representing the second word in the pair, should be named `"w2"`.


We will use this `DataFrame` throughout the following steps, to store the **Cosine similarities** and then the **WordNet path similarities** for each pair.

In the last line of the code cell, please use `word_pair_df.shape` to verify that you now have an $N^2 \times 2$ `DataFrame`, where $N$ represents the value of `num_words_wordnet` from `hw4.ini`.

In [5]:
#| label: hw4-4-4-response
# Your code here
#words = top_word_df['word']


import numpy as np
words_array = np.array(top_word_df['word'])
pairs = cartesian([words_array, words_array])
word_pair_df = pd.DataFrame(pairs, columns=['w1', 'w2'])

word_pair_df.head(),word_pair_df.shape

(    w1   w2
 0  her  her
 1  her  she
 2  her   we
 3  her   ms
 4  her  him,
 (2500, 2))

## Step 5: Compute Pairwise Cosine Similarities

Here, use `scikit-learn`'s `cosine_similarity()` function (imported in Step 1 above) to compute a cosine similarity score for **all pairs** of the $N$ vectors you extracted in Step 3.

The nice thing about this `cosine_similarity()` function is that, if you give it just a **single** NumPy matrix (remember that you can convert a Pandas `DataFrame` into a NumPy matrix via `df.values`), it will assume that the **rows** of this matrix represent the items which you'd like pairwise similarity score for, and compute the scores accordingly.

This means that, for example, if you provide an $N \times 256$ matrix, the function returns a new $N \times N$ where the entry in row $i$ column $j$ represents the similarity between row $i$ and row $j$ of the originally-provided matrix.

Use this fact to construct such an $N \times N$ matrix via `cosine_similarity()`, saving the result as a variable named `pairwise_sims`.

You should then be able to use the `flatten()` function from NumPy to convert this $N \times N$ matrix into a single length-$N^2$ vector, which you should append as a new column named `"cosine_sim"` within the `word_pair_df` `DataFrame` object created in the previous step. (Here, like in the earlier cases where you used `pd.concat()`, the flattened version of `pairwise_sims` should have the same ordering as the word pairs in `word_pair_df`, as long as you did not re-arrange these objects at any point!)

As the final line in the code cell, use `word_pair_df.head()` to display the first 5 rows, to verify that the values in the new `"cosine_sim"` column are valid and reasonable (for example, the similarity between a word and itself should be `1.0`!)

In [6]:
#| label: hw4-4-5-response
# Your code here
mat1 = top_word_df.iloc[:, :-1].values

pairwise_sims = cosine_similarity(mat1)

flattened_sims = pairwise_sims.flatten()

word_pair_df['cosine_sim'] = flattened_sims
word_pair_df.head()

Unnamed: 0,w1,w2,cosine_sim
0,her,her,1.0
1,her,she,0.53796
2,her,we,0.472494
3,her,ms,0.453016
4,her,him,0.483421


## Step 6: Compute Pairwise WordNet Similarities

In this step, we'll use `nltk`'s [programmatic interface](https://www.nltk.org/howto/wordnet.html) for [WordNet](https://wordnet.princeton.edu/), via the `wn` alias imported in Step 1 above, to obtain **human judgements** of the semantic similarities for each pair of words in `word_pair_df`.

As a helper, we've provided a `get_wordnet_sim()` function for you at the beginning of the following code cell, which takes in strings `w1` and `w2`, uses the `.synsets()` function to obtain the Synset for the **most commonly-used form** of each word, then uses `wn.path_similarity()` to compute a similarity score between the two Synsets.

**Note that this function returns the value `pd.NA`** when it is given word pairs where one or both words are not found in WordNet. **This will indeed be the case** for a few of the words (WordNet does *not* guarantee coverage of every word in the English language, since it is just a collaboration among linguists to cover as many words as possible given their resources!), so you should **take this into account** and **drop the rows containing `pd.NA` values** when computing the correlation coefficient in Step 7 below!

Though you can find more details in the NLTK guide linked above, the `path_similarity()` function boils down to: two Synsets are similar if only a small number of "hops" through the human-constructed WordNet hierarchy are required to go from one to the other. Thus, for example, "dog" and "cat" will receive a high similarity score due to the small number of hops required (here, "up" and "down" refer to the **hypernyms** and **hyponyms** of each term, which you can see by opening the linked pages then clicking "More" and expanding the "Hypernyms" or "Hyponyms" links which appear):

* From [dog](https://en-word.net/lemma/dog) up to [domestic animal](https://en-word.net/lemma/domestic%20animal), then
* From [domestic animal](https://en-word.net/lemma/domestic%20animal) down to [house cat](https://en-word.net/lemma/house%20cat), and finally
* From [house cat](https://en-word.net/lemma/house%20cat) up to [cat](https://en-word.net/lemma/cat)

Given this approach, and the `get_wordnet_sim()` function which computes its numeric value, construct a new column in `word_pair_df` (immediately after the `cosine_sim` column) named `"wn_sim"`, containing the **WordNet path similarity** for each pair of words.

Once this column has been added, use `word_pair_df.head()` as the last line in the code cell, to verify that the path similarity scores are reasonable, as you did for the Cosine similarity score in the previous step (for example, like with the Cosine similarity scores, the WordNet path similarity between a word and itself should be `1.0`).

In [7]:
#| label: hw4-4-6-response
import nltk
nltk.download('wordnet')
def get_wordnet_sim(w1, w2):
    w1_synsets = wn.synsets(w1)
    if len(w1_synsets) == 0:
        return pd.NA
    w1_synset = w1_synsets[0]
    w2_synsets = wn.synsets(w2)
    if len(w2_synsets) == 0:
        return pd.NA
    w2_synset = w2_synsets[0]
    try:
        wn_sim = wn.path_similarity(w1_synset, w2_synset)
    except:
        return pd.NA
    return wn_sim

# Your code here
word_pair_df['wn_sim'] = word_pair_df.apply(lambda row: get_wordnet_sim(row['w1'], row['w2']), axis=1)
word_pair_df.head()


[nltk_data] Downloading package wordnet to /Users/zp/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,w1,w2,cosine_sim,wn_sim
0,her,her,1.0,
1,her,she,0.53796,
2,her,we,0.472494,
3,her,ms,0.453016,
4,her,him,0.483421,


## Step 7: Cosine Distance-Path Distance Correlation

You've reached the final step! In the following code cell, first **handle any `pd.NA` values that were returned by `get_wordnet_sim()`**, as described in the Step 6 instructions.

Once you have the subset of `word_pair_df` containing all pairs without a `pd.NA` value for `wn_sim`, use the `.corr()` function from Pandas to compute the **Pearson correlation coefficient** between the two similarities.

You should find a fairly large value, above 70%, thus illustrating that **unsupervised algorithms like the Word Embedding algorithm used by Vertex AI can "automatically" construct semantic vector spaces which align to a great extent with human judgements of semantic similarity, which were painstakingly constructed over many years by the professional linguists behind WordNet!**

In [8]:
#| label: hw4-4-7-response
# Your code here
word_pairs_df = word_pair_df.dropna(subset=['wn_sim'])
correlation = word_pairs_df['cosine_sim'].corr(word_pairs_df['wn_sim'])
correlation,word_pairs_df.head(10),word_pairs_df.shape

(0.6527559177883395,
      w1          w2  cosine_sim    wn_sim
 153  ms          ms    1.000000       1.0
 155  ms      season    0.411227  0.083333
 156  ms          pm    0.478671    0.0625
 157  ms        game    0.531020  0.076923
 158  ms      people    0.569282       0.1
 159  ms  government    0.545909  0.076923
 161  ms        some    0.497096  0.090909
 163  ms       other    0.530623  0.090909
 164  ms       music    0.571006  0.090909
 165  ms     million    0.382197  0.071429,
 (1521, 4))

You made it! Congratulations :)

Please submit your completed notebooks by pushing them to the GitHub repository created by GitHub Classroom.