## Import Library

Imports essential for neural network modeling and NLP include NumPy for numerical operations, PyTorch for model construction, and Gensim for managing word embeddings, crucial for models like GloVe and applying pre-trained embeddings in NLP tasks.


In [10]:
import numpy as np  # Fundamental package for scientific computing with Python
import torch  # PyTorch, a deep learning framework
import torch.nn as nn  # Submodule of PyTorch specifically for neural network layers
from torch.nn import functional as F  # Functional interface containing typical operations used for building neural networks
import json  # Module for JSON file parsing

# Gensim related imports for handling word vectors
from gensim.test.utils import datapath  # Utility function to handle file paths within Gensim
from gensim.models import KeyedVectors  # Class to handle word vectors in Gensim
from gensim.scripts.glove2word2vec import glove2word2vec  # Script to convert GloVe format to Word2Vec format

from scipy.spatial.distance import cosine
from scipy.stats import spearmanr

## 1. Compare  training loss & training time.

In this section, we present a comparative analysis of three prominent word embedding models: Skip-gram, Skip-gram with Negative Sampling, and GloVe. The comparison is based on two critical metrics: training loss and training time.
<span style="color:magenta;">*Data for this analysis is sourced from the individual notebooks dedicated to each model.*</span>



The models were trained on a selected subset of the Reuters corpus from NLTK, containing 500 passages out of a total of 54,716, and 2677 tokens out of 1720917. The training performance was assessed based on the average training loss and the total time taken for training


#### Training Loss
| Model                          | Average Training Loss |
|--------------------------------|-----------------------|
| Skip-gram                      | 8.133966            |
| Skip-gram with Negative Sampling | 1.977957             |
| GloVe Scratch                        | 0.724803 |

- **Training Loss:**
- Skip-gram: Exhibited an average training loss of 8.133966, indicating room for optimization.
- Skip-gram with Negative Sampling: Showed improved efficiency with a lower average training loss of 1.977957.
- GloVe Scratch: Achieved the lowest training loss of 0.724803, indicating a more effective learning during training.

#### Training Time

| Model                          | Total Training Time |
|--------------------------------|---------------------|
| Skip-gram                      | 18m 4s            |
| Skip-gram with Negative Sampling | 17m 8s             |
| GloVe Scratch                    | 1m 54s              |


- **Training Time:**
- Skip-gram: Took 18 minutes and 4 seconds for training.
- Skip-gram with Negative Sampling: Required 17 minutes and 8 seconds, slightly faster than Skip-gram.
- GloVe Scratch: Was significantly faster with a training time of 1 minute and 54 seconds.

This comparison sheds light on the trade-offs between models in terms of learning efficiency and computational needs.


## 2. Analogy Data Load and Process

The load_data function loads and parses datasets for word analogy tasks, crucial in assessing word embeddings. 

It extracts analogies from a specified file, categorizes them (e.g., 'capital-common-countries', 'gram7-past-tense'), and returns categorized lists.

In [11]:
def load_data(file_path):
    with open(file_path, 'r') as file:
        lines = file.readlines()

    # Sections for different types of analogies
    capital_common_countries = []
    gram7_past_tense = []

    # Current section
    current_section = None

    for line in lines[1:]:
        # Identify the section
        if ': capital-common-countries' in line:
            current_section = capital_common_countries
            continue
        elif ': gram7-past-tense' in line:
            current_section = gram7_past_tense
            continue
        elif ': gram8-plural' in line:
            break

        # Split the line into words and add to the current section
        if current_section is not None:
            words = line.strip().split()
            if len(words) == 4:  # Ensure there are exactly 4 words
                current_section.append(tuple(words))

    return capital_common_countries, gram7_past_tense

# Load the data
file_path = 'analogy-dataset/word-test.v1.txt'
capital_common_countries, gram7_past_tense = load_data(file_path)


In [12]:
#Check the split is working okay or not for capital and countries
capital_common_countries[0], capital_common_countries[len(capital_common_countries)-1]

(('Athens', 'Greece', 'Baghdad', 'Iraq'),
 ('Ukraine', 'Ukrainian', 'Switzerland', 'Swiss'))

In [13]:
#Check the split is working okay or not for past tense
gram7_past_tense[0], gram7_past_tense[len(gram7_past_tense)-1]

(('dancing', 'danced', 'decreasing', 'decreased'),
 ('writing', 'wrote', 'walking', 'walked'))


## 3. Syntactice & Semantic Accuracy Calculation

The  `calculate_accuracy` function gauges the accuracy of either a skimgram or skimgram-neg-sampling model on word analogies, comparing predicted and actual embeddings. It serves as a metric for the model's capability in semantic and syntactic understanding of word representations.

In [14]:
def calculate_accuracy(model, dataset, word2index, index2word):
    correct = 0
    total = 0

    for word1, word1_target, word2, word2_target in dataset:
        # Check if all words are in the model's vocabulary, skip the analogy if any word is OOV
        if all(word in word2index for word in [word1, word1_target, word2, word2_target]):
            total += 1

            # Get the indices for each word
            word1_idx = word2index[word1]
            word1_target_idx = word2index[word1_target]
            word2_idx = word2index[word2]

            # Get the embeddings for each word using the formula (center_embedding + outside_embedding) / 2
            word1_emb = (model.embedding_center(torch.tensor([word1_idx], dtype=torch.long)).squeeze(0) +
                         model.embedding_outside(torch.tensor([word1_idx], dtype=torch.long)).squeeze(0)) / 2
            word1_target_emb = (model.embedding_center(torch.tensor([word1_target_idx], dtype=torch.long)).squeeze(0) +
                                model.embedding_outside(torch.tensor([word1_target_idx], dtype=torch.long)).squeeze(0)) / 2
            word2_emb = (model.embedding_center(torch.tensor([word2_idx], dtype=torch.long)).squeeze(0) +
                         model.embedding_outside(torch.tensor([word2_idx], dtype=torch.long)).squeeze(0)) / 2

            # Compute the expected embedding for the target word
            expected_emb = word2_emb - word1_emb + word1_target_emb

            # Calculate similarities between the expected embedding and all word embeddings in the vocabulary
            similarities = F.cosine_similarity( (model.embedding_center.weight + model.embedding_outside.weight) / 2, expected_emb.unsqueeze(0), dim=1)

            # Find the index of the maximum similarity (excluding the original words in the analogy)
            indices_to_exclude = [word2index[word] for word in [word1, word1_target, word2] if word in word2index]
            for idx in indices_to_exclude:
                similarities[idx] = -1

            max_similarity_idx = torch.argmax(similarities).item()

            # print(index2word[str(max_similarity_idx)], word2_target)



            # Check if the word with the maximum similarity is the target word
            if index2word[str(max_similarity_idx)] == word2_target:
                correct += 1

    accuracy = correct / total if total > 0 else 0
    return accuracy

The  `calculate_accuracy_GloVe` function gauges the accuracy of a GloVe Scratch model on word analogies, comparing predicted and actual embeddings. 
It serves as a metric for the model's capability in semantic and syntactic understanding of word representations.

In [15]:
def calculate_accuracy_GloVe(model, dataset, word2index, index2word):
    correct = 0
    total = 0

    for word1, word1_target, word2, word2_target in dataset:
        # Check if all words are in the model's vocabulary, skip the analogy if any word is OOV
        if all(word in word2index for word in [word1, word1_target, word2, word2_target]):
            total += 1

            # Get the indices for each word
            word1_idx = word2index[word1]
            word1_target_idx = word2index[word1_target]
            word2_idx = word2index[word2]

            # Get the embeddings for each word
            # For GloVe, consider using the sum of center and outside embeddings for each word
            word1_emb = (model.center_embedding(torch.tensor([word1_idx], dtype=torch.long)).squeeze(0) + \
                        model.outside_embedding(torch.tensor([word1_idx], dtype=torch.long)).squeeze(0))/2
            
            word1_target_emb = (model.center_embedding(torch.tensor([word1_target_idx], dtype=torch.long)).squeeze(0) + \
                               model.outside_embedding(torch.tensor([word1_target_idx], dtype=torch.long)).squeeze(0))/2
            
            word2_emb = (model.center_embedding(torch.tensor([word2_idx], dtype=torch.long)).squeeze(0) + \
                        model.outside_embedding(torch.tensor([word2_idx], dtype=torch.long)).squeeze(0))/2

            # Compute the expected embedding for the target word
            expected_emb = word2_emb - word1_emb + word1_target_emb

            # Calculate similarities between the expected embedding and all word embeddings in the vocabulary
            # Consider using the sum of center and outside embeddings for the similarity calculation
            all_embeddings = model.center_embedding.weight + model.outside_embedding.weight
            similarities = F.cosine_similarity(all_embeddings/2, expected_emb.unsqueeze(0), dim=1)

            # Find the index of the maximum similarity (excluding the original words in the analogy)
            indices_to_exclude = [word2index[word] for word in [word1, word1_target, word2] if word in word2index]
            for idx in indices_to_exclude:
                similarities[idx] = -1

            max_similarity_idx = torch.argmax(similarities).item()
            
            # print(index2word[str(max_similarity_idx)], word2_target)

            # Check if the word with the maximum similarity is the target word
            if index2word[str(max_similarity_idx)] == word2_target:
                correct += 1

    accuracy = correct / total if total > 0 else 0
    return accuracy

This function, `calculate_accuracy_GloVe_gensim`, evaluates the accuracy of a GloVe model (loaded via Gensim) on a word analogy dataset. The function operates by leveraging Gensim's data structures and functionalities to calculate the expected embedding for the target word in each analogy and compares it with the actual embeddings in the model. 


In [16]:
def calculate_accuracy_GloVe_gensim(model, dataset):
    correct = 0
    total = 0

    for word1, word1_target, word2, word2_target in dataset:
        # Check if all words are in the model's vocabulary, skip the analogy if any word is OOV
        if all(word in model.key_to_index for word in [word1, word1_target, word2, word2_target]):
            total += 1

            # Get the embeddings for each word
            word1_emb = model[word1]
            word1_target_emb = model[word1_target]
            word2_emb = model[word2]

            # Compute the expected embedding for the target word
            expected_emb = word2_emb - word1_emb + word1_target_emb

            # Calculate similarities between the expected embedding and all word embeddings in the vocabulary
            all_embeddings = model.vectors
            similarities = np.dot(all_embeddings, expected_emb) / (np.linalg.norm(all_embeddings, axis=1) * np.linalg.norm(expected_emb))

            # Exclude original words from consideration
            for word in [word1, word1_target, word2]:
                if word in model.key_to_index:
                    similarities[model.key_to_index[word]] = -1

            max_similarity_idx = np.argmax(similarities)

            # Check if the word with the maximum similarity is the target word
            if model.index_to_key[max_similarity_idx] == word2_target:
                correct += 1

    accuracy = correct / total if total > 0 else 0
    return accuracy

### 3.1. Skipgram word2vec's Accuracy Calculation

- This section defines the `Skipgram` class, a neural network model for learning word embeddings via the Skipgram architecture. 

- After defining the model, the code proceeds to load a already-trained model along with its configuration and the word-to-index and index-to-word mappings.

In [17]:
# Define the Skipgram model class
class Skipgram(nn.Module):
    def __init__(self, voc_size, emb_size):
        super(Skipgram, self).__init__()
        # Embedding layers for center and outside words
        self.embedding_center = nn.Embedding(voc_size, emb_size)
        self.embedding_outside = nn.Embedding(voc_size, emb_size)
    
    def forward(self, center, outside, all_vocabs):
        # Obtain embeddings for center, outside, and all vocabulary words
        center_embedding = self.embedding_center(center)
        outside_embedding = self.embedding_outside(outside)
        all_vocabs_embedding = self.embedding_outside(all_vocabs)
        
        # Calculate top and lower terms for loss computation
        top_term = torch.exp(outside_embedding.bmm(center_embedding.transpose(1, 2)).squeeze(2))
        lower_term = all_vocabs_embedding.bmm(center_embedding.transpose(1, 2)).squeeze(2)
        lower_term_sum = torch.sum(torch.exp(lower_term), 1)
        
        # Calculate and return loss
        loss = -torch.mean(torch.log(top_term / lower_term_sum))
        return loss

word2index_path = './config_model_files/word2index.json'  
index2word_path = './config_model_files/index2word.json' 
model_path = './config_model_files/word2vec_model.pth'
config_path = './config_model_files/word2vec_config.json'        
# Load model configuration
with open(config_path, 'r') as config_file:
    config_skipgram = json.load(config_file)

# Initialize model with loaded configuration
loaded_model_Skipgram = Skipgram(voc_size=config_skipgram['voc_size'], emb_size=config_skipgram['emb_size'])

# Load model state
loaded_model_Skipgram.load_state_dict(torch.load(model_path, map_location=torch.device('cpu')))
loaded_model_Skipgram.eval()  # Set model to evaluation mode

with open(word2index_path, 'r') as file:
    word2index_skipgram = json.load(file)

with open(index2word_path, 'r') as file:
    index2word_skipgram = json.load(file)

In [18]:
# Calculate the semantic accuracy on the 'capital-common-countries' analogy dataset
semantic_accuracy_Skipgram = calculate_accuracy(loaded_model_Skipgram, capital_common_countries, word2index_skipgram, index2word_skipgram)

# Calculate the syntactic accuracy on the 'gram7-past-tense' analogy dataset
syntactic_accuracy_Skipgram = calculate_accuracy(loaded_model_Skipgram, gram7_past_tense, word2index_skipgram, index2word_skipgram)

# Print the semantic accuracy as a percentage
print(f"Skipgram Semantic Accuracy: {semantic_accuracy_Skipgram * 100:.2f}%")

# Print the syntactic accuracy as a percentage
print(f"Skipgram Syntactic Accuracy: {syntactic_accuracy_Skipgram * 100:.2f}%")

Skipgram Semantic Accuracy: 0.00%
Skipgram Syntactic Accuracy: 0.00%


### 3.2. SkipgramNeg Sampling word2vec's Accuracy Calculation

- In this section, we define the `SkipgramNeg` class, an implementation of the Skipgram model enhanced with Negative Sampling for more efficient training. After defining the model, the code proceeds to load a already-trained model, including its configuration and the word-to-index mappings.

In [19]:
class SkipgramNeg(nn.Module):
    
    def __init__(self, voc_size, emb_size):
        super(SkipgramNeg, self).__init__()
        self.embedding_center  = nn.Embedding(voc_size, emb_size)
        self.embedding_outside = nn.Embedding(voc_size, emb_size)
        self.logsigmoid        = nn.LogSigmoid()
    
    def forward(self, center, outside, negative):
        #center, outside:  (bs, 1)
        #negative       :  (bs, k)
        
        center_embed   = self.embedding_center(center) #(bs, 1, emb_size)
        outside_embed  = self.embedding_outside(outside) #(bs, 1, emb_size)
        negative_embed = self.embedding_outside(negative) #(bs, k, emb_size)
        
        uovc           = outside_embed.bmm(center_embed.transpose(1, 2)).squeeze(2) #(bs, 1)
        ukvc           = -negative_embed.bmm(center_embed.transpose(1, 2)).squeeze(2) #(bs, k)
        ukvc_sum       = torch.sum(ukvc, 1).reshape(-1, 1) #(bs, 1)
        
        loss           = self.logsigmoid(uovc) + self.logsigmoid(ukvc_sum)
        
        return -torch.mean(loss)


word2index_path = './config_model_files/word2index_neg_sam.json'  
index2word_path = './config_model_files/index2word_neg_sam.json' 
model_path = './config_model_files/word2vec_model_neg_sam.pth'
config_path = './config_model_files/word2vec_config_neg_sam.json'

with open(word2index_path, 'r') as file:
    word2index_SkipgramNeg = json.load(file)  # Load the word2index dictionary from the JSON file

with open(index2word_path, 'r') as file:
    index2word_SkipgramNeg = json.load(file)

# Load the model's configuration from a JSON file
with open(config_path, 'r') as config_file:
    config_SkipgramNeg = json.load(config_file)

# Retrieve the configuration values
voc_size = config_SkipgramNeg['voc_size']  # Vocabulary size
emb_size = config_SkipgramNeg['emb_size']  # Embedding size

# Initialize a new Word2Vec model with the loaded configuration
loaded_model_SkipgramNeg = SkipgramNeg(voc_size, emb_size)

# Load the state dictionary (model parameters) into the initialized model
loaded_model_SkipgramNeg.load_state_dict(torch.load(model_path))

# Set the model to evaluation mode (useful for inference)
loaded_model_SkipgramNeg.eval()

# Confirm successful model loading
print("Model loaded successfully")

Model loaded successfully


In [20]:
# Calculate semantic accuracy on 'capital-common-countries' analogies
semantic_accuracy_Skipgram_neg = calculate_accuracy(loaded_model_SkipgramNeg, capital_common_countries, word2index_SkipgramNeg, index2word_SkipgramNeg)
# Calculate syntactic accuracy on 'gram7-past-tense' analogies
syntactic_accuracy_Skipgram_neg = calculate_accuracy(loaded_model_SkipgramNeg, gram7_past_tense, word2index_SkipgramNeg, index2word_SkipgramNeg)

# Print the results
print(f"Skipgram-neg Semantic Accuracy: {semantic_accuracy_Skipgram_neg * 100:.2f}%")
print(f"Skipgram-neg Syntactic Accuracy: {syntactic_accuracy_Skipgram_neg * 100:.2f}%")

Skipgram-neg Semantic Accuracy: 0.00%
Skipgram-neg Syntactic Accuracy: 0.00%


### 3.3.  GloVe Scratch word2vec's Accuracy Calculation

- This section defines the `Glove` class, which implements the GloVe model for learning word embeddings by capturing global word co-occurrence statistics.

- After defining the model, the code proceeds to load a already-trained GloVe model, including its configuration, word-to-index, and index-to-word mappings.


In [21]:
class Glove(nn.Module):
    
    def __init__(self, voc_size, emb_size):
        super(Glove, self).__init__()
        # Embeddings for center words
        self.center_embedding  = nn.Embedding(voc_size, emb_size)
        # Embeddings for context (outside) words
        self.outside_embedding = nn.Embedding(voc_size, emb_size)
        
        # Bias terms for center words
        self.center_bias       = nn.Embedding(voc_size, 1) 
        # Bias terms for context (outside) words
        self.outside_bias      = nn.Embedding(voc_size, 1)
    
    def forward(self, center, outside, coocs, weighting):
        # Retrieve the embeddings for the center words
        center_embeds  = self.center_embedding(center)  # (batch_size, 1, emb_size)
        # Retrieve the embeddings for the outside words
        outside_embeds = self.outside_embedding(outside)  # (batch_size, 1, emb_size)
        
        # Retrieve and squeeze the bias for the center words
        center_bias    = self.center_bias(center).squeeze(1)
        # Retrieve and squeeze the bias for the outside words
        target_bias    = self.outside_bias(outside).squeeze(1)
        
        # Compute the dot product of center and outside word embeddings
        inner_product  = outside_embeds.bmm(center_embeds.transpose(1, 2)).squeeze(2)
        
        # Compute the GloVe loss as the weighted squared error between
        # the log co-occurrence counts and the model predictions (dot product + biases)
        loss = weighting * torch.pow(inner_product + center_bias + target_bias - coocs, 2)
        
        # Return the sum of the losses for the batch
        return torch.sum(loss)
    
word2index_path = './config_model_files/word2index_GloVe_Scratch.json'  
index2word_path = './config_model_files/index2word_GloVe_Scratch.json' 
model_path = './config_model_files/word2vec_model_GloVe_Scratch.pth'
config_path = './config_model_files/word2vec_config_GloVe_Scratch.json'
with open(word2index_path, 'r') as file:
    word2index_Glove = json.load(file)  # Load the word2index dictionary from the JSON file

with open(index2word_path, 'r') as file:
    index2word_Glove = json.load(file)  # Load the index2word dictionary from the JSON file
# Load the model's configuration from a JSON file
with open(config_path, 'r') as config_file:
    config_Glove = json.load(config_file)

# Retrieve the configuration values
voc_size = config_Glove['voc_size']  # Vocabulary size
emb_size = config_Glove['emb_size']  # Embedding size

# Initialize a new Word2Vec model with the loaded configuration
loaded_model_Glove = Glove(voc_size, emb_size)

# Load the state dictionary (model parameters) into the initialized model
loaded_model_Glove.load_state_dict(torch.load(model_path))

# Set the model to evaluation mode (useful for inference)
loaded_model_Glove.eval()

# Confirm successful model loading
print("Model loaded successfully")

Model loaded successfully


In [22]:
semantic_accuracy_GloVe = calculate_accuracy_GloVe(loaded_model_Glove, capital_common_countries, word2index_Glove, index2word_Glove)
syntactic_accuracy_GloVe = calculate_accuracy_GloVe(loaded_model_Glove, gram7_past_tense, word2index_Glove, index2word_Glove)
print(f"GloVe Semantic Accuracy: {semantic_accuracy_GloVe * 100:.2f}%")
print(f"GloVe Syntactic Accuracy: {syntactic_accuracy_GloVe * 100:.2f}%")


GloVe Semantic Accuracy: 0.00%
GloVe Syntactic Accuracy: 0.00%


### 3.4. GloVe with Gensim pre-trained model's Accuracy Calculation

In this section, Gensim is used to convert GloVe word embeddings to the Word2Vec format, facilitating compatibility with NLP tools that natively support Word2Vec.

The `glove2word2vec` script facilitates this conversion, making GloVe embeddings readily usable in a broader array of applications.

In [23]:
# Path to the GloVe file (Replace this with GloVe file path)
glove_file_path = 'glove-dataset/glove.6B.100d.txt'   #  download from https://github.com/allenai/spv2/blob/master/model/glove.6B.100d.txt.gz

# File path for the output Word2Vec format file
word2vec_output_file = glove_file_path + '.word2vec'

# Convert the GloVe file format to the Word2Vec file format
# This creates a new file in Word2Vec format at the specified path
glove2word2vec(glove_file_path, word2vec_output_file)

# Load the model from the converted Word2Vec format file
# The binary flag is set to False because the Word2Vec format is text-based, not binary
loaded_model_Glove_Gen = KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False)

  glove2word2vec(glove_file_path, word2vec_output_file)


In [220]:
# Calculate semantic accuracy on the 'capital-common-countries' dataset
semantic_accuracy_GloVe_Gen = calculate_accuracy_GloVe_gensim(loaded_model_Glove_Gen, capital_common_countries)
# Calculate syntactic accuracy on the 'gram7-past-tense' dataset
syntactic_accuracy_GloVe_Gen = calculate_accuracy_GloVe_gensim(loaded_model_Glove_Gen, gram7_past_tense)

# Print the semantic and syntactic accuracies
print(f"GloVe gensim Semantic Accuracy: {semantic_accuracy_GloVe_Gen * 100:.2f}%")
print(f"GloVe gensim Syntactic Accuracy: {syntactic_accuracy_GloVe_Gen * 100:.2f}%")


GloVe gensim Semantic Accuracy: 54.97%
GloVe gensim Syntactic Accuracy: 53.40%


### Model's Comparison Summary

| Model                             | Window Size | Training Loss(taken from Traning Notebooks) | Syntactic Accuracy | Semantic Accuracy |
|-----------------------------------|-------------|---------------|--------------------|-------------------|
| Skip-gram                         | 2           | 8.133966         | 0.00%              | 0.00%             |
| Skip-gram with Negative Sampling  | 2           | 1.977957       | 0.00%              | 0.00%             |
| GloVe (Custom)                    | 2           | 0.724803        | 0.00%              | 0.00%             |
| GloVe (Pre-trained Gensim)        | N/A         | N/A           | 53.40%             | 54.97%            |


Comparison of various models on the   `capital-common-countries (semantic)` and `past-tense (syntactic)` [analogy dataset](https://www.fit.vutbr.cz/~imikolov/rnnlm/word-test.v1.txt) reveals significant insights:

- **Custom Models' Zero Accuracy:**
Skip-gram, Skip-gram with Negative Sampling,and Scratch GloVe models demonstrated zero accuracy, likely due to the **out-of-vocabulary issue**.

- **Strong Performance of Pre-trained GloVe:**
Over 50% accuracy in both tasks, highlighting the importance of extensive training and diverse datasets.

## 4. Find the correlation of WordSim353 Dataset

To calculate the correlation between our model's dot product and the provided similarity metrics, we'll use one of these files. Usually, the wordsim_relatedness_goldstandard.txt or wordsim_similarity_goldstandard.txt would contain the pairs of words along with human-assigned similarity scores.

We'll go through the following steps:

- Load the dataset.
- Compute the cosine similarity for each word pair using our model.
- Calculate Spearman's rank correlation between our model's similarities and the human-assigned scores.

#### Step 1: Load the Dataset
Let's load one of the gold standard files, parse it, and extract word pairs and human-assigned similarity scores.

In [24]:
import pandas as pd

# Path to the dataset file
file_path = 'wordsim353_sim_rel/wordsim_relatedness_goldstandard.txt'

# Load the dataset
# Adjust the separator and column names based on the actual format of our file
df = pd.read_csv(file_path, sep='\t', names=['word1', 'word2', 'human_score'])

# Display the first few rows of the dataframe
print(df.head())


       word1      word2  human_score
0   computer   keyboard         7.62
1  Jerusalem     Israel         8.46
2     planet     galaxy         8.11
3     canyon  landscape         7.53
4       OPEC    country         5.63


#### Step 2: Compute Model Similarities
Now, let's compute the cosine similarity for each word pair using  model's embeddings.

In [25]:

# Function for skipgram and skipgram-neg-sampling model
def compute_model_similarity(model, word_pairs, word2index):
    model_similarities = []
    for word1, word2 in word_pairs:
        # Check if both words are in the vocabulary
        if word1 in word2index and word2 in word2index:
            word1_idx = word2index[word1]
            word2_idx = word2index[word2]
            
            # Retrieve embeddings
            word1_emb = (model.embedding_center(torch.tensor([word1_idx], dtype=torch.long)).squeeze(0) +
                         model.embedding_outside(torch.tensor([word1_idx], dtype=torch.long)).squeeze(0))/2
            word2_emb = (model.embedding_center(torch.tensor([word2_idx], dtype=torch.long)).squeeze(0) +
                         model.embedding_outside(torch.tensor([word2_idx], dtype=torch.long)).squeeze(0))/2
            
            # Convert embeddings to numpy arrays after detaching them from the computation graph
            word1_emb_np = word1_emb.detach().numpy()
            word2_emb_np = word2_emb.detach().numpy()
            
            # Compute cosine similarity (use 1-cosine to convert distance to similarity)
            similarity = 1 - cosine(word1_emb_np, word2_emb_np)
            model_similarities.append(similarity)
        else:
            model_similarities.append(None)  # None if any or both words are OOV
            
    return model_similarities


In [26]:
def compute_model_similarity_GloVe_Scratch(model, word_pairs, word2index):
    model_similarities = []
    for word1, word2 in word_pairs:
        # Check if both words are in the vocabulary
        if word1 in word2index and word2 in word2index:
            word1_idx = word2index[word1]
            word2_idx = word2index[word2]
            
            # Retrieve embeddings            
            word1_emb = (model.center_embedding(torch.tensor([word1_idx], dtype=torch.long)).squeeze(0) + \
            model.outside_embedding(torch.tensor([word1_idx], dtype=torch.long)).squeeze(0))/2

            word2_emb = (model.center_embedding(torch.tensor([word2_idx], dtype=torch.long)).squeeze(0) + \
                        model.outside_embedding(torch.tensor([word2_idx], dtype=torch.long)).squeeze(0))/2
            
            # Convert embeddings to numpy arrays after detaching them from the computation graph
            word1_emb_np = word1_emb.detach().numpy()
            word2_emb_np = word2_emb.detach().numpy()
            
            # Compute cosine similarity (use 1-cosine to convert distance to similarity)
            similarity = 1 - cosine(word1_emb_np, word2_emb_np)
            model_similarities.append(similarity)
        else:
            model_similarities.append(None)  # None if any or both words are OOV
            
    return model_similarities

In [27]:
def compute_glove_gensim_similarity(model, word_pairs):
    model_similarities = []
    for word1, word2 in word_pairs:
        # Check if both words are in the vocabulary
        if word1 in model.key_to_index and word2 in model.key_to_index:
            # Compute cosine similarity using Gensim's method
            similarity = model.similarity(word1, word2)
            model_similarities.append(similarity)
        else:
            model_similarities.append(None)  # None if any or both words are OOV
    return model_similarities

def calculate_correlation(model, dataset):
    # Assuming dataset is a DataFrame with 'word1', 'word2', 'human_score'
    word_pairs = list(zip(dataset['word1'], dataset['word2']))
    model_similarities = compute_glove_gensim_similarity(model, word_pairs)
    
    # Filter out pairs where at least one word was OOV
    filtered_human_scores = [human_score for human_score, model_score in zip(dataset['human_score'], model_similarities) if model_score is not None]
    filtered_model_scores = [model_score for model_score in model_similarities if model_score is not None]

    # Calculate Spearman's rank correlation
    correlation, _ = spearmanr(filtered_human_scores, filtered_model_scores)
    return correlation

### Step 3: Calculate Spearman's Rank Correlation
Finally, calculate Spearman's rank correlation between the model-computed similarities and the human-assigned scores.

In [28]:


# skipgram model, word2index loaded and the compute_model_similarity function defined
model_similarities_skipgram = compute_model_similarity(loaded_model_Skipgram, list(zip(df['word1'], df['word2'])), word2index_skipgram)
# Filter out pairs where at least one word was OOV
filtered_human_scores = [human_score for human_score, model_score in zip(df['human_score'], model_similarities_skipgram) if model_score is not None]
filtered_model_scores = [model_score for model_score in model_similarities_skipgram if model_score is not None]

# Calculate Spearman's rank correlation
correlation, _ = spearmanr(filtered_human_scores, filtered_model_scores)
print(f"Spearman's rank correlation for Skipgram: {correlation:.3f}")


Spearman's rank correlation for Skipgram: 0.105


In [29]:
# skipgram neg-smapling model, word2index loaded and the compute_model_similarity function defined
model_similarities_skipgram_neg = compute_model_similarity(loaded_model_SkipgramNeg, list(zip(df['word1'], df['word2'])), word2index_SkipgramNeg)
# Filter out pairs where at least one word was OOV
filtered_human_scores = [human_score for human_score, model_score in zip(df['human_score'], model_similarities_skipgram_neg) if model_score is not None]
filtered_model_scores = [model_score for model_score in model_similarities_skipgram_neg if model_score is not None]

# Calculate Spearman's rank correlation
correlation, _ = spearmanr(filtered_human_scores, filtered_model_scores)
print(f"Spearman's rank correlation for Skipgram-Neg-Sampling: {correlation:.3f}")

Spearman's rank correlation for Skipgram-Neg-Sampling: -0.118


In [30]:
# GloVe Scratch model, word2index loaded and the compute_model_similarity function defined
model_similarities_Glove_Scratch = compute_model_similarity_GloVe_Scratch(loaded_model_Glove, list(zip(df['word1'], df['word2'])), word2index_Glove)
# Filter out pairs where at least one word was OOV
filtered_human_scores = [human_score for human_score, model_score in zip(df['human_score'], model_similarities_Glove_Scratch) if model_score is not None]
filtered_model_scores = [model_score for model_score in model_similarities_Glove_Scratch if model_score is not None]

# Calculate Spearman's rank correlation
correlation, _ = spearmanr(filtered_human_scores, filtered_model_scores)
print(f"Spearman's rank correlation for GloVe Scratch: {correlation:.3f}")

Spearman's rank correlation for GloVe Scratch: -0.224


In [31]:
correlation = calculate_correlation(loaded_model_Glove_Gen, df)
print(f"Spearman's rank correlation for GloVe Gensim: {correlation:.3f}")

Spearman's rank correlation for GloVe Gensim: 0.491


#### Spearman's Rank Correlation Summary
The Spearman's rank correlation analysis reveals varied performance among different word embedding models in aligning with human-perceived word similarities on [wordsim353_sim_rel data](http://alfonseca.org/eng/research/wordsim353.html).

| Model                     | Spearman's Rank Correlation |
|---------------------------|-----------------------------|
| Skipgram                  |  0.105                       |
| Skipgram-Neg-Sampling     | -0.118                     |
| GloVe Scratch             | -0.224                     |
| GloVe Gensim              | 0.491                       |


- Skipgram: Achieved a correlation of 0.105, indicating a weak positive relationship between the model's similarity scores and human judgment.

- Skipgram-Neg-Sampling: Registered a correlation of -0.118, suggesting a weak inverse relationship, indicating that the model's assessment of similarity tends to slightly oppose human judgment.

- GloVe Scratch: Had a correlation of -0.224, demonstrating a somewhat stronger inverse relationship than the Skipgram-Neg-Sampling model, suggesting its assessments are more contrary to human judgment.

- GloVe Gensim: Scored the highest with a correlation of 0.491, signifying a moderate positive correlation with human judgment, making it relatively the most aligned model with human perception among those evaluated.

The pre-trained GloVe model from Gensim showcases superior performance, significantly aligning with human judgment in word similarities.

Future research directions may include enhancing custom-trained models by:
   - Expanding the training corpus to expose the model to a more varied text.

   - Optimizing model architectures/ tuning the parameters to improve their learning capacity.
Refining training processes to ensure more effective learning.