# GloVe (Gensim)

For looking at word vectors, we'll use **Gensim**. **Gensim** isn't really a deep learning package. It's a package for for word and text similarity modeling, which started with (LDA-style) topic models and grew into SVD and neural word representations. But its efficient and scalable, and quite widely used.   We gonna use **GloVe** embeddings, downloaded at [the Glove page](https://nlp.stanford.edu/projects/glove/). They're inside [this zip file](https://nlp.stanford.edu/data/glove.6B.zip)

In [None]:
!pip install gensim
from gensim.test.utils import datapath
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec
import os

# Download and unzip the GloVe embeddings
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip


Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m68.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0
--2026-01-17 12:19:43--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2026-01-17 12:19:44--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu

**Load GloVe model**

In [2]:
from gensim.models import KeyedVectors

glove_model = KeyedVectors.load_word2vec_format(
    "glove.6B.100d.txt",
    binary=False,
    no_header=True
)


**Word Analogy evaluation**

In [3]:
def evaluate_glove_analogies(model, analogy_file):

    results = model.evaluate_word_analogies(analogy_file)

    semantic_correct = 0
    semantic_total = 0
    syntactic_correct = 0
    syntactic_total = 0

    for section in results[1]:
        section_name = section["section"]

        # Semantic: capital-common-countries
        if section_name == "capital-common-countries":
            semantic_correct += len(section["correct"])
            semantic_total += len(section["correct"]) + len(section["incorrect"])

        # Syntactic: past-tense
        if section_name == "past-tense":
            syntactic_correct += len(section["correct"])
            syntactic_total += len(section["correct"]) + len(section["incorrect"])

    semantic_acc = semantic_correct / semantic_total if semantic_total > 0 else 0
    syntactic_acc = syntactic_correct / syntactic_total if syntactic_total > 0 else 0

    return syntactic_acc, semantic_acc


**Evluate the Model**

In [4]:
analogy_path = datapath("/content/sample_data/word_analogies_dataset.txt")

syntactic_acc, semantic_acc = evaluate_glove_analogies(
    glove_model,
    analogy_path
)

print("Syntactic Accuracy (past-tense):", round(syntactic_acc, 4))
print("Semantic Accuracy (capital-common-countries):", round(semantic_acc, 4))


Syntactic Accuracy (past-tense): 0.5545
Semantic Accuracy (capital-common-countries): 0.9387


In [5]:
import pandas as pd
import numpy as np
from scipy.stats import spearmanr


In [6]:
# Path to your Excel similarity dataset
similarity_path = "/content/sample_data/wordsim353crowd.csv"  # change if needed

df = pd.read_csv(similarity_path)
df.head()


Unnamed: 0,Word 1,Word 2,Human (Mean)
0,admission,ticket,5.536
1,alcohol,chemistry,4.125
2,aluminum,metal,6.625
3,announcement,effort,2.0625
4,announcement,news,7.1875


In [7]:
def get_vector(model, word):
    if word in model:
        return model[word]
    else:
        return None



**Compute dot-product similarities**

In [8]:
model_scores = []
human_scores = []

for _, row in df.iterrows():
    w1, w2, human_score = row[0], row[1], row[2]

    v1 = get_vector(glove_model, w1)
    v2 = get_vector(glove_model, w2)

    if v1 is None or v2 is None:
        continue  # skip OOV pairs

    dot_product = np.dot(v1, v2)

    model_scores.append(dot_product)
    human_scores.append(human_score)


  w1, w2, human_score = row[0], row[1], row[2]


**Compute Spearman correlation**

In [9]:
correlation, p_value = spearmanr(model_scores, human_scores)

print(f"Spearman Correlation: {correlation:.4f}")
print(f"P-value: {p_value:.4e}")


Spearman Correlation: 0.4176
P-value: 1.4439e-15
