# Assignment 2: Word2Vec Representations & Sigmoid Classification


In this assignment you will:

1. Explore semantic properties captured by the 300‑dimensional **Google News Word2Vec** model.
2. Build a *sigmoid (logistic) classifier* that operates **directly on pre‑trained word vectors**.

The goal is to deepen your intuition for distributional semantics and to give you hands‑on experience using dense word representations as features for a simple supervised task.

## Step 1: Download and Extract the GoogleNews Vectors  
Download the pre-trained Word2Vec model from the following Google Drive link and move it to the current folder (i.e. `Assignment_2/`):  
https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing  
Once you have the `.gz` file, run the code below to extract `GoogleNews-vectors-negative300.bin.gz` and print basic summary statistics.

In [1]:
from data_processing import extract_and_summary_gnews

extract_and_summary_gnews()

Decompressing `.gz` → `.bin` (this may take a few minutes)…
Decompressed to: /Users/ryanabsar/Documents/02_Education/01_ICL_BA/01_Modules/03_Summer Semester/03_genai_llm/genai-llm-imperial-ba/01_assignments/a02_word2vec/GoogleNews-vectors-negative300.bin

=== Summary Statistics ===
• Total vocabulary size: 3,000,000 words
• Vector dimensionality: 300

Loading a small sample (limit=10) to display a few word–vector snippets…
• Sample words loaded (first 10): ['</s>', 'in', 'for', 'that', 'is', 'on', '##', 'The', 'with', 'said']

• Vector snippets for the first 5 sample words:
    </s>            → [0.001129150390625, -0.000896453857421875, 0.0003185272216796875, 0.00153350830078125, 0.00110626220703125, -0.00140380859375] …
    in              → [0.0703125, 0.0869140625, 0.087890625, 0.0625, 0.0693359375, -0.10888671875] …
    for             → [-0.01177978515625, -0.04736328125, 0.044677734375, 0.0634765625, -0.0181884765625, -0.06396484375] …
    that            → [-0.0157470703125, -0

## Step 2: Word Analogy Task

The `get_vector(word: str)` function retrieves the word2vec embedding for a given word, and `top_k_neighbours(target_vec, k)` returns the top *k* nearest neighbours in the vocabulary for a specified vector. Your goal is to implement the `analogy` function: given words **a**, **b**, and **c**, it should return the top *k* words **d** such that  
$$\mathbf{v}_b - \mathbf{v}_a + \mathbf{v}_c \approx \mathbf{v}_d.$$
Once you complete the `analogy` function, running the code block will demonstrate several example word analogies. **Do not** change any existing lines, only fill in the sections marked: 
```
############## YOUR CODE HERE ##############
#                                           
#                                           
############## YOUR CODE HERE ##############
```
Some hints are provided above the marked sections.

In [4]:
from typing import List
import numpy as np
from gensim.models import KeyedVectors

# Load the pre-trained GoogleNews Word2Vec model
model: KeyedVectors = KeyedVectors.load_word2vec_format(
    "GoogleNews-vectors-negative300.bin",
    binary=True
)

def get_vector(word: str) -> np.ndarray:
    """
    Return the 300-dimensional vector for `word` from the pre-loaded `model`.

    Raises:
        KeyError: If `word` is out-of-vocabulary (OOV).
    """
    try:
        # In Gensim 4.x, indexing directly: model[word] returns a numpy.ndarray of shape (300,)
        return model[word]
    except KeyError:
        # If the word is not in the model’s vocabulary, propagate KeyError with a clearer message
        raise KeyError(f"'{word}' not found in Word2Vec vocabulary.")

def top_k_neighbours(target_vec: np.ndarray, k: int) -> List[tuple]:
    """
    Return the top `k` most similar words to the given vector `target_vec`,
    as a list of (word, similarity_score) tuples.

    Args:
        target_vec: A numpy array (shape (300,)) representing the target word vector.
        k: Number of nearest neighbors to retrieve.

    Returns:
        A list of tuples: [(word1, similarity1), (word2, similarity2), ...].
    """
    return model.similar_by_vector(target_vec, topn=k)



def analogy(a: str, b: str, c: str, k: int = 5) -> List[str]:
    """
    Solve the analogy a : b :: c : d by finding the top-k words d whose vectors
    are closest to (vec_b - vec_a + vec_c), excluding a, b, and c themselves.

    Returns:
        A list of the top `k` predicted words (strings).

    """

    for w in (a, b, c):
        if w not in model.key_to_index:
            raise KeyError(f"Word '{w}' not found in the Word2Vec vocabulary.")
    
    # Retrieve the vectors:
    vec_a = model[a]
    vec_b = model[b]
    vec_c = model[c]


    # Compute the target vector: target_vec = vec_b - vec_a + vec_c
    # Get the top (k + 3) similar words to `target_vec` (requesting a few extra candidates (k+3) so that after filtering out you still have `k` words.)
    # Then filter out any occurrences of `a`, `b`, or `c` and return a list of the top `k` words
    ############## YOUR CODE HERE ##############

    target_vec = vec_b - vec_a + vec_c
    top_k_candidates = top_k_neighbours(target_vec, k + 3)
    # Filter out the words a, b, c and keep only the top k candidates
    analogy_exclude_self = [word for word, _ in top_k_candidates if word not in (a, b, c)][:k]
    
    
    ############## YOUR CODE HERE ##############

    return analogy_exclude_self
    


print("Top 5 similar words to `man` →", top_k_neighbours('man', 5))
print("Top 5 similar words to `fance` →", top_k_neighbours('france', 5))

# # Example 1: Gender relation (king : man :: queen : ?)
analogy_1 = analogy("king", "man", "queen", k=5)
print("Analogy 'king - man = queen - ?' →", analogy_1)

# Example 2: Capital–country relation (Paris : France :: Tokyo : ?)
analogy_2 = analogy("paris", "france", "tokyo", k=5)
print("Analogy 'paris - france = tokyo - ?' →", analogy_2)

# Example 3: Singular–plural (car : cars :: child : ?)
analogy_3 = analogy("car", "cars", "child", k=5)
print("Analogy 'car - cars = child - ?' →", analogy_3)

# Example 4: Verb tense (run : running :: swim : ?)
analogy_4 = analogy("run", "running", "swim", k=5)
print("Analogy 'run - running = swim - ?' →", analogy_4)


# Example 5: Currency relation (dollar : USA :: yen : ?)
analogy_5 = analogy("dollar", "usa", "yen", k=5)
print("Analogy 'dollar - usa = yen - ?' →", analogy_5)


# Example 6: Profession–person (doctor : hospital :: teacher : ?)
analogy_6 = analogy("doctor", "hospital", "teacher", k=5)
print("Analogy 'doctor - hospital = teacher - ?' →", analogy_6)

Top 5 similar words to `man` → [('woman', 0.7664012908935547), ('boy', 0.6824871301651001), ('teenager', 0.6586930155754089), ('teenage_girl', 0.6147903203964233), ('girl', 0.5921714305877686)]
Top 5 similar words to `fance` → [('spain', 0.6375302672386169), ('french', 0.6326055526733398), ('germany', 0.6314354538917542), ('europe', 0.6264256238937378), ('italy', 0.6257959008216858)]
Analogy 'king - man = queen - ?' → ['woman', 'girl', 'lady', 'teenage_girl', 'teenager']
Analogy 'paris - france = tokyo - ?' → ['japan', 'hong_kong', 'japanese', 'germany', 'europe']
Analogy 'car - cars = child - ?' → ['children', 'babies', 'newborns', 'infants', 'kids']
Analogy 'run - running = swim - ?' → ['swimming', 'swam', 'swims', 'swimmers', 'swum']
Analogy 'dollar - usa = yen - ?' → ['japanese', 'japan', 'india', 'uk', '¥']
Analogy 'doctor - hospital = teacher - ?' → ['elementary', 'teachers', 'school', 'classroom', 'School']


## Step 3: Train a Binary Sentiment Classifier with Word2Vec Embeddings

In this step, you will download the [Opinion Lexicon](https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html) (Hu & Liu, 2004), which contains two lists of words labeled “positive” and “negative” based on consumer reviews. The next cell uses the helper function `load_lexicon_and_filter` to download and parse the positive and negative word lists from the Opinion Lexicon, then filter out any words that do not appear in the pre-trained GoogleNews Word2Vec vocabulary. The codes then combine them into a single dataset of words with binary labels (1 = positive, 0 = negative) and construct a feature matrix **X** where each row is the 300-dimensional Word2Vec vector for that word. The **X** and the corresponding labels **y** are splitted into an 80% training set and a 20% test set, stratified by label.


The task for you is to train a logistic regression model on X_train and y_train, and then produce predictions on X_test. Function `result_summary` will summarize the evalution result on the test set. 

In [5]:
import os
import ssl
import urllib.request
from pathlib import Path

import numpy as np
from gensim.models import KeyedVectors
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from data_processing import load_lexicon_and_filter, result_summary

# ───────────────────────────────────────────────────────────────────────────────
# Load pre-trained Word2Vec embeddings and filter lexicon-based word lists
# ───────────────────────────────────────────────────────────────────────────────
model: KeyedVectors = KeyedVectors.load_word2vec_format(
    "GoogleNews-vectors-negative300.bin",
    binary=True
)

# Filter positive and negative words using the pre-trained embeddings
filtered_positive, filtered_negative = load_lexicon_and_filter(model)

# Combine positive and negative words into a single list with binary labels
all_words = filtered_positive + filtered_negative
labels = [1] * len(filtered_positive) + [0] * len(filtered_negative)
n_samples = len(all_words)

print(f"[DATA] Total samples = {n_samples} (pos={len(filtered_positive)}, neg={len(filtered_negative)})")

# ───────────────────────────────────────────────────────────────────────────────
# Build feature matrix X (word embeddings) and label vector y
# ───────────────────────────────────────────────────────────────────────────────
embedding_dim = model.vector_size

# Initialize an empty feature matrix of shape (n_samples, embedding_dim)
X = np.zeros((n_samples, embedding_dim), dtype=np.float32)
for i, word in enumerate(all_words):
    X[i, :] = model[word]  # Fetch the 300-dim vector for each word

# Convert the label list to a NumPy array of shape (n_samples,)
y = np.array(labels, dtype=np.int64)

print(f"[DATA] Feature matrix X shape = {X.shape}, Label vector y shape = {y.shape}\n")

# ───────────────────────────────────────────────────────────────────────────────
# Split the data into training and test sets (80% train, 20% test)
# ───────────────────────────────────────────────────────────────────────────────
idxs = np.arange(n_samples)
X_train, X_test, y_train, y_test, idx_train, idx_test = train_test_split(
    X, y, idxs,
    test_size=0.20,
    random_state=42,
    stratify=y
)

print(f"[SPLIT] X_train: {X_train.shape}, y_train: {y_train.shape}")
print(f"[SPLIT] X_test : {X_test.shape}, y_test : {y_test.shape}\n")

# ───────────────────────────────────────────────────────────────────────────────
# Train a Logistic Regression classifier to distinguish positive vs. negative words
# ───────────────────────────────────────────────────────────────────────────────
print("[TRAIN] Fitting Logistic Regression...")



# Train a logistic regression model on X_train and y_train
# The produce predictions on X_test and store them in y_pred_logreg
############## YOUR CODE HERE ##############
logreg = LogisticRegression(max_iter=1000, random_state=42)
logreg.fit(X_train, y_train)
y_pred_logreg = logreg.predict(X_test)

print("[TRAIN] Logistic Regression fitted.\n")
############## YOUR CODE HERE ##############

# ───────────────────────────────────────────────────────────────────────────────
# Evaluate the classifier by summarizing results on the test set
# ───────────────────────────────────────────────────────────────────────────────
result_summary(y_test, y_pred_logreg, all_words, idx_test) 

[SKIP] positive-words.txt already exists.
[SKIP] negative-words.txt already exists.
Filtered 2006 → 1857 positive words kept
Filtered 4783 → 4445 negative words kept

[DATA] Total samples = 6302 (pos=1857, neg=4445)
[DATA] Feature matrix X shape = (6302, 300), Label vector y shape = (6302,)

[SPLIT] X_train: (5041, 300), y_train: (5041,)
[SPLIT] X_test : (1261, 300), y_test : (1261,)

[TRAIN] Fitting Logistic Regression...
[TRAIN] Logistic Regression fitted.

=== Logistic Regression (Sigmoid) Results ===
Accuracy: 0.9524
Classification Report:
              precision    recall  f1-score   support

         NEG       0.96      0.97      0.97       889
         POS       0.93      0.91      0.92       372

    accuracy                           0.95      1261
   macro avg       0.94      0.94      0.94      1261
weighted avg       0.95      0.95      0.95      1261


Misclassified by Logistic Regression:
  Word: dominate          True=POS   Pred=NEG
  Word: overwhelming      True=NEG   P

## Step 4: Analyze and Discuss Misclassified Words

After running `result_summary`, you will receive a list of words that your logistic regression model classified incorrectly. For each misclassified word, consider:

- **Boundary Cases**: Is the word inherently ambiguous or context-dependent?  
- **Polysemy and Context**: Does the word have multiple meanings that Word2Vec embeddings might conflate?  
- **Nearest Neighbors**: How do the word’s closest vectors in the embedding space influence its classification?  

By exploring these questions, you’ll gain intuition about how Word2Vec encodes semantic and sentiment information—and why certain words may be misclassified even when their true sentiment is clear.

In [14]:
import pandas as pd

logreg_result = pd.DataFrame({
    "word": [all_words[i] for i in idx_test],
    "label": y_test,
    "prediction": y_pred_logreg
})

# check false positives and false negatives
false_positives = logreg_result[(logreg_result['label'] == 0) & (logreg_result['prediction'] == 1)]
false_negatives = logreg_result[(logreg_result['label'] == 1) & (logreg_result['prediction'] == 0)]

# confusion matrix
confusion_matrix = pd.crosstab(logreg_result['label'], logreg_result['prediction'], rownames=['Actual'], colnames=['Predicted'], margins=True)
confusion_matrix

Predicted,0,1,All
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,862,27,889
1,33,339,372
All,895,366,1261


In [12]:
false_negatives.head()

Unnamed: 0,word,label,prediction
4,dominate,1,0
26,subsidizes,1,0
46,indulgence,1,0
105,brainy,1,0
107,fervidly,1,0


In [13]:
false_positives.head()

Unnamed: 0,word,label,prediction
19,overwhelming,0,1
98,emphatic,0,1
125,stranger,0,1
210,disappoint,0,1
329,tenderness,0,1


## Nearest Neighbour

In [18]:
print(f"Nearest neigbour for word \'{false_negatives['word'].iloc[0]}\': {top_k_neighbours(false_negatives['word'].iloc[0], 5)}")

Nearest neigbour for word 'dominate': [('dominated', 0.7352245450019836), ('dominating', 0.7086454629898071), ('dominates', 0.6767717599868774), ('dominant', 0.6167248487472534), ('dominance', 0.6048017144203186)]
