# Assignment: week 4

Using pre-trained word embeddings and understanding word vectors.

### Variables and data download
I tested with all versions of the glove data, and ended up using the smallest 50d version for shorter runtime. I also declared the word_dict variable here, which will be filled with the word vectors from the given file-

In [124]:
import numpy as np

# Path to file
glove_path = "./glove.2024.wikigiga/wiki_giga_2024_50.txt"

# Vector dictionary
word_dict = {}

The dataset was loaded into a dictionary.

In [125]:
with open(glove_path, encoding="utf-8") as f:
    for line in f:
        parts = line.rstrip().split(" ")
        word = parts[0]
        vector = np.asarray(parts[1:], dtype="float32")
        if vector.shape == (50,):
            word_dict[word] = vector

print("Loaded vectors:", len(word_dict))

Loaded vectors: 1291147


### Functions
I created some helper lists and dictionaries for later use with the functions.

In [126]:
# Creating a list of words
words = list(word_dict.keys())

# Adding indexes for words in dictionary
word_to_index = {w: i for i, w in enumerate(words)}

# Creating a matrix with the vectors
vectors = np.stack([word_dict[w] for w in words])

# Normalizing the matrix for cosine similarity
norm_vectors = vectors / np.linalg.norm(vectors, axis=1, keepdims=True)

Here I defined functions to get the closest word to query word, and a function that makes it faster to use. The function has an option to return multiple words as well, I used longer lists in testing.

In [127]:
# Function to find the closest word. Option to return multiple.
def closest_word(query_vector, skip=None, top_k=1):
    # Normalize
    q = query_vector / np.linalg.norm(query_vector)
    
    # Cosine similarity
    sims = np.dot(norm_vectors, q)
    
    # Skip target words
    if skip:
        for w in skip:
            if w in word_to_index:
                sims[word_to_index[w]] = -1e9
    
    # Get k closest vectors
    idx = np.argsort(-sims)[:top_k]
    
    # Return words based on index
    return [words[i] for i in idx]

# Function to get analogies with skipping query words
def analogy(a, b, c):
    return closest_word(
        word_dict[a] - word_dict[b] + word_dict[c],
        skip=[a,b,c])

### Getting word analogies
First I printed the vector representations for the given words:

In [128]:
print(f"Man vector: {word_dict['man']}")
print(f"Woman vector: {word_dict['woman']}")
print(f"King vector: {word_dict['king']}")

Man vector: [-1.217787e+00 -2.935390e-01 -4.390530e-01  6.924400e-02  2.682640e-01
 -1.988100e-01 -1.297030e-01  1.003629e+00  4.606740e-01 -7.343290e-01
  3.579700e-02 -1.491620e-01  1.898410e-01 -1.992740e-01 -3.605320e-01
  3.182900e-01  5.654690e-01 -3.085540e-01 -1.395550e-01  7.721380e-01
 -1.136440e-01  8.250330e-01  3.852980e-01 -1.178040e-01 -2.507190e-01
  3.189500e-02  6.723400e-02  3.140840e-01 -8.328900e-02 -1.069440e-01
 -7.319130e-01 -8.047010e-01  3.093600e-02 -1.443550e-01  5.556690e-01
  4.766830e-01  7.349810e-01  4.709278e+00  5.564010e-01 -2.346130e-01
  1.162280e-01  9.931000e-03 -7.955820e-01  3.235500e-01 -6.286910e-01
 -3.320000e-04  6.728500e-02  8.380330e-01  7.360900e-02  1.323740e+00]
Woman vector: [-1.632972e+00 -1.172860e-01 -1.332430e+00 -7.439910e-01  6.614890e-01
  3.433440e-01 -1.023777e+00  4.300200e-01 -2.342020e-01 -3.084370e-01
  2.529340e-01 -1.476390e-01  7.209500e-02 -1.603000e-01 -4.309100e-01
  5.801640e-01 -1.318800e-02 -4.371610e-01 -5.2243

I tested the closest_word -function with the given words.

In [129]:
# Trying the basic function with word "King"
print(closest_word(word_dict['king']))

# Trying the basic function with equation 
print(closest_word(word_dict["woman"] - word_dict["man"] + word_dict["king"]))

['king']
['king']


Without skipping words included in the query, the closest words are the queried words.

In [130]:
# Using the function with skipping words in result
print(analogy("woman", "man", "king"))

['queen']


When the words included in the query are removed from the results, the resulting closest word is correct with the equation. Here the result "queen" reflects the word that has same relation to "king" as "woman" has to "man".

### Finding more interesting word vector relations
I tried to find other word pairings that would give a sensible result. Many words I tried ended up with just words similar to the last query parameter, without any effect from the two first words. Here are some that gave interesting results:

In [131]:
print(analogy("woman","man","son"))

['daughter']


The gendered word associations seem to work well.

In [132]:
print(analogy("paris","france","germany"))

['berlin']


Many tries with country names and capitals went into loop with just more country names, but "paris" seems to be strong enough city that it takes us to german cities. 

In [133]:
print(analogy("wine","france","germany"))

['beer']


National drinking habits seem to come through in the word analogies.

In [134]:
print(analogy("school", "child", "adult"))

['college']


I was afraid that this would give out "work", but different school levels seem to be clustered well in the matrix.

In [146]:
print(analogy("apartment", "poor", "rich"))
print(analogy("pc", "poor", "rich"))

['mansion']
['apple']


The word vectors seem to reflect the income level gap between different items.