# Nearest words via embeddings
Here is the prompt to generate this code in ChatGPT
Write a python program using pytorch to create embeddings for a list of words. Then given the input of one of the words find the 5 closest words to it. 

Question - Are the embeddings normalized? ie are they of length 1 ?

## Exercise in Class
Add a capability to embed 5 phrases containing up to 30 words, 
then given a query select the phrase that might answer that query

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
nn

<module 'torch.nn' from '/home/codespace/.local/lib/python3.12/site-packages/torch/nn/__init__.py'>

In [9]:
# Define a list of words (vocabulary)
words = ["apple", "banana", "orange", "pear", "peach", 
         "mango", "grape", "cherry", "berry", "melon", "robot", "machine"]

# Create mappings from word to index and index to word
word2idx = {word: idx for idx, word in enumerate(words)}
idx2word = {idx: word for idx, word in enumerate(words)}

# Set embedding dimensions and create the embedding layer
embedding_dim = 50
embedding_layer = nn.Embedding(num_embeddings=len(words), embedding_dim=embedding_dim)

# Get embeddings for all words in the vocabulary
# (This will be a matrix of shape [vocab_size, embedding_dim])
embeddings = embedding_layer(torch.arange(len(words)))
# print the embedding for the first two words
print("Embeddings for the first two words:")
print(embeddings[:2])
# print the embedding for the last two words
print("Embeddings for the last two words:")
print(embeddings[-2:])
# print the embedding for the word "apple"
print("Embedding for the word 'apple':")
print(embeddings[word2idx["apple"]])



Embeddings for the first two words:
tensor([[-0.5814, -0.5345, -0.5858, -0.1236, -0.5404, -0.6483,  1.1738,  1.1161,
          0.7136,  0.1037,  1.3802,  0.2885,  0.0857, -0.6413, -0.0583, -1.9639,
          1.5322,  0.2474,  0.6956, -0.4548,  0.3738, -0.7276, -0.0718,  0.8499,
         -1.2576, -0.2303,  1.9752,  1.7365,  0.2653,  0.0364, -0.6435, -0.3675,
         -0.5705,  1.8230,  1.4834, -0.1357, -1.2909,  0.9535,  1.0271,  1.0067,
          0.7270,  0.3991, -1.0937, -0.9313,  1.2438,  0.9947, -0.5409,  0.5476,
         -0.8724, -0.0973],
        [-0.2458,  0.0500,  0.2172,  1.4909, -0.6108, -1.7328, -0.7391, -0.2985,
          0.8389,  0.0183,  1.5236,  1.1159,  0.7781,  0.3074, -1.5263,  0.7955,
          0.1966, -0.6849,  0.2517, -0.5324,  0.0206, -1.3999, -0.5060, -0.4395,
          0.7224,  1.1292, -1.6691, -0.6456, -0.5425, -0.2378, -0.8077,  2.6304,
         -0.7133,  0.9567,  0.1927,  0.0066,  1.3816,  0.3905,  0.4137,  1.1556,
         -0.2576,  0.1480, -0.3582, -0.0535, 

In [10]:
def find_closest(word, top_k=7):
    """Finds the top_k closest words to the input word using cosine similarity."""
    if word not in word2idx:
        print(f"Word '{word}' not found in vocabulary.")
        return []
    
    # Get the embedding for the input word
    word_index = word2idx[word]
    word_embedding = embeddings[word_index]
    
    # Normalize all embeddings to unit length
    normalized_embeddings = F.normalize(embeddings, p=2, dim=1)
    normalized_word_embedding = F.normalize(word_embedding, p=2, dim=0)
    
    # Compute cosine similarities: dot product between normalized vectors
    similarities = torch.matmul(normalized_embeddings, normalized_word_embedding)
    
    # Get the indices of the top (top_k+1) similar words (including the word itself)
    top_values, top_indices = torch.topk(similarities, top_k + 1)
    
    results = []
    for value, idx in zip(top_values, top_indices):
        # Skip the word itself
        if idx.item() == word_index:
            continue
        results.append((idx2word[idx.item()], value.item()))
        if len(results) == top_k:
            break
    return results



In [11]:
# Example usage
input_word = "banana"
closest_words = find_closest(input_word)

if closest_words:
    print(f"Top {len(closest_words)} words similar to '{input_word}':")
    for word, similarity in closest_words:
        print(f"{word} (cosine similarity: {similarity:.4f})")

Top 7 words similar to 'banana':
peach (cosine similarity: 0.2374)
machine (cosine similarity: 0.1912)
pear (cosine similarity: 0.1851)
melon (cosine similarity: 0.1765)
orange (cosine similarity: 0.0866)
berry (cosine similarity: 0.0669)
apple (cosine similarity: 0.0553)


In [12]:
# Example usage
input_word = "robot"
closest_words = find_closest(input_word)

if closest_words:
    print(f"Top {len(closest_words)} words similar to '{input_word}':")
    for word, similarity in closest_words:
        print(f"{word} (cosine similarity: {similarity:.4f})")

Top 7 words similar to 'robot':
grape (cosine similarity: 0.2172)
apple (cosine similarity: 0.1565)
peach (cosine similarity: 0.1386)
berry (cosine similarity: 0.1303)
mango (cosine similarity: 0.0714)
cherry (cosine similarity: 0.0713)
orange (cosine similarity: 0.0392)
