# Nearest words via embeddings
Here is the prompt to generate this code in ChatGPT
Write a python program using pytorch to create embeddings for a list of words. Then given the input of one of the words find the 5 closest words to it. 

Question - Are the embeddings normalized? ie are they of length 1 ?

## Exercise in Class
Add a capability to embed 5 phrases containing up to 30 words, 
then given a query select the phrase that might answer that query

In [38]:
import torch
import torch.nn as nn
import torch.nn.functional as F
nn

<module 'torch.nn' from '/home/codespace/.local/lib/python3.12/site-packages/torch/nn/__init__.py'>

In [34]:
# Define a list of words (vocabulary)
words = ["apple", "banana", "orange", "pear", "peach", 
         "mango", "grape", "cherry", "berry", "melon"]

# Create mappings from word to index and index to word
word2idx = {word: idx for idx, word in enumerate(words)}
idx2word = {idx: word for idx, word in enumerate(words)}

# Set embedding dimensions and create the embedding layer
embedding_dim = 50
embedding_layer = nn.Embedding(num_embeddings=len(words), embedding_dim=embedding_dim)

# Get embeddings for all words in the vocabulary
# (This will be a matrix of shape [vocab_size, embedding_dim])
embeddings = embedding_layer(torch.arange(len(words)))
# print the embedding for the first two words
print("Embeddings for the first two words:")
print(embeddings[:2])
# print the embedding for the last two words
print("Embeddings for the last two words:")
print(embeddings[-2:])
# print the embedding for the word "apple"
print("Embedding for the word 'apple':")
print(embeddings[word2idx["apple"]])



Embeddings for the first two words:
tensor([[-0.3587,  0.7732, -0.4567, -0.4940, -0.0791, -0.0285,  0.7708,  0.8347,
         -1.5237,  1.0784, -1.8059,  0.6040, -1.5777,  0.1740,  1.4177,  0.9748,
         -1.6658,  1.6476,  0.0658,  0.5767,  0.1060,  1.3074,  0.9831,  0.3646,
          0.4986, -1.0046,  0.9260,  0.4735, -0.0531, -1.0035, -0.8946, -0.2291,
         -0.5259, -0.3352,  0.6604,  1.0024, -0.1573, -2.2002,  0.3426, -0.3788,
          0.7011,  1.0688, -0.5525,  0.4026, -0.0403,  0.0140,  0.5893, -0.0812,
         -1.1798, -0.3190],
        [-0.3961,  1.5104, -0.8098,  0.9906,  0.9506, -1.0023, -0.8107, -1.0364,
          0.1094, -0.6319, -1.2142, -0.3318, -0.1119,  0.8757,  0.4456,  1.0081,
         -1.1028, -2.0423, -0.6187, -1.4848, -0.1817, -1.5587,  0.2343, -0.5715,
          0.6843,  0.4334, -0.0658,  0.3723, -0.0389,  0.5961, -0.7011,  3.7297,
         -0.7646,  1.0899,  1.5763,  0.2432, -2.0390, -0.5935, -0.9231, -0.7742,
         -0.3183,  0.0485,  0.0549,  0.7712, 

In [35]:
def find_closest(word, top_k=5):
    """Finds the top_k closest words to the input word using cosine similarity."""
    if word not in word2idx:
        print(f"Word '{word}' not found in vocabulary.")
        return []
    
    # Get the embedding for the input word
    word_index = word2idx[word]
    word_embedding = embeddings[word_index]
    
    # Normalize all embeddings to unit length
    normalized_embeddings = F.normalize(embeddings, p=2, dim=1)
    normalized_word_embedding = F.normalize(word_embedding, p=2, dim=0)
    
    # Compute cosine similarities: dot product between normalized vectors
    similarities = torch.matmul(normalized_embeddings, normalized_word_embedding)
    
    # Get the indices of the top (top_k+1) similar words (including the word itself)
    top_values, top_indices = torch.topk(similarities, top_k + 1)
    
    results = []
    for value, idx in zip(top_values, top_indices):
        # Skip the word itself
        if idx.item() == word_index:
            continue
        results.append((idx2word[idx.item()], value.item()))
        if len(results) == top_k:
            break
    return results



In [36]:
# Example usage
input_word = "banana"
closest_words = find_closest(input_word)

if closest_words:
    print(f"Top {len(closest_words)} words similar to '{input_word}':")
    for word, similarity in closest_words:
        print(f"{word} (cosine similarity: {similarity:.4f})")

Top 5 words similar to 'banana':
grape (cosine similarity: 0.1776)
orange (cosine similarity: 0.0944)
cherry (cosine similarity: 0.0777)
berry (cosine similarity: 0.0662)
peach (cosine similarity: 0.0537)
