# Nearest words via embeddings
Here is the prompt to generate this code in ChatGPT
Write a python program using pytorch to create embeddings for a list of words. Then given the input of one of the words find the 5 closest words to it. 

Question - Are the embeddings normalized? ie are they of length 1 ?

## Exercise in Class
Add a capability to embed 5 phrases containing up to 30 words, 
then given a query select the phrase that might answer that query

In [38]:
import torch
import torch.nn as nn
import torch.nn.functional as F
nn

<module 'torch.nn' from '/home/codespace/.local/lib/python3.12/site-packages/torch/nn/__init__.py'>

In [None]:
# Define a list of words (vocabulary)
words = ["apple", "banana", "orange", "pear", "peach", 
         "mango", "grape", "cherry", "berry", "melon"]

# Create mappings from word to index and index to word
word2idx = {word: idx for idx, word in enumerate(words)}
idx2word = {idx: word for idx, word in enumerate(words)}

# Set embedding dimensions and create the embedding layer
embedding_dim = 50
embedding_layer = nn.Embedding(num_embeddings=len(words), embedding_dim=embedding_dim)

# Get embeddings for all words in the vocabulary
# (This will be a matrix of shape [vocab_size, embedding_dim])
embeddings = embedding_layer(torch.arange(len(words)))
# print the embedding for the first two words
print("Embeddings for the first two words:")
print(embeddings[:2])
# print the embedding for the last two words
print("Embeddings for the last two words:")
print(embeddings[-2:])
# print the embedding for the word "apple"
print("Embedding for the word 'apple':")
print(embeddings[word2idx["apple"]])



Embeddings for the first two words:
tensor([[-0.6265, -0.8561,  0.6741,  1.9465,  1.0323, -1.2304, -0.8675,  1.3524,
          2.0767, -0.5330, -0.0208,  0.7816, -0.2869,  1.6492, -0.6981, -0.7236,
          0.5189, -0.0886, -0.4306,  2.0238, -0.0951,  0.0615,  1.1272, -0.7233,
          0.5621, -1.7471,  1.5854,  0.8586, -0.4335, -1.3795,  0.7495, -1.5988,
         -0.7011, -0.2762, -0.0536, -0.9139, -0.2065,  0.8348,  0.2882,  0.5704,
          1.2892,  0.3607, -0.1623,  0.2154, -0.0061, -0.5041,  0.7950,  0.7241,
          0.3391, -0.1953],
        [-2.4873, -0.8843,  1.1065,  0.9266,  1.1889, -0.7478, -0.7300,  1.2461,
          0.5420,  1.2264, -0.8877,  0.6605, -1.0670,  1.8461,  0.0741, -0.8376,
          0.0667, -1.8449,  0.1929,  0.6504,  1.2975, -0.9518, -0.2281,  1.3243,
         -1.4686,  1.1725, -1.0446, -1.3329, -1.8933,  1.5292, -0.2723,  0.1359,
         -1.7234, -1.0254, -0.3054, -1.7177, -1.4470,  0.4552,  0.2994, -0.2911,
          0.1374,  1.8789, -0.2257, -0.4423, 

In [35]:
def find_closest(word, top_k=5):
    """Finds the top_k closest words to the input word using cosine similarity."""
    if word not in word2idx:
        print(f"Word '{word}' not found in vocabulary.")
        return []
    
    # Get the embedding for the input word
    word_index = word2idx[word]
    word_embedding = embeddings[word_index]
    
    # Normalize all embeddings to unit length
    normalized_embeddings = F.normalize(embeddings, p=2, dim=1)
    normalized_word_embedding = F.normalize(word_embedding, p=2, dim=0)
    
    # Compute cosine similarities: dot product between normalized vectors
    similarities = torch.matmul(normalized_embeddings, normalized_word_embedding)
    
    # Get the indices of the top (top_k+1) similar words (including the word itself)
    top_values, top_indices = torch.topk(similarities, top_k + 1)
    
    results = []
    for value, idx in zip(top_values, top_indices):
        # Skip the word itself
        if idx.item() == word_index:
            continue
        results.append((idx2word[idx.item()], value.item()))
        if len(results) == top_k:
            break
    return results



In [36]:
# Example usage
input_word = "banana"
closest_words = find_closest(input_word)

if closest_words:
    print(f"Top {len(closest_words)} words similar to '{input_word}':")
    for word, similarity in closest_words:
        print(f"{word} (cosine similarity: {similarity:.4f})")

Top 5 words similar to 'banana':
grape (cosine similarity: 0.1776)
orange (cosine similarity: 0.0944)
cherry (cosine similarity: 0.0777)
berry (cosine similarity: 0.0662)
peach (cosine similarity: 0.0537)
