# Assignment: Word2Vec with a Larger Corpus

Your task is to repeat the Word2Vec model training process on a larger corpus of your choice.

1.  **Find a Corpus**: Find a text file (`.txt`) to use as your corpus. You can use sources like [Project Gutenberg](https://www.gutenberg.org/) to find books in plain text format. Download a book and save it in a `Resources` folder.
2.  **Load and Preprocess**: Load the text data and preprocess it similar to the examples above (e.g., tokenization, lowercasing).
3.  **Train the Model**: Train a `Word2Vec` model on your corpus. You might need to experiment with the model parameters (`vector_size`, `window`, `min_count`, etc.) to get good results.
4.  **Explore the Embeddings**:
    *   Find the most similar words for a few words in your vocabulary.
    *   Perform some word analogies (e.g., "king" - "man" + "woman" = "queen").
    *   Implement a function to find the most similar document (sentence) in your corpus for a given query sentence. Test it with a few queries.
5.  **Reflect**: Briefly describe your findings. Are the results better or worse than the small example? What did you learn?

# Assignment Solution

In [20]:
# Load the corpus
with open('corpus.txt', 'r') as f:
    corpus = f.readlines()

# Preprocess the data


# Train the Word2Vec model


# Explore the embeddings
# Find the most similar words to 'king'
model.wv.most_similar('king')

[('jungle.', 0.2498362958431244),
 ('technology.', 0.2403174191713333),
 ('developers', 0.2143532931804657),
 ('home', 0.2039964646100998),
 ('written', 0.18727393448352814),
 ('millions', 0.18656769394874573),
 ('wore', 0.1845983862876892),
 ('bread', 0.17786766588687897),
 ('fundamental', 0.1757362335920334),
 ('lion', 0.1660567820072174)]

In [None]:
# Create Document Vectors
# We'll use a simple approach: average the word vectors for each document
def get_doc_vector(doc_tokens, model):
    word_vectors = [model.wv[word] for word in doc_tokens if word in model.wv]
    if not word_vectors:
        return np.zeros(model.vector_size)
    return np.mean(word_vectors, axis=0)


In [22]:
# Create document vectors for the entire corpus by averaging word vectors


In [24]:
# Define a query sentence
query = "a rocket to the moon"

# Vectorize the query using the correct function and tokenizing it
query_tokens = query.lower().split()
query_vector = get_doc_vector(query_tokens, model)

# scikit-learn's function expects 2D arrays, so we reshape the query vector
# The output is a 2D array, so we access the first (and only) row with [0]
similarities = cosine_similarity(query_vector.reshape(1, -1), doc_vectors)[0]

# Find the most similar document (excluding itself)
most_similar_idx = np.argmax(similarities) # -1 is the document itself

print(f"Original Document: '{query}'")
print(f"Most Similar Document: '{corpus[most_similar_idx]}'")
print(f"Similarity Score: {similarities[most_similar_idx]:.4f}")

Original Document: 'a rocket to the moon'
Most Similar Document: 'A river flows from the mountains to the ocean.
'
Similarity Score: 0.6263
