# üî§ Homework 4: Word Embeddings with Word2Vec
**MIS 769 - Big Data Analytics for Business | Spring 2026**

**Points:** 20 | **Due:** See WebCampus for deadline

**Author:** Richard Young, Ph.D. | UNLV Lee Business School

**Compute:** CPU (free tier)

---

## What You'll Learn

1. How words become numbers (and why it matters)
2. Train your own Word2Vec model
3. Document embedding successes AND failures
4. Create business-relevant analogies

---

word_embeddings.svg

## Part 1: Setup and Data Loading (3 points)

In [None]:
!pip install gensim datasets scikit-learn matplotlib -q

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
import re
import random

from gensim.models import Word2Vec
import gensim.downloader as api
from sklearn.manifold import TSNE

random.seed(103)
np.random.seed(103)

print("‚úÖ Libraries loaded!")

In [None]:
from datasets import load_dataset

# Need lots of text for good embeddings
dataset = load_dataset("stanfordnlp/imdb", split="train")
texts = dataset['text']

print(f"‚úÖ Loaded {len(texts):,} documents")

## Part 2: Prepare Text for Word2Vec (3 points)

In [None]:
def preprocess_for_word2vec(text):
    """Clean and tokenize text for Word2Vec."""
    text = text.lower()
    text = re.sub(r'[^a-z\s]', ' ', text)
    tokens = text.split()
    tokens = [t for t in tokens if len(t) > 2]
    return tokens

print("Preprocessing texts...")
corpus = [preprocess_for_word2vec(text) for text in texts]
corpus = [doc for doc in corpus if len(doc) > 5]

print(f"‚úÖ Prepared {len(corpus):,} documents")

## Part 3: Train Word2Vec Model (4 points)

In [None]:
print("üîß Training Word2Vec model (1-2 minutes)...")

model = Word2Vec(
    sentences=corpus,
    vector_size=100,
    window=5,
    min_count=10,
    workers=4,
    sg=1,
    epochs=15,
    seed=103
)

print(f"‚úÖ Model trained!")
print(f"   Vocabulary: {len(model.wv):,} words")

## Part 4: Explore Word Similarities (4 points)

In [None]:
test_words = ["good", "bad", "movie", "actor", "director"]

print("üìä WORD SIMILARITIES")
print("=" * 60)

for word in test_words:
    if word in model.wv:
        similar = model.wv.most_similar(word, topn=5)
        print(f"\n'{word}' is similar to:")
        for similar_word, score in similar:
            print(f"   {similar_word:15} {score:.3f}")

## Part 5: Document Failures (4 points)

Embeddings aren't magic - find where they fail!

In [None]:
# Antonym problem: antonyms often appear similar!
print("üîç ANTONYM TEST")
print("-" * 40)

antonym_pairs = [
    ("good", "bad"),
    ("love", "hate"),
    ("best", "worst"),
]

for word1, word2 in antonym_pairs:
    if word1 in model.wv and word2 in model.wv:
        sim = model.wv.similarity(word1, word2)
        problem = "‚ö†Ô∏è TOO SIMILAR!" if sim > 0.3 else "‚úì OK"
        print(f"{word1:10} ‚Üî {word2:10} : {sim:.3f} {problem}")

In [None]:
# YOUR FAILURE HUNT: Find at least 1 case where embeddings give wrong results
# Try words you expect to be similar but aren't, or different but are similar

# YOUR CODE HERE:


## Part 6: Business Analogies (3 points)

In [None]:
def test_analogy(word1, word2, word3, model):
    """Test: word1 - word2 + word3 = ?"""
    try:
        result = model.wv.most_similar(
            positive=[word1, word3],
            negative=[word2],
            topn=3
        )
        return result
    except KeyError as e:
        return f"Word not in vocabulary: {e}"

# Test some analogies
analogies = [
    ("good", "better", "bad"),  # bad + (better - good) = worse?
]

for word1, word2, word3 in analogies:
    result = test_analogy(word1, word2, word3, model)
    print(f"{word3} - {word2} + {word1} = ?")
    if isinstance(result, list):
        for word, score in result:
            print(f"   ‚Üí {word} ({score:.3f})")

In [None]:
# YOUR ANALOGIES: Create at least 3 business-relevant analogies
my_analogies = [
    # ("word1", "word2", "word3"),
]

for word1, word2, word3 in my_analogies:
    result = test_analogy(word1, word2, word3, model)
    print(f"\n{word3} - {word2} + {word1} = ?")
    if isinstance(result, list):
        for word, score in result:
            print(f"   ‚Üí {word} ({score:.3f})")

## Part 7: Compare to Pre-trained (2 points)

In [None]:
print("üì• Loading pre-trained GloVe...")
glove = api.load("glove-twitter-50")
print(f"‚úÖ Loaded {len(glove):,} words")

# Compare
compare_word = "good"
print(f"\nSimilar to '{compare_word}':")
print(f"Your model:  {[w for w, _ in model.wv.most_similar(compare_word, topn=5)]}")
print(f"Pre-trained: {[w for w, _ in glove.most_similar(compare_word, topn=5)]}")

---

## Questions to Answer

**Q1:** Which similarities made sense? Which surprised you?

*Your answer:*

**Q2:** Document an embedding failure you found.

*Your answer:*

**Q3:** Which analogies worked? Why do some fail?

*Your answer:*

**Q4:** How do your embeddings differ from pre-trained?

*Your answer:*

---

## Submission Checklist

| Item | Points | Done? |
|------|--------|-------|
| Parts 1-2: Data loaded and preprocessed | 3 | ‚òê |
| Part 3: Word2Vec trained | 4 | ‚òê |
| Part 4: Similarities explored | 4 | ‚òê |
| Part 5: 1+ embedding failure documented | 4 | ‚òê |
| Part 6: 3 business analogies | 3 | ‚òê |
| Part 7: Pre-trained comparison | 2 | ‚òê |
| **Total** | **20** | |