# 04. Word Embeddings: Word2Vec, GloVe, FastText

### What You'll Learn:
- Why embeddings are better than BoW/TF-IDF
- Word2Vec (Skip-gram, CBOW)
- GloVe
- FastText
- Finding similar words
- Word relationships

## Why Word Embeddings?

**Problem with BoW/TF-IDF**:
- Sparse vectors (mostly zeros)
- No semantic meaning
- 'king' and 'queen' have no relationship

**Solution: Dense Embeddings**:
- Dense vectors (small fixed size)
- Captures semantic meaning
- Related words have similar vectors
- Can do math: king - man + woman ≈ queen

## Word2Vec

### Concept:
- Creates dense word vectors (typically 100-300 dimensions)
- Captures semantic relationships
- Two architectures:
  - **CBOW**: Predict word from context
  - **Skip-gram**: Predict context from word

In [4]:
from gensim.models import Word2Vec
import numpy as np

sentences = [
    'machine learning is awesome',
    'deep learning with neural networks',
    'python for machine learning',
    'natural language processing is powerful'
]

# Create word tokens
tokenized = [s.split() for s in sentences]

print('='*60)
print('WORD2VEC EXAMPLE')
print('='*60)
print('Training sentences:', tokenized)

# Train Word2Vec model
model = Word2Vec(tokenized, vector_size=100, window=5, min_count=1, workers=4)
print('\nModel trained!')
print(f'Vocabulary size: {len(model.wv)}')

# Get word vector
print('\nWord vector for "learning":')
print(model.wv['learning'][:10], '...')  # Show first 10 dimensions
print(f'Vector shape: {model.wv["learning"].shape}')

# Find similar words
print('\nWords similar to "learning":')
similar = model.wv.most_similar('learning', topn=3)
for word, score in similar:
    print(f'  {word}: {score:.3f}')

WORD2VEC EXAMPLE
Training sentences: [['machine', 'learning', 'is', 'awesome'], ['deep', 'learning', 'with', 'neural', 'networks'], ['python', 'for', 'machine', 'learning'], ['natural', 'language', 'processing', 'is', 'powerful']]

Model trained!
Vocabulary size: 14

Word vector for "learning":
[-0.00053623  0.00023643  0.00510335  0.00900927 -0.00930295 -0.00711681
  0.00645887  0.00897299 -0.00501543 -0.00376337] ...
Vector shape: (100,)

Words similar to "learning":
  networks: 0.216
  deep: 0.093
  for: 0.093


## Word Arithmetic

Word2Vec captures relationships:
- king - man + woman ≈ queen
- france - paris + london ≈ england

In [5]:
# Word arithmetic
print('\nWORD ARITHMETIC EXAMPLE:')
try:
    result = model.wv.most_similar(positive=['learning', 'python'], topn=3)
    print('Words related to (learning + python):')
    for word, score in result:
        print(f'  {word}: {score:.3f}')
except:
    print('Need more training data for arithmetic operations')


WORD ARITHMETIC EXAMPLE:
Words related to (learning + python):
  networks: 0.182
  machine: 0.102
  with: 0.082


## GloVe vs Word2Vec

| Feature | Word2Vec | GloVe |
|---------|----------|-------|
| Speed | Fast | Slower |
| Quality | Good | Better |
| Training | Neural Network | Matrix factorization |
| Best for | Large corpora | Balanced |

(Pre-trained models recommended for GloVe)

## FastText

**Advantages over Word2Vec**:
- Handles out-of-vocabulary words
- Uses character n-grams
- Better for morphologically rich languages

In [7]:
from gensim.models import FastText

# Train FastText model
#ft_model = FastText(tokenized, vector_size=100, window=5, workers=4)
ft_model = FastText(
    sentences=sentences,
    vector_size=100,
    window=5,
    min_count=1,
    sg=1
)

print('\nFASTTEXT EXAMPLE')
print(f'Vocabulary size: {len(ft_model.wv)}')

# FastText can handle unknown words
print('\nVector for "learning":', ft_model.wv['learning'][:5], '...')

# Can get vectors for words not in training data
try:
    print('Vector for "xyz_unknown_word":', ft_model.wv['xyz_unknown_word'][:5], '...')
except:
    print('(Would work with more training data)')


FASTTEXT EXAMPLE
Vocabulary size: 21

Vector for "learning": [ 1.1518626e-03 -6.4418455e-05 -1.1798259e-03  6.3712716e-05
 -1.6880927e-03] ...
Vector for "xyz_unknown_word": [ 0.00037216 -0.00091789  0.00090347  0.00019503  0.00145477] ...
