### Word2vec vs Doc2vec

Word2Vec is a neural network-based model that learns dense vector representations (embeddings) of words based on their context in a corpus. It captures semantic relationships—so words like “king” and “queen” end up close in vector space.


There are two main architectures:

- **CBOW (Continuous Bag of Words)**
Predicts a target word from surrounding context words.
Example: Given “the cat ___ on the mat”, predict “sat”.

- **Skip-Gram**
Does the reverse—predicts context words from a target word.
Example: Given “sat”, predict “the”, “cat”, “on”, “the”, “mat”.

Both are trained using a shallow neural network and optimized using techniques like **negative sampling or hierarchical softmax.**


In [8]:
from gensim.models import Word2Vec

# Sample corpus
sentences = [
    ["i", "love", "data", "science"],
    ["data", "science", "is", "fun"],
    ["i", "enjoy", "machine", "learning"],
    ["deep", "learning", "is", "a", "part", "of", "data", "science"]
]

# Train Word2Vec - skipgram model
model_skipgram = Word2Vec(sentences, 
                 vector_size=50, 
                 window=2, 
                 min_count=1, 
                 sg=1)  # sg=1 for Skip-Gram, sg=0 for CBOW

# Access word vectors
print("Vector for 'data':\n", model_skipgram.wv['data'])

# Find similar words
print("\nWords similar to 'science':", model_skipgram.wv.most_similar('science'))



Vector for 'data':
 [-0.01631741  0.0089938  -0.0082783   0.00164443  0.0169952  -0.0089275
  0.00903588 -0.01357163 -0.00709614  0.01879338 -0.00315135  0.00064402
 -0.00827641 -0.01536737 -0.00301485  0.00494066 -0.00177148  0.01106917
 -0.00549089  0.00452204  0.01091461  0.0166924  -0.00290378 -0.01841146
  0.00873711  0.00114581  0.01488771 -0.00162303 -0.00527828 -0.01750591
 -0.00171363  0.00565091  0.01079952  0.01410548 -0.01140531  0.00371814
  0.01218113 -0.00959919 -0.00621546  0.01359666  0.00326131  0.00038279
  0.00695176  0.00043307  0.01924443  0.01012374 -0.01783155 -0.01408546
  0.00180391  0.0127816 ]

Words similar to 'science': [('fun', 0.16704080998897552), ('a', 0.13203266263008118), ('learning', 0.1267007291316986), ('machine', 0.09984554350376129), ('data', 0.04237872734665871), ('love', 0.040677644312381744), ('of', 0.012417477555572987), ('enjoy', -0.012591077946126461), ('is', -0.014479597099125385), ('deep', -0.0560765340924263)]


In [7]:

# Train Word2Vec - CBOW model
model_CBOW = Word2Vec(sentences, 
                 vector_size=50, 
                 window=2, 
                 min_count=1, 
                 sg=0)  # sg=1 for Skip-Gram, sg=0 for CBOW

# Access word vectors
print("Vector for 'data':\n", model_CBOW.wv['data'])

# Find similar words
print("\nWords similar to 'science':", model_CBOW.wv.most_similar('science'))

Vector for 'data':
 [-0.01631513  0.00899122 -0.00827763  0.00164529  0.01699386 -0.00892628
  0.00903487 -0.01357352 -0.0070978   0.01879499 -0.00315362  0.00064257
 -0.00827751 -0.01536535 -0.00301715  0.00494141 -0.00177357  0.01106989
 -0.00548864  0.00452071  0.01091368  0.01669276 -0.00290602 -0.01841386
  0.00873944  0.00114542  0.01488611 -0.00162427 -0.00527841 -0.01750569
 -0.0017135   0.00565139  0.01080079  0.01410554 -0.01140593  0.00371592
  0.01217804 -0.00959774 -0.00621279  0.01359432  0.00326069  0.00038156
  0.00695011  0.00043386  0.01924171  0.01012283 -0.01783208 -0.01408416
  0.00180482  0.0127821 ]

Words similar to 'science': [('fun', 0.16704080998897552), ('a', 0.13204070925712585), ('learning', 0.1267007291316986), ('machine', 0.09984554350376129), ('data', 0.04238428175449371), ('love', 0.040677644312381744), ('of', 0.01243622787296772), ('enjoy', -0.012591077946126461), ('is', -0.014479792676866055), ('deep', -0.0560765340924263)]


Youtube : https://www.youtube.com/watch?v=UqRCEmrv1gQ&ab_channel=TheSemicolon

### CBOW vs Skipgram

|Feature|CBOW|Skip-Gram|
|---------|---------|----------|
|Objective|Predict target word from context|Predict context words from target|
|Input|Surrounding words|Center word|
|Output|Center word|Surrounding words|
|Training Speed|Faster|Slower (more training examples)|
|Performance on Rare Words|Weaker|Stronger|
|Best for|large datasets with frequent words|Small datasets on rare word representationsm|

##### When to Use What:

- Use CBOW when:
    - You have a large corpus.
    - You care more about frequent words.
    - You want faster training.
      
- Use Skip-Gram when:
    - You want to capture rare word semantics.
    - You’re working with a smaller dataset.
    - You need finer-grained embeddings


##### Advantages & ❌ Disadvantages

CBOW

- ✅ Faster to train.
- ✅ Smooths over noisy data by averaging context.
- ❌ Doesn’t perform well with infrequent words.
- ❌ May lose nuance due to averaging.

Skip-Gram

- ✅ Better at capturing semantic relationships.
- ✅ Works well with rare words.
- ❌ Slower to train.
- ❌ Generates more training samples (one per context word)


##### Note:  If you're building a multi-agent RAG pipeline and want fast, general-purpose embeddings, CBOW might be enough. But if you're fine-tuning for nuanced understanding—like rare domain-specific terms—Skip-Gram is your friend