<a href="https://colab.research.google.com/github/sateeshfrnd/Generative-AI/blob/main/notebooks/Word2Vec/CBOW_(Continuous_Bag_of_Words).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CBOW (Continuous Bag of Words)

CBOW (Continuous Bag of Words), predicts a word based on its surrounding context.

## import required Libraries

In [1]:
from gensim.models import Word2Vec

In [2]:
# Sample corpus
sentences = [
    ["she", "enjoys", "baking", "cookies"],
    ["he", "loves", "drinking", "coffee"],
    ["they", "are", "enjoying", "baking", "bread", "together"],
    ["i", "like", "drinking", "tea"]
]

In [6]:
# Train a CBOW model
cbow_model = Word2Vec(
    sentences,            # Tokenized sentences from the corpus
    vector_size=50,       # Dimension of word vectors (50-dimensional space)
    window=2,             # Context window size (2 words on each side of the target word)
    min_count=1,          # Minimum frequency for a word to be included in the vocabulary
    sg=0                  # CBOW model (sg=0 means CBOW; sg=1 means Skip-Gram)
)


In [4]:
cbow_model.wv["baking"]

array([-0.01631914,  0.00899214, -0.00827197,  0.00165198,  0.01700068,
       -0.00892678,  0.00903681, -0.01357065, -0.00709898,  0.01879522,
       -0.00315654,  0.0006388 , -0.00828443, -0.01536287, -0.00301432,
        0.00494451, -0.00177102,  0.01107087, -0.00548726,  0.0045214 ,
        0.01091253,  0.01668911, -0.00290244, -0.01842042,  0.00874441,
        0.00114458,  0.01488109, -0.00162445, -0.00527704, -0.01750381,
       -0.00170972,  0.0056511 ,  0.01080317,  0.01410179, -0.01140448,
        0.00371835,  0.01218132, -0.0095951 , -0.00620931,  0.01359934,
        0.00326691,  0.00037636,  0.00694379,  0.00043393,  0.01923659,
        0.01012584, -0.01783104, -0.01408443,  0.00180271,  0.01278661],
      dtype=float32)

In [5]:
cbow_model.wv.most_similar("baking")

[('i', 0.12489530444145203),
 ('he', 0.0806213915348053),
 ('bread', 0.07402360439300537),
 ('drinking', 0.0424087829887867),
 ('together', 0.018300451338291168),
 ('enjoys', 0.011399010196328163),
 ('tea', 0.011335345916450024),
 ('they', 0.0013816740829497576),
 ('cookies', -0.012018062174320221),
 ('she', -0.03441700339317322)]

Observations:

- Similarity with Context Words: The output shows that words like "bread" and "cookies" are indeed related to "baking" (even though they are not the most similar ones).

- Less Expected Results: The inclusion of words like "I," "he," "enjoys," and "tea" shows how the model might have learned relationships between more general words due to sentence structures like "she enjoys baking cookies" or "he loves drinking tea."

## Using pretatined model

In [7]:
import gensim.downloader as api

In [8]:
wv = api.load('word2vec-google-news-300')



In [9]:
wv.most_similar("baking", topn=10)

[('cooking', 0.6751806139945984),
 ('Baking', 0.6691453456878662),
 ('bake', 0.6438429951667786),
 ('bread_baking', 0.6340261101722717),
 ('baking_cakes', 0.6335681080818176),
 ('baking_pies', 0.6270673274993896),
 ('decorating_cakes', 0.626034140586853),
 ('pastry', 0.6246585845947266),
 ('cake_decorating', 0.6199908256530762),
 ('Line_baking_tray', 0.6138713359832764)]

In [13]:
wv.similarity("baking", "cookies")

0.48642683

Now score is improved

In [11]:
wv.similarity("baking", "cooking")

0.6751806