Certainly! Let me explain the Continuous Bag of Words (CBOW) model in simple words.

Imagine you have a big collection of sentences, like books or articles. The CBOW model is like a smart system that learns the meaning of words by looking at the words around them. It's trying to understand what a word means by examining the words that often appear close to it.

Here's how CBOW works:

1. It looks at a target word. Let's say the target word is "apple."

2. It checks the words that are usually found near "apple" in sentences, like "delicious," "fruit," and "red."

3. It uses these surrounding words to guess what "apple" means. For example, if it sees "delicious" and "fruit" a lot, it might think "apple" is related to being a tasty fruit.

4. CBOW repeats this process for many words in the text to learn what they mean based on their context.

So, in simple terms, CBOW is a method that learns the meaning of words by looking at the words around them. This helps it create word representations that understand the words' relationships in the language, making it useful for various language-related tasks.

Prepare the Data: Load your text corpus and preprocess it. Tokenize the sentences, remove punctuation, convert text to lowercase, and create a vocabulary with unique words. Assign an index to each word in the vernacular.


In short, Gensim is a Python library used for natural language processing (NLP) tasks. Its main functions include topic modeling, word vector embeddings, document similarity analysis, text preprocessing, document representation, and efficient handling of large textual datasets. It's a valuable tool for tasks like discovering topics in documents, measuring document similarity, and creating word embeddings for NLP applications.

In [7]:
import gensim
import pandas as pd



Reading and Exploring the Dataset
The dataset we are using here is a subset of Amazon reviews from the Cell Phones & Accessories category. The data is stored as a JSON file and can be read using pandas.

Link to the Dataset: http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Cell_Phones_and_Accessories_5.json.gz

In [20]:
df=pd.read_json("http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Cell_Phones_and_Accessories_5.json.gz",lines=True)

Simple Preprocessing & Tokenization
The first thing to do for any data science task is to clean the data. For NLP, we apply various processing like converting all the words to lower case, trimming spaces, removing punctuations. This is something we will do over here too.

Additionally, we can also remove stop words like 'and', 'or', 'is', 'the', 'a', 'an' and convert words to their root forms like 'running' to 'run'.

In [22]:
df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A30TL5EWN6DFXT,120401325X,christina,"[0, 0]",They look good and stick good! I just don't li...,4,Looks Good,1400630400,"05 21, 2014"
1,ASY55RVNIL0UD,120401325X,emily l.,"[0, 0]",These stickers work like the review says they ...,5,Really great product.,1389657600,"01 14, 2014"
2,A2TMXE2AFO7ONB,120401325X,Erica,"[0, 0]",These are awesome and make my phone look so st...,5,LOVE LOVE LOVE,1403740800,"06 26, 2014"
3,AWJ0WZQYMYFQ4,120401325X,JM,"[4, 4]",Item arrived in great time and was in perfect ...,4,Cute!,1382313600,"10 21, 2013"
4,ATX7CZYFXI1KW,120401325X,patrice m rogoza,"[2, 3]","awesome! stays on, and looks great. can be use...",5,leopard home button sticker for iphone 4s,1359849600,"02 3, 2013"


The simple_preprocess function in Gensim is used to preprocess and tokenize a text document. It takes a text string as input and performs the following tasks:

Tokenization: It breaks the input text into individual words or tokens, splitting it based on spaces and punctuation.

Lowercasing: It converts all tokens to lowercase to ensure consistent handling of text.

Removing Short Tokens: By default, simple_preprocess removes tokens that are shorter than 2 characters in length, as they are often considered less meaningful.

In [21]:
review_text = df.reviewText.apply(gensim.utils.simple_preprocess)


In [23]:
review_text


0         [they, look, good, and, stick, good, just, don...
1         [these, stickers, work, like, the, review, say...
2         [these, are, awesome, and, make, my, phone, lo...
3         [item, arrived, in, great, time, and, was, in,...
4         [awesome, stays, on, and, looks, great, can, b...
                                ...                        
194434    [works, great, just, like, my, original, one, ...
194435    [great, product, great, packaging, high, quali...
194436    [this, is, great, cable, just, as, good, as, t...
194437    [really, like, it, becasue, it, works, well, w...
194438    [product, as, described, have, wasted, lot, of...
Name: reviewText, Length: 194439, dtype: object

In [24]:
review_text.loc[0]


['they',
 'look',
 'good',
 'and',
 'stick',
 'good',
 'just',
 'don',
 'like',
 'the',
 'rounded',
 'shape',
 'because',
 'was',
 'always',
 'bumping',
 'it',
 'and',
 'siri',
 'kept',
 'popping',
 'up',
 'and',
 'it',
 'was',
 'irritating',
 'just',
 'won',
 'buy',
 'product',
 'like',
 'this',
 'again']

In [25]:
df.reviewText.loc[0]


"They look good and stick good! I just don't like the rounded shape because I was always bumping it and Siri kept popping up and it was irritating. I just won't buy a product like this again"

Training the Word2Vec Model
Train the model for reviews. Use a window of size 10 i.e. 10 words before the present word and 10 words ahead. A sentence with at least 2 words should only be considered, configure this using min_count parameter.

Workers define how many CPU threads to be used.

Initialize the model


In short, the Word2Vec function in Gensim is used to create word embeddings, which are numerical representations of words that capture their semantic meaning. These embeddings are helpful for tasks like word similarity, analogies, text classification, and recommendation systems in natural language processing.

In [26]:
model = gensim.models.Word2Vec(
    window=10,
    min_count=2,
    workers=4,
)

Build Vocabulary

In short, the build_vocab function in Gensim is used to construct the vocabulary for training a Word2Vec model. It processes the text data to identify unique words and assigns a unique numerical ID to each word, which is crucial for subsequent training of the Word2Vec model.

In [27]:
model.build_vocab(review_text, progress_per=1000)

Train the Word2Vec Model

In [28]:
model.train(review_text, total_examples=model.corpus_count, epochs=model.epochs)


(61505326, 83868975)

Save the Model

Save the model so that it can be reused in other applications

In [29]:
model.save("./word2vec-amazon-cell-accessories-reviews-short.model")


Finding Similar Words and Similarity between words


https://radimrehurek.com/gensim/models/word2vec.html

the wv.most_similar function is used to find words that are most similar to a given word in a Word2Vec model's word embedding space. It helps you discover words that have similar meanings or contexts to the word you provide as input.

In [30]:
model.wv.most_similar("bad")

[('terrible', 0.6989720463752747),
 ('shabby', 0.6393527388572693),
 ('good', 0.5959081053733826),
 ('horrible', 0.5899097323417664),
 ('crappy', 0.5518659949302673),
 ('pathetic', 0.5452446937561035),
 ('funny', 0.5432130098342896),
 ('disappointing', 0.533385694026947),
 ('okay', 0.529837429523468),
 ('awful', 0.521769106388092)]

In [31]:
model.wv.similarity(w1="cheap", w2="inexpensive")

0.5554835

In [32]:
model.wv.similarity(w1="great", w2="good")

0.78176177

Further Reading
You can read about gensim more at https://radimrehurek.com/gensim/models/word2vec.html

Explore other Datasets related to Amazon Reviews: http://jmcauley.ucsd.edu/data/amazon/