# **NLP with Python: Tokenization, Word Embedding (word2vec), and Sentiment Analysis**

Dr. Aydede
  
## 1. `word2vec`

Word embedding is a technique in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. Conceptually, it involves the mathematical embedding from some space (e.g., the space of all possible words) to a lower-dimensional space of the real numbers. The key idea is to capture the semantic meanings, syntactic similarity, and relation of words in these vectors, such that words with similar meanings are closer to each other in the vector space.

Word embedding models like Word2Vec, GloVe (Global Vectors for Word Representation), and FastText have become foundational in NLP applications because they can reduce the dimensionality of text data while preserving lexical and semantic word relationships.

Let's take a simple example using Word2Vec from the Gensim library. Word2Vec can be trained with either the Continuous Bag of Words (CBOW) or Skip-Gram model. In CBOW, the model predicts a word given its context. In Skip-Gram, it predicts the context given a word. Here's how you can use Gensim to train a simple Word2Vec model on a small dataset:

1. First, we'll create a small dataset (corpus).
2. Then, we'll train a Word2Vec model on this dataset.
3. Finally, we'll explore the resulting word embeddings.

In [1]:
from gensim.models import Word2Vec
import logging

# Enable logging for monitoring training
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# Sample sentences
sentences = [
    ['python', 'is', 'a', 'programming', 'language'],
    ['python', 'and', 'java', 'are', 'popular', 'programming', 'languages'],
    ['python', 'programs', 'are', 'easy', 'to', 'write'],
    ['machine', 'learning', 'is', 'fun', 'with', 'python']
]

# Train a Word2Vec model
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)

# Summarize the loaded model
print("Word2Vec model:", model)

# Access vectors for one word
print("Vector for 'python':", model.wv['python'])

# Find most similar words
print("Words similar to 'python':", model.wv.most_similar('python'))


Word2Vec model: Word2Vec<vocab=18, vector_size=100, alpha=0.025>
Vector for 'python': [-5.3622725e-04  2.3643136e-04  5.1033497e-03  9.0092728e-03
 -9.3029495e-03 -7.1168090e-03  6.4588725e-03  8.9729885e-03
 -5.0154282e-03 -3.7633716e-03  7.3805046e-03 -1.5334714e-03
 -4.5366134e-03  6.5540518e-03 -4.8601604e-03 -1.8160177e-03
  2.8765798e-03  9.9187379e-04 -8.2852151e-03 -9.4488179e-03
  7.3117660e-03  5.0702621e-03  6.7576934e-03  7.6286553e-04
  6.3508903e-03 -3.4053659e-03 -9.4640139e-04  5.7685734e-03
 -7.5216377e-03 -3.9361035e-03 -7.5115822e-03 -9.3004224e-04
  9.5381187e-03 -7.3191668e-03 -2.3337686e-03 -1.9377411e-03
  8.0774371e-03 -5.9308959e-03  4.5162440e-05 -4.7537340e-03
 -9.6035507e-03  5.0072931e-03 -8.7595852e-03 -4.3918253e-03
 -3.5099984e-05 -2.9618145e-04 -7.6612402e-03  9.6147433e-03
  4.9820580e-03  9.2331432e-03 -8.1579173e-03  4.4957981e-03
 -4.1370760e-03  8.2453608e-04  8.4986202e-03 -4.4621765e-03
  4.5175003e-03 -6.7869602e-03 -3.5484887e-03  9.3985079e-03

This example demonstrates the basics of training a Word2Vec model with Gensim. Here, `vector_size` specifies the dimensionality of the word vectors, `window` defines the maximum distance between the current and predicted word within a sentence, and `min_count` ignores all words with total frequency lower than this.

After training, we access the vector for "python" and find words similar to "python" based on their word embeddings. The output will give you an insight into how the model understands "python" in the context of the provided corpus.

**It's single layer NN with 100 nodes**
  
Let's clarify a bit more about what happens inside Word2Vec.

Word2Vec, whether using the Continuous Bag of Words (CBOW) or Skip-Gram model, indeed leverages a neural network architecture, but it's structured a bit differently than a typical feedforward neural network with a single layer of 100 nodes (when you set vector_size=100). The "100 nodes" or "100 dimensions" represent the size of the word vectors you're aiming to learn, not the nodes of a hidden layer in a traditional sense.

Here's a simplified overview of the process for both CBOW and Skip-Gram models:

Input Layer: For CBOW, the input is the context words (multiple words), which are one-hot encoded vectors representing the presence of words in the context. For Skip-Gram, the input is just the target word. The size of each input vector is equal to the vocabulary size.

Projection Layer (or Hidden Layer): This is not a typical hidden layer with activation functions. Instead, it's a projection layer where the actual learning of word embeddings happens. When you set vector_size=100, it means this layer will have 100 neurons. The weights connecting the input layer to this layer are what become the word embeddings. In training, for a given input word, the corresponding row in the weight matrix is essentially the word vector for that word.

In CBOW, the vectors from the projection layer corresponding to each context word are averaged before being passed to the output layer.
In Skip-Gram, the projection layer directly connects to the output layer, using the vector of the input word.
Output Layer: The output layer is a softmax layer that makes predictions. For CBOW, it predicts the target word from the context. For Skip-Gram, it predicts the context words from the target word. The size of this layer is also equal to the vocabulary size.

So, in summary:

The "100 dimensions" are essentially the weights of the projection layer that you learn during training.
  
The learning involves adjusting these weights so that the model gets better at its prediction task (predicting context words for Skip-Gram, predicting a target word for CBOW), thereby capturing semantic and syntactic word relationships in the process.
The neural network aspect of Word2Vec is quite specialized and optimized for the task of learning word embeddings, which is a bit different from a general-purpose neural network used for other types of prediction tasks.

In the context of the Word2Vec architecture and specifically regarding the projection (or hidden) layer where the word embeddings are learned, the activation function can indeed be thought of as an identity function, $f(x)=x$. This means that the output of each neuron in this layer is the same as its input, without any nonlinear transformation applied.

## 2. Sentiment with labels

Let's have a simple example

After having a "toy" dataset, we tokenize the data.  Tokenization is a fundamental step in natural language processing (NLP) and plays a crucial role in preparing text data for training word embeddings or any other machine learning model. Here's what it does and why it's important in the context of training word embeddings, like in the sentiment analysis example:

What Tokenization Does:
Splits Text into Tokens: Tokenization breaks down text into its basic units (tokens), which are typically words or subwords. For instance, the sentence "I love machine learning" would be tokenized into ["I", "love", "machine", "learning"].
  
Facilitates Vector Representation: Each token (word) can then be represented as a vector in the word embedding space. This is crucial for training embeddings, as the algorithm needs to work with individual words to learn their semantic and syntactic relationships.
  
Removes Punctuation and Special Characters: Depending on the tokenizer, it can also help clean the text by removing punctuation, special characters, or unnecessary whitespace, making the text more uniform and easier to process.

In [4]:
# Define the dataset
comments = ["love the new features", "hate the long wait time", "excellent service", "poor experience with the product", "happy with the purchase"]
sentiments = [1, 0, 1, 0, 1]  # 1: Positive, 0: Negative

from nltk.tokenize import word_tokenize
import nltk

# Download the punkt tokenizer models
nltk.download('punkt')

# Now you can proceed with tokenizing your text
from nltk.tokenize import word_tokenize

tokenized_comments = [word_tokenize(comment.lower()) for comment in comments]


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [5]:
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import numpy as np

# Tokenize comments
tokenized_comments = [word_tokenize(comment.lower()) for comment in comments]

# Train word embeddings
model = Word2Vec(tokenized_comments, vector_size=4, window=2, min_count=1, workers=1)

# View a sample word vector
print("Vector for 'love':", model.wv['love'])

Vector for 'love': [-0.03944132  0.00803429 -0.10351574 -0.1920672 ]


In [6]:
# Calculate average word vectors for each comment
average_vectors = []
for comment in tokenized_comments:
  comment_vector = np.zeros(model.vector_size)
  for word in comment:
    try:
      comment_vector += model.wv[word]
    except KeyError:
      # Ignore words not in the vocabulary
      pass
  average_vectors.append(comment_vector / len(comment))

# Display the average vector for the first comment
print("Average vector for first comment:", average_vectors[0])

Average vector for first comment: [-0.01083414 -0.0337788   0.03695674  0.03914206]


When we talk about a "100-dimensional vector," we're referring to a list or array of 100 numbers, each representing a point in some dimensional space. A word vector in such a space encapsulates various aspects of the word's meaning and usage.

Understanding Dimensions and Averaging
Let's say we have 3 words, each represented by a 4-dimensional word vector (for simplicity, we're using 4 dimensions instead of 100):

- Word 1 vector: [1,2,3,4]
- Word 2 vector: [2,3,4,5]
- Word 3 vector: [3,4,5,6]
  
These vectors might be the embeddings for three words in a sentence. To represent the entire sentence by a single vector, we compute the average of these vectors.

To find the average vector, we calculate the mean for each dimension across all word vectors:

- Dimension 1 average: (1+2+3)/3=2
- Dimension 2 average: (2+3+4)/3=3
- Dimension 3 average: (3+4+5)/3=4
- Dimension 4 average: (4+5+6)/3=5
  
So, the average vector representing the entire sentence is [2,3,4,5].

What This Represents? This averaged vector is still in the same 4-dimensional space as the original word vectors, but it's a new vector that, in theory, captures the combined semantic and syntactic essence of all the words in the text.

In [9]:
# prompt: use logistic regression on test/train data

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(average_vectors, sentiments, test_size=0.2)

# Train Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Evaluate model on test data
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of Logistic Regression model:", accuracy)


Accuracy of Logistic Regression model: 0.0



## Word Embedding

The dataset we are using here is a subset of Amazon reviews from the Cell Phones & Accessories category. The data is stored as a JSON file and can be read using pandas.

In [10]:
import gensim
import pandas as pd

import requests
import pandas as pd
from io import BytesIO
import gzip

# URL of the dataset
url = "http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Cell_Phones_and_Accessories_5.json.gz"

# Send a HTTP request to the URL
response = requests.get(url)

# Make sure the request was successful
if response.status_code == 200:
    # Open the response content as a gzip file
    with gzip.open(BytesIO(response.content), 'rt') as read_file:
        # Read the dataset into a pandas DataFrame
        data = pd.read_json(read_file, lines=True)
    # Display the first few rows of the DataFrame
    print(data.head())
else:
    print("Failed to download the dataset.")

       reviewerID        asin      reviewerName helpful  \
0  A30TL5EWN6DFXT  120401325X         christina  [0, 0]   
1   ASY55RVNIL0UD  120401325X          emily l.  [0, 0]   
2  A2TMXE2AFO7ONB  120401325X             Erica  [0, 0]   
3   AWJ0WZQYMYFQ4  120401325X                JM  [4, 4]   
4   ATX7CZYFXI1KW  120401325X  patrice m rogoza  [2, 3]   

                                          reviewText  overall  \
0  They look good and stick good! I just don't li...        4   
1  These stickers work like the review says they ...        5   
2  These are awesome and make my phone look so st...        5   
3  Item arrived in great time and was in perfect ...        4   
4  awesome! stays on, and looks great. can be use...        5   

                                     summary  unixReviewTime   reviewTime  
0                                 Looks Good      1400630400  05 21, 2014  
1                      Really great product.      1389657600  01 14, 2014  
2                         

In [11]:
data.shape

(194439, 9)

In [13]:
review_text = data.reviewText.apply(gensim.utils.simple_preprocess)

In [14]:
review_text

0         [they, look, good, and, stick, good, just, don...
1         [these, stickers, work, like, the, review, say...
2         [these, are, awesome, and, make, my, phone, lo...
3         [item, arrived, in, great, time, and, was, in,...
4         [awesome, stays, on, and, looks, great, can, b...
                                ...                        
194434    [works, great, just, like, my, original, one, ...
194435    [great, product, great, packaging, high, quali...
194436    [this, is, great, cable, just, as, good, as, t...
194437    [really, like, it, becasue, it, works, well, w...
194438    [product, as, described, have, wasted, lot, of...
Name: reviewText, Length: 194439, dtype: object

In [17]:
from gensim.models import Word2Vec

# Assuming 'review_text' is a pandas Series where each entry is a list of tokens (words)
# Convert review_text into a list of lists of tokens for training
sentences = review_text.tolist()

# Initialize and train the Word2Vec model
model = Word2Vec(sentences=sentences,
                 vector_size=100,  # Size of word vectors; adjust based on your needs
                 window=10,
                 min_count=2,
                 workers=4)

# Summarize the loaded model
print(model)

# Save the model for later use
model.save("word2vec_amazon_reviews.model")

# Access vectors for a word
print("Vector for the word 'phone':", model.wv['phone'])

# Find most similar words to 'phone'
print("Words similar to 'phone':", model.wv.most_similar('phone'))


Word2Vec<vocab=35561, vector_size=100, alpha=0.025>
Vector for the word 'phone': [-2.1982071e+00 -8.0649149e-01  1.9280401e+00 -1.8713187e-03
 -1.8930609e+00  5.9042013e-01  3.6916751e-01 -1.5701485e+00
  2.1311574e+00  2.4163215e+00  1.0375583e+00 -3.4427629e+00
 -1.5282893e+00 -2.2158711e+00  1.2268982e+00 -3.2126966e+00
  8.0580190e-02  2.0776742e+00  1.1629301e+00  1.3575493e+00
 -1.7898109e+00  2.7559390e+00  1.2408208e+00  2.5871902e+00
 -7.0643437e-01  1.4327291e+00  3.6206458e+00  3.1199279e+00
  3.4209259e+00 -1.4895647e+00  3.4860287e+00 -2.8240194e+00
  2.5911493e+00  2.9806135e+00  8.1010908e-01 -5.1091522e-01
  1.1223867e+00 -2.3968844e+00  2.1699142e+00  2.1751547e+00
 -1.3381577e+00  2.1534038e+00 -1.7891235e+00 -4.1032502e-01
  7.0157099e-01 -2.6939659e+00  1.6133605e+00  1.7733831e+00
  1.1621673e+00  4.5868835e-01  3.1874602e+00 -2.4393067e+00
 -2.8102043e+00 -1.6228411e+00  3.3430862e-01  8.7354320e-01
 -3.4092398e+00  1.7223221e+00  2.0883474e+00  3.4938743e+00
 -3.

In [18]:
model.build_vocab(review_text, progress_per=1000)



In [19]:
model.train(review_text, total_examples=model.corpus_count, epochs=model.epochs)



(61502857, 83868975)

The first number (61502857): This is the total number of words processed during the training phase. It takes into account the window parameter and possibly multiple passes over the data, depending on the number of epochs the model is trained for. This number shows how many individual word contexts the training algorithm has used to adjust the vector representations.

The second number (83868975): This is the total number of raw words in the training data. It represents the sum of the lengths of all the sentences provided to the model as training data, before any filtering for min_count (minimum word frequency) or other preprocessing steps. Essentially, it's the size of the training corpus in terms of total words before any words are excluded based on the model's parameters.

In [20]:
model.wv.most_similar("bad")

[('shabby', 0.6625680923461914),
 ('terrible', 0.6611157059669495),
 ('good', 0.5967464447021484),
 ('keen', 0.5830436944961548),
 ('horrible', 0.5354795455932617),
 ('okay', 0.5237018465995789),
 ('ok', 0.5104485154151917),
 ('cheap', 0.5079092383384705),
 ('poor', 0.49226653575897217),
 ('fault', 0.4708757996559143)]

In [23]:
model.wv.similarity(w1="great", w2="great")


1.0

In [24]:
model.wv.similarity(w1="great", w2="good")

0.78947496

## Finding Similar Words

After we've trained your Word2Vec model on customer reviews, we've essentially transformed words into vectors that capture semantic meanings, relationships, and context within our dataset. This opens up a variety of ways to analyze and gain insights from the customer reviews. Here are some practical applications and analyses we can perform:

1. Finding Similar Words (we did already)
Discover words that are semantically related to specific terms. This can help identify common themes or issues in reviews. For example, finding words similar to "battery" might reveal related concerns or praises in the context of product reviews.

In [25]:
similar_words = model.wv.most_similar('battery', topn=10)
print(similar_words)

[('batter', 0.8460943102836609), ('batt', 0.7913365364074707), ('batteries', 0.6934639811515808), ('itorch', 0.6278401613235474), ('powerbank', 0.5674920678138733), ('powerplant', 0.551176130771637), ('juice', 0.5456146597862244), ('prolong', 0.5360251069068909), ('incredicharge', 0.5267910361289978), ('span', 0.5267504453659058)]


## Word Clustering

Cluster words based on their vector representations. This can help identify groups of related terms or concepts within the reviews. Techniques like K-means clustering can be applied to the word vectors to group words into clusters of similar meanings. Word clustering involves grouping words into clusters based on their vector representations, such that words in the same cluster have similar meanings or are used in similar contexts. This can reveal patterns, themes, or topics common in your data. For instance, in customer reviews, you might find clusters around product features, customer service, shipping issues, etc.   
Let's demonstrate word clustering using K-means on the Word2Vec embeddings you've trained. We'll use a subset of the most frequent words to make the clusters more interpretable. Finally, we'll discuss the insights that can be gained from this analysis.
  
**Step 1: Preparing Word Vectors**
First, extract a set of word vectors from your Word2Vec model. For demonstration, we'll use the 100 most frequent words (excluding very common but less informative words).

In [26]:
from sklearn.cluster import KMeans
import numpy as np

# Assuming `model` is your Word2Vec model

# Extract the list of words & their vectors
word_vectors = model.wv.vectors
words = list(model.wv.index_to_key)

# For a more focused analysis, consider filtering words by frequency or excluding stop words
# This example uses all words for simplicity


**Step 2: Clustering Words**
Now, we'll use K-means clustering to group these words into clusters based on their vector similarities.

In [27]:
# Number of clusters
k = 10  # Example: 10 clusters. Adjust based on your analysis needs.

# Perform KMeans clustering
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(word_vectors)

# Assign each word to a cluster
word_cluster_labels = kmeans.labels_




**Step 3: Examining the Clusters**
After clustering, let's examine which words ended up in the same clusters. This will give us an idea of the themes or topics present in the reviews.

In [28]:
# Create a dictionary of word clusters
word_clusters = {i: [] for i in range(k)}
for word, cluster_label in zip(words, word_cluster_labels):
    word_clusters[cluster_label].append(word)

# Display words in each cluster
for cluster, words in word_clusters.items():
    print(f"Cluster {cluster}: {words[:10]}")  # Displaying first 10 words for brevity


Cluster 0: ['and', 'to', 'is', 'for', 'that', 'in', 'with', 'but', 'not', 'as']
Cluster 1: ['this', 'was', 'one', 'they', 'had', 'me', 'product', 'these', 'them', 'what']
Cluster 2: ['of', 'if', 'there', 'any', 'used', 'thing', 'problem', 'being', 'people', 'headphones']
Cluster 3: ['case', 'screen', 'back', 'protector', 'cover', 'looks', 'protection', 'cases', 'plastic', 'look']
Cluster 4: ['sound', 'bluetooth', 'headset', 'music', 'life', 'speaker', 'camera', 'call', 'android', 'set']
Cluster 5: ['have', 'be', 'use', 'get', 'do', 'work', 'recommend', 'need', 'buy', 'go']
Cluster 6: ['like', 'can', 'great', 'good', 'would', 'will', 'well', 'does', 'don', 'much']
Cluster 7: ['my', 'or', 'when', 'time', 'while', 'using', 'car', 'day', 'times', 'having']
Cluster 8: ['the', 'it', 'on', 'phone', 'you', 'your', 'out', 'off', 'little', 'fit']
Cluster 9: ['battery', 'an', 'iphone', 'charge', 'charger', 'other', 'which', 'works', 'usb', 'charging']


**Insights from Word Clustering**.

Theme Identification: Each cluster represents a group of words that are contextually similar. By examining the words in each cluster, you can identify common themes or topics in the reviews. For example, a cluster containing words like "battery", "charge", and "power" might indicate discussions about battery life.

Product Features and Issues: Clusters might reveal specific product features that customers talk about the most, as well as recurring issues or areas of dissatisfaction.

Customer Sentiment: Although not a direct measure of sentiment, the clustering of certain words together can give clues about overall customer sentiment. Words with positive connotations clustering together separately from words with negative connotations could indicate polarized opinions about certain aspects of the product or service.

Improving Product and Service: By identifying clusters related to customer service, shipping, product durability, etc., businesses can pinpoint areas for improvement.

## Sentiment Analysis

Now let's try a sentiment analysis.  Performing sentiment analysis without pre-labeled data is a common challenge, but there are several approaches you can take to analyze sentiment in your customer reviews.

**Lexicon-Based Approach**
  
This method relies on predefined lists of words associated with positive and negative sentiments. You can use libraries like TextBlob or VADER, which come with built-in sentiment lexicons and can provide sentiment scores based on the presence and combinations of positive and negative words in your text.

Here is an example:



In [29]:
from textblob import TextBlob

# Example review
review = "The phone has an amazing battery life but a disappointing camera."

# Get sentiment polarity
sentiment = TextBlob(review).sentiment.polarity
print(f"Sentiment polarity: {sentiment}")


Sentiment polarity: 5.551115123125783e-17


A positive polarity score indicates a positive sentiment, while a negative score indicates a negative sentiment. TextBlob can be a straightforward way to start with sentiment analysis without needing labeled data.

This method relies on predefined sentiment scores for words to evaluate the overall sentiment of a piece of text. Two popular tools for this purpose are TextBlob and VADER (Valence Aware Dictionary and sEntiment Reasoner), both of which are well-suited for different types of text data. Here, I'll show you how to use both, and you can choose based on your preference and the nature of your dataset.

TextBlob is straightforward and works well for general-purpose sentiment analysis, including on longer texts like reviews.

In [32]:
# Applying TextBlob sentiment analysis on the reviewText column
data['sentiment_polarity'] = data['reviewText'].apply(lambda x: TextBlob(x).sentiment.polarity)
data['sentiment_subjectivity'] = data['reviewText'].apply(lambda x: TextBlob(x).sentiment.subjectivity)


Now that you have sentiment scores, you can analyze them to gain insights into the overall sentiment of the reviews, such as:

Overall Sentiment: Calculate the average sentiment polarity to get an idea of the overall sentiment towards the product.


In [34]:
average_sentiment = data['sentiment_polarity'].mean()
print(f"Average Sentiment Polarity: {average_sentiment}")


Average Sentiment Polarity: 0.24830300849739492


An average sentiment polarity of approximately 0.248 suggests that the overall sentiment in your dataset of reviews leans towards the positive side. This is a good starting point for understanding customer sentiment, but there are several ways you can delve deeper to gain more nuanced insights.  Now that we know the overall sentiment is somewhat positive, we might want to understand how sentiment varies across different aspects or features of the product, like its battery life, camera quality, or customer service. We can filter reviews mentioning specific features and calculate the average sentiment for reviews concerning each aspect:

In [36]:
positive_reviews = data[data['sentiment_polarity'] > 0].shape[0]
neutral_reviews = data[data['sentiment_polarity'] == 0].shape[0]
negative_reviews = data[data['sentiment_polarity'] < 0].shape[0]

print(f"Positive Reviews: {positive_reviews}")
print(f"Neutral Reviews: {neutral_reviews}")
print(f"Negative Reviews: {negative_reviews}")


Positive Reviews: 172070
Neutral Reviews: 4950
Negative Reviews: 17419


In [37]:
features = ['battery', 'camera', 'service']
for feature in features:
    feature_reviews = data[data['reviewText'].str.contains(feature, case=False)]
    avg_sentiment = feature_reviews['sentiment_polarity'].mean()
    print(f"Average sentiment for {feature}: {avg_sentiment}")


Average sentiment for battery: 0.199119172408153
Average sentiment for camera: 0.19423316438082053
Average sentiment for service: 0.22639224564186428
