# **NLP with Python - Basics**

Dr. Aydede

Creating word embeddings from raw text involves several steps, including text preprocessing, tokenization, and the application of an embedding algorithm. Here's a detailed guide, along with the libraries commonly used in Python for each step:

## 1. Text Preprocessing:

Preprocessing is crucial to clean and normalize the text data. This step typically includes:

- Lowercasing
- Removing punctuation
- Removing stop words
- Lemmatization or stemming


In [3]:
# Import necessary libraries
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import string
import re
import gensim
from gensim.models import Word2Vec

# Download NLTK data files if not already downloaded
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')  # Downloading the missing resource

# Initialize lemmatizer and stopwords
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

# Sample text (replace with your financial report text)
text = "This is a sample financial report text with numbers, punctuations, and various stop words."

# Text Preprocessing
def preprocess_text(text):
    text = text.lower()  # Lowercasing
    text = re.sub(f"[{string.punctuation}]", "", text)  # Removing punctuation
    tokens = word_tokenize(text)  # Tokenization
    processed_tokens = [lemmatizer.lemmatize(token) for token in tokens if token not in stop_words]  # Removing stop words and lemmatization
    return processed_tokens

processed_tokens = preprocess_text(text)
print("Processed Tokens:", processed_tokens)


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/yigitaydede/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/yigitaydede/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/yigitaydede/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/yigitaydede/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Processed Tokens: ['sample', 'financial', 'report', 'text', 'number', 'punctuation', 'various', 'stop', 'word']


The following lines download necessary data files from the NLTK (Natural Language Toolkit) library:

- `nltk.download('punkt')`: Downloads the Punkt tokenizer models, which are used for tokenizing text into sentences and words.
- `nltk.download('stopwords')`: Downloads a list of common stop words in various languages. Stop words are words that are commonly filtered out in natural language processing tasks because they don't carry significant meaning (e.g., "and", "the", "is").
- `nltk.download('wordnet')`: Downloads the WordNet lexical database, which is used for lemmatization, finding synonyms, and other lexical tasks.
- `nltk.download('omw-1.4')`: Downloads the Open Multilingual WordNet package, which is needed for certain WordNet functions, particularly for handling multiple languages.

The following lines initialize the tools for lemmatization and stop word removal:

- `WordNetLemmatizer()`: Creates an instance of the WordNet lemmatizer, which reduces words to their base or root form (e.g., "running" becomes "run").
- `set(stopwords.words('english'))`: Creates a set of English stop words to efficiently check if a word is a stop word. Using a set makes membership tests faster.

## 2. `word2vec`

Word embedding is a technique in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. Conceptually, it involves the mathematical embedding from some space (e.g., the space of all possible words) to a lower-dimensional space of the real numbers. The key idea is to capture the semantic meanings, syntactic similarity, and relation of words in these vectors, such that words with similar meanings are closer to each other in the vector space.

Word embedding models like Word2Vec, GloVe (Global Vectors for Word Representation), and FastText have become foundational in NLP applications because they can reduce the dimensionality of text data while preserving lexical and semantic word relationships.

Let's take a simple example using Word2Vec from the Gensim library. Word2Vec can be trained with either the Continuous Bag of Words (CBOW) or Skip-Gram model. In CBOW, the model predicts a word given its context. In Skip-Gram, it predicts the context given a word. Here's how you can use Gensim to train a simple Word2Vec model on a small dataset:

1. First, we'll create a small dataset (corpus).
2. Then, we'll train a Word2Vec model on this dataset.
3. Finally, we'll explore the resulting word embeddings.

In [10]:
from gensim.models import Word2Vec
import logging

# Enable logging for monitoring training
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# Sample sentences
sentences = [
    ['python', 'is', 'a', 'programming', 'language'],
    ['python', 'and', 'java', 'are', 'popular', 'programming', 'languages'],
    ['python', 'programs', 'are', 'easy', 'to', 'write'],
    ['machine', 'learning', 'is', 'fun', 'with', 'python']
]

# Train a Word2Vec model
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)

# Summarize the loaded model
print("Word2Vec model:", model)

# Access vectors for one word
print("Vector for 'python':", model.wv['python'])

# Find most similar words
print("Words similar to 'python':", model.wv.most_similar('python'))


2024-08-05 16:51:13,320 : INFO : collecting all words and their counts
2024-08-05 16:51:13,322 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2024-08-05 16:51:13,323 : INFO : collected 18 word types from a corpus of 24 raw words and 4 sentences
2024-08-05 16:51:13,324 : INFO : Creating a fresh vocabulary
2024-08-05 16:51:13,325 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=1 retains 18 unique words (100.0%% of original 18, drops 0)', 'datetime': '2024-08-05T16:51:13.325211', 'gensim': '4.1.2', 'python': '3.9.12 (main, Apr  5 2022, 01:52:34) \n[Clang 12.0.0 ]', 'platform': 'macOS-14.6-arm64-arm-64bit', 'event': 'prepare_vocab'}
2024-08-05 16:51:13,326 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=1 leaves 24 word corpus (100.0%% of original 24, drops 0)', 'datetime': '2024-08-05T16:51:13.326205', 'gensim': '4.1.2', 'python': '3.9.12 (main, Apr  5 2022, 01:52:34) \n[Clang 12.0.0 ]', 'platform': 'macOS-14.6-arm64-arm-64bit', 'e

Word2Vec model: Word2Vec(vocab=18, vector_size=100, alpha=0.025)
Vector for 'python': [-5.3622725e-04  2.3643136e-04  5.1033497e-03  9.0092728e-03
 -9.3029495e-03 -7.1168090e-03  6.4588725e-03  8.9729885e-03
 -5.0154282e-03 -3.7633716e-03  7.3805046e-03 -1.5334714e-03
 -4.5366134e-03  6.5540518e-03 -4.8601604e-03 -1.8160177e-03
  2.8765798e-03  9.9187379e-04 -8.2852151e-03 -9.4488179e-03
  7.3117660e-03  5.0702621e-03  6.7576934e-03  7.6286553e-04
  6.3508903e-03 -3.4053659e-03 -9.4640139e-04  5.7685734e-03
 -7.5216377e-03 -3.9361035e-03 -7.5115822e-03 -9.3004224e-04
  9.5381187e-03 -7.3191668e-03 -2.3337686e-03 -1.9377411e-03
  8.0774371e-03 -5.9308959e-03  4.5162440e-05 -4.7537340e-03
 -9.6035507e-03  5.0072931e-03 -8.7595852e-03 -4.3918253e-03
 -3.5099984e-05 -2.9618145e-04 -7.6612402e-03  9.6147433e-03
  4.9820580e-03  9.2331432e-03 -8.1579173e-03  4.4957981e-03
 -4.1370760e-03  8.2453608e-04  8.4986202e-03 -4.4621765e-03
  4.5175003e-03 -6.7869602e-03 -3.5484887e-03  9.3985079e-03

This example demonstrates the basics of training a Word2Vec model with Gensim. Here, `vector_size` specifies the dimensionality of the word vectors, `window` defines the maximum distance between the current and predicted word within a sentence, and `min_count` ignores all words with total frequency lower than this.

After training, we access the vector for "python" and find words similar to "python" based on their word embeddings. The output will give you an insight into how the model understands "python" in the context of the provided corpus.


## Word Embeddings - Details

### Popular Word Embedding Models
These are some of the most popular and widely used word embedding models: `Word2Vec`, `GloVe`, and `FastText`.

`Word2Vec`, `GloVe`, and `FastText` are different in their approaches to generating word embeddings, though they share the common goal of representing words as vectors in a continuous vector space. Here’s a brief comparison:

1. `Word2Vec`
- Developed by: Google.
- Approach: Predictive model.
- Architecture: Uses neural networks with either Continuous Bag of Words (CBOW) or Skip-gram models.
    - CBOW: Predicts a target word from a window of context words.
    - Skip-gram: Predicts context words from a target word.
- Training: Trained on large corpora of text, learning to predict words given their context.
- Output: Dense vector representations for each word.

2. `GloVe`
- Developed by: Stanford.
- Approach: Count-based model.
- Architecture: Constructs a co-occurrence matrix of words from a corpus, then factorizes this matrix to find word vectors.
- Training: Uses the statistical information contained in a corpus, specifically the co-occurrence matrix, to find vector representations that capture the probability of word co-occurrences.
- Output: Dense vector representations for each word.

3. `FastText`
- Developed by: Facebook.
- Approach: Predictive model with subword information.
- Architecture: Similar to Word2Vec (CBOW or Skip-gram), but includes subword (character n-grams) information.
- Training: Trains on large corpora of text, learning to predict words given their context, while incorporating subword information.
- Output: Dense vector representations for each word, incorporating subword information.

Each method has its own strengths and weaknesses, and the choice between them depends on the specific requirements of the task at hand, such as the size of the training data, the importance of handling out-of-vocabulary words, and the computational resources available.

### Word2Vec - CBOW & Skip-gram
Word2Vec is a popular technique for generating word embeddings, which are dense vector representations of words in a continuous vector space. There are two main approaches within Word2Vec: Continuous Bag of Words (CBOW) and Skip-gram.

#### Continuous Bag of Words (CBOW)
CBOW predicts the current word based on the context words surrounding it. It uses a sliding window of fixed size to capture the context around the target word. The model learns to predict the target word given the context words.

How CBOW works:
1. Context Window: Select a context window size (e.g., 2 words on either side of the target word).
2. Context Words: For a given target word, identify the words within the context window.
3. Prediction: Use the context words as input to predict the target word.
Learning: The neural network adjusts its weights to minimize the prediction error.

Example:
- Sentence: "The quick brown fox jumps over the lazy dog"
- Context window size = 2
- Target word = "brown"
- Context words = ["The", "quick", "fox", "jumps"]

The CBOW model would take the context words ("The", "quick", "fox", "jumps") as input and try to predict the target word "brown".

#### Skip-gram
Skip-gram, on the other hand, predicts the context words based on the current word. It uses the current word as the input and predicts the surrounding context words.

How it works:
1. Context Window: Select a context window size (e.g., 2 words on either side of the target word).
2. Target Word: For a given context window, identify the target word.
3. Prediction: Use the target word as input to predict the context words.
Learning: The neural network adjusts its weights to minimize the prediction error.

Example:
- Sentence: "The quick brown fox jumps over the lazy dog"
- Context window size = 2
- Target word = "brown"
- Context words = ["The", "quick", "fox", "jumps"]

The Skip-gram model would take the target word "brown" as input and try to predict the context words ("The", "quick", "fox", "jumps").

#### Comparison
- CBOW: Faster to train, works well with smaller datasets, and averages the context, which can smooth out noise.
- Skip-gram: Slower to train but can produce higher-quality embeddings, especially for infrequent words, as it focuses on each context-target pair individually.

Both architectures aim to create word embeddings that capture semantic relationships between words based on their contexts, but they do so in different ways. CBOW predicts the target word from its context, while Skip-gram predicts the context from the target word.

### Data structure

To understand how the data looks in CBOW (Continuous Bag of Words) using `Word2Vec` with one-hot encoding, let's walk through an example step-by-step.  Suppose we have 5 sentences each has different numbers of words

When dealing with multiple sentences of different lengths in the context of training `Word2Vec` using the CBOW model, the process involves creating context-target pairs for each sentence individually. Here's how it works:

General Approach
1. Tokenization: Each sentence is broken down into individual words.
2. Vocabulary Creation: A vocabulary of unique words across all sentences is created. Each word is assigned a unique index.
3. One-Hot Encoding: Each word in the vocabulary is represented as a one-hot encoded vector.

The CBOW model processes each sentence independently, generating context-target pairs based on the chosen context window size. Let's illustrate this with an example:

Example Sentences
1. "I love natural language processing."
2. "Word2Vec is a popular algorithm."
3. "CBOW and Skip-gram are two models."
4. "Training word embeddings is important."
5. "Handling different sentence lengths."

First, we tokenize each sentence:

1. ["I", "love", "natural", "language", "processing"]
2. ["Word2Vec", "is", "a", "popular", "algorithm"]
3. ["CBOW", "and", "Skip-gram", "are", "two", "models"]
4. ["Training", "word", "embeddings", "is", "important"]
5. ["Handling", "different", "sentence", "lengths"]

Then, we create a vocabulary (Assume each word is assigned an index based on its order in the vocabulary):
"I", "love", "natural", "language", "processing", "Word2Vec", "is", "a", "popular", "algorithm", "CBOW", "and", "Skip-gram", "are", "two", "models", "Training", "word", "embeddings", "important", "Handling", "different", "sentence", "lengths"

#### Context-Target Pairs

Now, let's create context-target pairs for each sentence:
1. Context-Target Pair Generation
For each sentence, context-target pairs are generated based on the chosen context window size. Let's assume the context window size is 2.

Sentence 1: "I love natural language processing."
1. Target: "natural" | Context: ["I", "love"]
2. Target: "language" | Context: ["love", "natural"]
3. Target: "processing" | Context: ["natural", "language"]
Sentence 2: "Word2Vec is a popular algorithm."
1. Target: "is" | Context: ["Word2Vec"]
2. Target: "a" | Context: ["Word2Vec", "is"]
3. Target: "popular" | Context: ["is", "a"]
4. Target: "algorithm" | Context: ["a", "popular"]

And so on.  In the case of sentences with different lengths, the process remains the same. The context-target pairs are generated for each sentence independently. The context window size is applied to each sentence, and the target word is selected from the context window.

#### Understanding the Input Matrix for CBOW

1. Context Window: For each target word, we consider a window of context words around it. Let's assume a context window size of 2 for this explanation.
2. Context-Target Pairs: Each pair consists of context words as input and a target word as output.
3. One-Hot Encoding: Each word in the context and target is represented as a one-hot encoded vector of length equal to the vocabulary size (24 in this case).

Example Breakdown
Given the five sentences and a vocabulary of 24 unique words, let's outline the context-target pairs and how they form the input matrix.

Generating Context-Target Pairs for the first 2 sentences:
Sentence 1: "I love natural language processing."
Pairs:
1. Target: "natural" | Context: ["I", "love"]
2. Target: "language" | Context: ["love", "natural"]
3. Target: "processing" | Context: ["natural", "language"]
Sentence 2: "Word2Vec is a popular algorithm."
Pairs:
1. Target: "is" | Context: ["Word2Vec"]
2. Target: "a" | Context: ["Word2Vec", "is"]
3. Target: "popular" | Context: ["is", "a"]
4. Target: "algorithm" | Context: ["a", "popular"]

Context Matrix Structure
For each context-target pair, the context words are one-hot encoded, and these one-hot encoded vectors are concatenated to form the input matrix.

Example Pair
Pair: Target: "natural" | Context: ["I", "love"]

One-hot Encoding:  
"I": [1, 0, 0, 0, ..., 0]
"love": [0, 1, 0, 0, ..., 0]

If we consider each context-target pair as an individual training example, the input (context) matrix for all pairs combined would be structured with each row representing a context word vector and columns representing features (words in the vocabulary).

Example Scenario
Given the same sentence: "I love natural language processing."

Vocabulary size (V) = 10
Context window size = 2
One-Hot Encoding Representation
Assume the indices for one-hot encoding are:

I -> [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
love -> [0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
natural -> [0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
language -> [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
processing -> [0, 0, 0, 0, 1, 0, 0, 0, 0, 0]
Word2Vec -> [0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
is -> [0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
a -> [0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
popular -> [0, 0, 0, 0, 0, 0, 0, 0, 1, 0]
algorithm -> [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]

Context-Target Pairs for "I love natural language processing."
For context window size of 2 :
1. Target: "natural" | Context: ["I", "love"]
- Input (context): Sum of $[1,0,0,0,0,0,0,0,0,0]$ and $[0,1,0,0,0,0,0,0,0,0]$
- Target: $[0,0,1,0,0,0,0,0,0,0]$
2. Target: "language" | Context: ["love", "natural"]
- Input (context): Sum of $[0,1,0,0,0,0,0,0,0,0]$ and $[0,0,1,0,0,0,0,0,0,0]$
- Target: $[0,0,0,1,0,0,0,0,0,0]$
3. Target: "processing" | Context: ["natural", "language"]
- Input (context): Sum of $[0,0,1,0,0,0,0,0,0,0]$ and $[0,0,0,1,0,0,0,0,0,0]$
- Target: $[0,0,0,0,1,0,0,0,0,0]$

Input and Target Matrices
Input Matrix (Summed Context Vectors):
$$
\left[\begin{array}{llllllllll}
1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0
\end{array}\right]
$$

Target Matrix:
$$
\left[\begin{array}{llllllllll}
0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0
\end{array}\right]
$$

By summing the one-hot encoded vectors of the context words, you create a combined context representation that keeps the 1s from each context word. This approach simplifies the input matrix and ensures the dimensions are aligned correctly:

Input Matrix Dimensions: Number of context-target pairs (3) × Vocabulary size (10)
Target Matrix Dimensions: Number of context-target pairs (3) × Vocabulary size (10)
This ensures that each row in the input matrix corresponds to a combined context vector, and each row in the target matrix corresponds to the target word's one-hot encoded vector.


## **It's single layer NN with 100 nodes**
  
Let's clarify a bit more about what happens inside Word2Vec.

Word2Vec, whether using the Continuous Bag of Words (CBOW) or Skip-Gram model, indeed leverages a neural network architecture, but it's structured a bit differently than a typical feedforward neural network with a single layer of 100 nodes (when you set vector_size=100). The "100 nodes" or "100 dimensions" represent the size of the word vectors you're aiming to learn, not the nodes of a hidden layer in a traditional sense.

Here's a simplified overview of the process for both CBOW and Skip-Gram models:

Input Layer: For CBOW, the input is the context words (multiple words), which are one-hot encoded vectors representing the presence of words in the context. For Skip-Gram, the input is just the target word. The size of each input vector is equal to the vocabulary size.

Projection Layer (or Hidden Layer): This is not a typical hidden layer with activation functions. Instead, it's a projection layer where the actual learning of word embeddings happens. When you set vector_size=100, it means this layer will have 100 neurons. The weights connecting the input layer to this layer are what become the word embeddings. In training, for a given input word, the corresponding row in the weight matrix is essentially the word vector for that word.

In CBOW, the vectors from the projection layer corresponding to each context word are averaged before being passed to the output layer.
In Skip-Gram, the projection layer directly connects to the output layer, using the vector of the input word.
Output Layer: The output layer is a softmax layer that makes predictions. For CBOW, it predicts the target word from the context. For Skip-Gram, it predicts the context words from the target word. The size of this layer is also equal to the vocabulary size.

So, in summary:

The "100 dimensions" are essentially the weights of the projection layer that you learn during training.
  
The learning involves adjusting these weights so that the model gets better at its prediction task (predicting context words for Skip-Gram, predicting a target word for CBOW), thereby capturing semantic and syntactic word relationships in the process.
The neural network aspect of Word2Vec is quite specialized and optimized for the task of learning word embeddings, which is a bit different from a general-purpose neural network used for other types of prediction tasks.

In the context of the Word2Vec architecture and specifically regarding the projection (or hidden) layer where the word embeddings are learned, the activation function can indeed be thought of as an identity function, $f(x)=x$. This means that the output of each neuron in this layer is the same as its input, without any nonlinear transformation applied.

## 2. Sentiment with labels

Let's have a simple example

After having a "toy" dataset, we tokenize the data.  Tokenization is a fundamental step in natural language processing (NLP) and plays a crucial role in preparing text data for training word embeddings or any other machine learning model. Here's what it does and why it's important in the context of training word embeddings, like in the sentiment analysis example:

What Tokenization Does:
Splits Text into Tokens: Tokenization breaks down text into its basic units (tokens), which are typically words or subwords. For instance, the sentence "I love machine learning" would be tokenized into ["I", "love", "machine", "learning"].
  
Facilitates Vector Representation: Each token (word) can then be represented as a vector in the word embedding space. This is crucial for training embeddings, as the algorithm needs to work with individual words to learn their semantic and syntactic relationships.
  
Removes Punctuation and Special Characters: Depending on the tokenizer, it can also help clean the text by removing punctuation, special characters, or unnecessary whitespace, making the text more uniform and easier to process.

In [11]:
# Define the dataset
comments = ["love the new features", "hate the long wait time", "excellent service", "poor experience with the product", "happy with the purchase"]
sentiments = [1, 0, 1, 0, 1]  # 1: Positive, 0: Negative

from nltk.tokenize import word_tokenize
import nltk

# Download the punkt tokenizer models
nltk.download('punkt')

# Now you can proceed with tokenizing your text
from nltk.tokenize import word_tokenize

tokenized_comments = [word_tokenize(comment.lower()) for comment in comments]


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/yigitaydede/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [12]:
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import numpy as np

# Tokenize comments
tokenized_comments = [word_tokenize(comment.lower()) for comment in comments]

# Train word embeddings
model = Word2Vec(tokenized_comments, vector_size=4, window=2, min_count=1, workers=1)

# View a sample word vector
print("Vector for 'love':", model.wv['love'])

2024-08-05 17:04:44,619 : INFO : collecting all words and their counts
2024-08-05 17:04:44,620 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2024-08-05 17:04:44,621 : INFO : collected 16 word types from a corpus of 20 raw words and 5 sentences
2024-08-05 17:04:44,621 : INFO : Creating a fresh vocabulary
2024-08-05 17:04:44,622 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=1 retains 16 unique words (100.0%% of original 16, drops 0)', 'datetime': '2024-08-05T17:04:44.622604', 'gensim': '4.1.2', 'python': '3.9.12 (main, Apr  5 2022, 01:52:34) \n[Clang 12.0.0 ]', 'platform': 'macOS-14.6-arm64-arm-64bit', 'event': 'prepare_vocab'}
2024-08-05 17:04:44,624 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=1 leaves 20 word corpus (100.0%% of original 20, drops 0)', 'datetime': '2024-08-05T17:04:44.624632', 'gensim': '4.1.2', 'python': '3.9.12 (main, Apr  5 2022, 01:52:34) \n[Clang 12.0.0 ]', 'platform': 'macOS-14.6-arm64-arm-64bit', 'e

Vector for 'love': [-0.03944132  0.00803429 -0.10351574 -0.1920672 ]


In [13]:
# Calculate average word vectors for each comment
average_vectors = []
for comment in tokenized_comments:
  comment_vector = np.zeros(model.vector_size)
  for word in comment:
    try:
      comment_vector += model.wv[word]
    except KeyError:
      # Ignore words not in the vocabulary
      pass
  average_vectors.append(comment_vector / len(comment))

# Display the average vector for the first comment
print("Average vector for first comment:", average_vectors[0])

Average vector for first comment: [-0.01083414 -0.0337788   0.03695674  0.03914206]


When we talk about a "100-dimensional vector," we're referring to a list or array of 100 numbers, each representing a point in some dimensional space. A word vector in such a space encapsulates various aspects of the word's meaning and usage.

Understanding Dimensions and Averaging
Let's say we have 3 words, each represented by a 4-dimensional word vector (for simplicity, we're using 4 dimensions instead of 100):

- Word 1 vector: [1,2,3,4]
- Word 2 vector: [2,3,4,5]
- Word 3 vector: [3,4,5,6]
  
These vectors might be the embeddings for three words in a sentence. To represent the entire sentence by a single vector, we compute the average of these vectors.

To find the average vector, we calculate the mean for each dimension across all word vectors:

- Dimension 1 average: (1+2+3)/3=2
- Dimension 2 average: (2+3+4)/3=3
- Dimension 3 average: (3+4+5)/3=4
- Dimension 4 average: (4+5+6)/3=5
  
So, the average vector representing the entire sentence is [2,3,4,5].

What This Represents? This averaged vector is still in the same 4-dimensional space as the original word vectors, but it's a new vector that, in theory, captures the combined semantic and syntactic essence of all the words in the text.

In [14]:
# prompt: use logistic regression on test/train data

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(average_vectors, sentiments, test_size=0.2)

# Train Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Evaluate model on test data
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of Logistic Regression model:", accuracy)


NameError: name 'accuracy_score' is not defined


## Word Embedding

The dataset we are using here is a subset of Amazon reviews from the Cell Phones & Accessories category. The data is stored as a JSON file and can be read using pandas.

In [None]:
import gensim
import pandas as pd

import requests
import pandas as pd
from io import BytesIO
import gzip

# URL of the dataset
url = "http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Cell_Phones_and_Accessories_5.json.gz"

# Send a HTTP request to the URL
response = requests.get(url)

# Make sure the request was successful
if response.status_code == 200:
    # Open the response content as a gzip file
    with gzip.open(BytesIO(response.content), 'rt') as read_file:
        # Read the dataset into a pandas DataFrame
        data = pd.read_json(read_file, lines=True)
    # Display the first few rows of the DataFrame
    print(data.head())
else:
    print("Failed to download the dataset.")

: 

In [None]:
data.shape

: 

In [None]:
review_text = data.reviewText.apply(gensim.utils.simple_preprocess)

: 

In [None]:
review_text

: 

In [None]:
from gensim.models import Word2Vec

# Assuming 'review_text' is a pandas Series where each entry is a list of tokens (words)
# Convert review_text into a list of lists of tokens for training
sentences = review_text.tolist()

# Initialize and train the Word2Vec model
model = Word2Vec(sentences=sentences,
                 vector_size=100,  # Size of word vectors; adjust based on your needs
                 window=10,
                 min_count=2,
                 workers=4)

# Summarize the loaded model
print(model)

# Save the model for later use
model.save("word2vec_amazon_reviews.model")

# Access vectors for a word
print("Vector for the word 'phone':", model.wv['phone'])

# Find most similar words to 'phone'
print("Words similar to 'phone':", model.wv.most_similar('phone'))


: 

In [None]:
model.build_vocab(review_text, progress_per=1000)

: 

In [None]:
model.train(review_text, total_examples=model.corpus_count, epochs=model.epochs)

: 

The first number (61502857): This is the total number of words processed during the training phase. It takes into account the window parameter and possibly multiple passes over the data, depending on the number of epochs the model is trained for. This number shows how many individual word contexts the training algorithm has used to adjust the vector representations.

The second number (83868975): This is the total number of raw words in the training data. It represents the sum of the lengths of all the sentences provided to the model as training data, before any filtering for min_count (minimum word frequency) or other preprocessing steps. Essentially, it's the size of the training corpus in terms of total words before any words are excluded based on the model's parameters.

In [None]:
model.wv.most_similar("bad")

: 

In [None]:
model.wv.similarity(w1="great", w2="great")


: 

In [None]:
model.wv.similarity(w1="great", w2="good")

: 

## Finding Similar Words

After we've trained your Word2Vec model on customer reviews, we've essentially transformed words into vectors that capture semantic meanings, relationships, and context within our dataset. This opens up a variety of ways to analyze and gain insights from the customer reviews. Here are some practical applications and analyses we can perform:

1. Finding Similar Words (we did already)
Discover words that are semantically related to specific terms. This can help identify common themes or issues in reviews. For example, finding words similar to "battery" might reveal related concerns or praises in the context of product reviews.

In [None]:
similar_words = model.wv.most_similar('battery', topn=10)
print(similar_words)

: 

## Word Clustering

Cluster words based on their vector representations. This can help identify groups of related terms or concepts within the reviews. Techniques like K-means clustering can be applied to the word vectors to group words into clusters of similar meanings. Word clustering involves grouping words into clusters based on their vector representations, such that words in the same cluster have similar meanings or are used in similar contexts. This can reveal patterns, themes, or topics common in your data. For instance, in customer reviews, you might find clusters around product features, customer service, shipping issues, etc.   
Let's demonstrate word clustering using K-means on the Word2Vec embeddings you've trained. We'll use a subset of the most frequent words to make the clusters more interpretable. Finally, we'll discuss the insights that can be gained from this analysis.
  
**Step 1: Preparing Word Vectors**
First, extract a set of word vectors from your Word2Vec model. For demonstration, we'll use the 100 most frequent words (excluding very common but less informative words).

In [None]:
from sklearn.cluster import KMeans
import numpy as np

# Assuming `model` is your Word2Vec model

# Extract the list of words & their vectors
word_vectors = model.wv.vectors
words = list(model.wv.index_to_key)

# For a more focused analysis, consider filtering words by frequency or excluding stop words
# This example uses all words for simplicity


: 

**Step 2: Clustering Words**
Now, we'll use K-means clustering to group these words into clusters based on their vector similarities.

In [None]:
# Number of clusters
k = 10  # Example: 10 clusters. Adjust based on your analysis needs.

# Perform KMeans clustering
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(word_vectors)

# Assign each word to a cluster
word_cluster_labels = kmeans.labels_


: 

**Step 3: Examining the Clusters**
After clustering, let's examine which words ended up in the same clusters. This will give us an idea of the themes or topics present in the reviews.

In [None]:
# Create a dictionary of word clusters
word_clusters = {i: [] for i in range(k)}
for word, cluster_label in zip(words, word_cluster_labels):
    word_clusters[cluster_label].append(word)

# Display words in each cluster
for cluster, words in word_clusters.items():
    print(f"Cluster {cluster}: {words[:10]}")  # Displaying first 10 words for brevity


: 

**Insights from Word Clustering**.

Theme Identification: Each cluster represents a group of words that are contextually similar. By examining the words in each cluster, you can identify common themes or topics in the reviews. For example, a cluster containing words like "battery", "charge", and "power" might indicate discussions about battery life.

Product Features and Issues: Clusters might reveal specific product features that customers talk about the most, as well as recurring issues or areas of dissatisfaction.

Customer Sentiment: Although not a direct measure of sentiment, the clustering of certain words together can give clues about overall customer sentiment. Words with positive connotations clustering together separately from words with negative connotations could indicate polarized opinions about certain aspects of the product or service.

Improving Product and Service: By identifying clusters related to customer service, shipping, product durability, etc., businesses can pinpoint areas for improvement.

## Sentiment Analysis

Now let's try a sentiment analysis.  Performing sentiment analysis without pre-labeled data is a common challenge, but there are several approaches you can take to analyze sentiment in your customer reviews.

**Lexicon-Based Approach**
  
This method relies on predefined lists of words associated with positive and negative sentiments. You can use libraries like TextBlob or VADER, which come with built-in sentiment lexicons and can provide sentiment scores based on the presence and combinations of positive and negative words in your text.

Here is an example:



In [None]:
from textblob import TextBlob

# Example review
review = "The phone has an amazing battery life but a disappointing camera."

# Get sentiment polarity
sentiment = TextBlob(review).sentiment.polarity
print(f"Sentiment polarity: {sentiment}")


: 

A positive polarity score indicates a positive sentiment, while a negative score indicates a negative sentiment. TextBlob can be a straightforward way to start with sentiment analysis without needing labeled data.

This method relies on predefined sentiment scores for words to evaluate the overall sentiment of a piece of text. Two popular tools for this purpose are TextBlob and VADER (Valence Aware Dictionary and sEntiment Reasoner), both of which are well-suited for different types of text data. Here, I'll show you how to use both, and you can choose based on your preference and the nature of your dataset.

TextBlob is straightforward and works well for general-purpose sentiment analysis, including on longer texts like reviews.

In [None]:
# Applying TextBlob sentiment analysis on the reviewText column
data['sentiment_polarity'] = data['reviewText'].apply(lambda x: TextBlob(x).sentiment.polarity)
data['sentiment_subjectivity'] = data['reviewText'].apply(lambda x: TextBlob(x).sentiment.subjectivity)


: 

Now that you have sentiment scores, you can analyze them to gain insights into the overall sentiment of the reviews, such as:

Overall Sentiment: Calculate the average sentiment polarity to get an idea of the overall sentiment towards the product.


In [None]:
average_sentiment = data['sentiment_polarity'].mean()
print(f"Average Sentiment Polarity: {average_sentiment}")


: 

An average sentiment polarity of approximately 0.248 suggests that the overall sentiment in your dataset of reviews leans towards the positive side. This is a good starting point for understanding customer sentiment, but there are several ways you can delve deeper to gain more nuanced insights.  Now that we know the overall sentiment is somewhat positive, we might want to understand how sentiment varies across different aspects or features of the product, like its battery life, camera quality, or customer service. We can filter reviews mentioning specific features and calculate the average sentiment for reviews concerning each aspect:

In [None]:
positive_reviews = data[data['sentiment_polarity'] > 0].shape[0]
neutral_reviews = data[data['sentiment_polarity'] == 0].shape[0]
negative_reviews = data[data['sentiment_polarity'] < 0].shape[0]

print(f"Positive Reviews: {positive_reviews}")
print(f"Neutral Reviews: {neutral_reviews}")
print(f"Negative Reviews: {negative_reviews}")


: 

In [None]:
features = ['battery', 'camera', 'service']
for feature in features:
    feature_reviews = data[data['reviewText'].str.contains(feature, case=False)]
    avg_sentiment = feature_reviews['sentiment_polarity'].mean()
    print(f"Average sentiment for {feature}: {avg_sentiment}")


: 