# **NLP with Python - Basics**

Dr. Aydede

Natural Language Processing (NLP) and Large Language Models (LLMs) are grounded in statistical methods that rely on numerical representations rather than directly using words. LLMs represent the latest advancement in NLP, pushing the boundaries of text processing and generation. Converting text into numbers is a key step in both fields, and one common technique is word embedding. However, the approach to word embeddings in LLMs differs significantly from traditional NLP methods. In this context, we’ll focus on word embeddings specifically within traditional NLP.

Creating word embeddings from raw text involves several steps, such as text preprocessing, tokenization, and the application of embedding algorithms. Below, we provide a detailed overview of these steps along with commonly used Python libraries for each stage.

## 1. Text Preprocessing:

Preprocessing is crucial to clean and normalize the text data. This step typically includes:

- Lowercasing
- Removing punctuation
- Removing stop words
- Lemmatization or stemming


In [1]:
# Import necessary libraries
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import string
import re
import gensim
from gensim.models import Word2Vec

# Download NLTK data files if not already downloaded
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')  # Downloading the missing resource

# Initialize lemmatizer and stopwords
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

# Sample text (replace with your financial report text)
text = "This is a sample financial report text with numbers, punctuations, and various stop words."

# Text Preprocessing
def preprocess_text(text):
    text = text.lower()  # Lowercasing
    text = re.sub(f"[{string.punctuation}]", "", text)  # Removing punctuation
    tokens = word_tokenize(text)  # Tokenization
    processed_tokens = [lemmatizer.lemmatize(token) for token in tokens if token not in stop_words]  # Removing stop words and lemmatization
    return processed_tokens

processed_tokens = preprocess_text(text)
print("Processed Tokens:", processed_tokens)


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/YigitAydede/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/YigitAydede/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/YigitAydede/nltk_data...
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/YigitAydede/nltk_data...


Processed Tokens: ['sample', 'financial', 'report', 'text', 'number', 'punctuation', 'various', 'stop', 'word']


The following lines download necessary data files from the NLTK (Natural Language Toolkit) library:

- `nltk.download('punkt')`: Downloads the Punkt tokenizer models, which are used for tokenizing text into sentences and words.
- `nltk.download('stopwords')`: Downloads a list of common stop words in various languages. Stop words are words that are commonly filtered out in natural language processing tasks because they don't carry significant meaning (e.g., "and", "the", "is").
- `nltk.download('wordnet')`: Downloads the WordNet lexical database, which is used for lemmatization, finding synonyms, and other lexical tasks.
- `nltk.download('omw-1.4')`: Downloads the Open Multilingual WordNet package, which is needed for certain WordNet functions, particularly for handling multiple languages.

The following lines initialize the tools for lemmatization and stop word removal:

- `WordNetLemmatizer()`: Creates an instance of the WordNet lemmatizer, which reduces words to their base or root form (e.g., "running" becomes "run").
- `set(stopwords.words('english'))`: Creates a set of English stop words to efficiently check if a word is a stop word. Using a set makes membership tests faster.

## 2. `word2vec`

Word embedding is a technique in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. Conceptually, it involves the mathematical embedding from some space (e.g., the space of all possible words) to a lower-dimensional space of the real numbers. The key idea is to capture the semantic meanings, syntactic similarity, and relation of words in these vectors, such that words with similar meanings are closer to each other in the vector space.

Word embedding models like Word2Vec, GloVe (Global Vectors for Word Representation), and FastText have become foundational in NLP applications because they can reduce the dimensionality of text data while preserving lexical and semantic word relationships.

Let's take a simple example using Word2Vec from the Gensim library. Word2Vec can be trained with either the Continuous Bag of Words (CBOW) or Skip-Gram model. In CBOW, the model predicts a word given its context. In Skip-Gram, it predicts the context given a word. Here's how you can use Gensim to train a simple Word2Vec model on a small dataset:

1. First, we'll create a small dataset (corpus).
2. Then, we'll train a Word2Vec model on this dataset.
3. Finally, we'll explore the resulting word embeddings.

In [2]:
from gensim.models import Word2Vec
import logging

# Enable logging for monitoring training
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# Sample sentences
sentences = [
    ['python', 'is', 'a', 'programming', 'language'],
    ['python', 'and', 'java', 'are', 'popular', 'programming', 'languages'],
    ['python', 'programs', 'are', 'easy', 'to', 'write'],
    ['machine', 'learning', 'is', 'fun', 'with', 'python']
]

# Train a Word2Vec model
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)

# Summarize the loaded model
print("Word2Vec model:", model)

# Access vectors for one word
print("Vector for 'python':", model.wv['python'])

# Find most similar words
print("Words similar to 'python':", model.wv.most_similar('python'))


2024-08-25 15:26:50,270 : INFO : collecting all words and their counts
2024-08-25 15:26:50,271 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2024-08-25 15:26:50,272 : INFO : collected 18 word types from a corpus of 24 raw words and 4 sentences
2024-08-25 15:26:50,272 : INFO : Creating a fresh vocabulary
2024-08-25 15:26:50,273 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=1 retains 18 unique words (100.00% of original 18, drops 0)', 'datetime': '2024-08-25T15:26:50.273001', 'gensim': '4.3.2', 'python': '3.9.19 (main, May  6 2024, 14:39:30) \n[Clang 14.0.6 ]', 'platform': 'macOS-14.6.1-arm64-arm-64bit', 'event': 'prepare_vocab'}
2024-08-25 15:26:50,273 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=1 leaves 24 word corpus (100.00% of original 24, drops 0)', 'datetime': '2024-08-25T15:26:50.273295', 'gensim': '4.3.2', 'python': '3.9.19 (main, May  6 2024, 14:39:30) \n[Clang 14.0.6 ]', 'platform': 'macOS-14.6.1-arm64-arm-64bit'

Word2Vec model: Word2Vec<vocab=18, vector_size=100, alpha=0.025>
Vector for 'python': [-5.3622725e-04  2.3643136e-04  5.1033497e-03  9.0092728e-03
 -9.3029495e-03 -7.1168090e-03  6.4588725e-03  8.9729885e-03
 -5.0154282e-03 -3.7633716e-03  7.3805046e-03 -1.5334714e-03
 -4.5366134e-03  6.5540518e-03 -4.8601604e-03 -1.8160177e-03
  2.8765798e-03  9.9187379e-04 -8.2852151e-03 -9.4488179e-03
  7.3117660e-03  5.0702621e-03  6.7576934e-03  7.6286553e-04
  6.3508903e-03 -3.4053659e-03 -9.4640139e-04  5.7685734e-03
 -7.5216377e-03 -3.9361035e-03 -7.5115822e-03 -9.3004224e-04
  9.5381187e-03 -7.3191668e-03 -2.3337686e-03 -1.9377411e-03
  8.0774371e-03 -5.9308959e-03  4.5162440e-05 -4.7537340e-03
 -9.6035507e-03  5.0072931e-03 -8.7595852e-03 -4.3918253e-03
 -3.5099984e-05 -2.9618145e-04 -7.6612402e-03  9.6147433e-03
  4.9820580e-03  9.2331432e-03 -8.1579173e-03  4.4957981e-03
 -4.1370760e-03  8.2453608e-04  8.4986202e-03 -4.4621765e-03
  4.5175003e-03 -6.7869602e-03 -3.5484887e-03  9.3985079e-03

This example demonstrates the basics of training a Word2Vec model with Gensim. Here, `vector_size` specifies the dimensionality of the word vectors, `window` defines the maximum distance between the current and predicted word within a sentence, and `min_count` ignores all words with total frequency lower than this.

After training, we access the vector for "python" and find words similar to "python" based on their word embeddings. The output will give you an insight into how the model understands "python" in the context of the provided corpus.


## 3. Word Embeddings - Details

### Popular Word Embedding Models
These are some of the most popular and widely used word embedding models: `Word2Vec`, `GloVe`, and `FastText`.

`Word2Vec`, `GloVe`, and `FastText` are different in their approaches to generating word embeddings, though they share the common goal of representing words as vectors in a continuous vector space. Here’s a brief comparison:

1. `Word2Vec`
- Developed by: Google.
- Approach: Predictive model.
- Architecture: Uses neural networks with either Continuous Bag of Words (CBOW) or Skip-gram models.
    - CBOW: Predicts a target word from a window of context words.
    - Skip-gram: Predicts context words from a target word.
- Training: Trained on large corpora of text, learning to predict words given their context.
- Output: Dense vector representations for each word.

2. `GloVe`
- Developed by: Stanford.
- Approach: Count-based model.
- Architecture: Constructs a co-occurrence matrix of words from a corpus, then factorizes this matrix to find word vectors.
- Training: Uses the statistical information contained in a corpus, specifically the co-occurrence matrix, to find vector representations that capture the probability of word co-occurrences.
- Output: Dense vector representations for each word.

3. `FastText`
- Developed by: Facebook.
- Approach: Predictive model with subword information.
- Architecture: Similar to Word2Vec (CBOW or Skip-gram), but includes subword (character n-grams) information.
- Training: Trains on large corpora of text, learning to predict words given their context, while incorporating subword information.
- Output: Dense vector representations for each word, incorporating subword information.

Each method has its own strengths and weaknesses, and the choice between them depends on the specific requirements of the task at hand, such as the size of the training data, the importance of handling out-of-vocabulary words, and the computational resources available.

### Word2Vec - CBOW & Skip-gram
Word2Vec is a popular technique for generating word embeddings, which are dense vector representations of words in a continuous vector space. There are two main approaches within Word2Vec: Continuous Bag of Words (CBOW) and Skip-gram.

#### Continuous Bag of Words (CBOW)
CBOW predicts the current word based on the context words surrounding it. It uses a sliding window of fixed size to capture the context around the target word. The model learns to predict the target word given the context words.

How CBOW works:
1. Context Window: Select a context window size (e.g., 2 words on either side of the target word).
2. Context Words: For a given target word, identify the words within the context window.
3. Prediction: Use the context words as input to predict the target word.
Learning: The neural network adjusts its weights to minimize the prediction error.

Example:
- Sentence: "The quick brown fox jumps over the lazy dog"
- Context window size = 2
- Target word = "brown"
- Context words = ["The", "quick", "fox", "jumps"]

The CBOW model would take the context words ("The", "quick", "fox", "jumps") as input and try to predict the target word "brown".

#### Skip-gram
Skip-gram, on the other hand, predicts the context words based on the current word. It uses the current word as the input and predicts the surrounding context words.

How it works:
1. Context Window: Select a context window size (e.g., 2 words on either side of the target word).
2. Target Word: For a given context window, identify the target word.
3. Prediction: Use the target word as input to predict the context words.
Learning: The neural network adjusts its weights to minimize the prediction error.

Example:
- Sentence: "The quick brown fox jumps over the lazy dog"
- Context window size = 2
- Target word = "brown"
- Context words = ["The", "quick", "fox", "jumps"]

The Skip-gram model would take the target word "brown" as input and try to predict the context words ("The", "quick", "fox", "jumps").

#### Comparison
- CBOW: Faster to train, works well with smaller datasets, and averages the context, which can smooth out noise.
- Skip-gram: Slower to train but can produce higher-quality embeddings, especially for infrequent words, as it focuses on each context-target pair individually.

Both architectures aim to create word embeddings that capture semantic relationships between words based on their contexts, but they do so in different ways. CBOW predicts the target word from its context, while Skip-gram predicts the context from the target word.

### Data structure

To understand how the data looks in CBOW (Continuous Bag of Words) using `Word2Vec` with one-hot encoding, let's walk through an example step-by-step.  Suppose we have 5 sentences each has different numbers of words

When dealing with multiple sentences of different lengths in the context of training `Word2Vec` using the CBOW model, the process involves creating context-target pairs for each sentence individually. Here's how it works:

General Approach
1. Tokenization: Each sentence is broken down into individual words.
2. Vocabulary Creation: A vocabulary of unique words across all sentences is created. Each word is assigned a unique index.
3. One-Hot Encoding: Each word in the vocabulary is represented as a one-hot encoded vector.

The CBOW model processes each sentence independently, generating context-target pairs based on the chosen context window size. Let's illustrate this with an example:

Example Sentences
1. "I love natural language processing."
2. "Word2Vec is a popular algorithm."
3. "CBOW and Skip-gram are two models."
4. "Training word embeddings is important."
5. "Handling different sentence lengths."

First, we tokenize each sentence:

1. ["I", "love", "natural", "language", "processing"]
2. ["Word2Vec", "is", "a", "popular", "algorithm"]
3. ["CBOW", "and", "Skip-gram", "are", "two", "models"]
4. ["Training", "word", "embeddings", "is", "important"]
5. ["Handling", "different", "sentence", "lengths"]

Then, we create a vocabulary (Assume each word is assigned an index based on its order in the vocabulary):
"I", "love", "natural", "language", "processing", "Word2Vec", "is", "a", "popular", "algorithm", "CBOW", "and", "Skip-gram", "are", "two", "models", "Training", "word", "embeddings", "important", "Handling", "different", "sentence", "lengths"

#### Context-Target Pairs

Now, let's create context-target pairs for each sentence:

For each sentence, context-target pairs are generated based on the chosen context window size. Let's assume the context window size is 2.

Sentence 1: "I love natural language processing."
1. Target: "natural" | Context: ["I", "love"]
2. Target: "language" | Context: ["love", "natural"]
3. Target: "processing" | Context: ["natural", "language"]
Sentence 2: "Word2Vec is a popular algorithm."
1. Target: "is" | Context: ["Word2Vec"]
2. Target: "a" | Context: ["Word2Vec", "is"]
3. Target: "popular" | Context: ["is", "a"]
4. Target: "algorithm" | Context: ["a", "popular"]

And so on.  In the case of sentences with different lengths, the process remains the same. The context-target pairs are generated for each sentence independently. The context window size is applied to each sentence, and the target word is selected from the context window.

#### Understanding the Input Matrix for CBOW

1. Context Window: For each target word, we consider a window of context words around it. Let's assume a context window size of 2 for this explanation.
2. Context-Target Pairs: Each pair consists of context words as input and a target word as output.
3. One-Hot Encoding: Each word in the context and target is represented as a one-hot encoded vector of length equal to the vocabulary size (24 in this case).

Example Breakdown
Given the five sentences and a vocabulary of 24 unique words, let's outline the context-target pairs and how they form the input matrix.

Generating Context-Target Pairs for the first 2 sentences:
Sentence 1: "I love natural language processing."
Pairs:
1. Target: "natural" | Context: ["I", "love"]
2. Target: "language" | Context: ["love", "natural"]
3. Target: "processing" | Context: ["natural", "language"]
Sentence 2: "Word2Vec is a popular algorithm."
Pairs:
1. Target: "is" | Context: ["Word2Vec"]
2. Target: "a" | Context: ["Word2Vec", "is"]
3. Target: "popular" | Context: ["is", "a"]
4. Target: "algorithm" | Context: ["a", "popular"]

Context Matrix Structure
For each context-target pair, the context words are one-hot encoded, and these one-hot encoded vectors are concatenated to form the input matrix.

Example Pair
Pair: Target: "natural" | Context: ["I", "love"]

One-hot Encoding:  
"I": [1, 0, 0, 0, ..., 0]
"love": [0, 1, 0, 0, ..., 0]

If we consider each context-target pair as an individual training example, the input (context) matrix for all pairs combined would be structured with each row representing a context word vector and columns representing features (words in the vocabulary).

Example Scenario
Given the same sentence: "I love natural language processing."

Vocabulary size (V) = 10
Context window size = 2
One-Hot Encoding Representation
Assume the indices for one-hot encoding are:

I -> [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
love -> [0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
natural -> [0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
language -> [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
processing -> [0, 0, 0, 0, 1, 0, 0, 0, 0, 0]
Word2Vec -> [0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
is -> [0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
a -> [0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
popular -> [0, 0, 0, 0, 0, 0, 0, 0, 1, 0]
algorithm -> [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]

Context-Target Pairs for "I love natural language processing."
For context window size of 2 :
1. Target: "natural" | Context: ["I", "love"]
- Input (context): Sum of $[1,0,0,0,0,0,0,0,0,0]$ and $[0,1,0,0,0,0,0,0,0,0]$
- Target: $[0,0,1,0,0,0,0,0,0,0]$
2. Target: "language" | Context: ["love", "natural"]
- Input (context): Sum of $[0,1,0,0,0,0,0,0,0,0]$ and $[0,0,1,0,0,0,0,0,0,0]$
- Target: $[0,0,0,1,0,0,0,0,0,0]$
3. Target: "processing" | Context: ["natural", "language"]
- Input (context): Sum of $[0,0,1,0,0,0,0,0,0,0]$ and $[0,0,0,1,0,0,0,0,0,0]$
- Target: $[0,0,0,0,1,0,0,0,0,0]$

Input and Target Matrices
Input Matrix (Summed Context Vectors):
$$
\left[\begin{array}{llllllllll}
1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0
\end{array}\right]
$$

Target Matrix:
$$
\left[\begin{array}{llllllllll}
0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0
\end{array}\right]
$$

By summing the one-hot encoded vectors of the context words, you create a combined context representation that keeps the 1s from each context word. This approach simplifies the input matrix and ensures the dimensions are aligned correctly:

Input Matrix Dimensions: Number of context-target pairs (3) × Vocabulary size (10)
Target Matrix Dimensions: Number of context-target pairs (3) × Vocabulary size (10)
This ensures that each row in the input matrix corresponds to a combined context vector, and each row in the target matrix corresponds to the target word's one-hot encoded vector.


### **It's single layer NN with 100 nodes**
  
Let's clarify a bit more about what happens inside Word2Vec.

Word2Vec, whether using the Continuous Bag of Words (CBOW) or Skip-Gram model, indeed leverages a neural network architecture, but it's structured a bit differently than a typical feedforward neural network with a single layer of 100 nodes (when you set vector_size=100). The "100 nodes" or "100 dimensions" represent the size of the word vectors you're aiming to learn, not the nodes of a hidden layer in a traditional sense.

Here's a simplified overview of the process for both CBOW and Skip-Gram models:

Input Layer: For CBOW, the input is the context words (multiple words), which are one-hot encoded vectors representing the presence of words in the context. For Skip-Gram, the input is just the target word. The size of each input vector is equal to the vocabulary size.

Projection Layer (or Hidden Layer): This is not a typical hidden layer with activation functions. Instead, it's a projection layer where the actual learning of word embeddings happens. When you set vector_size=100, it means this layer will have 100 neurons. The weights connecting the input layer to this layer are what become the word embeddings. In training, for a given input word, the corresponding row in the weight matrix is essentially the word vector for that word.

In CBOW, the vectors from the projection layer corresponding to each context word are averaged before being passed to the output layer.
In Skip-Gram, the projection layer directly connects to the output layer, using the vector of the input word.
Output Layer: The output layer is a softmax layer that makes predictions. For CBOW, it predicts the target word from the context. For Skip-Gram, it predicts the context words from the target word. The size of this layer is also equal to the vocabulary size.

So, in summary:

The "100 dimensions" are essentially the weights of the projection layer that you learn during training.
  
The learning involves adjusting these weights so that the model gets better at its prediction task (predicting context words for Skip-Gram, predicting a target word for CBOW), thereby capturing semantic and syntactic word relationships in the process.
The neural network aspect of Word2Vec is quite specialized and optimized for the task of learning word embeddings, which is a bit different from a general-purpose neural network used for other types of prediction tasks.

In the context of the Word2Vec architecture and specifically regarding the projection (or hidden) layer where the word embeddings are learned, the activation function can indeed be thought of as an identity function, $f(x)=x$. This means that the output of each neuron in this layer is the same as its input, without any nonlinear transformation applied.

In [3]:
# Define the dataset
comments = ["love the new features", "hate the long wait time", "excellent service", "poor experience with the product", "happy with the purchase"]
sentiments = [1, 0, 1, 0, 1]  # 1: Positive, 0: Negative

from nltk.tokenize import word_tokenize
import nltk

# Download the punkt tokenizer models
nltk.download('punkt')

# Now you can proceed with tokenizing your text
from nltk.tokenize import word_tokenize

tokenized_comments = [word_tokenize(comment.lower()) for comment in comments]


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/YigitAydede/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [4]:
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import numpy as np

# Tokenize comments
tokenized_comments = [word_tokenize(comment.lower()) for comment in comments]

# Train word embeddings
model = Word2Vec(tokenized_comments, vector_size=4, window=2, min_count=1, workers=1)

# View a sample word vector
print("Vector for 'love':", model.wv['love'])

2024-08-25 15:27:09,649 : INFO : collecting all words and their counts
2024-08-25 15:27:09,650 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2024-08-25 15:27:09,650 : INFO : collected 16 word types from a corpus of 20 raw words and 5 sentences
2024-08-25 15:27:09,651 : INFO : Creating a fresh vocabulary
2024-08-25 15:27:09,651 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=1 retains 16 unique words (100.00% of original 16, drops 0)', 'datetime': '2024-08-25T15:27:09.651826', 'gensim': '4.3.2', 'python': '3.9.19 (main, May  6 2024, 14:39:30) \n[Clang 14.0.6 ]', 'platform': 'macOS-14.6.1-arm64-arm-64bit', 'event': 'prepare_vocab'}
2024-08-25 15:27:09,652 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=1 leaves 20 word corpus (100.00% of original 20, drops 0)', 'datetime': '2024-08-25T15:27:09.652102', 'gensim': '4.3.2', 'python': '3.9.19 (main, May  6 2024, 14:39:30) \n[Clang 14.0.6 ]', 'platform': 'macOS-14.6.1-arm64-arm-64bit'

Vector for 'love': [-0.03944132  0.00803429 -0.10351574 -0.1920672 ]


In [5]:
# Calculate average word vectors for each comment
average_vectors = []
for comment in tokenized_comments:
  comment_vector = np.zeros(model.vector_size)
  for word in comment:
    try:
      comment_vector += model.wv[word]
    except KeyError:
      # Ignore words not in the vocabulary
      pass
  average_vectors.append(comment_vector / len(comment))

# Display the average vector for the first comment
print("Average vector for first comment:", average_vectors[0])

Average vector for first comment: [-0.01083414 -0.0337788   0.03695674  0.03914206]


When we talk about a "100-dimensional vector," we're referring to a list or array of 100 numbers, each representing a point in some dimensional space. A word vector in such a space encapsulates various aspects of the word's meaning and usage.

Understanding Dimensions and Averaging
Let's say we have 3 words, each represented by a 4-dimensional word vector (for simplicity, we're using 4 dimensions instead of 100):

- Word 1 vector: [1,2,3,4]
- Word 2 vector: [2,3,4,5]
- Word 3 vector: [3,4,5,6]
  
These vectors might be the embeddings for three words in a sentence. To represent the entire sentence by a single vector, we compute the average of these vectors.

To find the average vector, we calculate the mean for each dimension across all word vectors:

- Dimension 1 average: (1+2+3)/3=2
- Dimension 2 average: (2+3+4)/3=3
- Dimension 3 average: (3+4+5)/3=4
- Dimension 4 average: (4+5+6)/3=5
  
So, the average vector representing the entire sentence is [2,3,4,5].

What This Represents? This averaged vector is still in the same 4-dimensional space as the original word vectors, but it's a new vector that, in theory, captures the combined semantic and syntactic essence of all the words in the text.

In [6]:
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
import nltk

# Download NLTK punkt tokenizer models
nltk.download('punkt')

# Define the dataset
comments = ["love the new features", "hate the long wait time", "excellent service", "poor experience with the product", "happy with the purchase"]
sentiments = [1, 0, 1, 0, 1]  # 1: Positive, 0: Negative

# Tokenize comments
tokenized_comments = [word_tokenize(comment.lower()) for comment in comments]

# Train Word2Vec embeddings
model = Word2Vec(tokenized_comments, vector_size=4, window=2, min_count=1, workers=1)

# Function to convert a comment to its average word vector
def comment_to_vector(comment):
    comment_vector = np.zeros(model.vector_size)
    for word in comment:
        if word in model.wv:
            comment_vector += model.wv[word]
    return comment_vector / len(comment)

# Convert all tokenized comments to average word vectors
average_vectors = [comment_to_vector(comment) for comment in tokenized_comments]

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(average_vectors, sentiments, test_size=0.4, random_state=42)

# Train a Logistic Regression model
clf = LogisticRegression()
clf.fit(X_train, y_train)

# Predict on the test set
y_pred = clf.predict(X_test)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:")
print(classification_report(y_test, y_pred))


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/YigitAydede/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
2024-08-25 15:27:18,544 : INFO : collecting all words and their counts
2024-08-25 15:27:18,544 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2024-08-25 15:27:18,545 : INFO : collected 16 word types from a corpus of 20 raw words and 5 sentences
2024-08-25 15:27:18,545 : INFO : Creating a fresh vocabulary
2024-08-25 15:27:18,545 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=1 retains 16 unique words (100.00% of original 16, drops 0)', 'datetime': '2024-08-25T15:27:18.545616', 'gensim': '4.3.2', 'python': '3.9.19 (main, May  6 2024, 14:39:30) \n[Clang 14.0.6 ]', 'platform': 'macOS-14.6.1-arm64-arm-64bit', 'event': 'prepare_vocab'}
2024-08-25 15:27:18,545 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=1 leaves 20 word corpus (100.00% of original 20, drops 0)', 'datetime': '2024-08-25T15:27:1

Accuracy: 0.5
Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.50      1.00      0.67         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0.25      0.50      0.33         2



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
