# NLP Application


## Application with Word Embedding

The dataset we are using here is a subset of Amazon reviews from the Cell Phones & Accessories category. The data is stored as a JSON file and can be read using pandas.

In [1]:
import gensim
import pandas as pd

import requests
import pandas as pd
from io import BytesIO
import gzip

# URL of the dataset
url = "http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Cell_Phones_and_Accessories_5.json.gz"

# Send a HTTP request to the URL
response = requests.get(url)

# Make sure the request was successful
if response.status_code == 200:
    # Open the response content as a gzip file
    with gzip.open(BytesIO(response.content), 'rt') as read_file:
        # Read the dataset into a pandas DataFrame
        data = pd.read_json(read_file, lines=True)
    # Display the first few rows of the DataFrame
    print(data.head())
else:
    print("Failed to download the dataset.")

       reviewerID        asin      reviewerName helpful  \
0  A30TL5EWN6DFXT  120401325X         christina  [0, 0]   
1   ASY55RVNIL0UD  120401325X          emily l.  [0, 0]   
2  A2TMXE2AFO7ONB  120401325X             Erica  [0, 0]   
3   AWJ0WZQYMYFQ4  120401325X                JM  [4, 4]   
4   ATX7CZYFXI1KW  120401325X  patrice m rogoza  [2, 3]   

                                          reviewText  overall  \
0  They look good and stick good! I just don't li...        4   
1  These stickers work like the review says they ...        5   
2  These are awesome and make my phone look so st...        5   
3  Item arrived in great time and was in perfect ...        4   
4  awesome! stays on, and looks great. can be use...        5   

                                     summary  unixReviewTime   reviewTime  
0                                 Looks Good      1400630400  05 21, 2014  
1                      Really great product.      1389657600  01 14, 2014  
2                         

In [3]:
data.shape

(194439, 9)

In [2]:
review_text = data.reviewText.apply(gensim.utils.simple_preprocess)
review_text

0         [they, look, good, and, stick, good, just, don...
1         [these, stickers, work, like, the, review, say...
2         [these, are, awesome, and, make, my, phone, lo...
3         [item, arrived, in, great, time, and, was, in,...
4         [awesome, stays, on, and, looks, great, can, b...
                                ...                        
194434    [works, great, just, like, my, original, one, ...
194435    [great, product, great, packaging, high, quali...
194436    [this, is, great, cable, just, as, good, as, t...
194437    [really, like, it, becasue, it, works, well, w...
194438    [product, as, described, have, wasted, lot, of...
Name: reviewText, Length: 194439, dtype: object

In [4]:
from gensim.models import Word2Vec

# Convert review_text into a list of lists of tokens for training
sentences = review_text.tolist()

# Initialize and train the Word2Vec model
model = Word2Vec(sentences=sentences,
                 vector_size=100,  # Size of word vectors; adjust based on your needs
                 window=10,
                 min_count=2,
                 workers=4)

# Summarize the loaded model
print(model)

# Save the model for later use
model.save("word2vec_amazon_reviews.model")

# Access vectors for a word
print("Vector for the word 'phone':", model.wv['phone'])

# Find most similar words to 'phone'
print("Words similar to 'phone':", model.wv.most_similar('phone'))


Word2Vec<vocab=35561, vector_size=100, alpha=0.025>
Vector for the word 'phone': [-6.3998312e-01 -1.4190605e+00  2.1279287e+00 -1.4021382e+00
 -1.8854320e+00  6.1295718e-01 -2.1259439e+00  3.4396327e-01
  2.7594447e+00  1.8673247e+00  2.1820788e+00 -1.5625589e+00
 -1.0210887e+00 -4.2885270e+00  1.1849930e+00  2.4339321e+00
 -1.4142124e-01  3.2184696e+00  3.4006277e-01 -1.9219936e-01
 -1.2596589e+00  1.3896925e+00  3.5460535e-01  3.5005484e+00
 -1.1138523e+00 -1.9386652e-01  2.0508106e+00  3.2048085e+00
 -5.4209900e-01  8.1116128e-01  4.0708828e+00 -3.0688353e+00
  4.1906486e+00 -8.3819085e-01  1.8337253e+00 -1.8341529e-01
  1.7472942e+00 -3.1700909e+00 -1.3498080e+00  5.2830368e-01
 -3.3253145e-01  3.2028456e+00  1.4030707e+00 -1.9721323e+00
  1.4377276e+00 -5.1707673e-01  5.0323230e-01 -2.5271021e-03
  1.1227527e+00  1.2285974e+00  2.1387091e+00  3.6197749e-01
 -1.8539317e+00 -3.4891474e+00  2.3794155e+00 -2.7587482e-01
 -1.1417617e+00  2.1127207e+00  4.0166011e+00  3.9557195e+00
 -9.

In [7]:
model.build_vocab(review_text, progress_per=1000)
model.train(review_text, total_examples=model.corpus_count, epochs=model.epochs)

(61505184, 83868975)

The first number (61502857): This is the total number of words processed during the training phase. It takes into account the window parameter and possibly multiple passes over the data, depending on the number of epochs the model is trained for. This number shows how many individual word contexts the training algorithm has used to adjust the vector representations.

The second number (83868975): This is the total number of raw words in the training data. It represents the sum of the lengths of all the sentences provided to the model as training data, before any filtering for min_count (minimum word frequency) or other preprocessing steps. Essentially, it's the size of the training corpus in terms of total words before any words are excluded based on the model's parameters.

In [8]:
model.wv.most_similar("bad")

[('shabby', 0.6256607174873352),
 ('terrible', 0.6220461130142212),
 ('good', 0.5864471793174744),
 ('horrible', 0.5709187388420105),
 ('disappointing', 0.5291851758956909),
 ('poor', 0.5196760892868042),
 ('fault', 0.5156261920928955),
 ('sad', 0.515142560005188),
 ('okay', 0.49805721640586853),
 ('cheap', 0.4978519678115845)]

In [9]:
model.wv.similarity(w1="great", w2="great")

1.0

In [10]:
model.wv.similarity(w1="great", w2="good")

0.7913931

## Finding Similar Words

After we've trained your Word2Vec model on customer reviews, we've essentially transformed words into vectors that capture semantic meanings, relationships, and context within our dataset. This opens up a variety of ways to analyze and gain insights from the customer reviews. Here are some practical applications and analyses we can perform:

1. Finding Similar Words (we did already)
Discover words that are semantically related to specific terms. This can help identify common themes or issues in reviews. For example, finding words similar to "battery" might reveal related concerns or praises in the context of product reviews.

In [11]:
similar_words = model.wv.most_similar('battery', topn=10)
print(similar_words)

[('batter', 0.8712946176528931), ('batt', 0.8024268746376038), ('batteries', 0.6934431791305542), ('juice', 0.5941666960716248), ('powerplant', 0.583953857421875), ('cycle', 0.5682535767555237), ('powerbank', 0.5618174076080322), ('igeek', 0.5543268918991089), ('incredicharge', 0.5326968431472778), ('ion', 0.5305150747299194)]


## Why the `train` method is sometimes used separately in Gensim’s Word2Vec.

1. Incremental Training (Online Learning):

Incremental training (also known as online learning) is when you want to train your model on new data without starting from scratch. Here’s why it might be useful:

- Dynamic Data: Suppose you continuously receive new data (like real-time reviews or social media posts). Instead of retraining the model from the beginning every time new data arrives, you can load the existing model and use train to update it with the new data.
- Memory Efficiency: When working with very large datasets, you might not have all the data available at once. You can build your model incrementally, feeding in data in chunks.

2. Custom Vocabulary Building:
Separating vocabulary building and training is useful when:

- Inspecting the Vocabulary: You might want to inspect or fine-tune the vocabulary before training, for example by adjusting min_count (which controls the minimum frequency for a word to be included in the model).
- Adjusting Hyperparameters: After inspecting the vocabulary, you might decide to tweak hyperparameters like window, vector_size, or min_count before proceeding to the training phase.

In this case, you build the vocabulary first, then call train explicitly when you’re ready.

3. Fine-Tuning a Pre-Trained Model:
Sometimes you start with a pre-trained model and then fine-tune it on a smaller, domain-specific dataset. For instance:

- Domain Adaptation: If you have a Word2Vec model trained on general text (like news articles), you might want to adapt it to a specific domain like medical text or legal documents.
- Improving Performance: Fine-tuning on new data can improve performance for tasks specific to your domain, such as document classification or sentiment analysis.

In these cases, you load the pre-trained model and use train to update the word vectors based on your new data.

Let's see how exactly you can load and continue training your model with additional data.


In [12]:
# Load the pre-trained model
model = Word2Vec.load("word2vec_amazon_reviews.model")

new_reviews = [
    "I love this phone, it has great battery life",
    "The camera on this phone is amazing but the battery drains quickly",
    "Highly recommended phone for its price"
]

# Preprocess the new reviews (using the same preprocessing steps)
new_sentences = [gensim.utils.simple_preprocess(review) for review in new_reviews]


Before continuing training, you need to update the model’s vocabulary with the new words in your additional data. Use build_vocab with the `update=True` option, which ensures that the existing vocabulary is retained, and the new words are added.

In [13]:
model.build_vocab(new_sentences, update=True)
model.train(new_sentences, total_examples=len(new_sentences), epochs=model.epochs)
model.save("word2vec_amazon_reviews_updated.model")

`total_examples=len(new_sentences)`: This tells the model how many examples (sentences) you’re training it on.
`epochs=model.epochs`: This ensures the new data is trained for the same number of epochs as your original data.

Summary of the Process:
- Model Name: The initial model is saved as `word2vec_amazon_reviews.model`.  
- Loading and Updating: You load the model, add new data, and then retrain it incrementally.
- Vocabulary Update: Remember to use `build_vocab(..., update=True)` when adding new data to ensure the new words are incorporated into the model.

In [14]:
# Check updated vectors
print("Words similar to 'phone' after incremental training:", model.wv.most_similar('phone'))

Words similar to 'phone' after incremental training: [('cellphone', 0.5536061525344849), ('iphone', 0.5523805022239685), ('it', 0.5207656025886536), ('lap', 0.4834569990634918), ('case', 0.4824077785015106), ('device', 0.46920591592788696), ('tabletcons', 0.46835488080978394), ('cheek', 0.4509800672531128), ('pocket', 0.43245789408683777), ('face', 0.42960381507873535)]


## Incremental Learning vs. Fine-Tuning
Fine-tuning and incremental learning might seem similar at first glance, but they serve slightly different purposes and are used in different contexts.

1. The purpose of Incremetal Learning is to continue training the model with new data as it becomes available while retaining knowledge from the existing model. It's suitable for scenarios where you receive new data continuously (e.g., streaming data or new customer reviews) and want to update the model without retraining it from scratch.
The model’s vocabulary and knowledge are expanded to incorporate the new data. You load the model, update the vocabulary, and train it on the new data.
2. In Fine-Tuning, the purpose is to adapt a pre-trained model to a specific domain or task. You fine-tune the model on a smaller, domain-specific dataset while retaining the general knowledge from the pre-training phase. It's useful when you want to apply a model pre-trained on a large general corpus (e.g., news articles, Wikipedia) to a specialized task (e.g., medical text analysis, legal documents). The model is trained further on a small dataset specific to the new task, usually with adjusted hyperparameters (e.g., lower learning rate, fewer epochs) to avoid overfitting.

In [15]:
# Load the pre-trained model
model = Word2Vec.load("word2vec_amazon_reviews.model")

# Specialized data for fine-tuning
fine_tuning_reviews = [
    "The battery life of this phone is exceptional.",
    "I found the screen resolution to be subpar.",
    "This smartphone offers a great balance between price and performance."
]

# Preprocess the reviews
fine_tuning_sentences = [gensim.utils.simple_preprocess(review) for review in fine_tuning_reviews]

# Optionally update the vocabulary
model.build_vocab(fine_tuning_sentences, update=True)

# Fine-tune the model with a lower learning rate and fewer epochs
model.train(fine_tuning_sentences, total_examples=len(fine_tuning_sentences), epochs=5)

# Save the fine-tuned model
model.save("word2vec_amazon_reviews_finetuned.model")

# Check results
print("Words similar to 'phone' after fine-tuning:", model.wv.most_similar('phone'))


Words similar to 'phone' after fine-tuning: [('cellphone', 0.5534407496452332), ('iphone', 0.5522252321243286), ('it', 0.5205486416816711), ('lap', 0.48235467076301575), ('case', 0.482332319021225), ('device', 0.4693024456501007), ('tabletcons', 0.4679739475250244), ('cheek', 0.45062342286109924), ('pocket', 0.4320789873600006), ('face', 0.4295228123664856)]


In summary, fine-tuning adapts your model for a specific domain or task, while incremental learning keeps your model up-to-date with new data. Fine-tuning typically involves more cautious training to avoid losing the general knowledge your model has already gained.

## Word Clustering

Clustering words based on their vector representations can help identify groups of related terms or concepts within the reviews. Techniques like K-means clustering can be applied to the word vectors to group words into clusters of similar meanings. Word clustering involves grouping words into clusters based on their vector representations, such that words in the same cluster have similar meanings or are used in similar contexts. This can reveal patterns, themes, or topics common in your data. For instance, in customer reviews, you might find clusters around product features, customer service, shipping issues, etc.   

Let's demonstrate word clustering using K-means on the Word2Vec embeddings you've trained. We'll use a subset of the most frequent words to make the clusters more interpretable. Finally, we'll discuss the insights that can be gained from this analysis.
  
**Step 1: Preparing Word Vectors**
First, extract a set of word vectors from your Word2Vec model. For demonstration, we'll use the 100 most frequent words (excluding very common but less informative words).

In [16]:
from sklearn.cluster import KMeans
import numpy as np

# Assuming `model` is your Word2Vec model

# Extract the list of words & their vectors
word_vectors = model.wv.vectors
words = list(model.wv.index_to_key)

# For a more focused analysis, consider filtering words by frequency or excluding stop words
# This example uses all words for simplicity


**Step 2: Clustering Words**
Now, we'll use K-means clustering to group these words into clusters based on their vector similarities.

In [17]:
# Number of clusters
k = 10  # Example: 10 clusters. Adjust based on your analysis needs.

# Perform KMeans clustering
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(word_vectors)

# Assign each word to a cluster
word_cluster_labels = kmeans.labels_


**Step 3: Examining the Clusters**
After clustering, let's examine which words ended up in the same clusters. This will give us an idea of the themes or topics present in the reviews.

In [None]:
# Create a dictionary of word clusters
word_clusters = {i: [] for i in range(k)}
for word, cluster_label in zip(words, word_cluster_labels):
    word_clusters[cluster_label].append(word)

# Display words in each cluster
for cluster, words in word_clusters.items():
    print(f"Cluster {cluster}: {words[:10]}")  # Displaying first 10 words for brevity


**Insights from Word Clustering**.

- Theme Identification: Each cluster represents a group of words that are contextually similar. By examining the words in each cluster, you can identify common themes or topics in the reviews. For example, a cluster containing words like "battery", "charge", and "power" might indicate discussions about battery life.
- Product Features and Issues: Clusters might reveal specific product features that customers talk about the most, as well as recurring issues or areas of dissatisfaction.
- Customer Sentiment: Although not a direct measure of sentiment, the clustering of certain words together can give clues about overall customer sentiment. Words with positive connotations clustering together separately from words with negative connotations could indicate polarized opinions about certain aspects of the product or service.
- Improving Product and Service: By identifying clusters related to customer service, shipping, product durability, etc., businesses can pinpoint areas for improvement.

**Difference in K-Means**
The approach of finding "similar words" using Word2Vec (like with `most_similar`) and clustering words using techniques like K-means are related but serve different purposes and work in slightly different ways. Let’s break down the distinctions and relationships between these approaches:

1. Finding Similar Words (using most_similar): The primary goal is to identify words that are closest in vector space to a given target word. Word2Vec learns vector representations (embeddings) where words that appear in similar contexts have similar vectors. The most_similar method identifies words whose vectors are closest to a target word’s vector, based on cosine similarity.

It’s often used for synonym detection, expanding queries, or understanding context in NLP tasks. For example, given the word "phone," it might return words like "mobile," "smartphone," "device," etc., which are semantically related.

2. Clustering Words (using K-means): The goal is to group words into clusters based on their vector representations. Each cluster represents a group of words that are close to each other in vector space. K-means clustering algorithm groups word vectors into a specified number of clusters (k). The algorithm iteratively assigns words to the nearest cluster centroid and adjusts the centroids until the clusters stabilize.

It’s used for tasks like topic modeling, organizing vocabulary into themes, or understanding word groups in a large corpus.  For example, you might get clusters where one cluster contains words related to "technology" (e.g., "phone," "tablet," "laptop"), while another contains words related to "battery life" (e.g., "charging," "drains," "power").

In practice, you can use both methods together:

- First Step: Use K-means clustering to identify broad categories of words.
- Second Step: Within each cluster, you can then use most_similar to explore finer relationships and identify the most representative words in that cluster.

After performing K-means clustering, you might also want to visualize the clusters using techniques like t-SNE or PCA, which project the high-dimensional word vectors into 2D space for better understanding.

# Sentiment Analysis

To demonstrate sentiment analysis using the same Amazon review data, I’ll walk you through both supervised (with labeled data) and unsupervised (without labeled data) methods. Sentiment analysis typically involves determining whether a piece of text has a positive, negative, or neutral sentiment.

Here’s how you can approach it:

### 1. **Sentiment Analysis with Labeled Data (Supervised Learning)**:

In supervised learning, you need labeled data where each review has a corresponding sentiment label (e.g., positive, negative). We'll explore how to use machine learning models like Logistic Regression or even advanced methods like fine-tuning pre-trained transformer models.

2. **Sentiment Analysis without Labeled Data (Unsupervised Learning)**:
In unsupervised learning, you don’t have labels, so you use heuristic-based approaches like:

- Lexicon-based methods: Use pre-built dictionaries of positive and negative words.
- Unsupervised clustering methods: Group reviews into clusters that represent sentiment.

Let's start with the unsupervised approach.

## Lexicon-Based Approach - Unsupervised
  
This method relies on predefined lists of words associated with positive and negative sentiments. You can use libraries like TextBlob or VADER, which come with built-in sentiment lexicons and can provide sentiment scores based on the presence and combinations of positive and negative words in your text.

**Method 1: Lexicon-Based Approach using `TextBlob`**
`TextBlob` is a simple and widely used library for sentiment analysis that doesn’t require labeled data:

In [18]:
from textblob import TextBlob

# Example review
review = "The phone has an amazing battery life but a disappointing camera."

# Get sentiment polarity
sentiment = TextBlob(review).sentiment.polarity
print(f"Sentiment polarity: {sentiment}")


Sentiment polarity: 5.551115123125783e-17


A positive polarity score indicates a positive sentiment, while a negative score indicates a negative sentiment. TextBlob can be a straightforward way to start with sentiment analysis without needing labeled data.

This method relies on predefined sentiment scores for words to evaluate the overall sentiment of a piece of text. Two popular tools for this purpose are TextBlob and VADER (Valence Aware Dictionary and sEntiment Reasoner), both of which are well-suited for different types of text data. Here, I'll show you how to use both, and you can choose based on your preference and the nature of your dataset.

TextBlob is straightforward and works well for general-purpose sentiment analysis, including on longer texts like reviews.

In [20]:
# Applying TextBlob sentiment analysis on the reviewText column
data['sentiment_polarity'] = data['reviewText'].apply(lambda x: TextBlob(x).sentiment.polarity)
data['sentiment_subjectivity'] = data['reviewText'].apply(lambda x: TextBlob(x).sentiment.subjectivity)


Subjectivity in sentiment analysis measures how much a piece of text reflects personal opinions, emotions, and judgments rather than objective facts. The subjectivity score ranges from 0 to 1.
- 0 (Objective): A score closer to 0 indicates that the text is more objective, meaning it contains factual information with little to no personal opinion or emotion.
- 1 (Subjective): A score closer to 1 indicates that the text is highly subjective, meaning it is based more on personal opinions, feelings, or beliefs rather than facts.


Objective Text (Score Near 0):

Examples: “The phone weighs 200 grams.” or “The sky is blue.”
These statements provide factual information without expressing personal thoughts or emotions. Such text would have a low subjectivity score because it’s focused on facts rather than opinions.
  
Subjective Text (Score Near 1):
  
Examples: “I love how lightweight this phone is!” or “The movie was amazing and heartwarming.”
These statements express personal opinions and feelings. The language is driven by individual perception rather than facts. Such text would have a high subjectivity score because it is based on emotions, preferences, and judgments.
Application in Sentiment Analysis:
In tasks like sentiment analysis or opinion mining, determining the subjectivity helps in distinguishing between:

Objective Reviews or Comments: These are more likely to provide factual information that can be useful in different contexts.
Subjective Reviews or Comments: These contain opinions that help in understanding user preferences, satisfaction, or dissatisfaction.

Example Use Case: Suppose you have product reviews, and you want to focus on opinions (subjective content) rather than purely descriptive (objective) content. A subjectivity score allows you to filter reviews based on whether they reflect factual descriptions or personal experiences and emotions.

Now that you have sentiment scores, you can analyze them to gain insights into the overall sentiment of the reviews, such as:
Overall Sentiment: Calculate the average sentiment polarity to get an idea of the overall sentiment towards the product.


In [21]:
average_sentiment = data['sentiment_polarity'].mean()
print(f"Average Sentiment Polarity: {average_sentiment}")


Average Sentiment Polarity: 0.24830300849739492


An average sentiment polarity of approximately 0.248 suggests that the overall sentiment in your dataset of reviews leans towards the positive side. This is a good starting point for understanding customer sentiment, but there are several ways you can delve deeper to gain more nuanced insights.  Now that we know the overall sentiment is somewhat positive, we might want to understand how sentiment varies across different aspects or features of the product, like its battery life, camera quality, or customer service. We can filter reviews mentioning specific features and calculate the average sentiment for reviews concerning each aspect:

In [22]:
positive_reviews = data[data['sentiment_polarity'] > 0].shape[0]
neutral_reviews = data[data['sentiment_polarity'] == 0].shape[0]
negative_reviews = data[data['sentiment_polarity'] < 0].shape[0]

print(f"Positive Reviews: {positive_reviews}")
print(f"Neutral Reviews: {neutral_reviews}")
print(f"Negative Reviews: {negative_reviews}")


Positive Reviews: 172070
Neutral Reviews: 4950
Negative Reviews: 17419


In [23]:
features = ['battery', 'camera', 'service']
for feature in features:
    feature_reviews = data[data['reviewText'].str.contains(feature, case=False)]
    avg_sentiment = feature_reviews['sentiment_polarity'].mean()
    print(f"Average sentiment for {feature}: {avg_sentiment}")


Average sentiment for battery: 0.199119172408153
Average sentiment for camera: 0.19423316438082053
Average sentiment for service: 0.22639224564186428


Let's develop a script that we follow in both models

In [24]:
from textblob import TextBlob

def get_sentiment(review):
    analysis = TextBlob(review)
    return analysis.sentiment.polarity  # Returns a score between -1 (negative) and 1 (positive)

# Apply sentiment analysis to your dataset
data['sentiment_score'] = data['reviewText'].apply(get_sentiment)

# Classify sentiment based on score
data['sentiment'] = data['sentiment_score'].apply(lambda x: 'positive' if x > 0 else ('negative' if x < 0 else 'neutral'))

# Display results
print(data[['reviewText', 'sentiment_score', 'sentiment']].head())

                                          reviewText  sentiment_score  \
0  They look good and stick good! I just don't li...         0.391667   
1  These stickers work like the review says they ...         0.533333   
2  These are awesome and make my phone look so st...         0.573828   
3  Item arrived in great time and was in perfect ...         0.600000   
4  awesome! stays on, and looks great. can be use...         0.360000   

  sentiment  
0  positive  
1  positive  
2  positive  
3  positive  
4  positive  


**Method 2: Lexicon-Based Approach using VADER (for Social Media/Informal Text)**

VADER is specifically tuned for social media text, making it useful for analyzing informal reviews:

In [25]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()

def vader_sentiment(review):
    sentiment_dict = analyzer.polarity_scores(review)
    return sentiment_dict['compound']  # Returns a score between -1 (negative) and 1 (positive)

data['vader_sentiment'] = data['reviewText'].apply(vader_sentiment)
data['vader_label'] = data['vader_sentiment'].apply(lambda x: 'positive' if x > 0.05 else ('negative' if x < -0.05 else 'neutral'))

# Display results
print(data[['reviewText', 'vader_sentiment', 'vader_label']].head())


                                          reviewText  vader_sentiment  \
0  They look good and stick good! I just don't li...           0.5396   
1  These stickers work like the review says they ...           0.9403   
2  These are awesome and make my phone look so st...           0.8852   
3  Item arrived in great time and was in perfect ...           0.9625   
4  awesome! stays on, and looks great. can be use...           0.9020   

  vader_label  
0    positive  
1    positive  
2    positive  
3    positive  
4    positive  


## Supervised Sentiment Analysis

In [3]:
from gensim.models import Word2Vec
import nltk
from nltk.tokenize import word_tokenize

# Define the dataset
comments = ["love the new features", "hate the long wait time", "excellent service", "poor experience with the product", "happy with the purchase"]
sentiments = [1, 0, 1, 0, 1]  # 1: Positive, 0: Negative

# Download the punkt tokenizer models
nltk.download('punkt')

# Now you can proceed with tokenizing your text
from nltk.tokenize import word_tokenize

tokenized_comments = [word_tokenize(comment.lower()) for comment in comments]

# Train word embeddings
model = Word2Vec(tokenized_comments, vector_size=4, window=2, min_count=1, workers=1)

# View a sample word vector
print("Vector for 'love':", model.wv['love'])

Vector for 'love': [-0.03944132  0.00803429 -0.10351574 -0.1920672 ]


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/YigitAydede/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [5]:
import numpy as np

# Calculate average word vectors for each comment
average_vectors = []
for comment in tokenized_comments:
  comment_vector = np.zeros(model.vector_size)
  for word in comment:
    try:
      comment_vector += model.wv[word]
    except KeyError:
      # Ignore words not in the vocabulary
      pass
  average_vectors.append(comment_vector / len(comment))

# Display the average vector for the first comment
print("Average vector for first comment:", average_vectors[0])
print("Average vector for first comment:", average_vectors[1])
print("Average vector for first comment:", average_vectors[2])
print("Average vector for first comment:", average_vectors[3])
print("Average vector for first comment:", average_vectors[4])

Average vector for first comment: [-0.01083414 -0.0337788   0.03695674  0.03914206]
Average vector for first comment: [ 0.01442597  0.04124683 -0.09715094  0.0698698 ]
Average vector for first comment: [ 0.02520601 -0.14069088 -0.12306689 -0.03584729]
Average vector for first comment: [ 0.03350125 -0.02111864  0.04544426  0.07532134]
Average vector for first comment: [-0.12119516 -0.02556053  0.08801746  0.09145366]


When we talk about a "100-dimensional vector," we're referring to a list or array of 100 numbers, each representing a point in some dimensional space. A word vector in such a space encapsulates various aspects of the word's meaning and usage.

Understanding Dimensions and Averaging
Let's say we have 3 words, each represented by a 4-dimensional word vector (for simplicity, we're using 4 dimensions instead of 100):

- Word 1 vector: [1,2,3,4]
- Word 2 vector: [2,3,4,5]
- Word 3 vector: [3,4,5,6]
  
These vectors might be the embeddings for three words in a sentence. To represent the entire sentence by a single vector, we compute the average of these vectors.

To find the average vector, we calculate the mean for each dimension across all word vectors:

- Dimension 1 average: (1+2+3)/3=2
- Dimension 2 average: (2+3+4)/3=3
- Dimension 3 average: (3+4+5)/3=4
- Dimension 4 average: (4+5+6)/3=5
  
So, the average vector representing the entire sentence is [2,3,4,5].

What This Represents? This averaged vector is still in the same 4-dimensional space as the original word vectors, but it's a new vector that, in theory, captures the combined semantic and syntactic essence of all the words in the text.

These average vectors are also known as "sentence embeddings" so that when they append they create the input for a classifier that can classify the entire sentence.

Here is the entire code using logistic regression as classifier:

In [6]:
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
import nltk

# Download NLTK punkt tokenizer models
nltk.download('punkt')

# Define the dataset
comments = ["love the new features", "hate the long wait time", "excellent service", "poor experience with the product", "happy with the purchase"]
sentiments = [1, 0, 1, 0, 1]  # 1: Positive, 0: Negative

# Tokenize comments
tokenized_comments = [word_tokenize(comment.lower()) for comment in comments]

# Train Word2Vec embeddings
model = Word2Vec(tokenized_comments, vector_size=4, window=2, min_count=1, workers=1)

# Function to convert a comment to its average word vector
def comment_to_vector(comment):
    comment_vector = np.zeros(model.vector_size)
    for word in comment:
        if word in model.wv:
            comment_vector += model.wv[word]
    return comment_vector / len(comment)

# Convert all tokenized comments to average word vectors
average_vectors = [comment_to_vector(comment) for comment in tokenized_comments]

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(average_vectors, sentiments, test_size=0.4, random_state=42)

# Train a Logistic Regression model
clf = LogisticRegression()
clf.fit(X_train, y_train)

# Predict on the test set
y_pred = clf.predict(X_test)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:")
print(classification_report(y_test, y_pred))


Accuracy: 0.5
Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.50      1.00      0.67         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0.25      0.50      0.33         2



[nltk_data] Downloading package punkt to
[nltk_data]     /Users/YigitAydede/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Below is the code that uses TF-IDF

In [7]:
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
import nltk

# Download NLTK punkt tokenizer models
nltk.download('punkt')

# Define the dataset
comments = ["love the new features", "hate the long wait time", "excellent service", "poor experience with the product", "happy with the purchase"]
sentiments = [1, 0, 1, 0, 1]  # 1: Positive, 0: Negative

# Tokenize comments (though it's not strictly necessary with TfidfVectorizer)
tokenized_comments = [" ".join(word_tokenize(comment.lower())) for comment in comments]

# Initialize the TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the comments to get TF-IDF vectors
tfidf_vectors = vectorizer.fit_transform(tokenized_comments).toarray()

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(tfidf_vectors, sentiments, test_size=0.4, random_state=42)

# Train a Logistic Regression model
clf = LogisticRegression()
clf.fit(X_train, y_train)

# Predict on the test set
y_pred = clf.predict(X_test)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:")
print(classification_report(y_test, y_pred))


Accuracy: 0.5
Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.50      1.00      0.67         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0.25      0.50      0.33         2



[nltk_data] Downloading package punkt to
[nltk_data]     /Users/YigitAydede/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Key Differences:

1. Text Representation:

- TF-IDF: Converts the text data into sparse vectors where each feature represents a word and its corresponding importance in the document (based on term frequency and inverse document frequency). The resulting vectors are high-dimensional and sparse.
- Word2Vec (Example 2): Converts each word into a dense vector (embedding) that captures semantic meaning. The final document vector is an average of all word embeddings in the document, producing a dense, lower-dimensional vector.

2. Vectorization Approach:

- TF-IDF: The focus is on the importance of words within the document and across the corpus. Words that occur frequently in a document but rarely across the entire dataset have high scores.
- Word2Vec: Focuses on capturing the relationships and meanings of words based on their context within sentences. Words with similar contexts have similar vector representations, capturing semantic similarities.

3. Model Input:

- TF-IDF: Produces a sparse, high-dimensional feature vector where each dimension corresponds to a specific word in the corpus (e.g., 5000 dimensions if max_features=5000).
- Word2Vec: Produces a dense, lower-dimensional vector (e.g., 100 dimensions if vector_size=100) that is an average of the word embeddings in the document.

4. Semantic Understanding:

- TF-IDF: Doesn’t capture semantic similarities between words. For instance, "great" and "excellent" will be treated as completely different features.
- Word2Vec: Captures semantic similarities between words. If "great" and "excellent" have similar contexts in the training data, their embeddings will be similar.

5. Training and Computation:

- TF-IDF: Faster and simpler to compute. No additional model needs to be trained, as it’s purely a statistical method.
- Word2Vec: Requires training the Word2Vec model on the text data, which can be computationally intensive depending on the dataset size and vector dimensions.

6. When to Use Each Approach:

- TF-IDF: Suitable for traditional machine learning approaches, especially when the focus is on word importance rather than word relationships. It works well with large corpora and can be a strong baseline.
- Word2Vec: Preferred when you need to capture more nuanced relationships between words and their contexts. It’s useful when semantic meaning plays an important role in your classification task.

**Which Approach Is Better?**

- For Simplicity: TF-IDF is easier to implement and understand, and it works well as a baseline for many text classification tasks.
- For Capturing Semantic Relationships: Word2Vec is better when the relationships between words are important (e.g., understanding that "excellent" and "great" are similar even if they don’t co-occur).

## IMDB Large Movie Review Dataset v1.0

Dowbload the data from the link below.
https://ai.stanford.edu/~amaas/data/sentiment/ 

The IMDB Large Movie Review Dataset is a benchmark dataset used extensively for sentiment analysis tasks. It contains movie reviews labeled by sentiment polarity (positive or negative), making it ideal for training and evaluating sentiment classification models.

1. Dataset Composition
Total Reviews: 50,000 labeled movie reviews.
Training Set: 25,000 reviews, balanced with 12,500 positive and 12,500 negative samples.
Test Set: 25,000 reviews, balanced with 12,500 positive and 12,500 negative samples.
Unlabeled Data: An additional 50,000 unlabeled reviews are available for unsupervised learning.
2. Sentiment Labeling Criteria
Positive Reviews: Reviews with a rating of 7 or higher out of 10.
Negative Reviews: Reviews with a rating of 4 or lower out of 10.
Neutral Reviews Excluded: Reviews with ratings between 5 and 6 are not included in the labeled dataset to ensure clear sentiment polarity.
3. Dataset Structure
The dataset is organized into two main directories: train and test. Each of these directories contains two subdirectories, pos and neg, corresponding to positive and negative reviews, respectively.

The directory structure is as follows:
aclImdb/
    train/
        pos/  (12,500 positive reviews)
        neg/  (12,500 negative reviews)
    test/
        pos/  (12,500 positive reviews)
        neg/  (12,500 negative reviews)

Each review is stored as a plain text file, with the file name following the convention [id]_[rating].txt. For example:

200_8.txt is a positive review from the test set with an ID of 200 and a rating of 8 out of 10.
4. Dataset Characteristics
Balanced Classes: The dataset is balanced, with an equal number of positive and negative reviews in both the training and test sets.
Diversity in Reviews: No more than 30 reviews are included for any single movie to prevent biases from repeated reviews.
Disjoint Movie Sets: The training and test sets contain reviews from completely different movies to avoid model overfitting based on movie-specific language.
5. Use in Sentiment Analysis
The IMDB dataset is widely used as a benchmark for binary sentiment classification. It provides a challenging testbed for models because of the informal, nuanced, and varied language used in movie reviews. Models trained on this dataset aim to classify the sentiment of a review as either positive or negative based solely on the text content.

6. Available Features and Additional Files
In addition to the raw text reviews, the dataset includes:

Bag-of-Words Features: Preprocessed bag-of-words (BoW) representations are available in .feat files.
Vocabulary: A vocabulary file (imdb.vocab) lists all words used in the dataset, which can be used for feature extraction and analysis.
Expected Ratings (imdbEr.txt): This file provides the expected sentiment rating for each word in the vocabulary, based on prior studies.
7. Citation
If you use the IMDB dataset in your research or projects, please cite the following paper:
Maas, Andrew L., et al. "Learning Word Vectors for Sentiment Analysis." Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2011.

In [10]:
from sklearn.datasets import load_files
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
import nltk

# Download NLTK data files if not already downloaded
nltk.download('punkt')

# Load the IMDB dataset
dataset = load_files('/Users/YigitAydede/Library/CloudStorage/Dropbox/Documents/Courses/MBAN/NLPBootcamp/Section4/aclImdb/train', categories=['pos', 'neg'], shuffle=True, encoding='utf-8', decode_error='ignore')

# Separate features and labels
X, y = dataset.data, dataset.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature extraction using TF-IDF
vectorizer = TfidfVectorizer(max_features=5000)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Train a Logistic Regression model
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train_tfidf, y_train)

# Predict and evaluate the model
y_pred = clf.predict(X_test_tfidf)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:")
print(classification_report(y_test, y_pred))


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/YigitAydede/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Accuracy: 0.8718
Classification Report:
              precision    recall  f1-score   support

           0       0.87      0.87      0.87      2482
           1       0.87      0.88      0.87      2518

    accuracy                           0.87      5000
   macro avg       0.87      0.87      0.87      5000
weighted avg       0.87      0.87      0.87      5000

