<a href="https://colab.research.google.com/github/rsadaphule/jhu-aaml/blob/main/Module_10_Word_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Let us process the IMDB movie reviews for sentiment analysis to improve our understanding of
word embeddings and their relation to a context.
1. [20 pts] In this assignment, we will approach the problem with Word2Vec and contextual
analysis of keywords towards sentiment/category processing in our pipeline.
Generate a gensim model of the movie reviews. Use any parameters you like while
answering the questions (2.) and (3.) below.
Report the size of the vocabulary and characteristics of the gensim model, such as the
number of mapping dimensions, etc.

In [1]:
!pip install gensim



In [2]:
from google.colab import drive; drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
import gensim
from gensim.models import Word2Vec
import nltk
from nltk.corpus import stopwords
import string
import os
from sklearn.model_selection import train_test_split
import logging
import pandas as pd

In [4]:
PATH_DATA = '/content/drive/My Drive/JHU/AAML/Assignments/data/imdb/'
FILE_NAME = "movie_data.csv"

In [5]:
# Download NLTK stopwords
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [6]:
# Enable logging for gensim (optional)
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)



In [7]:
# Function to preprocess text
def preprocess_text(text):
    # Convert to lower case
    text = text.lower()
    # Tokenize
    tokens = nltk.word_tokenize(text)
    # Remove punctuation
    table = str.maketrans('', '', string.punctuation)
    stripped = [w.translate(table) for w in tokens]
    # Remove non-alphabetic tokens
    words = [word for word in stripped if word.isalpha()]
    # Filter out stop words
    stop_words = set(stopwords.words('english'))
    words = [w for w in words if not w in stop_words]
    return words

In [8]:
# Load and preprocess the dataset
# Read the CSV file
df = pd.read_csv(PATH_DATA + FILE_NAME)
df.head()

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0


In [9]:
# Preprocess the reviews
df['processed_reviews'] = df['review'].apply(preprocess_text)


In [10]:
# List of processed reviews
reviews = df['processed_reviews'].tolist()

In [11]:

# Train a Word2Vec model
model = Word2Vec(sentences=reviews, vector_size=100, window=5, min_count=1, workers=4)

In [12]:

# Summarize the loaded model
print(f"Vocabulary size: {len(model.wv.index_to_key)}")

Vocabulary size: 133264


In [13]:

# Save model
model.save('imdb_word2vec.model')


In [14]:

# Load model
new_model = Word2Vec.load('imdb_word2vec.model')

In [15]:

# Characteristics of the model
print(f"Number of dimensions: {new_model.vector_size}")

Number of dimensions: 100


[20 pts] Generate the contexts for the following keywords:
(a.) melancholy
(b.) ghastly
(c.) lackluster
(d.) romantic

In [16]:
keywords = ['melancholy', 'ghastly', 'lackluster', 'romantic']

# Generate contexts for the keywords
for keyword in keywords:
    if keyword in model.wv:
        print(f"Contexts for '{keyword}':")
        for similar_word, similarity in model.wv.most_similar(keyword):
            print(f"  {similar_word} (similarity: {similarity})")
    else:
        print(f"The word '{keyword}' is not in the vocabulary.")

Contexts for 'melancholy':
  elegance (similarity: 0.871760368347168)
  powerfully (similarity: 0.8616267442703247)
  wistful (similarity: 0.8480486869812012)
  sensual (similarity: 0.8480376601219177)
  somber (similarity: 0.8396613597869873)
  spontaneous (similarity: 0.8367413282394409)
  exuberant (similarity: 0.8334642052650452)
  stirring (similarity: 0.8323051929473877)
  poignancy (similarity: 0.8301796913146973)
  dreamy (similarity: 0.8296907544136047)
Contexts for 'ghastly':
  transparent (similarity: 0.8081660866737366)
  unscary (similarity: 0.8004619479179382)
  infantile (similarity: 0.7996281981468201)
  galore (similarity: 0.7932235598564148)
  abominable (similarity: 0.7905946969985962)
  unimpressive (similarity: 0.7865118980407715)
  grotesquely (similarity: 0.7862119078636169)
  soapopera (similarity: 0.7853855490684509)
  hohum (similarity: 0.7826284170150757)
  lumpy (similarity: 0.782532274723053)
Contexts for 'lackluster':
  uninspired (similarity: 0.8629769682

[20 pts] Comment about similarities and differences in (3.). Any comments on why romantic
context was not affected?

Ans -
In a Word2Vec model, the context of a word is determined by its neighboring words in the training corpus. The model generates word embeddings by predicting a word based on its context (Continuous Bag of Words model) or predicting context words given a word (Skip-gram model). The resulting vectors capture semantic meanings and relationships between words.

For "melancholy," although the real meaning is "sadness", in this case, the model associates words that evoke a sense of beauty, emotion, and calmness, such as "powerfully," "wistful," and "serene." This suggests that in the corpus, "melancholy" may often be used in a context that appreciates the depth of emotion, possibly in a positive light, rather than solely focusing on sadness.

The context for "ghastly" includes words like "unscary," "abominable," and "hideous," which are indicative of negative sentiment and align with the word's meaning of something being shockingly frightful or dreadful. Interestingly, some words like "unscary" might suggest a negation of fear, possibly used in reviews where a horror movie failed to deliver the intended scare.

"Lackluster" is associated with words that convey a sense of mediocrity or dullness, such as "uninspired," "pedestrian," and "underwhelming." These are straightforward and align well with the sentiment that a movie or performance was not particularly engaging or impressive.

For "romantic," the model finds words that are directly related to romance and positive emotional experiences, like "romance," "tender," and "heartwarming." These words are typically used to describe movies or scenes that successfully evoke the warmth and affection associated with romantic content.

The fact that "romantic" is not affected in a negative way and maintains a positive context could be due to the generally positive inclination of romance in movie reviews. Romance as a genre tends to be associated with positive emotions and experiences, and this is reflected in the word associations provided by the model.


[20 pts] Read the following paper:
Maas, Andrew L., et al. "Learning word vectors f or sentiment analysis." Proceedings of
the 49th annual meeting of the association f or computational linguistics: Human
language technologies -volume 1. Association f or Computational Linguistics, 2011 .
Comment about and/or align your results in this assignment.


Ans - The paper "Learning Word Vectors for Sentiment Analysis" by Maas et al., presented at the 49th Annual Meeting of the Association for Computational Linguistics in 2011, disctalsks about the application of word vector representations in sentiment analysis tasks. The authors talk about a model that learns word vectors from an large unlabeled dataset using an unsupervised learning algorithm. These vectors are then used as features in a sentiment classification task.

The paper's findings are important because they demonstrate that word vectors capture fine-grained semantic and syntactic regularities using vector arithmetic, which is particularly useful for sentiment analysis. The model presented by Maas et al. is trained on a dataset containing 25,000 IMDB movie reviews, aiming to predict the sentiment of the reviews.

In the context of this assignment, the approach aligns with the task of generating a gensim Word2Vec model for sentiment analysis. The model trained on movie reviews could potentially capture similar semantic regularities as those discussed in the paper. The keywords "melancholy," "ghastly," "lackluster," and "romantic" would have their vectors influenced by the contexts they appear in within the movie reviews, and these vectors could be used to infer sentiment.

If the "romantic" context was not affected in a hypothetical analysis, it could be due to the strong and consistent representation of romantic contexts in the training data, which aligns with the paper's findings that well-represented concepts tend to have stable word vector representations.
