## Sentiment Analysis

In this exercise we use the IMDb-dataset, which we will use to perform a sentiment analysis. The code below assumes that the data is placed in the same folder as this notebook. We see that the reviews are loaded as a pandas dataframe, and print the beginning of the first few reviews.

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import pandas as pd
from sklearn.metrics import classification_report, confusion_matrix

from keras.src import Sequential
from keras.src.layers import Dense, Dropout


In [3]:
reviews = pd.read_csv('reviews.txt', header=None)
labels = pd.read_csv('labels.txt', header=None)
Y = (labels=='positive').astype(np.int_)

print(type(reviews))
print(reviews)

<class 'pandas.core.frame.DataFrame'>
                                                       0
0      bromwell high is a cartoon comedy . it ran at ...
1      story of a man who has unnatural feelings for ...
2      homelessness  or houselessness as george carli...
3      airport    starts as a brand new luxury    pla...
4      brilliant over  acting by lesley ann warren . ...
...                                                  ...
24995  i saw  descent  last night at the stockholm fi...
24996  a christmas together actually came before my t...
24997  some films that you pick up for a pound turn o...
24998  working  class romantic drama from director ma...
24999  this is one of the dumbest films  i  ve ever s...

[25000 rows x 1 columns]


**(a)** Split the reviews and labels in test, train and validation sets. The train and validation sets will be used to train your model and tune hyperparameters, the test set will be saved for testing. Use the `CountVectorizer` from `sklearn.feature_extraction.text` to create a Bag-of-Words representation of the reviews. Only use the 10,000 most frequent words (use the `max_features`-parameter of `CountVectorizer`).

In [4]:
reviews_train, reviews_test, Y_train, Y_test = train_test_split(
    reviews[0],
    Y[0],
    test_size=0.2,
    random_state=42
)

reviews_train, reviews_val, Y_train, Y_val = train_test_split(
    reviews_train,
    Y_train,
    test_size=0.25,
    random_state=42
)

vectorizer = CountVectorizer(max_features=10000)

X_train = vectorizer.fit_transform(reviews_train)

X_val = vectorizer.transform(reviews_val)
X_test = vectorizer.transform(reviews_test)

**(b)** Explore the representation of the reviews. How is a single word represented? How about a whole review?

In [5]:
print("*********************** First 100 features ***********************")
print(vectorizer.get_feature_names_out()[:100])

print("*********************** First Review ***********************")

feature_names = vectorizer.get_feature_names_out()
first_review = reviews_train.iloc[0]
print(first_review)

print("*********************** Indexes ***********************")
first_review_vector = X_train[0]

for idx in first_review_vector.indices:
    feature_name = feature_names[idx]
    count = first_review_vector[0, idx]
    print(f"The word '{feature_name}' appears {count} times in the first review.")

print("*********************** Representation of first review 100 features ***********************")
print(first_review_vector)

print("*********************** Indices of words in first review ***********************")
print(first_review_vector.indices)



*********************** First 100 features ***********************
['abandon' 'abandoned' 'abby' 'abc' 'abducted' 'abilities' 'ability'
 'able' 'aboard' 'abominable' 'abomination' 'abortion' 'abound' 'about'
 'above' 'abraham' 'abrupt' 'abruptly' 'absence' 'absent' 'absolute'
 'absolutely' 'absorbed' 'absorbing' 'abstract' 'absurd' 'absurdity' 'abu'
 'abundance' 'abuse' 'abused' 'abusive' 'abysmal' 'academic' 'academy'
 'accent' 'accents' 'accept' 'acceptable' 'acceptance' 'accepted'
 'accepting' 'accepts' 'access' 'accessible' 'accident' 'accidental'
 'accidentally' 'acclaim' 'acclaimed' 'accompanied' 'accompanying'
 'accomplish' 'accomplished' 'accomplishment' 'according' 'account'
 'accounts' 'accuracy' 'accurate' 'accurately' 'accusations' 'accused'
 'ace' 'achieve' 'achieved' 'achievement' 'achievements' 'achieves' 'acid'
 'acknowledge' 'acknowledged' 'acquire' 'acquired' 'across' 'act' 'acted'
 'acting' 'action' 'actions' 'active' 'activities' 'activity' 'actor'
 'actors' 'actres

**(c)** Train a neural network with a single hidden layer on the dataset, tuning the relevant hyperparameters to optimize accuracy. 

In [6]:
X_train_dense = X_train.toarray()
X_val_dense = X_val.toarray()
X_test_dense = X_test.toarray()

input_dim = X_train_dense.shape[1]
model = Sequential()
model.add(Dense(128, input_dim=input_dim, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

history = model.fit(X_train_dense, Y_train,
                    epochs=10,
                    verbose=True,
                    validation_data=(X_val_dense, Y_val),
                    batch_size=10)

loss, accuracy = model.evaluate(X_test_dense, Y_test, verbose=False)
print(f"Test Accuracy: {accuracy:.4f}")

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test Accuracy: 0.8766


**(d)** Test your sentiment-classifier on the test set.

In [7]:
loss, accuracy = model.evaluate(X_test_dense, Y_test, verbose=True)

print(f"Test Loss: {loss:.4f}")
print(f"Test Accuracy: {accuracy:.4f}")
predictions = model.predict(X_test_dense)

predicted_classes = (predictions > 0.5).astype(int)
print(classification_report(Y_test, predicted_classes, target_names=['Negative', 'Positive']))
conf_matrix = confusion_matrix(Y_test, predicted_classes)
print(conf_matrix)


Test Loss: 0.6904
Test Accuracy: 0.8766
              precision    recall  f1-score   support

    Negative       0.90      0.84      0.87      2492
    Positive       0.85      0.91      0.88      2508

    accuracy                           0.88      5000
   macro avg       0.88      0.88      0.88      5000
weighted avg       0.88      0.88      0.88      5000

[[2099  393]
 [ 224 2284]]


**(e)** Use the classifier to classify a few sentences you write yourselves. 

In [8]:
new_sentences = [
    "Last night, I watched 'Dreams of Light', a movie that's been the talk of the town lately. The movie kicked off with stunning visual effects that were truly a feast for the eyes. Each frame was like a beautifully crafted painting, capturing the essence of the story's vibrant setting. However, as the movie progressed, I couldn't help but feel let down by the predictable plot. It seemed as though the scriptwriters had taken no risks, sticking to a formulaic storyline that offered no real surprises. The lead actor delivered a heartwarming performance, adding depth to a character that might otherwise have felt one-dimensional. But on the flip side, the supporting cast just didn't bring the same level of energy. Their performances were lackluster, and at times, it felt as if they were simply going through the motions. The dialogue was another high point, filled with witty and engaging lines that kept me invested in the characters' journeys. Yet, the pacing of the film was off. However, the sound editing was less impressive. The background score often drowned out the actors' voices, making it hard to follow the dialogue. The cinematography was a saving grace, with breathtaking landscapes and well-executed shots that kept me visually engaged. In contrast, the movie's climax was underwhelming. It lacked the emotional punch I was expecting and felt somewhat rushed.",
    "This movie was a fantastic journey through emotions.",
    "I did not like the movie, it was boring and too long.",
    "An absolute masterpiece, brilliantly acted and well written.",
    "The acting was mediocre and the plot was predictable",
    "It was interesting movie, although very long",
    "Short movie with not an interesting outcome",
    "Did not like the movie"
]

X_new = vectorizer.transform(new_sentences)
X_new_dense = X_new.toarray()
new_predictions = model.predict(X_new_dense)
new_predicted_classes = (new_predictions > 0.5).astype(int)

for i, sentence in enumerate(new_sentences):
    print(f"Sentence: '{sentence}'")
    print(f"Predicted Sentiment: {'Positive' if new_predicted_classes[i][0] == 1 else 'Negative'}\n")

Sentence: 'Last night, I watched 'Dreams of Light', a movie that's been the talk of the town lately. The movie kicked off with stunning visual effects that were truly a feast for the eyes. Each frame was like a beautifully crafted painting, capturing the essence of the story's vibrant setting. However, as the movie progressed, I couldn't help but feel let down by the predictable plot. It seemed as though the scriptwriters had taken no risks, sticking to a formulaic storyline that offered no real surprises. The lead actor delivered a heartwarming performance, adding depth to a character that might otherwise have felt one-dimensional. But on the flip side, the supporting cast just didn't bring the same level of energy. Their performances were lackluster, and at times, it felt as if they were simply going through the motions. The dialogue was another high point, filled with witty and engaging lines that kept me invested in the characters' journeys. Yet, the pacing of the film was off. How