<a href="https://colab.research.google.com/github/tr41z/machine-learning/blob/main/Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Get Data and import libraries that we need

Version of this jupyter notebook: sentimentanalysis_V2.4

In [1]:
# Define where we will download the data
path_data = "/content/sample_data"

Download the data

In [2]:
!wget -P /content/sample_data/ -c "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

--2024-11-07 11:37:44--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘/content/sample_data/aclImdb_v1.tar.gz’


2024-11-07 11:37:46 (32.8 MB/s) - ‘/content/sample_data/aclImdb_v1.tar.gz’ saved [84125825/84125825]



Decompress the archive

In [3]:
!tar -xf  /content/sample_data/aclImdb_v1.tar.gz -C /content/sample_data/

Check that the folder aclImdb exists

In [4]:
!ls /content/sample_data/

aclImdb		   anscombe.json		california_housing_train.csv  mnist_train_small.csv
aclImdb_v1.tar.gz  california_housing_test.csv	mnist_test.csv		      README.md


Import all required libraries

In [5]:
import pandas as pd
import numpy as np
import os
from bs4 import BeautifulSoup
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re
import nltk
from gensim.models import Word2Vec, Phrases
import matplotlib.pyplot as plt

import tensorflow as tf
from keras.utils import pad_sequences

Define function that reads all the files from a given folder

In [6]:
def read_data(path, files):
  data = []
  for f in files:
    with open(path+f) as file:
      ### BEGIN YOUR CODE HERE
      ## Read a line from the file and append it to the data list variable.
      ## TIP: use function append(): https://www.w3schools.com/python/ref_list_append.asp
      ## and readline(): https://www.w3schools.com/python/ref_file_readline.asp
      line = file.readline()
      data.append(line)


      ### END YOUR CODE HERE
  return data

Load data (movie reviews), both for train and test

In [7]:
path_data_train_pos = path_data + '/aclImdb/train/pos/'
path_data_train_neg = path_data + '/aclImdb/train/neg/'
path_data_test_pos = path_data + '/aclImdb/test/pos/'
path_data_test_neg = path_data + '/aclImdb/test/neg/'

# Get the list of files from the four folders
train_pos_files = os.listdir(path_data_train_pos)
train_neg_files = os.listdir(path_data_train_neg)
test_pos_files = os.listdir(path_data_test_pos)
test_neg_files = os.listdir(path_data_test_neg)

In [21]:
### BEGIN YOUR CODE HERE
## Read the review data in these four variables: train_data_pos, train_data_neg,
## test_data_pos and test_data_neg using the function defined above read_data().
## Tip: First argument is the folder containing the files, second argument is
# the list of files

train_data_pos = read_data(path_data_train_pos, train_pos_files)
train_data_neg = read_data(path_data_train_neg, train_neg_files)
test_data_pos = read_data(path_data_test_pos, test_pos_files)
test_data_neg = read_data(path_data_test_neg, test_neg_files)

# 1a. How many examples do we have in training for positive reviews? How many for negative review?
# positive - 12500
# negative - 12500
print(f"Positive training: {len(train_data_pos)}, Negative training: {len(train_data_neg)}")


# 1b. How about in the testing set?
# positive - 12500
# negative - 12500
print(f"Positive testing: {len(test_data_pos)}, Negative testing: {len(test_data_neg)}")



# 1c. Is the dataset balanced or not?
# Yes

# 2.Print examples of positive and negative reviews from training and testing dataset
# Tip: the output of the read_data() function is a list.
# https://www.w3schools.com/python/python_lists_access.asp
print("==============================")
print("Train data positive example:")
print(train_data_pos[0])

print("\n==============================")
print("Train data negative example:")
print(train_data_neg[0])

print("\n==============================")
print("Test data positive example:")
print(test_data_pos[0])

print("\n==============================")
print("Test data negative example:")
print(test_data_neg[0])


### END YOUR CODE HERE

Positive training: 12500, Negative training: 12500
Positive testing: 12500, Negative testing: 12500
Train data positive example:
The "movie aimed at adults" is a rare thing these days, but Moonstruck does it well, and is still a better than average movie, which is aging very well. Although it's comic moments aim lower than the rest of it, the movie has a wonderful specificity (Italians in Brooklyn) that isn't used to shortchange the characters or the viewers. (i.e. Mobsters never appear in acomplication. It never becomes grotesque like My Big Fat Greek Wedding) The secondary story lines are economically told with short scenes that allow a break from the major thread. These are the scenes that are now missing in contemporary movies where their immediate value cannot be impressed upon producers and bigwigs. I miss these scenes. It also beautifully involves older characters. The movie takes it's own slight, quiet path to a conclusion. There isn't a poorly written scene included anywhere t

In [22]:
# Let's work on a subset of training and testing dataset to start with.
# I recommend that while you are developing the code to work with a small number of examples, maybe 1000.
# This will speed up how fast you can get the actual results. After the code is working, for the full tests
# please use the entire dataset

### BEGIN YOUR CODE HERE
sample_number = 1000
### END YOUR CODE HERE
train_data_pos = train_data_pos[:sample_number]
train_data_neg = train_data_neg[:sample_number]
test_data_pos = test_data_pos[:sample_number]
test_data_neg = test_data_neg[:sample_number]

Create the data structures that we'll use for training and testing

In [24]:
### BEGIN YOUR CODE HERE
# Assign the length of the train_data_pos to variable length_train_pos
# Tip: to get the length of an array, you can use the function len()
length_train_pos = len(train_data_pos)

# Assign the length of the train_data_neg to variable length_train_neg
length_train_neg = len(train_data_neg)

# Assign the length of the test_data_pos to variable length_test_pos
length_test_pos = len(test_data_pos)

# Assign the length of the test_data_neg to variable length_test_neg
length_test_neg = len(test_data_neg)

### END YOUR CODE HERE

print("Length of the positive training examples is", length_train_pos )
print("Length of the negative training examples is", length_train_neg)
print("Length of the positive testing examples is", length_test_pos )
print("Length of the negative testing examples is", length_test_neg)

Length of the positive training examples is 1000
Length of the negative training examples is 1000
Length of the positive testing examples is 1000
Length of the negative testing examples is 1000


In [25]:
# Create the training DataFrame with examples and labels
# Concatenate positive and negative training examples, then pair each example with a label: 1 for positive, 0 for negative
data_train = pd.DataFrame(zip(train_data_pos+train_data_neg, [1]*length_train_pos+[0]*length_train_neg),  columns=['review', 'label'])
# Create the test DataFrame with examples and labels
# Concatenate positive and negative test examples, then pair each example with a label: 1 for positive, 0 for negative
data_test = pd.DataFrame(zip(test_data_pos+test_data_neg, [1]*length_test_pos+[0]*length_test_pos),  columns=['review', 'label'])
# Combine all reviews into a single list (training + test, positive + negative)
all_reviews = train_data_pos+train_data_neg+test_data_pos+test_data_neg

In [26]:
print("The length of the train reviews is",len(data_train))
print("The length of the test reviews is",len(data_test))
print("The length of all reviews is",len(all_reviews))

The length of the train reviews is 2000
The length of the test reviews is 2000
The length of all reviews is 4000


## The following are functions for preprocessing text

In [27]:
# Download stopwords and wordnet vectors
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

counter = 0
REPLACE_WITH_SPACE = re.compile(r'[^A-Za-z\s]')
stop_words = set(stopwords.words("english"))
# Declare the lemmatizer
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


In [28]:
# The following functions preprocess text data
def clean_review(raw_review: str) -> str:
    # 1. Remove HTML
    review_text = BeautifulSoup(str(raw_review), "lxml").get_text()
    # 2. Remove non-letters
    letters_only = REPLACE_WITH_SPACE.sub(" ", review_text)
    # 3. Convert to lower case
    lowercase_letters = letters_only.lower()
    return lowercase_letters


def lemmatize(tokens: list) -> list:
    # 1. Lemmatize
    tokens = list(map(lemmatizer.lemmatize, tokens))
    lemmatized_tokens = list(map(lambda x: lemmatizer.lemmatize(x, "v"), tokens))
    # 2. Remove stop words
    meaningful_words = list(filter(lambda x: not x in stop_words, lemmatized_tokens))
    return meaningful_words


def preprocess(review: str, total: int, show_progress: bool = True) -> list:
    if show_progress:
        global counter
        counter += 1
        print('Processing... %6i/%6i'% (counter, total), end='\r')
    # 1. Clean text
    review = clean_review(review)
    # 2. Split into individual words
    tokens = word_tokenize(review)
    # 3. Lemmatize
    lemmas = lemmatize(tokens)
    # 4. Join the words back into one string separated by space,
    # and return the result.
    return lemmas

## Preprocess the text of the reviews by removing the non-word characters, converting everything to lower case, and lemmatizing words


In [45]:
### Begin your code here

# Display the first review text from the training data.
# Purpose: To check how the raw data looks before applying any preprocessing.
# Tip: The training data is stored in a pandas DataFrame (2-dimensional array-like structure).
# Here, we access the 'review' column of the first row.
print("Training 'review' column first row:")
print(data_train['review'].iloc[0])

# Display the label associated with the first review in the training data.
# Purpose: To confirm the label for the first review, indicating its classification (1 for positive, 0 for negative).
print("\n\nTraining 'label' column first row:")
print(data_train['label'].iloc[0])

# Display the preprocessed text of the first review.
# Purpose: To see the effect of preprocessing on the raw text.
# Tip: Use the preprocess() function, passing the first review text and specifying "1" as the second argument
# since we're processing a single example.
preprocessed_text = preprocess(data_train['review'].iloc[0], 1, show_progress=False)
print("\n\nPreprocessed text:")
print(preprocessed_text)

### End your code here

Training 'review' column first row:
The "movie aimed at adults" is a rare thing these days, but Moonstruck does it well, and is still a better than average movie, which is aging very well. Although it's comic moments aim lower than the rest of it, the movie has a wonderful specificity (Italians in Brooklyn) that isn't used to shortchange the characters or the viewers. (i.e. Mobsters never appear in acomplication. It never becomes grotesque like My Big Fat Greek Wedding) The secondary story lines are economically told with short scenes that allow a break from the major thread. These are the scenes that are now missing in contemporary movies where their immediate value cannot be impressed upon producers and bigwigs. I miss these scenes. It also beautifully involves older characters. The movie takes it's own slight, quiet path to a conclusion. There isn't a poorly written scene included anywhere to make some executives sphincter relax. Cage and Cher do very nice work.<br /><br />Moonstruc

Let's preproces the entire set of reviews

In [46]:
# Preprocess each review in all_reviews, applying the preprocess function to each item.
# The preprocess function  takes each review (x) and the total number of reviews (len(all_reviews))
# as inputs, performing some transformation or cleaning on each review based on the dataset size.

all_reviews = [preprocess(x, len(all_reviews)) for x in all_reviews]

Processing...      4/  4000Processing...      5/  4000Processing...      6/  4000Processing...      7/  4000Processing...      8/  4000Processing...      9/  4000Processing...     10/  4000Processing...     11/  4000Processing...     12/  4000Processing...     13/  4000Processing...     14/  4000Processing...     15/  4000Processing...     16/  4000Processing...     17/  4000Processing...     18/  4000Processing...     19/  4000Processing...     20/  4000Processing...     21/  4000Processing...     22/  4000Processing...     23/  4000Processing...     24/  4000Processing...     25/  4000Processing...     26/  4000Processing...     27/  4000Processing...     28/  4000Processing...     29/  4000Processing...     30/  4000Processing...     31/  4000Processing...     32/  4000Processing...     33/  4000Processing...     34/  4000Processing...     35/  4000Processing...     36/  4000Processing...     37/  4000Processing...     38/  4000Processing...     39

  review_text = BeautifulSoup(str(raw_review), "lxml").get_text()




In [47]:
# Slice the preprocessed reviews to get only the training set portion (first len(train_data_pos) + len(train_data_neg) items)
X_train_preprocessed = all_reviews[:(len(train_data_pos)+len(train_data_neg))]
X_test_preprocessed = all_reviews[(len(train_data_pos)+len(train_data_neg)):]

In [48]:
len(train_data_pos)

1000

In [49]:
# Let's see how many reviews we have in total for training
# We should get 2000 if you kept the sample_number=1000
# print(X_train_preprocessed.shape)
print(len(X_train_preprocessed))

2000


## Compute Word2Vec vectors on all reviews

In [50]:
# compute bigrams, meaning detect phrases in the texts
# For example: ["new","york"] will be detected as one phrase "new york"
print("Compute phrases begin")
bigrams = Phrases(sentences=all_reviews)
# compute trigrams, meaning we detect three words that usually appear
# together. Notice that because we work with a small subset, we might
# not detect a lot of trigrams
trigrams = Phrases(sentences=bigrams[all_reviews])
print("Compute phrases end")

Compute phrases begin
Compute phrases end


In [51]:
# Test how our phrase looks after calling the bigrams
print(bigrams['space station near the solar system'.split()])

['space', 'station', 'near', 'the', 'solar', 'system']


In [52]:
# Test how our phrase looks after calling the trigrams
# Do you notice any difference compared with the bigrams?
print(trigrams[bigrams['space station near the solar system'.split()]])

['space', 'station', 'near', 'the', 'solar', 'system']


In [53]:
# compute word embedding from the dataset
### Begin your code here
# set the embedding vector size variable to 256
embedding_vector_size = 256


### End your code here

# Next, we train a custom word2vec model based on our custom dataset.
# Notice that the input sentences are the trigrams
# In this case, we consider grouping like new_york one word.
# The input of the word2vec will be the processed words as trigrams.
# The duration of this process depends on the size of the dataset.
# For the restricted size of all reviews, this would take around 1-2minutes
print("Start learning the word embedding")
trigram_model = Word2Vec(
    sentences = trigrams[bigrams[all_reviews]],
    vector_size = embedding_vector_size,
    min_count=3, window=5, workers=4)
print("Done learning")

Start learning the word embedding
Done learning


Check what is the vocabulary size

In [54]:
print("Vocabulary size:", len(trigram_model.wv))

Vocabulary size: 13533


Let's check the most similar words for "movie" & "galaxy"

In [55]:
trigram_model.wv.most_similar('sun')
# If you are working with the subset of 1000 reviews, the most similar words might not be
# the most relevant ones. You can remove the constraint of working with only 1000 reviews,
# and compare what are the most similar words again, but please be aware that this might
# increase the training time of the word2vec

[('scientist', 0.999759316444397),
 ('boat', 0.9997497797012329),
 ('n', 0.9997242093086243),
 ('board', 0.9997135996818542),
 ('girlfriend', 0.9997103214263916),
 ('g', 0.9997051954269409),
 ('road', 0.9997047185897827),
 ('fire', 0.999704122543335),
 ('trip', 0.9997013211250305),
 ('priest', 0.9997006058692932)]

In [56]:
trigram_model.wv.most_similar('action')

[('low_budget', 0.9994406700134277),
 ('nice', 0.9993757009506226),
 ('truly', 0.9993634223937988),
 ('fact', 0.9992556571960449),
 ('idea', 0.9992082715034485),
 ('although', 0.9991889595985413),
 ('predictable', 0.9991137981414795),
 ('one_best', 0.999112069606781),
 ('ok', 0.9990673065185547),
 ('felt', 0.9988930821418762)]

Given a list of words identify which word does not match with the others

In [57]:
trigram_model.wv.doesnt_match(['moon', 'sun', 'planet'])

'moon'

# Transform our reviews from the training set into vectors

In [58]:
def vectorize_data(data, vocab: dict) -> list:
    print('Vectorize sentences...', end='\r')
    keys = list(vocab.keys())
    filter_unknown = lambda word: vocab.get(word, None) is not None
    encode = lambda review: list(map(keys.index, filter(filter_unknown, review)))
    vectorized = list(map(encode, data))
    print('Vectorize sentences... (done)')
    return vectorized

print('Convert sentences to sentences with ngrams...', end='\r')
X_data = trigrams[bigrams[X_train_preprocessed]]
print('Convert sentences to sentences with ngrams... (done)')
input_length = 150

# Transform all sequences to 150, sequences shorter are padded, while sequences longer are truncated to maximum size
X_pad_train = pad_sequences(
    sequences=vectorize_data(X_data, vocab=trigram_model.wv.key_to_index),
    maxlen=input_length,
    padding='post')
print('Transform sentences to sequences on the train set... (done)')


X_data_test = trigrams[bigrams[X_test_preprocessed]]
X_pad_test = pad_sequences(
    sequences=vectorize_data(X_data_test, vocab=trigram_model.wv.key_to_index),
    maxlen=input_length,
    padding='post')

print('Transform sentences to sequences on the test set... (done)')

Convert sentences to sentences with ngrams... (done)
Vectorize sentences... (done)
Transform sentences to sequences on the train set... (done)
Vectorize sentences... (done)
Transform sentences to sequences on the test set... (done)


In [59]:
# For a given example, each number in the vector represents the position of the word in the vocabulary
X_pad_train[2]

array([  299,     5,  2048,   226,   758,  3138,  6087,  1364,  2291,
        6291,   603,   972,  1681,   230,  6011,    97,   220,   607,
        9976,   269,   119,  1515,    74,   179,   706,    13,   607,
         168,    60,  4835,  2006,   440,  1160,   648,    97,  7959,
        5303,     0,    23,   883,  1158,  2371,   806,   260,   265,
        4645,   616,    74,   149,  2307,    13,   697,   991,  1551,
         130,  4598,  2371,  1158,  3390,    31,    12, 11399,    72,
         351,  6011,   209,  4835,    31,   250,  5303,   265, 11343,
        1935,    91,   541,    51,  1350,   114,    54,   354,  1149,
        2351,   323,  6940,  1157,   559,     6,  5303,  6117,   345,
        7182,     0,    10,   473,  7088,  9377,  9976,    68,  4116,
          65,   260,  2695,  7946,  2505,  6013,   126,  3574,   109,
         143,    45,  8070,    36,   183,   459,  2990,  1166,     1,
          23,   836,    45,   583,    46,  2512, 11371,  1055,  7784,
         172,  6291,

# Train a classifier based on a particular type of recurrent neural network called LSTM to differentiate between positive and negative reviews

In [64]:
from sklearn.model_selection import train_test_split

#We'll train a model based on a subset of the training set

# Step 1: Split into 80% train+validation and 20% validation

# BEGIN YOUR CODE HERE
# Use train_test_split to split the dataset (X_pad and data_train['label']) into X_train, X_val, y_train, y_val.
# https://scikit-learn.org/1.5/modules/generated/sklearn.model_selection.train_test_split.html
# Ensure that the data is shuffled to avoid any ordering bias, and set a random state for reproducibility.
X_train, X_val, y_train, y_val = train_test_split(
    X_pad_train,
    data_train['label'],
    test_size=0.2,
    shuffle=True,
    random_state=42
)

# END YOUR CODE HERE

# Step 2: The test set is just the padded test sequences along with the test labels

X_test = X_pad_test
y_test = data_test['label']

In [66]:
print(f"X_train shape: {X_train.shape}")
print(f"X_val shape: {X_val.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_val shape: {y_val.shape}")

X_train shape: (1600, 150)
X_val shape: (400, 150)
y_train shape: (1600,)
y_val shape: (400,)


In [65]:
data_test['label']

Unnamed: 0,label
0,1
1,1
2,1
3,1
4,1
...,...
1995,0
1996,0
1997,0
1998,0


Define the Neural Network

In [None]:
### Begin your code here
# Define a neural network model
# TIP: Use the same sequential model in order to define the network:
# https://www.tensorflow.org/guide/keras/sequential_model
# You can also see an example in the MNIST lab.
# Add the following layers:
# 1. an Embedding layer of the following form:
# tf.keras.layers.Embedding(
#         input_dim = trigram_model.wv.vectors.shape[0],
#         output_dim = trigram_model.wv.vectors.shape[1],
#         input_length = input_length,
#         weights = [trigram_model.wv.vectors],
#         trainable=False)
# 2. A Bidirectional layer with LSTM, with 128 internal units and a recurrent dropout of 0.1
# A sentence can be considered a temporal sequence, where the order of the words
# might be important. This is why we need a temporal model, like a Long short-term memory model.
# A bidirectional model, simply means that the we want to parse the data both forward
# and backwards. This allows the network to capture both past and future context for each time step.
# See example here: https://www.tensorflow.org/api_docs/python/tf/keras/layers/Bidirectional
# eg: tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(128, recurrent_dropout=0.1)),
# 3. A dropout layer with 0.25 probability
# You will find an example of dropout layer in the previous MNIST lab.
# See example here: https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dropout
# 4. A Dense layer with 64 internal units
# You will find an example of a dense layer in the previous MNIST lab.
# 5. A dropout layer with 0.3 probability
# 6. A final dense layer with 1 neuron and a sigmoid activation function
# tf.keras.layers.Dense(1, activation='sigmoid')




### End your code here

In [None]:
# compile the model
model.compile(loss='binary_crossentropy',
              optimizer=tf.keras.optimizers.Adam(),
              metrics=['accuracy'])

In [None]:
### Begin your code here
# Train the model with two epochs, and a batch size of 100.
# Tip: The x is X_train, y is y_train, and validation_data is (X_val, y_val)
# To view the model.fit function definition check here: https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit
# and also an example here: https://www.tensorflow.org/guide/keras/training_with_built_in_methods (Look for fit() )
# To obtain a better accuracy you would need to train for more epochs. Find a right balance between the training time
# and the accuracy
# The parameters that you need to set are:
# x as X_train
# y as y_train
# validation_data with the (X_val, y_val)
# Choose and appropriate batch_size. You can experiment with different values and see how the model behaves.
# When you first start training, only train for 1-2 epoch




### End your code here


In [None]:
# Extract the loss values for each epoch and display it in a figure
# BEGIN YOUR CODE HERE

# Extract loss values for training and validation from the history object
# Create an array, epochs, containing integers from 1 to the number of epochs (inclusive).


# Plot training and validation loss over epochs



# Add labels, title, grid, and legend


# Show the plot



# END YOUR CODE HERE

In [None]:
# BEGIN YOUR CODE HERE
# What is the loss and accuracy on the Testing dataset?
# Tip: instead of (x_test, y_test) we used in the lab last week, you can use
# directly test_ds which contains both data and labels
# https://www.tensorflow.org/api_docs/python/tf/keras/Model#evaluate
# When you print the output of the evaluate function, it will return both
# the loss and accuracy, maybe in a  format like [loss_value, accuracy_value]



### End your code here

# What is the accuracy you get?
# If you get an accuracy of aprox 50%, what does it mean? Did your model learn?
# Try to modify the architecture; how you train the model or how much data you
# use to train it in order to improve the results.


Test the model with a random text


In [None]:
test_samples = []

# normally in python we have one line per instruction. Defining a long string
# will be difficult to read if it is only in one line. The way that we tell
# the interpreter that we have an instruction that spans several lines
# is by using the character \
review1 = "Petter Mattei's 'Love in the Time of Money' is a visually stunning"\
          "film to watch. Mr. Mattei offers us a vivid portrait about human" \
          " This is a movie that seems to be telling us what money, power and" \
          "success do to people"
### Begin your code here
# Write a couple of reviews and analyse how the model performs on your own data.
# Are the results what you expect?
# What did you change in the network architecture to get better results?
review2 = ""
review3 = ""
### End your code here

test_samples.append(review1)
test_samples.append(review2)
test_samples.append(review3)

# test_samples_preprocess = np.array(list(map(lambda x: preprocess(x, len(test_samples)), test_samples)))
test_samples_preprocess = list(map(lambda x: preprocess(x, len(test_samples)), test_samples))
print(test_samples_preprocess)
print(trigrams[bigrams[test_samples_preprocess]])
test_data_bigrams = trigrams[bigrams[test_samples_preprocess]]
test_data_pad = pad_sequences(
sequences=vectorize_data(test_data_bigrams, vocab=trigram_model.wv.key_to_index),
maxlen=input_length,
padding='post')

predictions = model.predict(test_data_pad)
#print(predictions)

for (t, p) in zip( test_samples, predictions):
    prediction_string = "positive"
    if p<0.5:
        prediction_string = "negative"
    print("Predicted "+prediction_string+" "+str(p)+" for review:"+ t)
