<a href="https://colab.research.google.com/github/shstreuber/AI/blob/main/CS_345_545_Midterm_Student.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Welcome to the lab portion of your Midterm Exam!**

In this lab, you will work with the IMDB Movie Review dataset which you have already encountered in Week 5 of this course. If you don't remember the dataset, please check out the February 6 and February 8 class session recordings in the [Zoom Cloud Recordings](https://applications.zoom.us/lti/rich/home).


#**0. Importing the Required Libraries**

In [None]:
# Here are the libraries

import numpy as np
from keras.datasets import imdb
from tensorflow.keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, LSTM, Flatten, SimpleRNN, Embedding
from keras.preprocessing import sequence

#**1. Importing and Inspecting the Data**
The data in the IMDB dataset are already tokenized.

This code will load the data, decode the integer-encoded reviews back to words using the word index, and then iterate through the training set to find the top 100 common and least common words. Finally, it will display these words. Adjust max_words if you want to display more or fewer words.

**NOTE** that, in the interest of memory maintenance, we are loading only the first 5,000 words in the dataset and we are shortening each review to the first 100 words.

In [None]:
# Load the IMDB dataset, but constrain what is being loaded so that you don't run out of memory,

max_features = 5000  # Consider only the first 5,000 words in the dataset
maxlen = 100  # Limit the length of each review to 100 words
batch_size = 32

# Now we can load the dataset into training and test sets with these parameters already in place.
# That simplifies the preprocessing considerably.

print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)

In [None]:
# Load the word index from the dataset
word_index = imdb.get_word_index()

# Reverse the word index to map integers to words
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

In [None]:
# Count the frequency of each word in the dataset
word_counts = {}
for review in x_test:
    for word in review:
        if word in word_counts:
            word_counts[word] += 1
        else:
            word_counts[word] = 1

# Sort the words by their frequency in descending order
sorted_words = sorted(word_counts.items(), key=lambda x: x[1], reverse=True)
sorted_words[0]

In [None]:
# Extract the top 100 most common and least common
top_common_words = [reverse_word_index[word[0]] for word in sorted_words[:100]]
top_uncommon_words = [reverse_word_index[word[0]] for word in sorted_words[-100:]]

# Display the top 100 common and uncommon words
print("Top 100 common Words:")
print(top_common_words)

print("\nTop 100 uncommon Words:")
print(top_uncommon_words)

#**2. Preprocessing the Data for Neural Networks**

In [None]:
# Before we import the data, we are setting parameters for data preprocessing
max_features = 5000  # Consider only the top 5,000 words in the dataset
maxlen = 100  # Limit the length of each review to 100 words
batch_size = 32

# Now we can load the dataset into training and test sets with these parameters already in place.
# That simplifies the preprocessing considerably.

print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

# Pad sequences to ensure uniform length so we can analyze
print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)
print(x_train[:5])

#**The Embedding Layer**

The Embedding layer in a neural network is responsible for representing words or entities as dense vectors in a continuous vector space. It is commonly used in natural language processing (NLP) tasks, where words are represented as high-dimensional vectors.

Here's how the Embedding layer works:

1. **Word Representation**: Each word in a vocabulary is assigned a unique index. For example, in a vocabulary of 10,000 words, each word would have an index ranging from 1 to 10,000.

2. **Embedding Matrix**: The Embedding layer initializes an embedding matrix, where each row corresponds to the vector representation of a word in the vocabulary. The size of this matrix is determined by the vocabulary size and the dimensionality of the embedding space.

3. **Vector Lookup**: During training, the Embedding layer converts each word index in the input sequence into its corresponding dense vector representation by performing a lookup operation in the embedding matrix. This operation effectively transforms each word index into a dense vector of fixed size.

4. **Learnable Parameters**: The parameters of the embedding matrix are learned during training through backpropagation. The network adjusts these parameters to minimize the loss function, effectively learning the most informative vector representations for the words in the vocabulary.

5. **Semantic Similarity**: The learned dense vectors capture semantic relationships between words. Words that are semantically similar are likely to have similar vector representations, as they often appear in similar contexts within the training data.

Overall, **the Embedding layer** plays a crucial role in converting discrete word indices into dense vector representations that capture semantic information, facilitating the learning process in tasks such as sentiment analysis, machine translation, and text generation.

#**3. Simple Feed-Forward/ Backpropagration Neural Network**
In this section, we will build a regular feed-forward neural network

In [None]:
# Build and compile the feed-forward neural network model
print('Build and compile feed-forward model...')
FF_model = Sequential()
FF_model.add(Embedding(max_features, 64))
FF_model.add(Flatten())
FF_model.add(Dense(64, activation='relu'))
FF_model.add(Dense(1, activation='sigmoid'))

# Compile the model
FF_model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [None]:
print("Feed-Forward Neural Network Model Architecture:")
FF_model.summary()

In [None]:
# Train the feed-forward neural network model

print('Train feed-forward model...')
FF_model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=10,
          validation_data=(x_test, y_test))

In [None]:
# Evaluate the feed-forward neural network model
print('Evaluate feed-forward model...')
FF_acc = FF_model.evaluate(x_test, y_test,
                            batch_size=batch_size)
print('Validation accuracy:', FF_acc)

#**4. Recurrent Neural Network Model**

In [None]:
# Build and compile the recurrent neural network model
print('Build and compile recurrent neural network model...')
RNN_model = Sequential()
RNN_model.add(Embedding(max_features, 64))
# RNN_model.add(SimpleRNN(64, dropout=0.2, recurrent_dropout=0.2))
RNN_model.add(SimpleRNN(64))
RNN_model.add(Dense(1, activation='sigmoid'))

# Compile the model
RNN_model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [None]:
print("Recurrent Neural Network Model Architecture:")
RNN_model.summary()

In [None]:
# Train the recurrent neural network model
print('Train recurrent neural network model ...')
RNN_model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=10,
          validation_data=(x_test, y_test))

In [None]:
# Evaluate the model
print('Evaluate recurrent neural network model...')
RNN_acc = RNN_model.evaluate(x_test, y_test,
                            batch_size=batch_size)
print('Validation accuracy:', RNN_acc)

#**5. LSTM Model**

In [None]:
# Build and compile the LSTM model
print('Build and compile LSTM model...')
LSTM_model = Sequential()
LSTM_model.add(Embedding(max_features, 64))
#LSTM_model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
LSTM_model.add(LSTM(64))
LSTM_model.add(Dense(1, activation='sigmoid'))

# Compile the model
LSTM_model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [None]:
print("LSTM Network Model Architecture:")
LSTM_model.summary()

In [None]:
# Train the model
print('Train LSTM model...')
LSTM_model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=10,
          validation_data=(x_test, y_test))

In [None]:
# Evaluate the model
print('Evaluate LSTM model...')
LSTM_acc = LSTM_model.evaluate(x_test, y_test,
                            batch_size=batch_size)
print('Validation accuracy:', LSTM_acc)

#**EXAMPLE CLASSIFICATION**
Replace the text with the text you are given in the exam.

In [None]:
# Tokenize the input sentence
tokenizer = Tokenizer(num_words=10000)
text = "[use the text from the exam]"
sequences = tokenizer.texts_to_sequences([text])

# Pad the sequence
maxlen = 100  # Same length as the input expected by the model
padded_sequences = pad_sequences(sequences, maxlen=maxlen)

##**PREDICT THE SENTIMENT**



In [None]:
prediction = model.predict(np.array(padded_sequences)) # replace model. with the name of the model you are using

# Display the sentiment prediction
if prediction[0] < 0.5:
    print("Negative Sentiment")
else:
    print("Positive Sentiment")

#**Extra Credit Question (5 points)**
To improve on the top 100 word list you are seeing in section 1 of this file, you could implement a stop word list. A stop word list literally stops common words (like "and," "it," "but," etc. from appearing.

Here is a code snippet for implementing stop words:

```
from nltk.tokenize import word_tokenize

# Download NLTK resources (only required once)
nltk.download('punkt')
nltk.download('stopwords')

# Example sentence
sentence = "This is an example sentence that contains stop words."

# Tokenize the sentence
tokens = word_tokenize(sentence)

# Load stop words
stop_words = set(stopwords.words('english'))

# Filter out stop words
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

print("Original Tokens:", tokens)
print("Filtered Tokens:", filtered_tokens)
```
Use this snippet code as the basis to improve the quality of the data by filtering out the words that don't carry much meaning for our sentiment analysis.



