# 1 Sentiment analysis

## Loading data and prelimanry analysis

### Packages loading

In [None]:
!pip install contractions

Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Collecting textsearch>=0.0.21 (from contractions)
  Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Collecting anyascii (from textsearch>=0.0.21->contractions)
  Downloading anyascii-0.3.2-py3-none-any.whl (289 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m289.9/289.9 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyahocorasick (from textsearch>=0.0.21->contractions)
  Downloading pyahocorasick-2.0.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.8/110.8 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully installed anyascii-0.3.2 contractions-0.1.73 pyahocorasick-2.0.0 textsearch-0.0.24


In [None]:
import nltk
from nltk.corpus import stopwords
import string
import re
import pandas as pd
import contractions
import numpy as np
from sklearn.model_selection import train_test_split
from bs4 import BeautifulSoup
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Embedding, Flatten, LSTM, Dense, Dropout, GlobalAveragePooling1D
from keras.callbacks import EarlyStopping
from sklearn.metrics import confusion_matrix, f1_score, classification_report
from keras.optimizers import Adam

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

Due to the work being done on Google Colab, we opted to mount to a Google Drive folder containing the IMDB dataset.

### Loading the data set

In [None]:
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Specify the path to the CSV file in your Google Drive
file_path = '/content/drive/MyDrive/Dataset Text Mining/IMDB Dataset.csv'


In [None]:
# Load the CSV file into a DataFrame
data = pd.read_csv(file_path)

# Display the summary of the dataset
print(data.describe())

                                                   review sentiment
count                                               50000     50000
unique                                              49582         2
top     Loved today's show!!! It was a variety and not...  positive
freq                                                    5     25000


It is noted that there are 25000 observations each on positive and negative sentiment, thus we have skipped considerations on oversampling/undersampling techniques.

In [None]:
#sentiment count
data['sentiment'].value_counts()

positive    25000
negative    25000
Name: sentiment, dtype: int64

In [None]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


We note that from checking the first few observations that there are HTML-related tags and formats such as the "&lt;br&gt;" tag. Thus, we elected to remove these tags first.


## Pre-Processing


### Removing HTML instructions


It is necessary to remove these tags before proceeding with the rest of the text pre-processing. The standard preprocessing steps might only remove '<' and '>', leaving the meaningless, letters.

In [None]:
# Check for common HTML tags or entities in the entire dataset
found_tags = set()

for review in data['review'].values:
    # Check for the presence of common HTML tags/entities
    for tag in ['<br />', '<p>', '<a>', '<strong>', '<em>', '&nbsp;', '<h1>', '<h2>', '<h3>', '<h4>', '<h5>', '<h6>',
                '<ul>', '<ol>', '<li>', '<blockquote>', '<code>', '<img>', '<div>', '<span>']:
        if tag in review:
            found_tags.add(tag)

# Print the found HTML tags/entities
if found_tags:
    print("Found HTML tags/entities:")
    for tag in found_tags:
        print(tag)


Found HTML tags/entities:
<br />
<em>
<p>


In [None]:
#Removing the html strips
def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text(separator=' ')

# Apply the function to the "review" column and update the column
processed_data = data.copy()
processed_data["review"] = data["review"].apply(strip_html)


  soup = BeautifulSoup(text, "html.parser")


In [None]:
# Check for common HTML tags or entities in the entire dataset
found_tags = set()

for review in processed_data['review'].values:
    # Check for the presence of common HTML tags/entities
    for tag in ['<br />', '<p>', '<a>', '<strong>', '<em>', '&nbsp;', '<h1>', '<h2>', '<h3>', '<h4>', '<h5>', '<h6>',
                '<ul>', '<ol>', '<li>', '<blockquote>', '<code>', '<img>', '<div>', '<span>']:
        if tag in review:
            found_tags.add(tag)

# Print the found HTML tags/entities
if found_tags:
    print("Found HTML tags/entities:")
    for tag in found_tags:
        print(tag)


In [None]:
processed_data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. The filming te...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


All the HTML tags have been removed, and replaced with blank spaces. The absence of these tags is evident in the second review.

### Rest of the Pre-Processing

- The preprocessing function we used comprises of the following steps:

  1. Remove URLs
  
  2. Lowercase the text
  
  3. Remove punctuations
  
  4. Expand contractions and split words
  
  5. Remove stopwords
  
  6. Join words back into a single string

- The above function was applied to the "review" column with the review text.
- During the preprocessing step of splitting words, e.g. “we’re” into “we are”, we used the “contractions” package which is dictionary-based. Further details on the contractions package and the strings converted can be found on https://github.com/kootenpv/contractions/blob/master/contractions/data/contractions_dict.json and its surrounding files.


In [None]:
# Define stopwords
stop_words = set(stopwords.words('english'))

def preprocessing_text(text):
    # Get rid of URLs
    text = re.sub('https?://\S+|www\.\S+', '', text)

    # Lowercase the text
    text = text.lower()

    # Removing punctuations using replace() method
    for punctuation in string.punctuation:
        text = text.replace(punctuation, '')

    # Expand contractions and split
    words = contractions.fix(text).split()

    # Remove stopwords
    words = [word for word in words if word not in stop_words]

    # Join the words back into a single string
    text = ' '.join(words)

    return text

# Apply the function to the "review" column and update the column
processed_data["review"] = processed_data["review"].apply(preprocessing_text)


- **Dataset Splitting:** The dataset is divided into training and testing sets.

  - `X_train` and `y_train` are utilized for model training.
  - `X_test` and `y_test` are employed to assess the model's performance.

  Parameters:
  - `test_size=0.2`: Allocates 20% of the data for testing.
  - `shuffle=True`: Ensures that the data points are randomly shuffled. This is turned on by default.


In [None]:
X_train, X_test, y_train, y_test = train_test_split(processed_data["review"], processed_data["sentiment"], test_size=0.2, shuffle=True)

Some properties are crucial for text data preparation before neural network processing.

- **vocab_size = 10000:** Sets the max number of unique words in the vocabulary to 10,000.

- **max_length = 1500:** Defines the maximum sequence length. Texts longer than 1500 words are truncated, while shorter ones are padded. This parameter value was found after a careful hyperparameter tuning.

- **trunc_type = 'post':** When truncation is needed, it occurs at the end of the text.

- **padding_type = 'post':** Padding is applied at the end of texts shorter than `max_length`.

- **oov_tok = &lt;OOV&gt;:** Represents the Out-Of-Vocabulary token, handling unseen words during training.


In [None]:
vocab_size = 10000
max_length = 1500

trunc_type = 'post'
padding_type = 'post'
oov_tok = '<OOV>'

**Tokenizer Definition:**

Creating a tokenizer, a tool that converts text into a numerical format suitable for neural networks. The parameter `num_words=vocab_size` restricts the tokenizer to the top 10,000 most frequent words, as defined by `vocab_size`. The special token `<OOV>` is assigned to out-of-vocabulary words using `oov_token=oov_tok`.

**Fitting on Texts:**

We 'fit' the tokenizer to the training text data (`X_train`). This process involves analyzing the text, building the vocabulary, and assigning numerical values to words.


In [None]:
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(X_train)


**Tokenization:**

- `X_train = tokenizer.texts_to_sequences(X_train)`: Converts the text in the training set (`X_train`) into sequences of numerical values using the earlier-defined tokenizer.
  
- `X_test = tokenizer.texts_to_sequences(X_test)`: Similarly, applies tokenization to the test set (`X_test`).

**Padding:**

- `X_train = pad_sequences(X_train, maxlen=max_length, padding=padding_type, truncating=trunc_type)`: Ensures all sequences in the training set have the same length by padding or truncating as needed, resulting in sequences with a length of the max_length variable defined above (1500 words).

- `X_test = pad_sequences(X_test, maxlen=max_length, padding=padding_type, truncating=trunc_type)`: Analogously applies padding to the test set.


In [None]:
X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)

X_train = pad_sequences(X_train, maxlen=max_length,
                         padding=padding_type,
                         truncating=trunc_type)
X_test = pad_sequences(X_test, maxlen=max_length,
                         padding=padding_type,
                         truncating=trunc_type)

# Initialize the label encoder
label_encoder = LabelEncoder()

# Transform the labels in y_train into numerical format using and encoder for compatibility with neural networks
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.fit_transform(y_test)

# Now y_train_encoded and y_test_encoded contain numerical labels

## Build and Train the Feed-forward Neural Network



- Using a validation set, we attempted to finetune several hyperparameters: max_length, neurons_multiplier_values for the neurons, dropout_value and learning_rate_values for the optimizer.
- A lower learning rate is found to be more suitable for increasing validation accuracy.
- Of note is that we chose the ADAM optimizer after trying Stochastic Gradient Descent and Adagrad as well.

### Feed-forward Neural Network Architecture

To build the architecture the Keras package was used, with the model detailed as below:

- **Embedding Layer:** Converts input text to 100-dimensional vectors.
- **Average Pooling Layer:** Global Average Pooling is applied to the output of the Embedding layer. This layer averages the values along the sequence dimension.
- **Fully Connected Layers:** Utilize ReLU activation with dropout for regularization.
  - Dense layer with 64 units.
  - Dropout with a rate of 0.2.
  - Dense layer with 32 units.
  - Another Dropout with a rate of 0.2.
- **Output Layer:** Sigmoid activation for binary classification.

**Training Configuration:**
- Loss Function: Binary Crossentropy.
- Optimizer: Adam.

**Additional Technique:**
- Early stopping is applied with a patience of 3 to prevent overfitting during training.

**Batch Size Selection:**

- A batch size of 250 is chosen, representing the number of training examples processed in each iteration, to find a balance between computational efficiency and memory constraints.

In [None]:
def build_fnn_pooled_model():
    model = Sequential()
    model.add(Embedding(vocab_size, 100, input_length=max_length))

    model.add(GlobalAveragePooling1D())  # Apply mean pooling

    # Fully connected layers
    model.add(Dense(64, activation='relu'))
    model.add(Dropout(0.2))

    model.add(Dense(32, activation='relu'))
    model.add(Dropout(0.2))

    # Output layer for binary classification
    model.add(Dense(1, activation='sigmoid'))

    model.compile(loss='binary_crossentropy', optimizer = Adam(learning_rate=0.0001), metrics=['accuracy'])

    early_stopping = EarlyStopping(monitor='val_loss', patience=3)  # Apply early stopping

    return model, early_stopping

In [None]:
model, early_stopping = build_fnn_pooled_model()

model.summary()

Model: "sequential_8"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_8 (Embedding)     (None, 1500, 100)         1000000   
                                                                 
 global_average_pooling1d_8  (None, 100)               0         
  (GlobalAveragePooling1D)                                       
                                                                 
 dense_24 (Dense)            (None, 64)                6464      
                                                                 
 dropout_16 (Dropout)        (None, 64)                0         
                                                                 
 dense_25 (Dense)            (None, 32)                2080      
                                                                 
 dropout_17 (Dropout)        (None, 32)                0         
                                                      

In [None]:
# Batch size value
batch_size = 250

# Train the model
history = model.fit(X_train, y_train_encoded, epochs=200, batch_size=batch_size, validation_split=0.2, callbacks=[early_stopping])

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200


The training reached a final validation accuracy of 90.06% after 42 epochs, when the early stopping technique, using patience equals to 3, concluded the training process. we can now use the trained network on the Test Set.

In [None]:
score = model.evaluate(np.asarray(X_test),np.asarray(y_test_encoded))



We obtained a final accuracy of 89.44%, which, although slightly lower than the validation accuracy, is sufficiently similar to the validation accuracy, indicating that the model generalized well to unseen data and did not exhibit significant overfitting.

In [None]:
# Make predictions using the trained model
predictions = model.predict(np.asarray(X_test))

# Set a threshold (adjust as needed)
threshold = 0.5

# Convert the predicted probabilities to class labels using the threshold
predicted_labels = (predictions > threshold).astype(int)

# Compute the confusion matrix
conf_matrix = confusion_matrix(y_test_encoded, predicted_labels)

# Compute the F1 score
f1 = f1_score(y_test_encoded, predicted_labels, average='weighted')

# Generate the classification report
class_report = classification_report(y_test_encoded, predicted_labels, target_names=label_encoder.classes_)

print("Confusion Matrix:")
print(conf_matrix)
print("\nF1 Score:", round(f1, 5))
print("\nClassification Report:")
print(class_report)

Confusion Matrix:
[[4468  591]
 [ 465 4476]]

F1 Score: 0.8944

Classification Report:
              precision    recall  f1-score   support

    negative       0.91      0.88      0.89      5059
    positive       0.88      0.91      0.89      4941

    accuracy                           0.89     10000
   macro avg       0.89      0.89      0.89     10000
weighted avg       0.89      0.89      0.89     10000




The confusion matrix indicates the model's performance, showing that it correctly classified 4468 negative instances (True Negatives) and 4476 positive instances (True Positives), with 591 False Positives and 465 False Negatives.

The F1 Score of 0.895 suggests a good balance between precision and recall, reflecting the model's ability to make accurate positive predictions while considering both false positives and false negatives.

## Investigating incorrectly classified reviews

In order to investigate what can cause the model to not accurately classify the sentiment of some of the revies, 5 random incorrectly classified reviews will be printed (where 1 stands for positive and 0 for negative)

In [None]:
predicted_labels_flattened = predicted_labels.ravel()
predicted_labels_flattened

# Get the indices of incorrectly classified instances
incorrectly_classified_indices = np.where(y_test_encoded != predicted_labels_flattened)[0]

# Create a mapping from original index to position in y_test_encoded
index_to_position = {index: position for position, index in enumerate(y_test.index)}

# Create a mapping from position in y_test_encoded to original index
position_to_index = {position: index for index, position in index_to_position.items()}

# Randomly select 5 indices from the incorrect ones
random_indices = np.random.choice(incorrectly_classified_indices, size=min(5, len(incorrectly_classified_indices)), replace=False)

# Print information about the selected instances
for position_in_y_test_encoded in random_indices:
    original_index = position_to_index.get(position_in_y_test_encoded, None)

    if original_index is not None:
        true_label = y_test_encoded[position_in_y_test_encoded]
        predicted_label = predicted_labels[position_in_y_test_encoded][0]  # Extract the actual label from the array
        review_text = data['review'].iloc[original_index]

        print(f"Original Index: {original_index}, Position in y_test_encoded: {position_in_y_test_encoded}, True Label: {true_label}, Predicted Label: {predicted_label}, Review: {review_text}")
    else:
        print(f"No original index found for position {position_in_y_test_encoded}")



Original Index: 37697, Position in y_test_encoded: 1316, True Label: 0, Predicted Label: 1, Review: Tom Cutler (Jackson) is a retired policeman who now works as a crime scene Cleaner-upper. In his latest job, he cleans a new crime scene and destroys evidence and isn't aware the crime hasn't been officially reported. Uh oh, this can't be good. <br /><br />You hear about Cleaners all the time, but usually when a mob or gangland hit is involved when bodies etc need to be removed and the area cleaned up. This one is different in that Tom Cutler works with the Police to clean up after the police have done their investigation of a crime scene. Hey, someone has to do it. You know the Police won't. The movie makes that quite clear and it is up to the family to get the area cleaned up. <br /><br />This is almost a good thriller, but a side plot involving Tom's daughter (Palmer) makes this story somewhat awkward. I guess they had to fill in some time. Oh, they brought this side plot around to co

The printed reviews suggest that could be challenging for the model to interpret mixed sentiments or subtle distinctions like "almost a good thriller" (repeated many times in the first review) or "not as bad as one review suggested" (second review).

Additionally, the fourth review highlights the complexity of reviews where an overall negative sentiment coexists ("I have to boo this movie", "Not a fun movie to sit through") with positive sentiments towards specific aspects ("I love Ellen Barkin", "if you like Ellen Barkin it's nice to see her place a tough lady") making classification difficult for the model.

Due to the complexity of neural networks and their "black-box nature", achieving a complete understanding may not be entirely possible, but only to hypothesize about potential causes.