# EECS 487 Final Project: Stance Detection in Satire

Anurag Renduchintala, Yoojin Bae, Karl Yan

Run the following cell to mount the Google Drive

In [None]:
'''from google.colab import drive
drive.mount('/content/drive')'''

Run the following cell to import (and install) necessary modules

In [None]:
!pip install portalocker
!pip install transformers

Run the following if using Google Drive.

In [None]:
import os
import sys

'''# TODO: Change this to the path to your homework folder
GOOGLE_DRIVE_PATH = '/content/drive/MyDrive/EECS 487/Homework_3'
print(os.listdir(GOOGLE_DRIVE_PATH))
os.chdir(GOOGLE_DRIVE_PATH)'''

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

### Colab GPU Resources

Check if GPU resources are available. If device = 'cpu', in the toolbar, click Runtime -> Change runtime type -> select GPU as the hardware accelerator.

**Important:**

Google Colab imposes a **dynamic GPU usage limit** that depends on how much/long you use Colab. This is to keep Colab free for everyone. You can read about it [here](https://stackoverflow.com/questions/61126851/how-can-i-use-gpu-on-google-colab-after-exceeding-usage-limit). That being said, you should be able to complete this assignment without reaching your usage limit. You are **not** expected to spend your own money on Colab's paid GPU resources. In the event that you have run out of GPU resources, you would have to wait for resources or use a different Google account.

Here are some tips to conserve your GPU usage:


*   Change your runtime to GPU only when are working on parts that require GPU
*   When spending long intervals on coding/taking a break, remember to disconnect your runtime.



In [None]:
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)

In [None]:
# Install required packages
import nltk
nltk.download('punkt')
nltk.download('stopwords')
!pip install readability

Run the following cell to load the autoreload extension so that functions in python files will be re-imported into the notebook every time we run them. We also need to import all necessary packages.

In [None]:
%load_ext autoreload
%autoreload 2

import os
import json

import numpy as np
from torch.utils.data import DataLoader

# Main Task: Satire Detection

Prepare the data by importing the necessary modules. Get the compiled final data frame. Tokenize all reviews (lowercasing will be done later).

In [None]:
# clean the raw datasets, putting together only the important columns (text and satire)
from prepare_data import ALL_DATA
# make a copy of this df. We don't want to modify the actual data.
satire_data = ALL_DATA.copy()
# lowercase all the headlines
satire_data['text'] = satire_data['text'].str.lower()
# tokenize all the text
satire_data['tokenized_text'] = satire_data['text'].apply(lambda x: nltk.word_tokenize(x))
# verify that our data is balanced
print(len(satire_data))
print(len(satire_data[satire_data["satire"] == 0]))
print(len(satire_data[satire_data["satire"] == 1]))

Do Train, Test, Split; split the given data into training and testing sets.

In [None]:
from sklearn.model_selection import train_test_split
#satire_data.to_csv("data_20231204.csv") '''IGNORE THIS LINE'''
#satire_data = pd.read_csv("data_20231204.csv")  '''UNCOMMENT THIS LINE TO REPRODUCE RESULTS OBTAINED ON 12/4'''
X_train, X_test, y_train, y_test = train_test_split(satire_data["text"], satire_data["satire"], stratify=satire_data["satire"])
print(X_train.head(9))
print(len(X_train))
print(len(X_test))

Get the BERT model preprocessor and encoder. Import necessary packages.

In [None]:
import tensorflow_hub as hub
import tensorflow_text as text
import tensorflow as tf

In [None]:
preprocess_url = 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3'
encoder_url = 'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4'

In [None]:
bert_preprocess = hub.KerasLayer(preprocess_url)
bert_encoder = hub.KerasLayer(encoder_url)


Now create BERT and Neural Network Layers, and then, create the final model.

In [None]:
# Initialize some hyperparameters first. Mess with these to see what you get.
learning_rate = 1e-2
weight_decay = 1e-2
batch_size = 64
reg = "l2"

In [None]:
# BERT Layers
input_ = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
preprocessed = bert_preprocess(input_)
output = bert_encoder(preprocessed)

# NN Layers
MODEL = tf.keras.Sequential([
    tf.keras.layers.Dropout(0.1, name='dropout', input_shape=(output['pooled_output'].shape[1],)),
    tf.keras.layers.Dense(1, activation='sigmoid', name='output', kernel_regularizer=tf.keras.regularizers.l2(l2=weight_decay) if reg=="l2" else tf.keras.regularizers.l1(l1=weight_decay))
])

# Build the final model
model = tf.keras.Model(inputs=input_, outputs=MODEL(output['pooled_output']))

# Compile the model
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate), loss="binary_crossentropy", metrics=[tf.keras.metrics.BinaryAccuracy(name='accuracy'), tf.keras.metrics.Precision(name="precision"), 
                                                                     tf.keras.metrics.Recall(name="recall")])

Train, Evaluate, Make Predictions. 

In [None]:
# Train model. Graph validation loss and accuracy by epoch.
history = model.fit(X_train, y_train, epochs=10, batch_size=batch_size, validation_split=.2)
# Plot training and validation loss
plt.figure(figsize=(10, 5))
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Training and Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.show()

# Plot training and validation accuracy
plt.figure(figsize=(10, 5))
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Training and Validation Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

# Evaluate model
test_loss, test_accuracy, test_precision, test_recall = model.evaluate(X_test, y_test)
print(f'Test Loss: {test_loss}, Test Accuracy: {test_accuracy}')

# Make some predictions on new data (optional)
# TODO: Write some code for this (later on)

Now, we will predict on unseen test data.

In [None]:
y_pred = model.predict(X_test)
y_pred = y_pred.flatten()
# our y-pred values are sigmoid values between 0 and 1. So, we need to convert them.
y_pred = [1 if score > 0.5 else 0 for score in y_pred]
y_test_ = list(y_test)

For visualization purposes, we calculate confusion matrix to assess our model. 

In [None]:
import seaborn as sns
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import classification_report

# Calculate confusion matrix and accuracy
cm = confusion_matrix(y_test_, y_pred)
acc = accuracy_score(y_test_, y_pred)
print("Accuracy: ", acc)

# Plot confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()

# Print classification report
print("Classification Report:")
print(classification_report(y_test_, y_pred))
