**Data Preparation:**

Loaded and cleaned the MC-TACO dataset, including renaming columns, converting the "stationarity" column to a binary target, and removing any NaN values.
Embedding Generation:

Used DistilBERT to generate embeddings (vector representations) for the event descriptions, which are used as inputs to the model.

**LSTM Model Creation:**

Defined an LSTM model for classifying the event descriptions into two categories (stationarity vs. event duration).


**Model Training:**

Trained the LSTM model on the DistilBERT embeddings, using a custom training loop with cross-entropy loss and the Adam optimizer.

**Prediction:**

Applied the trained model to make predictions for a sample sentence, using DistilBERT embeddings and LSTM output to classify the event.

**What We Should Accomplish:**

**Train the LSTM Model**: Use the event embeddings and stationarity labels to train the LSTM model. The model should learn to predict whether an event is stationary or of variable duration.

**Make Predictions**: Once the model is trained, you should be able to input new sentences and the model will classify them as either:

Stationarity (1)
Event Duration (0)

**Evaluate the Model**: You’ll assess its accuracy, precision, recall, and F1-score to measure how well it performs on unseen test data.

In [24]:
import nltk
nltk.download('wordnet')


[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [28]:
import pandas as pd
import torch
from torch import nn
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from transformers import DistilBertTokenizer, DistilBertModel
import numpy as np
from imblearn.over_sampling import RandomOverSampler  # Import RandomOverSampler

# Load the data from the TSV file
data = pd.read_csv('mc-taco.tsv', sep='\t')

# Clean up column names and rename for easier access
data.columns = data.columns.str.strip().str.replace(' ', '_')
data.rename(columns={
    'Islam_later_emerged_as_the_majority_religion_during_the_centuries_of_Ottoman_rule,_though_a_significant_Christian_minority_remained.': 'event_description',
    'Stationarity': 'stationarity'
}, inplace=True)

# Filter relevant columns
filtered_data = data[['event_description', 'stationarity']]

# Convert the Stationarity column to binary
filtered_data['stationarity'] = filtered_data['stationarity'].map({
    'Stationarity': 1,
    'Event Duration': 0,
    'Frequency': 0,
    'Event Ordering': 0,
    'Typical Time': 0
})

# Remove rows with NaN values in the 'stationarity' column
filtered_data = filtered_data.dropna(subset=['stationarity'])

# Check class distribution
print("Class distribution before oversampling:")
print(filtered_data['stationarity'].value_counts())

# Prepare the target variable
y = filtered_data['stationarity'].values

# Initialize DistilBERT tokenizer and model
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
distilbert_model = DistilBertModel.from_pretrained('distilbert-base-uncased')

# Function to convert a sentence into its DistilBERT representation
def get_sentence_vector(sentence):
    inputs = tokenizer(sentence, return_tensors='pt', padding=True, truncation=True, max_length=512)
    with torch.no_grad():
        outputs = distilbert_model(**inputs)
    # Use the output of the last hidden state
    return outputs.last_hidden_state.mean(dim=1).squeeze().numpy()

# Create embeddings for the entire dataset
X = np.array([get_sentence_vector(desc) for desc in filtered_data['event_description']])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

# Apply Random Oversampling to the training set
ros = RandomOverSampler(random_state=42)
X_train_resampled, y_train_resampled = ros.fit_resample(X_train, y_train)

# Convert to tensors
X_train = torch.tensor(X_train_resampled, dtype=torch.float32)
X_test = torch.tensor(X_test, dtype=torch.float32)
y_train = torch.tensor(y_train_resampled, dtype=torch.float32)
y_test = torch.tensor(y_test, dtype=torch.float32)

# Define the LSTM model
class MyLSTMModel(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(MyLSTMModel, self).__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = x.unsqueeze(1)  # Add sequence dimension
        lstm_out, _ = self.lstm(x)
        out = self.fc(lstm_out[:, -1, :])  # Get the last output from LSTM
        return self.sigmoid(out)  # Output probability

# Initialize the model
input_size = X_train.shape[1]  # Number of features from DistilBERT
hidden_size = 128  # Choose a hidden size
lstm_model = MyLSTMModel(input_size, hidden_size)

# Define loss and optimizer
criterion = nn.BCELoss()  # Binary Cross Entropy Loss
optimizer = torch.optim.Adam(lstm_model.parameters(), lr=0.001)

# Training the model
def train_model(model, X_train, y_train, criterion, optimizer, epochs=10):
    model.train()
    for epoch in range(epochs):
        optimizer.zero_grad()  # Clear gradients
        outputs = model(X_train)  # Forward pass
        loss = criterion(outputs.squeeze(), y_train)  # Calculate loss
        loss.backward()  # Backward pass
        optimizer.step()  # Update parameters
        print(f"Epoch [{epoch + 1}/{epochs}], Loss: {loss.item():.4f}")

# Train the model
train_model(lstm_model, X_train, y_train, criterion, optimizer, epochs=10)

# Function to predict the stationarity of a new sentence
def predict_stationarity(model, new_sentences):
    # Convert new sentences to DistilBERT embeddings
    new_X = np.array([get_sentence_vector(sentence) for sentence in new_sentences])
    new_X_tensor = torch.tensor(new_X, dtype=torch.float32)

    # Make predictions
    model.eval()
    with torch.no_grad():
        predictions = model(new_X_tensor)
    predicted_labels = (predictions > 0.5).float()  # Threshold at 0.5 for binary classification
    return predicted_labels.numpy()

# Make predictions on the test set
lstm_model.eval()
with torch.no_grad():
    test_outputs = lstm_model(X_test)
    predicted_labels = (test_outputs > 0.5).float().numpy()  # Convert probabilities to binary labels

# Calculate accuracy
accuracy = accuracy_score(y_test.numpy(), predicted_labels)
print(f"Test Accuracy: {accuracy:.4f}")

# Classification report for detailed metrics
print("\nClassification Report:")
print(classification_report(y_test.numpy(), predicted_labels, target_names=['Not Stationarity (0)', 'Stationarity (1)']))

# Example inputs, including one that belongs to stationarity
example_inputs = [
    "The event was planned to occur next week.",  # Not Stationary
    "The meeting will take place regularly every Monday.",  # Not Stationary
    "This festival is held only once a year.",  # Not Stationary
    "The concert will be held every year without fail.",  # Not Stationary
    "This workshop happens every spring.",  # Not Stationary
    "The exhibition is scheduled to occur next month."  # Stationary
]

# Make predictions on the example inputs
predicted_stationarity = predict_stationarity(lstm_model, example_inputs)

# Display the predictions in binary format (0 or 1)
for sentence, prediction in zip(example_inputs, predicted_stationarity):
    print(f"Sentence: '{sentence}' => Prediction: {int(prediction[0])}")  # Convert to int for binary output


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_data['stationarity'] = filtered_data['stationarity'].map({


Class distribution before oversampling:
stationarity
0    3510
1     272
Name: count, dtype: int64




Epoch [1/10], Loss: 0.6934
Epoch [2/10], Loss: 0.6887
Epoch [3/10], Loss: 0.6843
Epoch [4/10], Loss: 0.6795
Epoch [5/10], Loss: 0.6746
Epoch [6/10], Loss: 0.6697
Epoch [7/10], Loss: 0.6644
Epoch [8/10], Loss: 0.6588
Epoch [9/10], Loss: 0.6530
Epoch [10/10], Loss: 0.6470
Test Accuracy: 0.6486

Classification Report:
                      precision    recall  f1-score   support

Not Stationarity (0)       0.94      0.66      0.78       703
    Stationarity (1)       0.09      0.44      0.15        54

            accuracy                           0.65       757
           macro avg       0.52      0.55      0.47       757
        weighted avg       0.88      0.65      0.73       757

Sentence: 'The event was planned to occur next week.' => Prediction: 0
Sentence: 'The meeting will take place regularly every Monday.' => Prediction: 0
Sentence: 'This festival is held only once a year.' => Prediction: 0
Sentence: 'The concert will be held every year without fail.' => Prediction: 0
Sentence