# Whats going on in the model


**Input and Output**

**Input:** The input to the model is padded sequences of integer-encoded words representing pairs of events extracted from the TML files (TimeBank and TimeEval-3 datasets). Each sequence corresponds to a pair of events with their associated text descriptions.

**Output**: The output is a label indicating the temporal relation between the two events (e.g., "BEFORE", "AFTER", "SIMULTANEOUS"). The model outputs a probability distribution over these possible relations, from which the most probable relation is selected



1.** Data Splitting**

Purpose: To separate the combined and preprocessed data into training and testing sets.

Process:

X_train, X_test, y_train, y_test: These variables hold the training and testing data for both input sequences (X) and labels (y).
The train_test_split function is used to randomly split the data, with 80% going to training and 20% to testing.
Outcome: The data is divided into X_train, y_train for training and X_test, y_test for testing.

**2. Model Building**

**Purpose:** To create an LSTM (Long Short-Term Memory) neural network model for sequence prediction.

Architecture:

**Embedding Layer:** Converts integer-encoded words (from the tokenizer) into dense vectors of fixed size (128). This layer allows the model to learn the relationship between words.

LSTM Layer: This layer processes the sequences and captures the temporal dependencies between words. It outputs a fixed-size vector that represents the sequence.

Dropout Layer: This layer is used to prevent overfitting by randomly setting a fraction of the input units to 0 at each update during training.

Dense Layer: A fully connected layer that processes the output of the LSTM layer.

Output Layer: A dense layer with a softmax activation function, which outputs a probability distribution over all possible label classes.

**3. Model Training**

Purpose: To fit the model to the training data, allowing it to learn patterns in the sequences.

Process:

The model is trained using the fit method, where it processes the training data in batches of 32 sequences at a time.
The model runs for 10 epochs, meaning it goes through the entire training set 10 times.
Validation is performed on the test data after each epoch to monitor how well the model generalizes.

Outcome: The model learns from the training data, and its performance is tracked over the epochs.



# 1: Data Extraction and Sequence Creation

In [46]:
import pandas as pd
import xml.etree.ElementTree as ET

# Function to extract events from a TML file
def extract_events(file_path):
    tree = ET.parse(file_path)
    root = tree.getroot()
    events = []

    for i, event in enumerate(root.findall('.//EVENT')):
        event_text = event.text.strip() if event.text else ""
        event_class = event.get('class')
        event_tense = event.get('tense')
        event_aspect = event.get('aspect')
        events.append({
            'event_id': i,
            'text': event_text,
            'class': event_class,
            'tense': event_tense,
            'aspect': event_aspect
        })

    return events

# Function to extract temporal relations from a TML file
def extract_temporal_relations(file_path):
    tree = ET.parse(file_path)
    root = tree.getroot()
    relations = []

    for tlink in root.findall('.//TLINK'):
        event_1 = tlink.get('eventInstanceID')
        event_2 = tlink.get('relatedToEventInstance')
        relation_type = tlink.get('relType')
        relations.append({
            'event_1': event_1,
            'event_2': event_2,
            'relation': relation_type
        })

    return relations

# Extract data from TimeBank
timebank_file_path = 'TimeBank.tml'  # Replace with actual file path
timebank_events = extract_events(timebank_file_path)
timebank_relations = extract_temporal_relations(timebank_file_path)

# Convert to DataFrames
timebank_events_df = pd.DataFrame(timebank_events)
timebank_relations_df = pd.DataFrame(timebank_relations)

# Display TimeBank DataFrames
print("TimeBank Events DataFrame:")
print(timebank_events_df.head())  # Display first few rows
print("\nTimeBank Relations DataFrame:")
print(timebank_relations_df.head())

# Create sequences and labels
timebank_sequences, timebank_labels = create_sequences(timebank_events_df, timebank_relations_df)
timebank_sequences_df = pd.DataFrame({'sequence': timebank_sequences, 'label': timebank_labels})

# Display TimeBank Sequences DataFrame
print("\nTimeBank Sequences DataFrame:")
print(timebank_sequences_df.head())

# Extract data from TimeEval-3
timeeval_file_path = 'TimeEval3.tml'  # Replace with actual file path
timeeval_events = extract_events(timeeval_file_path)
timeeval_relations = extract_temporal_relations(timeeval_file_path)

# Convert to DataFrames
timeeval_events_df = pd.DataFrame(timeeval_events)
timeeval_relations_df = pd.DataFrame(timeeval_relations)

# Display TimeEval-3 DataFrames
print("\nTimeEval-3 Events DataFrame:")
print(timeeval_events_df.head())
print("\nTimeEval-3 Relations DataFrame:")
print(timeeval_relations_df.head())

# Create sequences and labels
timeeval_sequences, timeeval_labels = create_sequences(timeeval_events_df, timeeval_relations_df)
timeeval_sequences_df = pd.DataFrame({'sequence': timeeval_sequences, 'label': timeeval_labels})

# Display TimeEval-3 Sequences DataFrame
print("\nTimeEval-3 Sequences DataFrame:")
print(timeeval_sequences_df.head())



TimeBank Events DataFrame:
   event_id      text       class tense aspect
0         0  watching  OCCURRENCE  None   None
1         1    killed  OCCURRENCE  None   None
2         2   emptied  OCCURRENCE  None   None
3         3      said   REPORTING  None   None
4         4  appeared  OCCURRENCE  None   None

TimeBank Relations DataFrame:
  event_1 event_2     relation
0   ei236    None       BEFORE
1   ei224    None       BEFORE
2   ei216   ei215     INCLUDES
3   ei237    None  IS_INCLUDED
4   ei239    None  IS_INCLUDED

TimeBank Sequences DataFrame:
Empty DataFrame
Columns: [sequence, label]
Index: []

TimeEval-3 Events DataFrame:
   event_id     text       class tense aspect
0         0   dipped  OCCURRENCE  None   None
1         1  falling  OCCURRENCE  None   None
2         2   whammy  OCCURRENCE  None   None
3         3  reeling  OCCURRENCE  None   None
4         4     said   REPORTING  None   None

TimeEval-3 Relations DataFrame:
  event_1 event_2 relation
0     ei1    None   BEFO

In [47]:
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Combine TimeBank and TimeEval-3 DataFrames
combined_sequences_df = pd.concat([timebank_sequences_df, timeeval_sequences_df], ignore_index=True)

# Display Combined Sequences DataFrame
print("\nCombined Sequences DataFrame:")
print(combined_sequences_df.head())

# Clean and preprocess sequences
def clean_text(text):
    text = text.lower().strip()
    return text

combined_sequences_df['cleaned_sequence'] = combined_sequences_df['sequence'].apply(clean_text)

# Display Cleaned Sequences
print("\nCombined Cleaned Sequences DataFrame:")
print(combined_sequences_df[['sequence', 'cleaned_sequence']].head())

# Tokenization
tokenizer = Tokenizer()
tokenizer.fit_on_texts(combined_sequences_df['cleaned_sequence'])
combined_sequences_encoded = tokenizer.texts_to_sequences(combined_sequences_df['cleaned_sequence'])

# Padding
max_length = max(len(seq) for seq in combined_sequences_encoded)
combined_sequences_padded = pad_sequences(combined_sequences_encoded, maxlen=max_length, padding='post')

# Display Encoded and Padded Sequences
print("\nSample Encoded Sequences:")
print(combined_sequences_encoded[:5])  # Display first 5 sequences

print("\nPadded Sequences:")
print(combined_sequences_padded[:5])  # Display first 5 padded sequences

# Encode Labels
label_encoder = LabelEncoder()
combined_labels_encoded = label_encoder.fit_transform(combined_sequences_df['label'])

# Display Encoded Labels
print("\nEncoded Labels:")
print(combined_labels_encoded[:5])  # Display first 5 encoded labels



Combined Sequences DataFrame:
           sequence   label
0      led declined  BEFORE
1   limited lending   AFTER
2  touching limited  BEFORE
3   limited session   AFTER
4    touching quell   AFTER

Combined Cleaned Sequences DataFrame:
           sequence  cleaned_sequence
0      led declined      led declined
1   limited lending   limited lending
2  touching limited  touching limited
3   limited session   limited session
4    touching quell    touching quell

Sample Encoded Sequences:
[[8, 3], [4, 9], [1, 4], [4, 10], [1, 11]]

Padded Sequences:
[[ 8  3]
 [ 4  9]
 [ 1  4]
 [ 4 10]
 [ 1 11]]

Encoded Labels:
[1 0 1 0 0]


In [48]:
from sklearn.model_selection import train_test_split

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(combined_sequences_padded, combined_labels_encoded, test_size=0.2, random_state=42)

# Display the shape of the training and testing data
print("\nTraining Data Shape:", X_train.shape)
print("Training Labels Shape:", y_train.shape)
print("Test Data Shape:", X_test.shape)
print("Test Labels Shape:", y_test.shape)



Training Data Shape: (12, 2)
Training Labels Shape: (12,)
Test Data Shape: (4, 2)
Test Labels Shape: (4,)


In [49]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout

# Define the model architecture
model = Sequential()

# Embedding layer
model.add(Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=128, input_length=max_length))

# LSTM layer
model.add(LSTM(units=128, return_sequences=False))

# Dropout layer
model.add(Dropout(0.5))

# Dense layer
model.add(Dense(units=64, activation='relu'))

# Output layer
model.add(Dense(units=len(label_encoder.classes_), activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Display the model summary
print("\nModel Summary:")
model.summary()



Model Summary:




In [50]:
# Train the model
history = model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_test, y_test))

# Display training history
print("\nTraining History:")
print(history.history)


Epoch 1/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 4s/step - accuracy: 0.3333 - loss: 1.1009 - val_accuracy: 0.5000 - val_loss: 1.0953
Epoch 2/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 106ms/step - accuracy: 0.5833 - loss: 1.0929 - val_accuracy: 0.5000 - val_loss: 1.0952
Epoch 3/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 146ms/step - accuracy: 0.3333 - loss: 1.0965 - val_accuracy: 0.5000 - val_loss: 1.0945
Epoch 4/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 113ms/step - accuracy: 0.6667 - loss: 1.0879 - val_accuracy: 0.5000 - val_loss: 1.0934
Epoch 5/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 132ms/step - accuracy: 0.5833 - loss: 1.0857 - val_accuracy: 0.5000 - val_loss: 1.0922
Epoch 6/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 135ms/step - accuracy: 0.4167 - loss: 1.0831 - val_accuracy: 0.5000 - val_loss: 1.0911
Epoch 7/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━

In [51]:
# Evaluate the model on the test set
test_loss, test_accuracy = model.evaluate(X_test, y_test)

# Display the evaluation results
print("\nTest Loss:", test_loss)
print("Test Accuracy:", test_accuracy)


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 52ms/step - accuracy: 0.5000 - loss: 1.0846

Test Loss: 1.0846385955810547
Test Accuracy: 0.5


In [52]:
# Make predictions on the test data
y_pred = model.predict(X_test)

# Convert predictions to label indices
y_pred_labels = y_pred.argmax(axis=-1)

# Display some predictions
print("\nSample Predictions:")
for i in range(5):  # Display the first 5 predictions
    print(f"Predicted: {label_encoder.inverse_transform([y_pred_labels[i]])[0]}, Actual: {label_encoder.inverse_transform([y_test[i]])[0]}")


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 197ms/step

Sample Predictions:
Predicted: AFTER, Actual: BEFORE
Predicted: AFTER, Actual: AFTER
Predicted: AFTER, Actual: AFTER
Predicted: SIMULTANEOUS, Actual: BEFORE


IndexError: index 4 is out of bounds for axis 0 with size 4