<a href="https://colab.research.google.com/github/shoaib-niazi/cloak-project/blob/main/BERT%20model%20for%20FYP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [7]:
import pandas as pd

file_path = '/content/household_power_consumption.txt'
df = pd.read_csv(file_path, sep=';', low_memory=False)
display(df.head())

Unnamed: 0,Date,Time,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
0,16/12/2006,17:24:00,4.216,0.418,234.84,18.4,0.0,1.0,17.0
1,16/12/2006,17:25:00,5.36,0.436,233.63,23.0,0.0,1.0,16.0
2,16/12/2006,17:26:00,5.374,0.498,233.29,23.0,0.0,2.0,17.0
3,16/12/2006,17:27:00,5.388,0.502,233.74,23.0,0.0,1.0,17.0
4,16/12/2006,17:28:00,3.666,0.528,235.68,15.8,0.0,1.0,17.0


In [4]:
from transformers import BertTokenizer, BertModel
import torch

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
bert = BertModel.from_pretrained("bert-base-uncased")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

# Task
Prepare the time-series data from `/content/household_power_consumption.txt` for a BERT model, then define, train, and evaluate a BERT-based model for a time-series analysis task.

## Data Preparation for BERT

### Subtask:
Transform the numerical time-series data in the `df` DataFrame into a format suitable for input into a BERT model, including handling missing values, scaling, creating sequences, and preparing training and validation sets as PyTorch `Dataset` and `DataLoader` objects.


**Reasoning**:
The first step is to identify the relevant numerical columns and convert them to a numeric type, coercing errors to NaN. This is crucial for numerical operations and handling missing values in subsequent steps.



In [13]:
numerical_cols = ['Global_active_power', 'Global_reactive_power', 'Voltage', 'Global_intensity', 'Sub_metering_1', 'Sub_metering_2', 'Sub_metering_3']

for col in numerical_cols:
    df[col] = pd.to_numeric(df[col], errors='coerce')

print("Data types after conversion:\n", df[numerical_cols].dtypes)
display(df[numerical_cols].head())

Data types after conversion:
 Global_active_power      float64
Global_reactive_power    float64
Voltage                  float64
Global_intensity         float64
Sub_metering_1           float64
Sub_metering_2           float64
Sub_metering_3           float64
dtype: object


Unnamed: 0_level_0,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
DateTime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2006-12-16 17:24:00,4.216,0.418,234.84,18.4,0.0,1.0,17.0
2006-12-16 17:25:00,5.36,0.436,233.63,23.0,0.0,1.0,16.0
2006-12-16 17:26:00,5.374,0.498,233.29,23.0,0.0,2.0,17.0
2006-12-16 17:27:00,5.388,0.502,233.74,23.0,0.0,1.0,17.0
2006-12-16 17:28:00,3.666,0.528,235.68,15.8,0.0,1.0,17.0


**Reasoning**:
Now that the numerical columns are correctly typed, the next step is to handle any remaining missing values using forward-fill followed by backward-fill to maintain the time-series integrity and ensure no NaNs are left.



In [14]:
df[numerical_cols] = df[numerical_cols].ffill().bfill()

print("Number of missing values after imputation:")
print(df[numerical_cols].isnull().sum())

Number of missing values after imputation:
Global_active_power      0
Global_reactive_power    0
Voltage                  0
Global_intensity         0
Sub_metering_1           0
Sub_metering_2           0
Sub_metering_3           0
dtype: int64


**Reasoning**:
With missing values handled, the next step is to scale the numerical features using `MinMaxScaler` to normalize their range, which is crucial for models like BERT that are sensitive to input scale.



In [15]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])

print("Scaled data head:")
display(df[numerical_cols].head())

Scaled data head:


Unnamed: 0_level_0,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
DateTime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2006-12-16 17:24:00,0.390442,0.300719,0.396161,0.391304,0.0,0.012821,0.548387
2006-12-16 17:25:00,0.498489,0.313669,0.353927,0.491304,0.0,0.012821,0.516129
2006-12-16 17:26:00,0.499811,0.358273,0.342059,0.491304,0.0,0.025641,0.548387
2006-12-16 17:27:00,0.501133,0.361151,0.357766,0.491304,0.0,0.012821,0.548387
2006-12-16 17:28:00,0.338496,0.379856,0.42548,0.334783,0.0,0.012821,0.548387


**Reasoning**:
Now that the numerical features are scaled, the next logical step is to create time-series sequences from the data. This involves defining a sequence length and a forecast horizon, then iterating through the scaled data to generate input sequences (X) and corresponding target values (y) that will be fed into the BERT model.



In [16]:
import numpy as np

sequence_length = 24  # e.g., 24 hours of data as input
forecast_horizon = 1  # predict the next data point

X, y = [], []

data_array = df[numerical_cols].values # Use only the numerical columns for sequence generation

for i in range(len(data_array) - sequence_length - forecast_horizon + 1):
    X.append(data_array[i : i + sequence_length])
    y.append(data_array[i + sequence_length + forecast_horizon - 1, 0]) # Predicting 'Global_active_power' of the next step

X = np.array(X)
y = np.array(y)

print(f"Shape of input sequences (X): {X.shape}")
print(f"Shape of target values (y): {y.shape}")

Shape of input sequences (X): (733827, 24, 7)
Shape of target values (y): (733827,)


**Reasoning**:
With the sequences and targets prepared, the next step is to split them into training and validation sets. Given this is time-series data, a chronological split is essential, ensuring the training set comes before the validation set to avoid data leakage.



In [17]:
train_split_ratio = 0.8
split_index = int(len(X) * train_split_ratio)

X_train, X_val = X[:split_index], X[split_index:]
y_train, y_val = y[:split_index], y[split_index:]

print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of X_val: {X_val.shape}")
print(f"Shape of y_val: {y_val.shape}")

Shape of X_train: (587061, 24, 7)
Shape of y_train: (587061,)
Shape of X_val: (146766, 24, 7)
Shape of y_val: (146766,)


**Reasoning**:
Now that the data has been split into training and validation sets, the next step is to create a custom PyTorch `Dataset` class to encapsulate the sequences and targets, which is a prerequisite for using PyTorch's `DataLoader`.



In [18]:
import torch
from torch.utils.data import Dataset, DataLoader

class TimeSeriesDataset(Dataset):
    def __init__(self, X, y):
        self.X = torch.tensor(X, dtype=torch.float32)
        self.y = torch.tensor(y, dtype=torch.float32)

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

# Create Dataset objects
train_dataset = TimeSeriesDataset(X_train, y_train)
val_dataset = TimeSeriesDataset(X_val, y_val)

print(f"Train dataset size: {len(train_dataset)}")
print(f"Validation dataset size: {len(val_dataset)}")

Train dataset size: 587061
Validation dataset size: 146766


**Reasoning**:
The final step in data preparation is to instantiate `DataLoader` objects for both the training and validation `Dataset`s. This will allow for efficient batching and iteration over the datasets during model training, while ensuring `shuffle=False` is maintained for time-series data integrity.



In [19]:
batch_size = 64

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=False)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

print(f"Number of batches in train_loader: {len(train_loader)}")
print(f"Number of batches in val_loader: {len(val_loader)}")

# Verify one batch
for X_batch, y_batch in train_loader:
    print(f"Shape of X_batch: {X_batch.shape}")
    print(f"Shape of y_batch: {y_batch.shape}")
    break

Number of batches in train_loader: 9173
Number of batches in val_loader: 2294
Shape of X_batch: torch.Size([64, 24, 7])
Shape of y_batch: torch.Size([64])


## Define BERT-based Model for Time Series

### Subtask:
Adapt the pre-trained `bert` model for the specific time-series task by adding custom layers on top of BERT's output layer and configuring the model for regression.


**Reasoning**:
To adapt the pre-trained BERT model for time-series regression, a custom PyTorch model class needs to be defined. This class will integrate the BERT model, add a linear layer to project the numerical time-series features into BERT's embedding space, and include a regression head on top of BERT's output to predict a scalar value.



In [20]:
import torch.nn as nn

class TimeSeriesBERTModel(nn.Module):
    def __init__(self, bert_model, input_dim, hidden_size):
        super(TimeSeriesBERTModel, self).__init__()
        self.bert = bert_model
        # Project numerical features to BERT's hidden size
        self.embedding_layer = nn.Linear(input_dim, hidden_size)
        # Regression head: two linear layers with ReLU activation
        self.reg_head = nn.Sequential(
            nn.Linear(hidden_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, 1) # Output a single scalar for regression
        )

    def forward(self, x):
        # x shape: (batch_size, sequence_length, input_dim)
        # Project numerical features to the BERT's hidden dimension
        embedded_x = self.embedding_layer(x) # (batch_size, sequence_length, hidden_size)

        # Pass through BERT using inputs_embeds
        outputs = self.bert(inputs_embeds=embedded_x)
        last_hidden_state = outputs.last_hidden_state # (batch_size, sequence_length, hidden_size)

        # Average pooling across sequence dimension to get a single representation for the sequence
        pooled_output = torch.mean(last_hidden_state, dim=1) # (batch_size, hidden_size)

        # Pass through regression head to get the final prediction
        prediction = self.reg_head(pooled_output) # (batch_size, 1)

        return prediction.squeeze(-1) # Squeeze to (batch_size) for single scalar prediction

# Get input dimensions from the prepared data
input_dim = X_train.shape[2] # Number of numerical features
bert_hidden_size = bert.config.hidden_size # BERT's hidden size (e.g., 768 for bert-base-uncased)

# Instantiate the custom model
model = TimeSeriesBERTModel(bert, input_dim, bert_hidden_size)

# Move the model to the appropriate device (GPU if available, else CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

print(f"Model moved to {device}")
print(model)

Model moved to cpu
TimeSeriesBERTModel(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,

## Configure Training Parameters

### Subtask:
Set up the training loop by defining a suitable loss function (e.g., Mean Squared Error for regression, Cross-Entropy for classification), an optimizer (e.g., AdamW), a learning rate scheduler, and an appropriate batch size for training.


**Reasoning**:
To configure the training parameters, I will first define the loss function for regression (MSE Loss) and the optimizer (AdamW) with an initial learning rate, then calculate the total training steps and number of epochs.



In [21]:
import torch.optim as optim
from transformers import get_linear_schedule_with_warmup

# 1. Define Loss Function (Mean Squared Error for Regression)
loss_fn = nn.MSELoss()

# 2. Initialize Optimizer (AdamW)
learning_rate = 2e-5
optimizer = optim.AdamW(model.parameters(), lr=learning_rate)

# 3. Define the number of epochs
num_epochs = 10

# Calculate total training steps for the scheduler
total_training_steps = len(train_loader) * num_epochs

# 4. Initialize Learning Rate Scheduler (optional but recommended)
# A linear scheduler with warmup is commonly used for BERT-like models
warmup_steps = int(0.1 * total_training_steps) # 10% of total steps for warmup
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=warmup_steps,
    num_training_steps=total_training_steps
)

print(f"Loss function: {loss_fn}")
print(f"Optimizer: {optimizer}")
print(f"Number of epochs: {num_epochs}")
print(f"Total training steps for scheduler: {total_training_steps}")
print(f"Warmup steps for scheduler: {warmup_steps}")

Loss function: MSELoss()
Optimizer: AdamW (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    capturable: False
    decoupled_weight_decay: True
    differentiable: False
    eps: 1e-08
    foreach: None
    fused: None
    initial_lr: 2e-05
    lr: 0.0
    maximize: False
    weight_decay: 0.01
)
Number of epochs: 10
Total training steps for scheduler: 91730
Warmup steps for scheduler: 9173


## Train the BERT Model

### Subtask:
Execute the training loop over multiple epochs using the prepared training data, optimizing the model's parameters to minimize the chosen loss function.


**Reasoning**:
I will implement the training and validation loop for the BERT model according to the instructions, iterating through epochs, processing data, calculating losses, updating model parameters, and tracking progress.



# Task
Adjust the data splitting logic to use the first 500,000 rows of `X` and `y` for training and the remaining rows for validation, then recreate the `TimeSeriesDataset` and `DataLoader` objects for both sets, confirming that the training data is limited to the specified size.

## Adjust Training Data Split

### Subtask:
Modify the data splitting logic to use the first 500,000 rows of `X` and `y` for training, and the subsequent rows for validation, ensuring no overlap.


**Reasoning**:
The previous training cell encountered a KeyboardInterrupt, indicating that the training process was stopped prematurely, likely due to long training times with the current dataset size and model complexity. The current subtask requires adjusting the data split to use a fixed number of rows (500,000) for training and the rest for validation. This adjustment will create smaller training and validation sets, potentially reducing training time and making the development process more manageable. The next step is to implement this data splitting logic.



In [23]:
split_index = 500000

X_train, X_val = X[:split_index], X[split_index:]
y_train, y_val = y[:split_index], y[split_index:]

print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of X_val: {X_val.shape}")
print(f"Shape of y_val: {y_val.shape}")

Shape of X_train: (500000, 24, 7)
Shape of y_train: (500000,)
Shape of X_val: (233827, 24, 7)
Shape of y_val: (233827,)


**Reasoning**:
Now that the data has been split with the new `split_index`, I need to recreate the `TimeSeriesDataset` and `DataLoader` objects using the updated `X_train`, `y_train`, `X_val`, and `y_val` to reflect the new training and validation set sizes.



In [24]:
import torch
from torch.utils.data import Dataset, DataLoader

# Assuming TimeSeriesDataset class is already defined from a previous cell
# If not, it should be included here:
# class TimeSeriesDataset(Dataset):
#     def __init__(self, X, y):
#         self.X = torch.tensor(X, dtype=torch.float32)
#         self.y = torch.tensor(y, dtype=torch.float32)

#     def __len__(self):
#         return len(self.X)

#     def __getitem__(self, idx):
#         return self.X[idx], self.y[idx]

# Create Dataset objects with the new split
train_dataset = TimeSeriesDataset(X_train, y_train)
val_dataset = TimeSeriesDataset(X_val, y_val)

print(f"New Train dataset size: {len(train_dataset)}")
print(f"New Validation dataset size: {len(val_dataset)}")

# Recreate DataLoader objects
batch_size = 64 # Re-using batch_size from previous setup

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=False)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

print(f"New Number of batches in train_loader: {len(train_loader)}")
print(f"New Number of batches in val_loader: {len(val_loader)}")

# Verify one batch from the new train_loader
for X_batch, y_batch in train_loader:
    print(f"Shape of X_batch from new train_loader: {X_batch.shape}")
    print(f"Shape of y_batch from new train_loader: {y_batch.shape}")
    break

New Train dataset size: 500000
New Validation dataset size: 233827
New Number of batches in train_loader: 7813
New Number of batches in val_loader: 3654
Shape of X_batch from new train_loader: torch.Size([64, 24, 7])
Shape of y_batch from new train_loader: torch.Size([64])


**Reasoning**:
The data has been re-split into new training and validation sets. To ensure the model training uses these new splits, the `TimeSeriesDataset` and `DataLoader` objects must be recreated with `X_train`, `y_train`, `X_val`, and `y_val` to reflect the updated data partitioning.



In [25]:
import torch
from torch.utils.data import Dataset, DataLoader

class TimeSeriesDataset(Dataset):
    def __init__(self, X, y):
        self.X = torch.tensor(X, dtype=torch.float32)
        self.y = torch.tensor(y, dtype=torch.float32)

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

# Create Dataset objects with the new split
train_dataset = TimeSeriesDataset(X_train, y_train)
val_dataset = TimeSeriesDataset(X_val, y_val)

print(f"New Train dataset size: {len(train_dataset)}")
print(f"New Validation dataset size: {len(val_dataset)}")

# Recreate DataLoader objects
batch_size = 64 # Re-using batch_size from previous setup

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=False)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

print(f"New Number of batches in train_loader: {len(train_loader)}")
print(f"New Number of batches in val_loader: {len(val_loader)}")

# Verify one batch from the new train_loader
for X_batch, y_batch in train_loader:
    print(f"Shape of X_batch from new train_loader: {X_batch.shape}")
    print(f"Shape of y_batch from new train_loader: {y_batch.shape}")
    break

New Train dataset size: 500000
New Validation dataset size: 233827
New Number of batches in train_loader: 7813
New Number of batches in val_loader: 3654
Shape of X_batch from new train_loader: torch.Size([64, 24, 7])
Shape of y_batch from new train_loader: torch.Size([64])


## Train the BERT Model

### Subtask:
Execute the training loop over multiple epochs using the prepared training data, optimizing the model's parameters to minimize the chosen loss function.


**Reasoning**:
I will implement the training and validation loop for the BERT model according to the instructions, iterating through epochs, processing data, calculating losses, updating model parameters, and tracking progress.



In [26]:
import tqdm # For progress bar

# Initialize lists to store training and validation losses
train_losses = []
val_losses = []

print("Starting model training...")

for epoch in range(num_epochs):
    # Training Phase
    model.train() # Set the model to training mode
    total_train_loss = 0
    train_progress_bar = tqdm.tqdm(train_loader, desc=f"Epoch {epoch+1}/{num_epochs} Training")

    for X_batch, y_batch in train_progress_bar:
        # Move batches to the appropriate device
        X_batch = X_batch.to(device)
        y_batch = y_batch.to(device)

        # Forward pass
        predictions = model(X_batch)

        # Calculate loss
        loss = loss_fn(predictions, y_batch)

        # Backward pass and optimization
        loss.backward()

        # Clip gradients to prevent exploding gradients
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

        optimizer.step() # Update model parameters
        scheduler.step() # Update the learning rate scheduler
        optimizer.zero_grad() # Clear gradients

        total_train_loss += loss.item()
        train_progress_bar.set_postfix(loss=loss.item())

    avg_train_loss = total_train_loss / len(train_loader)
    train_losses.append(avg_train_loss)

    # Validation Phase
    model.eval() # Set the model to evaluation mode
    total_val_loss = 0
    val_progress_bar = tqdm.tqdm(val_loader, desc=f"Epoch {epoch+1}/{num_epochs} Validation")

    with torch.no_grad(): # Disable gradient calculations
        for X_batch_val, y_batch_val in val_progress_bar:
            X_batch_val = X_batch_val.to(device)
            y_batch_val = y_batch_val.to(device)

            predictions_val = model(X_batch_val)
            val_loss = loss_fn(predictions_val, y_batch_val)

            total_val_loss += val_loss.item()
            val_progress_bar.set_postfix(loss=val_loss.item())

    avg_val_loss = total_val_loss / len(val_loader)
    val_losses.append(avg_val_loss)

    print(f"Epoch {epoch+1}/{num_epochs}: Train Loss = {avg_train_loss:.4f}, Val Loss = {avg_val_loss:.4f}")

print("Training finished.")

Starting model training...


Epoch 1/10 Training:   0%|          | 9/7813 [02:40<38:42:43, 17.86s/it, loss=0.0682]


KeyboardInterrupt: 

# Task
The model training was interrupted, so its performance is expected to be very poor. However, as per the plan, I will now evaluate the model's performance on the validation set using appropriate regression metrics (MAE, RMSE, R-squared) and then provide a summary.

First, I will make predictions on the validation set using the current state of the model. Then, I will inverse transform the predictions and actual values back to their original scale to calculate the metrics meaningfully. Finally, I will compute and report the evaluation metrics.

## Evaluate Model Performance

### Subtask:
Assess the trained model's performance on the validation set using appropriate regression metrics (MAE, RMSE, R-squared).


**Reasoning**:
The training loop was interrupted, and the task now requires evaluating the model. Following the instructions, I will first prepare to collect predictions and true values from the validation set by setting the model to evaluation mode and initializing necessary lists. Then, I will iterate through the validation data loader, collect predictions, and store them along with the true values. After gathering all predictions and true values, I will perform inverse scaling, calculate MAE, RMSE, and R-squared using `sklearn.metrics`, and finally print the results.



In [27]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# 1. Set the model to evaluation mode
model.eval()

# 2. Initialize empty lists to store predictions and true values
all_predictions = []
all_true_values = []

print("Starting model evaluation...")

# 3. Iterate through the val_loader with torch.no_grad() to perform inference
with torch.no_grad():
    for X_batch_val, y_batch_val in val_loader:
        # a. Move input batches and target batches to the correct device
        X_batch_val = X_batch_val.to(device)
        y_batch_val = y_batch_val.to(device)

        # b. Get model predictions
        predictions_val = model(X_batch_val)

        # c. Append the predictions and true values (converted to NumPy arrays) to their respective lists
        all_predictions.extend(predictions_val.cpu().numpy())
        all_true_values.extend(y_batch_val.cpu().numpy())

# 4. Concatenate all collected predictions and true values into single NumPy arrays
all_predictions = np.array(all_predictions)
all_true_values = np.array(all_true_values)

# 5. Create dummy arrays for predictions and true values for inverse transform
# The original data_array shape was (num_samples, sequence_length, num_features)
# The target was the first feature (index 0) of the forecast horizon.
# To inverse transform, we need to create an array with the same number of features as the original df[numerical_cols]
# which was input_dim (7).

dummy_predictions = np.zeros((len(all_predictions), input_dim))
dummy_true_values = np.zeros((len(all_true_values), input_dim))

# Place the predictions/true values in the column corresponding to 'Global_active_power' (index 0)
dummy_predictions[:, 0] = all_predictions
dummy_true_values[:, 0] = all_true_values

# 6. Use the scaler.inverse_transform() method on these dummy arrays
# and extract the first column (Global_active_power)
inverse_transformed_predictions = scaler.inverse_transform(dummy_predictions)[:, 0]
inverse_transformed_true_values = scaler.inverse_transform(dummy_true_values)[:, 0]

# 7. Import mean_absolute_error, mean_squared_error, and r2_score from sklearn.metrics (already imported)

# 8. Calculate the Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R-squared score
mae = mean_absolute_error(inverse_transformed_true_values, inverse_transformed_predictions)
mse = mean_squared_error(inverse_transformed_true_values, inverse_transformed_predictions)
rmse = np.sqrt(mse)
r2 = r2_score(inverse_transformed_true_values, inverse_transformed_predictions)

# 9. Print the calculated MAE, RMSE, and R-squared scores
print(f"\nModel Evaluation Metrics on Validation Set (Inverse Transformed Data):")
print(f"Mean Absolute Error (MAE): {mae:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")
print(f"R-squared (R2): {r2:.4f}")

print("Evaluation finished.")

Starting model evaluation...


KeyboardInterrupt: 

# Task
The execution of the last cell was interrupted. Therefore, I cannot generate the summary with actual evaluation metrics. However, I can provide a summary of the BERT model training process and its potential applications, noting the interruption and its likely impact on performance.

### Summary of BERT Model Training for Time-Series Forecasting

This notebook aimed to adapt a pre-trained BERT model for a time-series forecasting task using the household power consumption dataset. The process involved several key steps:

1.  **Data Preparation**:
    *   The `household_power_consumption.txt` dataset was loaded and relevant numerical columns (`Global_active_power`, `Global_reactive_power`, `Voltage`, `Global_intensity`, `Sub_metering_1`, `Sub_metering_2`, `Sub_metering_3`) were identified.
    *   These columns were converted to a numeric type, and missing values were handled using forward-fill and backward-fill imputation to maintain time-series integrity.
    *   The numerical features were then scaled using `MinMaxScaler` to normalize their range, which is essential for neural networks.
    *   Time-series sequences were created with a `sequence_length` of 24 (e.g., 24 hours of data as input) and a `forecast_horizon` of 1 (predicting the next data point of 'Global\_active\_power').
    *   Initially, the data was split chronologically into training (80%) and validation (20%) sets. Due to long training times, this was later adjusted to use the first 500,000 rows for training and the remaining for validation.
    *   Finally, custom PyTorch `TimeSeriesDataset` and `DataLoader` objects were created for efficient batching and iteration during training and evaluation.

2.  **Model Definition**:
    *   A `TimeSeriesBERTModel` class was defined, incorporating the pre-trained `bert-base-uncased` model.
    *   An `embedding_layer` (linear layer) was added to project the 7 numerical input features into BERT's `hidden_size` (768).
    *   A `reg_head` (regression head) consisting of two linear layers with a ReLU activation was appended to BERT's output. This head takes the average-pooled `last_hidden_state` from BERT and outputs a single scalar prediction.
    *   The model was moved to the available device (CPU in this case).

3.  **Training Configuration**:
    *   `nn.MSELoss()` was chosen as the loss function, suitable for regression tasks.
    *   The `AdamW` optimizer was initialized with a learning rate of `2e-5`.
    *   The model was configured to train for `10` epochs.
    *   A `get_linear_schedule_with_warmup` learning rate scheduler was set up to manage the learning rate over `91730` total training steps, with a warmup phase of `9173` steps.

4.  **Training Interruption and Impact**:
    *   The training loop was initiated to iterate over the defined epochs. However, **the training process was interrupted via a `KeyboardInterrupt` early in the first epoch**. This means the model did not undergo substantial training.
    *   Consequently, the subsequent evaluation phase, which was intended to assess the model's performance on the validation set, **was also interrupted by a `KeyboardInterrupt`** as it tried to perform inference with the largely untrained model.
    *   **Impact on Performance**: Due to the severe interruption during training, the model's parameters were not sufficiently updated to learn meaningful patterns from the time-series data. Therefore, the model's predictive performance on the validation set, as measured by metrics like MAE, RMSE, and R-squared, **would be extremely poor and not representative of a properly trained model**. Any results from this partial training would essentially reflect random chance or initial model biases, not learned intelligence.

### Potential Applications and Insights (if fully trained)

Had the BERT model completed its training successfully, it could potentially offer several advantages and insights for time-series analysis:

*   **Complex Temporal Dependency Capture**: BERT's self-attention mechanism is excellent at capturing long-range and non-linear dependencies within sequences. In time-series, this means it could identify intricate relationships between past observations (e.g., daily, weekly, or seasonal patterns) that simpler models might miss.
*   **Feature Interaction**: By treating numerical features as "tokens," BERT could learn how different features interact with each other over time, leading to a more nuanced understanding of the underlying data generation process.
*   **Transfer Learning**: While BERT is pre-trained on text, its core architecture (Transformers) is highly adaptable. A well-trained BERT-based model could potentially generalize well to different time-series datasets, especially if fine-tuned on a sufficiently large and diverse dataset.
*   **Anomaly Detection**: By learning normal patterns, a BERT-based model could be used to detect deviations as anomalies, which is crucial in many industrial and financial applications.
*   **Interpretation**: Although inherently complex, some interpretability techniques for Transformers could be applied to understand which past time steps or features contribute most to a specific forecast.

In conclusion, while the setup for adapting BERT to time-series forecasting was logically constructed, the interruption during training prevented any meaningful performance evaluation or the realization of the model's potential. A complete training run would be necessary to derive valid insights and assess its true effectiveness for this task.

## Final Task

### Subtask:
Provide a summary of the BERT model training process and its potential applications or insights gained from using this architecture on the time-series dataset, noting the interruption during training and its likely impact on performance.


## Summary:

### Q&A

**What is the summary of the BERT model training process and its potential applications or insights gained from using this architecture on the time-series dataset, noting the interruption during training and its likely impact on performance?**

The process involved adapting a pre-trained BERT model for time-series forecasting. It began with preparing the `household_power_consumption` dataset by loading, cleaning missing values through imputation, scaling numerical features using `MinMaxScaler`, and creating time-series sequences with a 24-step look-back and a 1-step forecast horizon. The data was split chronologically, with an adjustment to use the first 500,000 rows for training due to long processing times. A `TimeSeriesBERTModel` was defined, incorporating `bert-base-uncased` with an added linear `embedding_layer` for numerical features and a two-layer `reg_head` for regression output. The training was configured using `nn.MSELoss()`, the `AdamW` optimizer, and a linear learning rate scheduler.

However, the training process was interrupted early in the first epoch by a `KeyboardInterrupt`, preventing substantial learning. This interruption led to an extremely poor and unrepresentative model performance, as the parameters were not sufficiently updated.

If fully trained, the model could potentially capture complex temporal dependencies and feature interactions due to BERT's self-attention mechanism, offering advanced insights into time-series patterns. Its transfer learning capabilities might allow generalization to other datasets, and it could be applied to anomaly detection.

### Data Analysis Key Findings

*   **Data Preparation:** The `household_power_consumption.txt` dataset was processed, including handling missing values via forward/backward fill, scaling numerical features with `MinMaxScaler`, and creating time-series sequences with a `sequence_length` of 24 and a `forecast_horizon` of 1.
*   **Model Architecture:** A `TimeSeriesBERTModel` was custom-built, integrating `bert-base-uncased` with an `embedding_layer` to project 7 input features to BERT's 768 `hidden_size` and a two-layer `reg_head` for regression output.
*   **Training Interruption:** The model's training was severely interrupted by a `KeyboardInterrupt` early in the first epoch out of 10, meaning the model did not undergo substantial learning.
*   **Performance Impact:** Due to the interruption, the model's performance would be "extremely poor and not representative of a properly trained model," reflecting random chance or initial biases rather than learned intelligence.

### Insights or Next Steps

*   A complete and uninterrupted training run is essential to properly evaluate the BERT model's effectiveness for time-series forecasting and to realize its potential for capturing complex temporal dependencies and feature interactions.
*   Upon successful training, evaluate the model's performance using standard time-series metrics (e.g., MAE, RMSE, R-squared) and explore its potential applications in anomaly detection or transfer learning to other time-series datasets.
