<a href="https://colab.research.google.com/github/shabayadletzh/bonus_deep_learning_time_series/blob/main/bonus_deep_learning_time_series_as17609.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Using a pretrained model and dataset from huggingface

I aimed to use a text classification dataset.  Searching for a small sized one I selected the `jailbreak` dataset which is a binary classification task to predict whether a prompt is a jailbreak attempt or benign.

### Loading as a dataset

In [1]:
from datasets import load_dataset
#loading directly -- from docs
ds = load_dataset("jackhhao/jailbreak-classification")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/988 [00:00<?, ?B/s]

jailbreak_dataset_train_balanced.csv: 0.00B [00:00, ?B/s]

jailbreak_dataset_test_balanced.csv: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/1044 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/262 [00:00<?, ? examples/s]

The dataset is essentially a dictionary with a train and test dataset.  It contains two columns, the text of the prompt and a type -- benign or jailbreak.

In [2]:
ds

DatasetDict({
    train: Dataset({
        features: ['prompt', 'type'],
        num_rows: 1044
    })
    test: Dataset({
        features: ['prompt', 'type'],
        num_rows: 262
    })
})

In [3]:
ds['train'][0]

{'prompt': 'You are a devoted fan of a celebrity.', 'type': 'benign'}

In [4]:
ds['train'][1]

{'prompt': 'You are Joseph Seed from Far Cry 5. Sermonize to a group of followers about the importance of faith and obedience during the collapse of civilization.',
 'type': 'benign'}

### Loading the Model and Tokenizer

We need a tokenizer to turn the text into numbers and a model to perform the classification.  Below, we load in the Bert tokenizer and Bert model for sequence classification.  The `tokenizer` will be applied to the dataset and then passed to the model for training.

In [5]:
from transformers import BertTokenizer, BertForSequenceClassification
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained("bert-base-uncased")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [6]:
#example of tokenizer
tokenizer(ds['train'][0]['prompt'])

{'input_ids': [101, 2017, 2024, 1037, 7422, 5470, 1997, 1037, 8958, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [7]:
#function to apply tokenizer to all input strings
#note that this is the text in the "prompt" column
def encode(examples):
    return tokenizer(examples['prompt'], truncation=True, padding="max_length")

In [8]:
#mapping tokenizer to dataset
data = ds.map(encode)

Map:   0%|          | 0/1044 [00:00<?, ? examples/s]

Map:   0%|          | 0/262 [00:00<?, ? examples/s]

In [9]:
#function to make target numeric
#note these are the 'type' column and model expects 'labels'
def targeter(examples):
  return {'labels': 1 if examples['type'] == 'jailbreak' else 0}

In [10]:
#map target function to data
data = data.map(targeter)

Map:   0%|          | 0/1044 [00:00<?, ? examples/s]

Map:   0%|          | 0/262 [00:00<?, ? examples/s]

In [11]:
#note the changed data
data['train'][0]

{'prompt': 'You are a devoted fan of a celebrity.',
 'type': 'benign',
 'input_ids': [101,
  2017,
  2024,
  1037,
  7422,
  5470,
  1997,
  1037,
  8958,
  1012,
  102,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,


In [12]:
#no longer need original columns in data
d = data.remove_columns(['prompt', 'type'])

### Using the `Trainer` api

To train the model to predict jailbreak or not we use the `Trainer` and `TrainingArguments` objects from huggingface.

The `Trainer` requires a model, dataset specification, and tokenizer.  We use our dataset and the appropriate keys and create a `TrainingArguments` object to define where to store the model.  Once instantiated, the `.train` method begins the model training.

In [13]:
from transformers import Trainer, TrainingArguments

In [14]:
ta = TrainingArguments('testing-jailbreak',remove_unused_columns=False)

In [15]:
trainer = Trainer(model = model,
                  args = ta,
                  train_dataset = d['train'],
                  eval_dataset = d['test'],
                  processing_class = tokenizer, )

In [16]:
trainer.train()

  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: (1) Create a W&B account
[34m[1mwandb[0m: (2) Use an existing W&B account
[34m[1mwandb[0m: (3) Don't visualize my results
[34m[1mwandb[0m: Enter your choice:

 3


[34m[1mwandb[0m: You chose "Don't visualize my results"


Step,Training Loss


KeyboardInterrupt: 

### Evaluating the Model

After training, we using the model to predict on the test (evaluation) dataset.  The predictions are logits and we interpret them like probabilities.  Whatever the larger value, we predict based on the column index -- 0 or 1.  To do this, we use the `np.argmax` function.

Next, we create an evaluation object with accuracy (percent correct) as the chosen metric.  The `.compute` method compares the true to predicted values and displays the accuracy.

In [None]:
#make predictions
preds = trainer.predict(d['test'])

In [None]:
#first few rows of predictions
preds.predictions[:5]

In [None]:
import numpy as np

In [None]:
#turning predictions into 0 and 1
yhat = np.argmax(preds.predictions, axis = 1)

In [None]:
!pip install evaluate

In [None]:
import evaluate

In [None]:
#create accuracy evaluater
acc = evaluate.load("accuracy")

In [None]:
#accuracy on test data
acc.compute(predictions = yhat,
            references=preds.label_ids)

In [None]:
#baseline accuracy
preds.label_ids.sum()/len(preds.label_ids)

### Task: Fine Tuning a Time Series Model

The `Trainer` api essentially exposes all huggingface models and the ability to fine tune them readily.  Your goal for this assignment is to find a time series dataset (large in that it has more than 500K rows) and fine tune a forecasting model on this data.  [Huggingface time series models](https://huggingface.co/models?pipeline_tag=time-series-forecasting&sort=trending). Read through the article "A comprehensive survey of deep learning for time series forecasting: architectural diversity and open challenges" [here](https://link.springer.com/article/10.1007/s10462-025-11223-9) and discuss the summary of your models architecture and design as relate to the author's comments.  (i.e. is it a transformer, a cnn, lstm, etc.)

One option is the `sktime.datasets.ForecastingData.monash` module that gives access to all datasets from the Monash Forecasting Repository.  These are shown below.  

The result of your work should be a notebook with the training of the model and a brief writeup of the models performance and forecasting task.  Create a github repository with this work and share the url.

In [17]:
!pip install sktime

Collecting sktime
  Downloading sktime-0.40.1-py3-none-any.whl.metadata (33 kB)
Collecting scikit-base<0.14.0,>=0.6.1 (from sktime)
  Downloading scikit_base-0.13.0-py3-none-any.whl.metadata (8.8 kB)
Downloading sktime-0.40.1-py3-none-any.whl (36.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m36.3/36.3 MB[0m [31m22.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading scikit_base-0.13.0-py3-none-any.whl (151 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m151.5/151.5 kB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: scikit-base, sktime
Successfully installed scikit-base-0.13.0 sktime-0.40.1


In [18]:
from sktime.datasets import ForecastingData

In [19]:
ForecastingData.all_datasets()

['m1_yearly_dataset',
 'm1_quarterly_dataset',
 'm1_monthly_dataset',
 'm3_yearly_dataset',
 'm3_quarterly_dataset',
 'm3_monthly_dataset',
 'm3_other_dataset',
 'm4_yearly_dataset',
 'm4_quarterly_dataset',
 'm4_monthly_dataset',
 'm4_weekly_dataset',
 'm4_daily_dataset',
 'm4_hourly_dataset',
 'tourism_yearly_dataset',
 'tourism_quarterly_dataset',
 'tourism_monthly_dataset',
 'cif_2016_dataset',
 'london_smart_meters_dataset_with_missing_values',
 'london_smart_meters_dataset_without_missing_values',
 'australian_electricity_demand_dataset',
 'wind_farms_minutely_dataset_with_missing_values',
 'wind_farms_minutely_dataset_without_missing_values',
 'dominick_dataset',
 'bitcoin_dataset_with_missing_values',
 'bitcoin_dataset_without_missing_values',
 'pedestrian_counts_dataset',
 'vehicle_trips_dataset_with_missing_values',
 'vehicle_trips_dataset_without_missing_values',
 'kdd_cup_2018_dataset_with_missing_values',
 'kdd_cup_2018_dataset_without_missing_values',
 'weather_dataset',


In [20]:
ForecastingData?

In [21]:
# Load target series y from Monash Forecasting Repository via sktime
dataset = ForecastingData(name="electricity_hourly_dataset")
y = dataset.load("y")

print("Loaded y. Total observations len(y) =", len(y))

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

electricity_hourly_dataset/electricity_h(…):   0%|          | 0.00/35.9M [00:00<?, ?B/s]

  time_index = pd.date_range(start=start, periods=n, freq=freq)


Loaded y. Total observations len(y) = 8443584


In [22]:
type(y)

In [23]:
y.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,value
instances,timepoints,Unnamed: 2_level_1
T1,2012-01-01 00:00:01,14.0
T1,2012-01-01 01:00:01,18.0
T1,2012-01-01 02:00:01,21.0
T1,2012-01-01 03:00:01,20.0
T1,2012-01-01 04:00:01,22.0


In [24]:
# Goal:
# Convert panel data into a matrix with:
#   rows = timestamps
#   columns = different electricity series (meters/customers)
# Then optionally keep only the first N series to make training feasible on Colab,
# while still keeping >500k total observations.

import pandas as pd
import numpy as np

def to_wide_df(y_obj, n_series=40):
    # Handle typical Monash format: a Series or DataFrame with a MultiIndex.
    if not hasattr(y_obj, "index") or not isinstance(y_obj.index, pd.MultiIndex):
        # Univariate fallback
        wide = pd.DataFrame({"y": y_obj}).sort_index()
        return wide

    # Identify which index level is "series id" vs "time".
    # Heuristic: series-id level has fewer unique values; time level has many.
    nunique_levels = [y_obj.index.get_level_values(i).nunique() for i in range(y_obj.index.nlevels)]
    series_level = int(np.argmin(nunique_levels))  # smaller unique count = series id
    time_level = 1 - series_level if y_obj.index.nlevels == 2 else None

    # If y is a DataFrame, pick first column as the target values
    if isinstance(y_obj, pd.DataFrame):
        y_series = y_obj.iloc[:, 0]
    else:
        y_series = y_obj

    # Choose first N series IDs
    series_ids = y_series.index.get_level_values(series_level).unique()
    keep_ids = series_ids[:n_series]

    # Filter to only kept series IDs
    mask = y_series.index.get_level_values(series_level).isin(keep_ids)
    y_small = y_series[mask]

    # Unstack series-id level -> columns = series, index = time
    wide = y_small.unstack(level=series_level).sort_index()

    # Ensure column names are simple strings (helps later)
    wide.columns = [str(c) for c in wide.columns]

    return wide

N_SERIES = 40  # adjust if needed (e.g., 20, 40, 80). Keep it small enough for Colab.
wide = to_wide_df(y, n_series=N_SERIES)

print("Wide shape (Timestamps, Series) =", wide.shape)
print("Total obs in this subset =", wide.shape[0] * wide.shape[1])

Wide shape (Timestamps, Series) = (26304, 40)
Total obs in this subset = 1052160


In [25]:
# PatchTST training is simplest if there are no NaNs.
# Electricity datasets are usually dense, but we forward/back fill just in case.

wide = wide.ffill().bfill()

values = wide.to_numpy(dtype=np.float32)   # shape (T, C)
T, C = values.shape

print("Final matrix shape:", values.shape)

Final matrix shape: (26304, 40)


In [26]:
# Goal:
# Split by time (no shuffling) to avoid leakage.
# Standardize each series using TRAIN mean/std only.

CONTEXT = 512   # how many past hours the model sees
HORIZON = 96    # how many future hours it predicts (4 days)

# Split proportions (70/10/20)
n_test = int(T * 0.20)
n_val  = int(T * 0.10)
n_train = T - n_val - n_test

# Make sure we have enough history for windows
min_needed = CONTEXT + HORIZON + 1
if n_train < min_needed:
    raise ValueError("Not enough training data for chosen CONTEXT/HORIZON. Reduce them or increase T.")

train_vals = values[:n_train]

mu = train_vals.mean(axis=0, keepdims=True)
sd = train_vals.std(axis=0, keepdims=True)
sd = np.where(sd == 0, 1.0, sd)

values_scaled = (values - mu) / sd

print("Splits:", n_train, n_val, n_test)

Splits: 18414 2630 5260


In [27]:
# This is the time-series equivalent of:
# - tokenization (creating past windows)
# - labels mapping (future window is the target)

import torch
from torch.utils.data import Dataset

class SlidingWindowDataset(Dataset):
    def __init__(self, arr, start, end, context_length, pred_length, stride=1):
        self.arr = arr  # (T, C)
        self.start = start
        self.end = end
        self.context = context_length
        self.pred = pred_length
        self.stride = stride

        last_start = end - (context_length + pred_length)
        if last_start <= start:
            self.idxs = []
        else:
            self.idxs = list(range(start, last_start + 1, stride))

    def __len__(self):
        return len(self.idxs)

    def __getitem__(self, i):
        s = self.idxs[i]
        past = self.arr[s : s + self.context]  # (context, C)
        fut  = self.arr[s + self.context : s + self.context + self.pred]  # (pred, C)

        # IMPORTANT:
        # Only return keys PatchTST accepts:
        # past_values, future_values
        return {
            "past_values": torch.tensor(past, dtype=torch.float32),
            "future_values": torch.tensor(fut, dtype=torch.float32),
        }

# Build split ranges (include overlap for context when starting val/test)
train_ds = SlidingWindowDataset(values_scaled, 0, n_train, CONTEXT, HORIZON, stride=1)

val_start = max(0, n_train - (CONTEXT + HORIZON))
val_end   = n_train + n_val
val_ds = SlidingWindowDataset(values_scaled, val_start, val_end, CONTEXT, HORIZON, stride=1)

test_start = max(0, T - n_test - (CONTEXT + HORIZON))
test_end   = T
test_ds = SlidingWindowDataset(values_scaled, test_start, test_end, CONTEXT, HORIZON, stride=1)

print("Windows (train/val/test):", len(train_ds), len(val_ds), len(test_ds))

Windows (train/val/test): 17807 2631 5261


In [28]:
# Use a public checkpoint
from transformers import PatchTSTForPrediction, Trainer, TrainingArguments

model_id = "ibm-granite/granite-timeseries-patchtst"

model = PatchTSTForPrediction.from_pretrained(
    model_id,
    num_input_channels=C,          # must match number of series you kept
    ignore_mismatched_sizes=True,  # allowed because checkpoint may have different channel count
)

import inspect

ta_sig = inspect.signature(TrainingArguments.__init__).parameters
eval_key = "evaluation_strategy" if "evaluation_strategy" in ta_sig else "eval_strategy"

train_args_kwargs = dict(
    output_dir="patchtst_electricity",
    learning_rate=1e-4,
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    eval_steps=500,
    logging_steps=100,
    save_strategy="no",               # keep notebook simple (no checkpoints)
    remove_unused_columns=False,      # critical: keep past_values/future_values
    label_names=["future_values"],    # tells Trainer what the supervised target is
    report_to="none",                 # avoids wandb prompts
)

# Add the correct eval strategy argument name
train_args_kwargs[eval_key] = "steps"

args = TrainingArguments(**train_args_kwargs)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
)

trainer.train()

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/2.47M [00:00<?, ?B/s]

Step,Training Loss,Validation Loss
500,0.3261,0.314094
1000,0.2858,0.259198
1500,0.276,0.242565
2000,0.261,0.234782
2500,0.2591,0.227998
3000,0.2486,0.226626


TrainOutput(global_step=3339, training_loss=0.2826566481954432, metrics={'train_runtime': 315.5349, 'train_samples_per_second': 169.303, 'train_steps_per_second': 10.582, 'total_flos': 4033780631470080.0, 'train_loss': 0.2826566481954432, 'epoch': 3.0})

In [29]:
# After training:
# - Predict on test windows
# - Compute MAE/RMSE
# - Compare to a naive baseline: repeat last observed value

import numpy as np

pred = trainer.predict(test_ds)
yhat = pred.predictions

# Build ytrue from dataset
ytrue = np.stack([test_ds[i]["future_values"].numpy() for i in range(len(test_ds))], axis=0)

if isinstance(yhat, (tuple, list)):
    yhat = yhat[0]

mae = np.mean(np.abs(yhat - ytrue))
rmse = np.sqrt(np.mean((yhat - ytrue) ** 2))

print("Fine-tuned MAE:", mae)
print("Fine-tuned RMSE:", rmse)

# Baseline: persistence (repeat last value from past window across horizon)
ybase = []
for i in range(len(test_ds)):
    past = test_ds[i]["past_values"].numpy()
    last = past[-1:, :]                       # (1, C)
    ybase.append(np.repeat(last, HORIZON, axis=0))
ybase = np.stack(ybase, axis=0)

b_mae = np.mean(np.abs(ybase - ytrue))
b_rmse = np.sqrt(np.mean((ybase - ytrue) ** 2))

print("Baseline MAE:", b_mae)
print("Baseline RMSE:", b_rmse)

Fine-tuned MAE: 0.3280239
Fine-tuned RMSE: 0.48132622
Baseline MAE: 0.8514731
Baseline RMSE: 1.1651485


# Interpretation of the results:
The fine-tuned forecasting model substantially outperforms the naive persistence baseline, indicating it has learned meaningful temporal structure beyond simply repeating the last observed value. On the test windows, the model achieves an MAE of 0.328 and an RMSE of 0.481, compared with the baseline MAE of 0.851 and RMSE of 1.165, which corresponds to roughly a 60% reduction in both average error (MAE) and error magnitude sensitive to larger misses (RMSE). This gap strongly suggests the model is capturing dynamics in the series that the baseline cannot, producing forecasts that are consistently closer to the true future trajectories across the prediction horizon.