<a href="https://colab.research.google.com/github/suprabhathk/FoundationalModels_TimeSeries_Epidemics/blob/main/LagLlama_FoundationModels_Workflow_ipynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Reference : [LagLlama Paper](https://colab.research.google.com/drive/1Fs8MiR5shVx751Nv2a4n-NRkb8VTVL0d)**

### **Installing and importing LagLlama dependencies via Huggingface**

In [None]:
!git clone -b update-gluonts https://github.com/time-series-foundation-models/lag-llama/

Cloning into 'lag-llama'...
remote: Enumerating objects: 486, done.[K
remote: Counting objects: 100% (172/172), done.[K
remote: Compressing objects: 100% (64/64), done.[K
remote: Total 486 (delta 147), reused 108 (delta 108), pack-reused 314 (from 3)[K
Receiving objects: 100% (486/486), 276.83 KiB | 7.69 MiB/s, done.
Resolving deltas: 100% (246/246), done.


In [None]:
cd /content/lag-llama

/content/lag-llama


In [None]:
pip install -r requirements.txt --quiet

In [None]:
!huggingface-cli download time-series-foundation-models/Lag-Llama lag-llama.ckpt --local-dir /content/lag-llama

/content/lag-llama/lag-llama.ckpt


### **Importing the relevant packages to run the time series forecasting model**

In [None]:
pip install gluonts==0.14.4



In [None]:
from itertools import islice

from matplotlib import pyplot as plt
import matplotlib.dates as mdates

import torch
from gluonts.evaluation import make_evaluation_predictions, Evaluator
from gluonts.dataset.repository.datasets import get_dataset

from gluonts.dataset.pandas import PandasDataset
import pandas as pd

from lag_llama.gluon.estimator import LagLlamaEstimator

In [None]:
import torch
from gluonts.torch.distributions.studentT import StudentTOutput
from gluonts.torch.modules.loss import NegativeLogLikelihood

# Add necessary objects to safe globals for weights-only loading
torch.serialization.add_safe_globals([StudentTOutput, NegativeLogLikelihood])

### **Some helper functions for prediction purposes**

In [None]:
import sys
from types import ModuleType

# Create dummy module hierarchy
def create_dummy_module(module_path):
    """
    Create a dummy module hierarchy for the given path.
    Returns the leaf module.
    """
    parts = module_path.split('.')
    current = ''
    parent = None

    for part in parts:
        current = current + '.' + part if current else part
        if current not in sys.modules:
            module = ModuleType(current)
            sys.modules[current] = module
            if parent:
                setattr(sys.modules[parent], part, module)
        parent = current

    return sys.modules[module_path]

# Create the dummy gluonts module hierarchy
gluonts_module = create_dummy_module('gluonts.torch.modules.loss')

# Create dummy classes for the specific loss functions
class DistributionLoss:
    def __init__(self, *args, **kwargs):
        pass

    def __call__(self, *args, **kwargs):
        return 0.0

    def __getattr__(self, name):
        return lambda *args, **kwargs: None

class NegativeLogLikelihood:
    def __init__(self, *args, **kwargs):
        pass

    def __call__(self, *args, **kwargs):
        return 0.0

    def __getattr__(self, name):
        return lambda *args, **kwargs: None

# Add the specific classes to the module
gluonts_module.DistributionLoss = DistributionLoss
gluonts_module.NegativeLogLikelihood = NegativeLogLikelihood

**Lag Llama prediction function**

We create a function for Lag-Llama inference that we can reuse for all different types of dataset below. This function returns the predictions for the given prediction horizon. The forecast will be of shape (num_samples, prediction_length), where num_samples is the number of samples sampled from the predicted probability distribution for each timestep

In [None]:
def get_lag_llama_predictions(dataset, prediction_length, device, context_length=32, use_rope_scaling=False, num_samples=100):
    ckpt = torch.load("lag-llama.ckpt", map_location=device) # Uses GPU since in this Colab we use a GPU.
    estimator_args = ckpt["hyper_parameters"]["model_kwargs"]

    rope_scaling_arguments = {
        "type": "linear",
        "factor": max(1.0, (context_length + prediction_length) / estimator_args["context_length"]),
    }

    estimator = LagLlamaEstimator(
        ckpt_path="lag-llama.ckpt",
        prediction_length=prediction_length,
        context_length=context_length, # Lag-Llama was trained with a context length of 32, but can work with any context length

        # estimator args
        input_size=estimator_args["input_size"],
        n_layer=estimator_args["n_layer"],
        n_embd_per_head=estimator_args["n_embd_per_head"],
        n_head=estimator_args["n_head"],
        scaling=estimator_args["scaling"],
        time_feat=estimator_args["time_feat"],
        rope_scaling=rope_scaling_arguments if use_rope_scaling else None,

        batch_size=1,
        num_parallel_samples=100,
        device=device,
    )

    lightning_module = estimator.create_lightning_module()
    transformation = estimator.create_transformation()
    predictor = estimator.create_predictor(transformation, lightning_module)

    forecast_it, ts_it = make_evaluation_predictions(
        dataset=dataset,
        predictor=predictor,
        num_samples=num_samples
    )
    forecasts = list(forecast_it)
    tss = list(ts_it)

    return forecasts, tss

# Loading datasets of different types

We expand upon how one can support loading data stored in different formats. This part of the demo uses the tutorial by the authors of GluonTS at https://ts.gluon.ai/stable/tutorials/data_manipulation/pandasdataframes.html. We thank the authors of GluonTS for putting together such a detailed tutorial.

## Important Points to Note

1. The prediction function provided in this notebook performs a prediction autoregressively for the last `prediction_length` steps in the dataset passed.

For the time being, if you would like to perform prediction, please include in the CSV/dataframe the timestamps you want to perform prediction for (with a dummy value), and set the prediction length to the required horizon.

2. Please keep in mind that Lag-Llama needs a minimum context of `32` timestamps before the prediction timestamp starts. Beyond the `32` timestamps, Lag-Llama can use a context of upto `1092` more timestamps in history for the lags - this part is optional but you will find that as you give more context upto `(32+) 1092` timestamps, Lag-Llama's performance will improve.

The context length passed below should not be changed and kept at 32 nevertheless. Lag-Llama will automatically use context beyond 32 for the lags, if available.


## **Loading the data - In this case, we use Influcast repository**

In [None]:
# Importing dependencies for data import
import pandas as pd
import requests
from io import StringIO

In [None]:
def get_last_available_week_data(year):
    """
    Try to get the last available weekly data for a season by checking weeks in reverse order
    """
    base_url = "https://raw.githubusercontent.com/Predizioni-Epidemiologiche-Italia/Influcast/main/sorveglianza/ILI"
    season = f"{year}-{year+1}"

    # Try weeks in reverse order for the second year (most likely to have the final data)
    for week in range(20, 0, -1):  # Try weeks 20 down to 1
        file_name = f"italia-{year+1}_{week:02d}-ILI.csv"
        url = f"{base_url}/{season}/{file_name}"

        try:
            df = pd.read_csv(url)
            print(f"Found data for {season} at week {week} of {year+1}")
            return df
        except:
            continue

    # If not found, try end weeks of first year
    for week in range(53, 39, -1):  # Try weeks 53 down to 40
        file_name = f"italia-{year}_{week:02d}-ILI.csv"
        url = f"{base_url}/{season}/{file_name}"

        try:
            df = pd.read_csv(url)
            print(f"Found data for {season} at week {week} of {year}")
            return df
        except:
            continue

    print(f"No data found for season {season}")
    return None

In [None]:
def get_latest_data(year):
    """Helper function to get latest format data (for 2023 onwards)"""
    base_url = "https://raw.githubusercontent.com/Predizioni-Epidemiologiche-Italia/Influcast/main/sorveglianza/ILI"
    season = f"{year}-{year+1}"
    url = f"{base_url}/{season}/latest/italia-latest-ILI.csv"

    try:
        df = pd.read_csv(url)
        return df
    except Exception as e:
        print(f"Error importing {season}: {str(e)}")
        return None


In [None]:
def import_flu_data(start_year=2003, end_year=2024):
    """
    Import flu data from GitHub repository:
    - For 2003-2022: Gets last available weekly data
    - For 2023 onwards: Gets latest format data
    """
    all_dfs = []

    # Handle older years (2003-2022)
    for year in range(start_year, 2023):
        print(f"\nProcessing {year}-{year+1} season...")
        df = get_last_available_week_data(year)
        if df is not None:
            print(f"Entries found: {len(df)}")
            all_dfs.append(df)

    # Handle newer years with latest format (2023-2025)
    for year in range(2023, end_year + 1):
        print(f"\nProcessing {year}-{year+1} season (latest format)...")
        df = get_latest_data(year)
        if df is not None:
            print(f"Entries found: {len(df)}")
            all_dfs.append(df)

    # Combine all data
    if all_dfs:
        combined_df = pd.concat(all_dfs, ignore_index=True)
        print(f"\nFinal dataset shape: {combined_df.shape}")
        return combined_df
    else:
        raise ValueError("No data was successfully imported")

In [None]:
# Import all data
try:
    df_influcast = import_flu_data()
except Exception as e:
    print(f"Error: {str(e)}")


Processing 2003-2004 season...
Found data for 2003-2004 at week 17 of 2004
Entries found: 28

Processing 2004-2005 season...
Found data for 2004-2005 at week 16 of 2005
Entries found: 28

Processing 2005-2006 season...
Found data for 2005-2006 at week 17 of 2006
Entries found: 28

Processing 2006-2007 season...
Found data for 2006-2007 at week 17 of 2007
Entries found: 28

Processing 2007-2008 season...
Found data for 2007-2008 at week 17 of 2008
Entries found: 28

Processing 2008-2009 season...
Found data for 2008-2009 at week 17 of 2009
Entries found: 28

Processing 2009-2010 season...
Found data for 2009-2010 at week 15 of 2010
Entries found: 27

Processing 2010-2011 season...
Found data for 2010-2011 at week 17 of 2011
Entries found: 28

Processing 2011-2012 season...
Found data for 2011-2012 at week 17 of 2012
Entries found: 28

Processing 2012-2013 season...
Found data for 2012-2013 at week 17 of 2013
Entries found: 28

Processing 2013-2014 season...
Found data for 2013-2014 at 

In [None]:
df_influcast

Unnamed: 0,anno,settimana,incidenza,numero_casi,numero_assistiti,target
0,2003,42,0.360000,357,1000656,ILI
1,2003,43,0.470000,500,1066723,ILI
2,2003,44,0.520000,597,1150866,ILI
3,2003,45,0.600000,723,1204797,ILI
4,2003,46,0.590000,742,1251026,ILI
...,...,...,...,...,...,...
597,2024,52,10.507239,23238,2211618,ILI
598,2025,1,12.355276,27339,2212739,ILI
599,2025,2,14.768421,32995,2234159,ILI
600,2025,3,15.888903,33734,2123117,ILI


In [None]:
duplicates = df_influcast[df_influcast.duplicated(subset=['anno', 'settimana'], keep=False)]
print(duplicates.sort_values(['anno', 'settimana']))

Empty DataFrame
Columns: [anno, settimana, incidenza, numero_casi, numero_assistiti, target]
Index: []


In [None]:
df = df_influcast.copy()
df['timestamp'] = pd.to_datetime(df['anno'].astype(str) + '-' + df['settimana'].astype(str) + '-1', format='%Y-%W-%w').dt.strftime('%Y-%m-%d 00:00:00')
df = df[['timestamp', 'incidenza']].rename(columns={'incidenza': 'target'})
df['item_id'] = 'ILI'
df = df.reset_index(drop=True)
df = df.rename(columns={'timestamp': ''})
df

Unnamed: 0,Unnamed: 1,target,item_id
0,2003-10-20 00:00:00,0.360000,ILI
1,2003-10-27 00:00:00,0.470000,ILI
2,2003-11-03 00:00:00,0.520000,ILI
3,2003-11-10 00:00:00,0.600000,ILI
4,2003-11-17 00:00:00,0.590000,ILI
...,...,...,...
597,2024-12-23 00:00:00,10.507239,ILI
598,2025-01-06 00:00:00,12.355276,ILI
599,2025-01-13 00:00:00,14.768421,ILI
600,2025-01-20 00:00:00,15.888903,ILI


### **Converting the data into a timeseries friendly format for analysis**

In [None]:
import pandas as pd
from gluonts.dataset.pandas import PandasDataset
from gluonts.time_feature.lag import get_lags_for_frequency
from pandas.tseries.frequencies import to_offset

In [None]:
# Set numerical columns as float32
for col in df.columns:
    # Check if column is not of string type and not a timestamp
    if df[col].dtype != 'object' and pd.api.types.is_string_dtype(df[col]) == False and df[col].dtype != 'datetime64[ns]':
        df[col] = df[col].astype('float32')
df

Unnamed: 0,Unnamed: 1,target,item_id
0,2003-10-20 00:00:00,0.360000,ILI
1,2003-10-27 00:00:00,0.470000,ILI
2,2003-11-03 00:00:00,0.520000,ILI
3,2003-11-10 00:00:00,0.600000,ILI
4,2003-11-17 00:00:00,0.590000,ILI
...,...,...,...
597,2024-12-23 00:00:00,10.507239,ILI
598,2025-01-06 00:00:00,12.355275,ILI
599,2025-01-13 00:00:00,14.768420,ILI
600,2025-01-20 00:00:00,15.888903,ILI


In [None]:
# Create the Pandas
dataset = PandasDataset.from_long_dataframe(df, target="target", item_id="item_id",unchecked=True)

In [None]:
backtest_dataset = dataset
prediction_length = 24  # Define your prediction length
num_samples = 100 # number of samples sampled from the probability distribution for each timestep
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu") # You can switch this to CPU or other GPUs if you'd like, depending on your environment

### **Prediction - Zero Shot**

In [None]:
forecasts, tss = get_lag_llama_predictions(backtest_dataset, prediction_length, device, num_samples)

  ckpt = torch.load("lag-llama.ckpt", map_location=device) # Uses GPU since in this Colab we use a GPU.


ValueError: Invalid frequency: QE