<a href="https://colab.research.google.com/github/samipn/Pycaret/blob/main/08_timeseries_univariate_with_exog_pm25.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Time Series — Univariate with Exogenous Variables (Beijing PM2.5) — PyCaret

# ✅ Enable GPU in Colab
*Runtime → Change runtime type → **T4 / L4 GPU** → Save.*  
Each notebook sets `use_gpu=True` in `setup()`. Models that support GPU (e.g., XGBoost, CatBoost) will leverage it automatically if available.

Target: `pm2.5` with exogenous features (dew temp, pressure, wind, etc.).

In [1]:
!pip install pycaret[full]

Collecting pycaret[full]
  Downloading pycaret-3.3.2-py3-none-any.whl.metadata (17 kB)
Collecting numpy<1.27,>=1.21 (from pycaret[full])
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pandas<2.2.0 (from pycaret[full])
  Downloading pandas-2.1.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting scipy<=1.11.4,>=1.6.1 (from pycaret[full])
  Downloading scipy-1.11.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.4/60.4 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting joblib<1.4,>=1.2.0 (from pycaret[full])
  Downloading joblib-1.3.2-py3-none-any.whl.metadata (5.4 kB)
Collecting pyod>=1.1.3 (from pycaret[full])
  Downloading pyod-2.0.5-py3-none-any.whl.meta

In [30]:
import pandas as pd

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pollution.csv"
raw = pd.read_csv(url)

# Build datetime from the original parts; this avoids any ambiguity
raw['ds'] = pd.to_datetime(raw[['year','month','day','hour']], errors='coerce')

# Friendly names via dict (not by position)
rename_map = {
    'pm2.5': 'y',   # target
    'DEWP': 'dew',
    'TEMP': 'temp',
    'PRES': 'press',
    'cbwd': 'wnd_dir',
    'Iws':  'wnd_spd',
    'Is':   'snow',
    'Ir':   'rain',
}
df = raw.rename(columns=rename_map)

# Keep what we need and ensure numeric types (coerce handles 'NA' strings)
keep = ['ds','y','dew','temp','press','wnd_spd','snow','rain']
df = df[keep]
for c in keep[1:]:
    df[c] = pd.to_numeric(df[c], errors='coerce')

# Drop rows with missing datetime/target before resampling
df = df.dropna(subset=['ds','y'])

# Daily aggregation
daily = df.set_index('ds').resample('D').mean().reset_index().dropna()

# Create a complete daily date range and reindex the daily DataFrame
start_date = daily['ds'].min()
end_date = daily['ds'].max()
all_dates = pd.date_range(start=start_date, end=end_date, freq='D')
daily = daily.set_index('ds').reindex(all_dates).reset_index()
daily = daily.rename(columns={'index':'ds'}) # Rename the index column back to 'ds'
daily = daily.dropna() # Drop rows that were not in the original daily data


# Exogenous matrix aligned with the daily frame
exog = daily[['dew','temp','press','wnd_spd','snow','rain']]

print(daily.dtypes)
print(daily.head())

ds         datetime64[ns]
y                 float64
dew               float64
temp              float64
press             float64
wnd_spd           float64
snow              float64
rain              float64
dtype: object
          ds           y        dew       temp        press     wnd_spd  \
0 2010-01-02  145.958333  -8.500000  -5.125000  1024.750000   24.860000   
1 2010-01-03   78.833333 -10.125000  -8.541667  1022.791667   70.937917   
2 2010-01-04   31.333333 -20.875000 -11.500000  1029.291667  111.160833   
3 2010-01-05   42.458333 -24.583333 -14.458333  1033.625000   56.920000   
4 2010-01-06   56.416667 -23.708333 -12.541667  1033.750000   18.511667   

        snow  rain  
0   0.708333   0.0  
1  14.166667   0.0  
2   0.000000   0.0  
3   0.000000   0.0  
4   0.000000   0.0  


In [33]:
from pycaret.time_series import setup, compare_models, finalize_model, predict_model, pull, save_model
s = setup(
    data=daily[['ds','y']],
    fh=14,
    fold=3,
    session_id=2025,
    use_gpu=True,
    seasonal_period=7,
    target='y'
    # Removed index='ds' as the DataFrame's index is already set and has a frequency
)
best = compare_models()
final = finalize_model(best)
future = predict_model(final, fh=14, X=exog.tail(14))  # last known exog
future.head()

Unnamed: 0,Description,Value
0,session_id,2025
1,Target,y
2,Approach,Univariate
3,Exogenous Variables,Present
4,Original data shape,"(1789, 2)"
5,Transformed data shape,"(1789, 2)"
6,Transformed train set shape,"(1775, 2)"
7,Transformed test set shape,"(14, 2)"
8,Rows with missing values,0.0%
9,Fold Generator,ExpandingWindowSplitter


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Using GPU Device: Tesla T4, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 16 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Start training from score 0.500000
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Using GPU Device: Tesla T4, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 16 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Start training from score 0.500000
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of u

Unnamed: 0,Model,MASE,RMSSE,MAE,RMSE,MAPE,SMAPE,R2,TT (Sec)
croston,Croston,1.3928,1.2928,72.3113,93.4264,2.1375,0.7478,-0.1204,0.0233


Processing:   0%|          | 0/89 [00:00<?, ?it/s]

[2025-10-26 23:21:53.622] [CUML] [info] Unused keyword parameter: n_jobs during cuML estimator initialization


[2025-10-26 23:21:54.180] [CUML] [info] Unused keyword parameter: n_jobs during cuML estimator initialization


Unnamed: 0,y_pred
1825,95.2191
1826,95.2191
1827,95.2191
1828,95.2191
1829,95.2191


In [34]:
save_model(final, 'ts_exog_model')
future.to_csv('ts_exog_forecast.csv', index=False)
print("Saved ts_exog_forecast.csv")

Transformation Pipeline and Model Successfully Saved
Saved ts_exog_forecast.csv
