TTCA: Text-Time Cross-Modal Attention

Aligning Textual Information with Time Series via Cross-Modal Attention for Time Series Forecasting

Overview

Text-Time Cross-Modal Attention (TTCA) is a multimodal framework that enhances time-series forecasting by fusing numerical data with exogenous textual information (e.g., news headlines, reports) via a cross-attention mechanism. TTCA employs a directed cross-modal attention where time-series features serve as queries and textual features as keys and values, ensuring that semantic context enhances — rather than overshadows — the underlying temporal dynamics.

Key Contributions

Cross-Modal Attention Mechanism: Effectively fuses time series with auxiliary textual information by allowing temporal features to dynamically attend to relevant textual context.
Grouped Text Encoding: Processes text at the temporal level with a grouped encoding mechanism that accounts for temporal alignment, avoiding information redundancy from long-text prompting.
State-of-the-Art Performance: Evaluated on the Time-MMD dataset across 9 real-world domains, achieving average improvements of 3.29% in MSE and 9.66% in MAE over unimodal baselines, and competitive results against multimodal baselines (TaTS, MM-TSFlib).

Architecture

TTCA consists of three main components:

Time Series Encoder — Projects numerical input into a latent space via value, positional, and temporal embeddings, then applies a Transformer encoder to capture temporal dependencies. A Translator layer maps the output to a shared latent space for fusion.
Text Encoder — Each text snippet is grouped by temporal alignment, individually tokenized, and passed through a frozen Pre-trained Language Model (e.g., GPT-2). A pooling layer produces a unified sentence embedding per time step. A Translator layer projects text features into the same shared latent space.
Cross-Modal Fusion (CMF) — A Multi-Head Cross-Attention mechanism where time series features form the Query (Q) and text features form the Key (K) and Value (V). A residual connection preserves the primary temporal signal while augmenting it with textual context. The fused representation is then projected to produce the final forecasting output.

Datasets

Experiments are conducted on the Time-MMD dataset covering 9 domains:

Domain	Target Variable	Dim	Frequency	Samples	Time Span
Agriculture	Retail Broiler Composite	1	Monthly	496	1983–2025
Climate	Drought Level	5	Monthly	496	1983–2025
Economy	International Trade Balance	3	Monthly	423	1989–2025
Energy	Gasoline Prices	9	Weekly	1479	1996–2025
Environment	Air Quality Index	4	Daily	11102	1982–2023
Health	Influenza Patients Proportion	11	Weekly	1389	1997–2025
Security	Disaster and Emergency Grants	1	Monthly	297	1999–2025
Social Good	Unemployment Rate	1	Monthly	900	1950–2025
Traffic	Travel Volume	1	Monthly	531	1980–2025

Time-MMD Overview:

Results

TTCA vs. Unimodal Baselines

TTCA is benchmarked against 5 unimodal models: iTransformer, PatchTST, Crossformer, Autoformer, and Informer.

Domain	TTCA MSE	TTCA MAE	Best Baseline MSE	Best Baseline MAE	MSE Rank	MAE Rank
Agriculture	0.825	0.617	0.279	0.351	4/6	4/6
Climate	0.901	0.760	1.021	0.802	1/6	1/6
Economy	0.063	0.208	0.013	0.090	3/6	3/6
Energy	0.342	0.465	0.244	0.357	6/6	6/6
Environment	0.274	0.399	0.279	0.389	1/6	2/6
Health	1.251	0.768	1.347	0.761	1/6	2/6
Security	81.886	4.902	73.079	4.134	4/6	4/6
Social Good	0.729	0.390	0.777	0.430	1/6	1/6
Traffic	0.155	0.252	0.209	0.268	1/6	1/6
Average	9.618	0.973	9.945	1.077	2.1/6	2.4/6

TTCA achieves average improvements of 3.29% in MSE and 9.66% in MAE, securing an average ranking of 2.1/6 (MSE) and 2.4/6 (MAE).

TTCA vs. Multimodal Baselines

Domain	TTCA MSE	TTCA MAE	MM-TSFlib MSE	MM-TSFlib MAE	TaTS MSE	TaTS MAE
Agriculture	0.825	0.617	0.906	0.736	0.555	0.495
Climate	0.901	0.760	1.206	0.876	0.927	0.768
Economy	0.063	0.208	0.382	0.497	0.103	0.244
Energy	0.342	0.465	0.732	0.679	0.492	0.571
Environment	0.274	0.399	0.716	0.665	0.287	0.399
Health	1.251	0.768	1.536	0.904	1.400	0.791
Security	81.886	4.902	82.080	4.819	81.386	4.848
Social Good	0.729	0.390	1.302	0.756	0.920	0.481
Traffic	0.155	0.252	0.393	0.490	0.169	0.231
Average	9.490	0.861	9.917	1.080	9.582	0.915

TTCA outperforms MM-TSFlib with improvements of 4.31% MSE, 20.28% MAE, 15.86% RMSE, 43.84% MAPE, and 57.01% MSPE.

Getting Started

Requirements

Install the dependencies:

pip install -r requirements.txt

Step 1: Generate Text Embeddings (Pre-computed LLM Embeddings)

Before training TTCA, you must pre-compute text embeddings using a frozen Pre-trained Language Model (GPT-2):

bash scripts/generate_emb/text_embedding.sh

Why pre-compute? TTCA decouples the LLM inference from the training loop by generating and caching text embeddings offline. Since the PLM is frozen (no weight updates), the embeddings remain constant across all training epochs — there is no need to recompute them on every forward pass. This design choice provides two key advantages:

Faster training — The multimodal fusion module trains significantly faster because it loads lightweight .h5 embedding files instead of running a large language model at each iteration.

Fixed, reproducible inputs — Pre-computed embeddings guarantee that the textual representation is deterministic and consistent across runs.

This differs from approaches such as MM-TSFlib and TaTS, which invoke the LLM to generate text embeddings during the training loop. Although their LLM weights are also frozen, running the full forward pass of a large language model at every training step introduces substantial computational overhead and slows down training considerably.

Step 2: Train TTCA

Run the main training script:

bash scripts/fusion_cross_attention.sh

You can customize the training by passing a start and end dataset index:

bash scripts/fusion_cross_attention.sh 0 8

Scripts

All runnable scripts are located in the scripts/ directory. Below is a description of each script and its purpose.

Training Scripts

Script	Description
`scripts/fusion_cross_attention.sh`	Main training script for TTCA using iTransformer as the time series encoder backbone. Trains across all 9 Time-MMD domains with domain-specific look-back windows and prediction horizons. Supports optional start/end dataset index arguments.
`scripts/fusion_cross_attention_patchTST.sh`	TTCA training variant using PatchTST as the time series encoder backbone. Automatically computes `patch_len` based on input length (e.g., `input_len=84 → patch_len=16`, `input_len=36 → patch_len=12`).
`scripts/fusion_cross_attention_adding_prior.sh`	Ablation script that sweeps over multiple fusion weights (`prompt_weight ∈ {0.1, 0.2, 0.5, 0.8}`) to study the effect of the residual connection balance between time series and textual features.

Baseline Scripts

Script	Description
`scripts/unimodal_forecasting.sh`	Trains unimodal baselines (Autoformer, Transformer, Informer, Crossformer, iTransformer, PatchTST) on all 9 domains using `run_unimodal.py`. No text data is used. Results are saved in `./benchmarks/logs/`.
`scripts/forecasting/timecma.sh`	Runs the TimeCMA benchmark across all 9 domains via `train.py` in the TimeCMA codebase. Results are saved in `./Results/`.

Embedding Generation Scripts

These scripts pre-compute text embeddings using a frozen PLM (GPT-2) before training. Embeddings are saved as .h5 files.

Script	Description
`scripts/generate_emb/text_embedding.sh`	Generates text embeddings for all 9 domains with multiple input lengths per domain. Outputs are stored in `./Embeddings/text/GPT2/<domain>/<input_len>/`.
`scripts/generate_emb/health.sh`	Generates text embeddings for the Health domain only (single input length). A lightweight script for quick single-domain embedding generation.
`scripts/generate_emb/timecma_embedding.sh`	Generates text embeddings in TimeCMA-compatible format for all 9 domains. Outputs are stored in `./Embeddings_TimeCMA/` for use with the TimeCMA benchmark.

Hyperparameters

Hyperparameter	Value / Choices
Batch size	32
TS encoder hidden dim	{256, 512}
Fusion dimension	256
Fusion heads	4
Fusion weight	0.5
Dropout rate	{0.1, 0.5, 0.7}
Learning rate	{1e-4, 5e-4, 1e-3}
Optimizer	AdamW
Scheduler	Cosine annealing
PLM for text	GPT-2 (1.5B)
Training epochs	100
Early stopping patience	20

Forecasting Configurations

Frequency	Example Domains	Look-back Window	Prediction Horizons
Daily	Environment	96	{48, 96, 192, 336}
Weekly	Energy, Health	36	{12, 24, 36, 48}
Monthly	Agriculture, Economy	8	{6, 8, 10, 12}

Citation

If you find this work useful, please cite:

@article{le2025ttca,
  title={Aligning Textual Information with Time Series via Cross-Modal Attention for Time Series Forecasting},
  author={Le, Hoang Anh and Dang, Thanh Vu and Yu, Gwang Hyun and Oh, Seungmin and Jo, Jung An and Kim, Jin Young},
  year={2025},
  keywords={Time Series Forecasting, Multi-Modal Fusion, Text-Time Alignment}
}

License

This project is licensed under the terms of the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
analysis		analysis
assets		assets
benchmarks		benchmarks
data_provider		data_provider
dataset		dataset
exp		exp
layers		layers
llm		llm
logs		logs
scripts		scripts
ts		ts
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
generate_embedding.py		generate_embedding.py
model.py		model.py
requirements.txt		requirements.txt
run.py		run.py
run_unimodal.py		run_unimodal.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TTCA: Text-Time Cross-Modal Attention

Overview

Key Contributions

Architecture

Datasets

Results

TTCA vs. Unimodal Baselines

TTCA vs. Multimodal Baselines

Getting Started

Requirements

Step 1: Generate Text Embeddings (Pre-computed LLM Embeddings)

Step 2: Train TTCA

Scripts

Training Scripts

Baseline Scripts

Embedding Generation Scripts

Hyperparameters

Forecasting Configurations

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

TTCA: Text-Time Cross-Modal Attention

Overview

Key Contributions

Architecture

Datasets

Results

TTCA vs. Unimodal Baselines

TTCA vs. Multimodal Baselines

Getting Started

Requirements

Step 1: Generate Text Embeddings (Pre-computed LLM Embeddings)

Step 2: Train TTCA

Scripts

Training Scripts

Baseline Scripts

Embedding Generation Scripts

Hyperparameters

Forecasting Configurations

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages