SustainLC: Benchmark for End-to-End Control of Liquid Cooling in Sustainable Data Centers

Sustain-LC is part of the High Performance Supercomputing Digital Twin led by Oak Ridge National Labs with HPE and other industry and academic partners

For more information, please visit the documentation webpage.

SustainLC: Benchmark for End-to-End Control of Liquid Cooling in Sustainable Data Centers

Quickstart evaluation on collab:

🔧 Liquid Cooling Control Benchmark for HPC Data Centers

This Python-based benchmark environment enables the development and evaluation of control strategies for high-performance computing (HPC) data centers, using a digital twin of ORNL’s Frontier supercomputer. It supports energy-efficient cooling optimization across fine-grained server Blade-Groups, Cooling Towers (CT), and Heat Reuse Units (HRU).

Features

End-to-end customizable, scalable, and multi-agent control setups
Real-time control interfaces via Gymnasium and ASHRAE G36-based traditional controllers
Granular CDU agent control over inlet setpoints, pump flow, and valve positions
Hybrid and interpretable policy support using decision tree distillation techniques ([Bastani et al., 2018])
Built-in tools for ablation studies, scalability testing, and LLM-based explainability

Ideal for researchers, engineers, and practitioners in RL, control systems, and sustainable computing infrastructure.

Installation

Prerequisites

Python 3.10+
Conda (for creating virtual environments)
Dependencies listed in requirements.txt

Steps

Clone the repository:

git clone https://github.com/HewlettPackard/sustain-lc.git

Navigate to the repository directory:
```
cd sustain-lc
```
Create a new Conda environment with from .YAML file:
```
conda env create -f environment.yml
```
Activate the newly created environment:
```
conda activate sustain-lc
```

Quick Start Guide

Train Example:

To train agent for centalized action policies, run

   python train_multiagent_ca_ppo.py

To train agent for centalized action multihead policies, run

   python train_mh_ma_ca_ppo.py

Monitor training on Tensorboard
```
tensorboard --logdir /runs/
```
Evaluation Example:

We provide a more user-friendly example to evaluate the agents via jupyter notebooks. Interested users, may also simply export the notebook to python script and run the resulting file

For evaluating the centalized action policies, users may run the evaluate_ma_ca_ppo.ipynb and for multihead policies, they may run evaluate_mh_ma_ca_ppo copy.ipynb.

Policy Distillation using Decision Trees (VIPER) To distill the policies for the pretrained agents, the users may run the policy_distillation.ipynb notebook

Sustain-LC environment implementation

The environment interfaces with a high-fidelity Modelica model compiled as a Functional Mock-up Unit (FMU) version 2.0 for Co-Simulation, leveraging the PyFMI library for interaction.

FMU Integration and Simulation Core

The core of the simulation is a Modelica model representing the data center's liquid cooling thermodynamics. This model is compiled into an FMU, for example LC_Frontier_5Cabinet_4_17_25.fmu. The environment utilizes PyFMI to:

Load the FMU and parse its model description.
Instantiate the FMU for simulation.
Set up the experiment parameters, including start time (0.0) and a tolerance (if specified, default is FMU's choice).
Initialize the FMU into its starting state.
During an episode step:
- Set input values (actions from the RL agent) to specified FMU variables.
- Advance the simulation time by sim_time_step using the fmu.do_step() method. This is repeated until the agent's step_size is covered.
- Get output values (observations for the RL agent and values for reward calculation) from specified FMU variables.
Terminate and free the FMU instance upon closing the environment or resetting for a new episode.

The FMU variable names used for interfacing are explicitly defined within the environment:

Action Variables: self.fmu_action_vars (e.g., pump1.speed_in, valve1.position_in)
Observation Variables: self.fmu_observation_vars (e.g., serverRack1.T_out, ambient.T)
Power Consumption Variables (for reward): self.fmu_power_vars (e.g., [pump1.P, fan1.P])
Target Temperature Variable (for reward): self.fmu_target_temp_var (e.g., controller.T_setpoint)

State (Observation) Space

The observation space is defined as a continuous gymnasium.spaces.Box with specific lower and upper bounds. It comprises the following variables retrieved from the FMU:

FrontierNode.AvgBladeGroupTemp: Average temperature of a Blade Group in a cabinet (K).
FrontierNode.AvgBladeGroupPower: Average power input to each Blade Group in a cabinet (W).

The bounds for these observations are set to e.g., 273.15 K and e.g., 373.15 K for temperature measurements, and between e.g., 0.0 kW and e.g., 400 kW, for power input measurements.

For the Cooling Tower Markov Decision Process, we have a similar observation space:

FrontierNode.CoolingTower.CellPower: Average power consumption of each cell of the cooling tower (W).
FrontierNode.CoolingTower.WaterLeavingTemp: Average temperature of the water leaving each cooling tower (K).
T_owb: Outside air wetbulb temperature.

Action Space

The action space is a hybrid of continuous gymnasium.spaces.Box and discrete gymnasium.spaces.Discrete, allowing the agent to control:

FrontierNode.CDU.Pump.normalized_speed: Scaled speed of the CDU pump (-1 to 1).
FrontierNode.CDU.TempSetpoint: Scaled Coolant supply temperature setpoint (-1 to 1).
FrontierNode.CDU.AvgBladeGroupValve: Scaled Valve opening to allow coolant to collect heat from the corresponding blade group (-1 to 1).
FrontierNode.CoolingTower.WaterLvTSPT: Discrete setting of cooling tower water leaving temperature setpoint delta.

These scaled values allow the neural network models used for the RL agents to learn properly and not saturate at the activation layers.

Reward Function

The reward function guides the RL agent towards desired operational states. It is calculated at each step as:

$$R_{\text{blade}} = \sum_{i,j} T_{i,j}$$

which is the negative of the aggregate temperature of the blade groups

$$R_{\text{coolingtower}} = - \sum_{i,j} P_{i,j}$$

which is the negative of the total cooling tower power consumption at each time step Where:

T_ij is the temperature of the j^th blade group of the i^th cabinet ∀ j in 1...B and ∀ i in 1...C
P_ij is the power consumption of the j^th cell of the i^th cooling tower ∀ j in 1...m and ∀ i in 1...N

The goal is to minimize server temperatures below the target and minimize energy consumption.

Episode Dynamics and Simulation Control

An episode runs for a maximum of max_episode_duration. The agent interacts with the environment at discrete time intervals defined by step_size. For each agent step, the FMU's do_step() method is called step_size / sim_time_step times. The reset() method terminates the current FMU instance, re-instantiates and re-initializes it, ensuring a consistent starting state for each new episode. Initial observations are drawn from the FMU after initialization.

Sustain-LC Training Scripts Documentation

Script: train_mh_ma_ca_ppo.py

This script is designed to train a Proximal Policy Optimization (PPO) agent in an environment that involves multiple agents and components. It can be configured to use Multi-Head (MH), Centralized Action (CA), and Multi-Agent (MA) features, allowing for flexible and expressive policy representations. CA implies there is a shared policy for multiple homogeneous agents within the specified environment.

Basic Run Command: python train_mh_ma_ca_ppo.py

Key Configurable Parameters

The script uses has the following relevant parameters you can modify:
- -exp-name (str, default: ppo_ma_ca): Name for the experiment, used for logging.
- -seed (int, default: 123): Random seed for reproducibility.
- -cuda (flag, default: True): Enables CUDA for GPU acceleration if available. Set -cuda False to force CPU.
- -env_name (str, default: MH_SmallFrontierModel): Name of the environment.
- -agent_type (str, default: MultiHead_CA_PPO): Type of the RL Agent.
- -max_training_timesteps (int, default: 5e6): Total budget for training.
- -max_ep_len (int, default: 200): Maximum episode length.
- -lr_actor (float, default: 3e-4): Learning rate for the actor optimizer.
- -lr_critic (float, default: 1e-3): Learning rate for the critic optimizer.
- -K_epochs (float, default: 50): Epochs of training to run for each update.
- -eps_clip (float, default: 0.2): clip parameter for PPO.
- -num_centralized_actions (int, default: 4): Number of centralized actions for each environment.
- -gamma (float, default: 0.80): Discount factor for future rewards.
- -gae_lambda (float, default: 0.95): Lambda for General Advantage Estimation (GAE).
- -minibatch_size (int, default: 32): Mini-batch size for each epoch.
- -ent-coef (float, default: 0.01): Entropy coefficient for exploration.
- -vf-coef (float, default: 0.5): Value function loss coefficient.
- -num-agents (int, default: 2): Specifies the number of agents in the custom environment.
Script: train_multiagent_ca_ppo.py

This script trains multiple PPO agents for a multi-agent reinforcement learning (MARL) task. Each agent has its own policy and value function, for the blade group control and the cooling tower control. It specifically employs a Centralized Action (CA) mechanism. The script is designed to work with MARL environments.

Basic Run Command: python train_multiagent_ca_ppo.py

The key configurable parameters for this script is identical to train_mh_ma_ca_ppo.py

Advanced AutoCSM usage for model building

This section covers advanced topics like building custom models for sustain-lc. This requires the user to have the following repositories and software installations.

Software Installation

Dymola and OpenModelica both provide a GUI and a command-line interface (CLI) for creating, compiling, running Modelica model simulations as well as exporting them to binaries called Functional Mockup Units (FMUs).

Repository Installation

Users need to clone the following repositories to their working folder that can be accessed by either Dymola or the OpenModelica IDEs:

Modelica Buildings library: git clone https://github.com/lbl-srg/modelica-buildings.git

The Modelica Buildings Library is a free, open-source library for modeling building energy and control systems, developed by Lawrence Berkeley National Laboratory. It provides comprehensive component models for HVAC systems, including heat exchangers, pumps, and valves essential for liquid cooling applications. The library enables dynamic simulation of thermal systems with fluid flow, heat transfer, and controls integration for performance analysis and optimization. Its modular architecture allows users to construct complex cooling systems by connecting components through standardized interfaces that preserve energy and mass balance. The library's extensive validation against measured data makes it suitable for accurately simulating liquid cooling systems in buildings and data centers.
TRANSFORM: git clone https://github.com/ORNL-Modelica/TRANSFORM-Library.git

The TRANSFORM (TRANsient Simulation Framework Of Reconfigurable Models) Library is an open-source Modelica toolkit developed by Oak Ridge National Laboratory for modeling complex thermal-hydraulic systems. It specializes in advanced energy systems with particular strength in liquid-cooled applications, including advanced reactor designs and heat transfer loops. The library provides detailed component models for heat exchangers, pumps, compressors, and specialized fluid systems with comprehensive thermophysical property implementations. TRANSFORM excels at simulating transient behaviors in cooling systems, making it valuable for studying system responses during operational changes or upset conditions. The modular architecture enables scaling from component-level to system-level simulations with various working fluids, including specialized coolants used in high-performance liquid cooling applications.
datacenterCoolingModel: git clone https://code.ornl.gov/exadigit/datacenterCoolingModel.git

The Data Center Cooling Model is an ORNL-developed specialized simulation framework targeting liquid cooling systems specifically for high-performance computing facilities. The repository provides detailed modeling capabilities for direct-to-chip, immersion, and rear-door heat exchanger liquid cooling technologies increasingly adopted in modern data centers. Its component models account for the complex interactions between IT equipment heat generation, coolant flow distribution, and thermal management systems at rack, row, and facility scales. The framework enables performance assessment, optimization, and efficiency analysis of cooling systems under various operating conditions and workloads. The models support integration with power consumption data to enable comprehensive energy efficiency calculations and cooling infrastructure planning for data centers.
AutoCSM: git clone https://code.ornl.gov/exadigit/AutoCSM.git

ExaDigit AutoCSM is a template system-of-systems modeling approach for automating the development, deployment, and integration of Cooling System Models (CSMs) for supercomputing facilities within the ExaDigiT framework.

ExaDigiT is a digital twin of supercomputers and their thermal infrastructures. It offers insights into operational strategies, “what-if" scenarios, as well as elucidates complex, cross-disciplinary transient behaviors. It also serves as a design tool for future system prototyping. It combines telemetry and simulations, providing a virtual representation of physical systems. It supports planning, construction, and operations, offering value in decision-making, predictive maintenance, and system efficiency. In design stages, it can evaluate energy efficiency, virtually prototype cooling systems, and model network performance. During operations, ExaDigiT aids in predictive maintenance and operational optimization. ExaDigiT is built on an open software stack (Modelica, SST Macro, Unreal Engine) with an aim to foster community-driven development, we have formed a partnership with national supercomputer centers (Oak Ridge National Laboratories, Lawrence Livermore National Labs, Los Alamos National Labs (USA), PAWSEY (Australia), LUMI (Finland), CINES (France), CINECA (Italy), etc) around the world to develop an open framework for modeling supercomputers. AutoCSM is a Python-based framework to assist in CSM developers in accelerating the creation and deployment of system-level thermal-hydraulic CSMs. The intention is for this tool specifically to help standardize digital twin workflows for ExaDigiT. However, this tool can be used independently of ExaDigiT (and even other systems besides CSMs).

Custom Sustain-LC models using AutoCSM

The primary model building process based on the specified structure is executed by the AutoCSM API library. It reads the JSON file and then populates a Modelica file using elements from the datacenterCoolingModel library.

To execute this process, we simply run the "python run_auto_csm.py" from the CLI in which the JSON file and the Python files are located in the AutoCSM library. The user needs to specify the path to the desired JSON file inside the run_auto_csm.py file as well as compilation parameters like solver information, steps to solve etc.

The above process generates the FMU which is then wrapped inside a Gymnasium Environment for Sustain-LC. Most of the common application requirements are already covered by the default Sustain-LC environment file frontier_env.py. If the user wishes to specify highly custom variables for logging, they have to specify those variables in the info dictionary for the environment.

Of these libraries, the user needs to access the datacenterCoolingModel to study the atomic structures of the thermodynamic components that can be used to build custom data center configurations. An example configuration is provided in Example JSON. This JSON describes an example hierarchical structure for the models. Further example hierarchical structures used for the results in the main paper are also included in the sustain-lc repository.

High Fidelity of CDU Model from ORNL developed in Sustain-LC

High Fidelity of underlying models for Sustain-LC CDU Loops

The Modelica models developed from the ORNL libraries show high fidelity when compared to ground truth data for Cooling Distribution Units data.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
MA_CA_PPO_preTrained/SmallFrontierModel		MA_CA_PPO_preTrained/SmallFrontierModel
MH_MA_CA_PPO_preTrained/MH_SmallFrontierModel		MH_MA_CA_PPO_preTrained/MH_SmallFrontierModel
assets		assets
docs		docs
docs_src		docs_src
.DS_Store		.DS_Store
LC_Frontier_5Cabinet_4_17_25.fmu		LC_Frontier_5Cabinet_4_17_25.fmu
LICENSE		LICENSE
SECURITY.md		SECURITY.md
ca_ppo.py		ca_ppo.py
code_of_conduct.md		code_of_conduct.md
environment.yml		environment.yml
evaluate_ma_ca_ppo.ipynb		evaluate_ma_ca_ppo.ipynb
evaluate_mh_ma_ca_ppo.ipynb		evaluate_mh_ma_ca_ppo.ipynb
frontier_env.py		frontier_env.py
input_04-07-24.csv		input_04-07-24.csv
mh_frontier_env.py		mh_frontier_env.py
multiagent_ca_ppo.py		multiagent_ca_ppo.py
multihead_ca_ppo.py		multihead_ca_ppo.py
policy_distillation.ipynb		policy_distillation.ipynb
readme.md		readme.md
requirements.txt		requirements.txt
train_mh_ma_ca_ppo.py		train_mh_ma_ca_ppo.py
train_multiagent_ca_ppo.py		train_multiagent_ca_ppo.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SustainLC: Benchmark for End-to-End Control of Liquid Cooling in Sustainable Data Centers

Quickstart evaluation on collab:

🔧 Liquid Cooling Control Benchmark for HPC Data Centers

Features

Installation

Prerequisites

Steps

Quick Start Guide

Sustain-LC environment implementation

FMU Integration and Simulation Core

State (Observation) Space

Action Space

Reward Function

Episode Dynamics and Simulation Control

Sustain-LC Training Scripts Documentation

Advanced AutoCSM usage for model building

Software Installation

Repository Installation

Custom Sustain-LC models using AutoCSM

High Fidelity of CDU Model from ORNL developed in Sustain-LC

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

License

HewlettPackard/sustain-lc

Folders and files

Latest commit

History

Repository files navigation

SustainLC: Benchmark for End-to-End Control of Liquid Cooling in Sustainable Data Centers

Quickstart evaluation on collab:

🔧 Liquid Cooling Control Benchmark for HPC Data Centers

Features

Installation

Prerequisites

Steps

Quick Start Guide

Sustain-LC environment implementation

FMU Integration and Simulation Core

State (Observation) Space

Action Space

Reward Function

Episode Dynamics and Simulation Control

Sustain-LC Training Scripts Documentation

Advanced AutoCSM usage for model building

Software Installation

Repository Installation

Custom Sustain-LC models using AutoCSM

High Fidelity of CDU Model from ORNL developed in Sustain-LC

About

Topics

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages