-
Notifications
You must be signed in to change notification settings - Fork 117
Description
Reframe in its current form cannot be used for cloud clusters without additional edits to the core API files, due to fundamental differences between VM vs baremetal behavior. This causes issues with both creating and maintaining cloud-based tests using Reframe.
The issue from a SW dev perspective is VMs needing to 'ask' the cloud provider for its specs. For example, because not all cores and HW on a physical machine are allocated to a cloud node, certain commands such as lscpu or checking locality/affinity can show infothat is misleading or incorrect. Getting this info in Azure VMs require a server ping that returns the HW/VM info.
The issue from a SW management perspective is cloud users needing to maintain a special branch of Reframe that has cloud-specific logic - as @vkarak pointed out in #2444. This means not being able to access the latest Reframe code without doing regular merges, which is not good for automated or regression testing.
There are a few ways I can think of to handle both issues. None are ideal.
- A slight variation of #2440. Introduce new cloud-specific functions that are equivalent to regular reframe functions (ex: self.current_system_VM.node_data) with the same behavior as the baremetal counterparts. But, different clouds likely have different behaviors - and, clouds themselves are not necessarily static, unlike baremetal OS - making this is option a stability, validation, and compatibility nightmare.
- Add conditionals inside existing reframe functions, similar to how reframe has different logic for various schedulers. But, that may require adding support for only one VM/cloud at a time, with each provider needing to wait their turn.
- Put VM function calls in various parts of the configuration that cloud users can fill in a separate, stand-alone python file. For example, with the above referenced _detect_system() in config.py you can include a sub-function _detect_vm() that runs before any _detect_system() code and is empty by default. Cloud users fill in _detect_vm() in a separate file - cloud_config.py - with desired behavior. This would prevent the need for Reframe to account for different VM/cloud usage scenarios. And, automated cloud tests can pull the main reframe code, then pull their additional file(s) when building/initializing the environment. The downside is outsourcing reframe development to cloud devs (basically, having mini-versions of reframe for every cloud provider), causing reframe devs to [rightfully] decline customer support from any cloud user.
Thoughts?
Metadata
Metadata
Assignees
Type
Projects
Status