| title | Noc Agent Environment Server | |
|---|---|---|
| emoji | π | |
| colorFrom | yellow | |
| colorTo | green | |
| sdk | docker | |
| pinned | false | |
| app_port | 8000 | |
| base_path | /web | |
| tags |
|
An OpenEnv environment that simulates a Linux server under an active incident. An LLM-based agent acts as an autonomous NOC (Network Operations Centre) engineer, observing live system telemetry and choosing remediation actions to resolve the incident as quickly as possible.
The environment simulates three incident types, each with its own drift rate and action-response characteristics:
| Incident | Difficulty | Primary symptom |
|---|---|---|
network_congestion |
Easy | High latency and packet loss |
memory_leak |
Medium | RAM climbing toward OOM |
cpu_overload |
Hard | Runaway process saturating CPU |
At each step the agent receives a NOCObservation with 6 normalised system metrics and must choose one of 6 discrete actions:
| Action | Best used for |
|---|---|
do_nothing |
When metrics are near-healthy |
throttle_cpu |
High CPU usage |
scale_up |
Capacity pressure (CPU or network) |
clear_cache |
High memory / cache pressure |
restart_service |
Service down or memory leak |
reroute_traffic |
High latency / packet loss |
An episode ends when the incident is resolved (all metrics below healthy thresholds for 3 consecutive steps), the system crashes (any metric exceeds the critical threshold), or the episode is truncated at max_steps.
uv syncThis installs all dependencies from the lockfile into an isolated virtual environment. No manual venv or pip install needed.
The simplest way to evaluate an LLM against all three tasks:
# Set environment variables
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
export API_KEY="<your-api-key>"
export IMAGE_NAME="openenv-noc_agent"
# Build the Docker image first (see below)
docker build -t openenv-noc_agent -f server/Dockerfile .
# Run all three tasks sequentially
uv run python -m noc_agent.inferenceTo run a single task:
NOC_AGENT_TASK=cpu_overload uv run python -m noc_agent.inferenceThe script prints structured logs per step and a final summary table:
[START] task=cpu_overload env=noc_agent model=Qwen/Qwen2.5-72B-Instruct
[STEP] step=1 action=throttle_cpu reward=1.23 done=false error=null
...
[END] success=true steps=12 score=0.742 rewards=1.23,0.95,...
[SUMMARY]
Task Success Steps Score
------------------------- ---------- -------- ------
network_congestion true 8 0.810
memory_leak true 14 0.623
cpu_overload false 30 0.312
[SUMMARY] overall_score=0.582
import asyncio
from noc_agent.client import NocAgentEnvClient
from noc_agent.models import ActionType, NOCAction
async def main():
async with await NocAgentEnvClient.from_docker_image("openenv-noc_agent") as env:
result = await env.reset(incident_type="cpu_overload")
print(f"Incident: {result.observation.incident_type}")
print(f"CPU: {result.observation.metrics.cpu_usage:.1%}")
result = await env.step(NOCAction(action_type=ActionType.THROTTLE_CPU))
print(f"Reward: {result.reward:.3f}")
print(f"Explanation: {result.observation.explanation}")
print(f"Done: {result.done}")
asyncio.run(main())The environment exposes a standard gymnasium.Env interface for training RL policies without the HTTP server:
from noc_agent.gym_env import NOCSystemEnv
from stable_baselines3 import PPO
env = NOCSystemEnv() # random incident each episode
# env = NOCSystemEnv(incident_type=IncidentType.CPU_OVERLOAD) # fixed incident
model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=100_000)Observation space: Box(6,) β normalised metrics [cpu, memory, latency, packet_loss, service_healthy, error_rate]
Action space: Discrete(6) β indexed by ACTION_INDEX order
docker build -t openenv-noc_agent -f server/Dockerfile .# From the environment directory (where openenv.yaml is located)
openenv push
# Or with options
openenv push --namespace my-org --privateThe openenv push command validates the environment, prepares a Hugging Face Docker space build, and uploads it. After deployment, your space exposes:
- Web Interface at
/webβ interactive UI for exploring the environment - API Documentation at
/docsβ full OpenAPI/Swagger interface - Health Check at
/healthβ container health monitoring - WebSocket at
/wsβ persistent session endpoint
--directory,-d: Directory containing the OpenEnv environment (defaults to current directory)--repo-id,-r: Repository ID in formatusername/repo-name--base-image,-b: Base Docker image to use (overrides Dockerfile FROM)--private: Deploy the space as private (default: public)
| Field | Type | Description |
|---|---|---|
metrics |
SystemMetrics |
Current normalised system metrics |
incident_type |
IncidentType |
Active incident in this episode |
step |
int |
Current step within the episode |
explanation |
str |
Post-hoc explanation of the last action's effect |
done |
bool |
Whether the episode has ended |
reward |
float |
Reward received for the last action |
All values are normalised to [0.0, 1.0]. Higher values indicate more stress (except service_healthy).
| Field | Description | Healthy threshold |
|---|---|---|
cpu_usage |
CPU utilisation | < 0.65 |
memory_usage |
RAM utilisation | < 0.65 |
latency |
Network latency (normalised over 500 ms) | < 0.20 |
packet_loss |
Fraction of packets dropped | < 0.05 |
service_healthy |
1.0 = healthy, 0.0 = down | β₯ 1.0 |
error_rate |
Fraction of requests returning errors | < 0.10 |
The aggregate health_score is a weighted sum:
1.0 β (cpuΓ0.25 + memoryΓ0.25 + latencyΓ0.20 + packet_lossΓ0.15 + (1βservice_healthy)Γ0.10 + error_rateΓ0.05)
The reward at each step combines:
- Health delta β reward for improving overall health (
Ξhealth Γ 5.0) - Health bonus β absolute nudge for staying healthy (
health Γ 0.5) - Step penalty β encourages faster resolution (
β0.10per step) - Resolution bonus β
+15.0on successful resolution - Crash penalty β
β10.0if metrics exceed critical thresholds - Ineffective action penalty β
β0.30for actions with no measurable positive effect
noc_agent/
βββ __init__.py # Module exports
βββ README.md # This file
βββ openenv.yaml # OpenEnv manifest
βββ pyproject.toml # Project metadata and dependencies
βββ models.py # Pydantic models: NOCAction, NOCObservation, SystemMetrics
βββ simulator.py # Core state machine: incident drift + action effects
βββ incidents.py # Incident profiles (initial state, drift, action mappings)
βββ gym_env.py # Gymnasium-compatible env for RL training
βββ explainability.py # Natural-language explanations for agent actions
βββ inference.py # LLM agent runner (OpenAI-compatible API)
βββ client.py # OpenEnv WebSocket client
βββ server/
βββ noc_agent_environment.py # OpenEnv server environment (wraps SystemSimulator)
βββ app.py # FastAPI application (HTTP + WebSocket endpoints)
βββ Dockerfile # Container image definition
uv run python -m noc_agent.simulatoruv run serverOr with auto-reload during development:
uv run uvicorn server.app:app --reloaduv run train # Run RL training
uv run demo # Launch the Gradio demo UIuv add <package> # Add a runtime dependency
uv add --dev <pkg> # Add a dev-only dependency
uv remove <package> # Remove a dependency