- Overview
- Main Goals
- Architecture
- Configuration
- Metric Details
- Operational Details
- Detailed Collection Process
- Edge Node Collector Output Structure
- ESXi Collector Output Structure
- Statistics Processing in Main Script
- Logging Examples
- Thread Processing
- vROPs Metric Structure
- Error Notification System
- Performance Considerations
- Setup Steps
- vROPs Metric Visualization
This project implements an automated solution for collecting, processing, and monitoring network performance metrics from both NSX Edge nodes and ESXi hosts in a virtualized environment. The collected metrics are then published to VMware Aria Operations (vROps) for monitoring and analysis.
The code has been developed and tested on the following product versions:
- NSX 4.2.1
- ESXi 8.0u3
- Aria Operations 8.18.2
- Collect detailed performance metrics from NSX Edge nodes and ESXi hosts
- Monitor CPU utilization, interface statistics, and flow cache performance
- Track network-related thread activities and resource usage
- Aggregate performance data from multiple sources
- Normalize and process metrics for consistent reporting
- Push consolidated metrics to vROps for centralized monitoring
- Track CPU usage against defined thresholds
- Monitor interface errors and network flow statistics
- Identify potential performance bottlenecks
- Central coordination script that manages the overall metrics collection and publishing process
- Implements the
StatsCollectorclass which orchestrates:- Sequential collection of metrics from Edge nodes and ESXi hosts
- Processing and normalization of collected data
- Publishing metrics to vROps
- Error handling and notification management
- Implements
NSXEdgeStatsCollectorclass for collecting metrics from NSX Edge nodes - Gathers:
- CPU performance statistics
- Interface metrics
- Flow cache statistics
- Handles SSH connections and command execution on Edge nodes
- Implements
ESXiStatsCollectorclass for collecting metrics from ESXi hosts - Collects:
- Network interface statistics (vmnic stats)
- Thread performance metrics
- EnsNetWorld statistics
- Manages SSH connections to ESXi hosts
requestvROpsAccessToken.py: vROps authenticationedge_node_config_reader.py: Configuration file parsingsendNotificationOnError.py: Error notification system
- Connects to each Edge node via SSH
- Executes
get interfaces | jsonto collect interface statistics - Executes
get dataplane perfstats {interval}to collect performance metrics - Processes and aggregates the data
- Calculates maximum values across all nodes
- Connects to each ESXi host in the cluster via SSH
- Executes
net-stats -i 1 -tW -Ato collect network statistics - Processes statistics for:
- VMNIC poll world threads
- EnsNetWorld TX/RX threads
- Calculates maximum values across all hosts
The solution uses two YAML files for configuration: one for endpoints/infrastructure details and another for credentials.
nsx_manager:
ip: "10.191.21.92"
edge_nodes:
18b3dd22-2ba6-482a-80a3-eb90068dfb2d: 10.191.21.93 # Edge Node UUID: IP address
e5b7ce1b-4210-46dd-a3f9-38bd1c71e332: 10.191.21.97 # Edge Node UUID: IP address
edge_clusters:
2ebffbe1-8aca-46de-919a-cf606c15ed82: # Edge Cluster UUID
nodes:
- 18b3dd22-2ba6-482a-80a3-eb90068dfb2d
- e5b7ce1b-4210-46dd-a3f9-38bd1c71e332
esxi_hosts:
10.163.183.151: 39095510-a095-4344-8d81-4de80da95871 # ESXi IP address: vROps entity ID
10.163.183.152: bf7e490a-3f0f-4f00-b2f8-804e36d52df0 # ESXi IP address: vROps entity ID
vrops_instance:
ip: "10.191.21.95"
adapter_instance_id: "fe470524-a2c7-4890-ba43-4f317a88bb76" # NSX adapter instance ID in vROpsNote: For instructions on finding your vROPs NSX adapter instance ID, refer to the vROPs Metric Visualization section of this documentation.
nsx_manager:
username: "admin"
password: "****"
edge_nodes:
default: # Default credentials for all edge nodes
username: "admin"
password: "****"
nodes: # Optional: Node-specific credentials
"e5b7ce1b-4210-46dd-a3f9-38bd1c71e332":
username: "admin"
password: "****"
esxi_hosts:
default: # Default credentials for all ESXi hosts
username: "root"
password: "****"
hosts: # Optional: Host-specific credentials
"esxi-01":
username: "root"
password: "****"
vrops_instance:
username: "admin"
password: "****"- Both files should be in the same directory as the Python scripts
- Edge Node UUIDs must be obtained from NSX Manager
- Edge Cluster UUID must match the NSX Edge Cluster UUID
- ESXi hosts should include all hosts where Edge nodes could run
- ESXi hosts are defined using their IP addresses as keys with their vROps entity IDs as values
- All IP addresses must be reachable from the monitoring server
- Default credentials can be specified for edge nodes and ESXi hosts
- Individual credentials can be specified for specific nodes/hosts
- If a node/host has specific credentials defined, those will be used instead of defaults
- If no specific credentials are found, the default credentials will be used
For each interface (fp-eth0 through fp-eth3):
rx_errors: Number of receive errorsrx_misses: Number of receive packet missestx_errors: Number of transmission errorstx_drops: Number of dropped transmission packets
For each CPU core:
usage: CPU utilization percentagerx: Receive packet rate (packets per second)tx: Transmit packet rate (packets per second)crypto: Cryptographic operations rateslowpath: Slow path packet processing rateintercore: Inter-core communication rate
For each core:
micro_hit_rate: Micro flow cache hit rate percentagemega_hit_rate: Mega flow cache hit rate percentage
For each vmnic:
max_used: Maximum CPU usage percentage across all threadsmax_ready: Maximum ready time percentage- Thread-specific statistics:
used: CPU usage percentage per threadready: Ready time percentage per thread
Tracks network processing threads divided into:
- TX (Transmit) threads
- RX (Receive) threads
Each thread includes:
used: CPU usage percentageready: Ready time percentage
The ESXi collector implements thread filtering to focus on significant resource usage:
- Only threads with CPU usage above a configurable threshold are collected
- Default threshold is 2% CPU usage
- Filtering applies to both VMNIC and EnsNetWorld threads
- Reduces noise in metrics by excluding low-activity threads
Both collectors implement comprehensive logging with:
- Standard output (console) logging
- File logging to 'edge_monitoring.log'
- Configurable verbose mode
Enable verbose logging by initializing collectors with:
collector = NSXEdgeStatsCollector(verbose=True)
collector = ESXiStatsCollector(verbose=True)-
Adapter Instance ID
- Found in vROPs UI under Inventory -> Adapters
- Required for metric publishing and notifications
- Must be from an active NSX-T adapter instance
-
Authentication
- Uses token-based authentication
- Tokens are managed by the
requestvROpsAccessToken.pymodule
-
Error Handling
- Failed collections trigger notifications in vROPs
- Notifications include detailed error messages
- Can be viewed in vROPs alerts and notifications panel
-
Credential Management
- Credentials are stored in
requirements.py - File should have restricted permissions
- Consider using environment variables or a secure vault in production
- Credentials are stored in
-
Network Access
- Monitoring server needs SSH access to Edge nodes and ESXi hosts
- HTTPS access required for vROPs API
- Firewall rules should be configured accordingly
-
Interface Statistics Command:
get interfaces | json- Collects raw interface metrics
- JSON format for easy parsing
- Includes all physical ports
-
Performance Statistics Command:
get dataplane perfstats 1
- 1-second sampling interval (Configurable)
- Includes CPU and flow cache metrics
- Real-time performance data
net-stats -i 1 -tW -AParameters:
-i 1: 1-second sampling interval-t: Include thread statistics-W: Wide output format-A: All statistics
The Edge Node collector produces a nested dictionary structure containing metrics from all monitored Edge nodes:
{
"nodes": {
"18b3dd22-2ba6-482a-80a3-eb90068dfb2d": { # Edge Node UUID
"interfaces": {
"fp-eth0": { # Interface name
"rx_errors": 0.0,
"rx_misses": 0.0,
"tx_errors": 0.0,
"tx_drops": 0.0
},
"fp-eth1": {
"rx_errors": 76421.0, # Cumulative error count
"rx_misses": 6448.0,
"tx_errors": 56.0,
"tx_drops": 42.0
}
},
"performance": {
"cpu_stats": {
"0": { # Core ID
"usage": 5.0, # Percentage
"rx": 17130.0, # Packets/sec
"tx": 17130.0,
"crypto": 0.0,
"slowpath": 0.0,
"intercore": 0.0
}
},
"flow_cache_stats": {
"micro_hit_rate": { # Per core hit rates
"0": 100.0,
"1": 100.0,
"2": null
},
"mega_hit_rate": {
"0": 6.0,
"1": 10.0,
"2": null
}
}
}
}
},
"max_values": { # Maximum values across all nodes
"interfaces": {
"rx_errors": 76421.0,
"rx_misses": 6448.0,
"tx_errors": 36.0,
"tx_drops": 42.0
},
"cpu": {
"usage": 58.0,
"crypto": 0.0,
"slowpath": 20.0,
"intercore": 0.0
},
"flow_cache": {
"micro_hit_rate": 100.0,
"mega_hit_rate": 2.0
}
}
}-
Interface Statistics:
- Collected via
get interfaces | jsoncommand - Parsed from JSON response
- Accumulated error and miss counts stored
- Collected via
-
Performance Statistics:
- Collected via
get dataplane perfstats {interval}command - CPU metrics sampled over specified interval
- Flow cache statistics gathered per core
- Collected via
-
Maximum Values:
- Calculated across all nodes during processing
- Updated for each metric category
- Stored in top-level max_values section
The ESXi collector produces a structure focusing on thread-level statistics:
{
"hosts": {
"esxi-01": { # ESXi hostname
"vmnic_stats": {
"vmnic2": { # Network interface
"max_used": 99.52, # Maximum CPU usage
"max_ready": 0.29, # Maximum ready time
"threads": {
"vmnic2-pollWorld-0-0x4301158a2040": {
"used": 26.17, # Current CPU usage
"ready": 0.21 # Current ready time
}
}
}
}
},
"esxi-02": {
"vmnic_stats": {
"ens": { # EnsNetWorld metrics
"max_used": 92.33,
"max_ready": 2.88,
"tx": { # Transmit threads
"threads": {
"EnsNetWorld-0-1": {
"used": 5.96,
"ready": 0.52
}
}
},
"rx": { # Receive threads
"threads": {
"EnsNetWorld-0-0": {
"used": 71.82,
"ready": 2.88
}
}
}
}
}
}
},
"max_values": { # Cluster-wide maximums
"used": 99.52,
"ready": 2.88
}
}-
Thread Statistics:
- Collected via
net-stats -i 1 -tW -Acommand - Filtered for threads above usage threshold (default 2%)
- Separated into VMNIC and EnsNetWorld categories
- Collected via
-
VMNIC Processing:
- Identifies pollWorld threads per interface
- Tracks CPU usage and ready time
- Calculates maximum values per interface
-
EnsNetWorld Processing:
- Separates TX and RX threads
- Thread ID pattern determines type:
- Odd numbers: TX threads
- Even numbers: RX threads
- Maintains maximum values per direction
-
Maximum Values:
- Calculated across all hosts
- Updated for both used and ready metrics
- Represents cluster-wide peak values
- Only threads exceeding the usage threshold are included
- Default threshold is 2% CPU usage
- Reduces noise in collected metrics
- Focuses on significant resource consumers
This method transforms raw Edge Node statistics into vROPs metric format.
def _process_edge_stats(self, stats: Dict[str, Any], timestamp: int) -> list:- CPU Statistics
# For each core in cpu_stats
metrics.append({
'statKey': f'EdgePerformanceMetrics|CPU_Stats|Cores:{core}|{stat_name.upper()}',
'timestamps': [timestamp],
'data': [value]
})- Processes each core's metrics separately
- Transforms raw values into time-series data points
- Includes: usage, rx, tx, crypto, slowpath, intercore
- Flow Cache Statistics
# For each cache type (micro/mega) and core
if hit_rate is not None and hit_rate > 0:
metrics.append({
'statKey': f'EdgePerformanceMetrics|Flow_Cache_Stats|{cache_type}|Core:{core}',
'timestamps': [timestamp],
'data': [hit_rate]
})- Filters out null and zero values
- Processes both micro and mega cache hit rates
- Maintains per-core statistics
- Interface Statistics
# For each interface and metric
if value > 0:
metrics.append({
'statKey': f'EdgePerformanceMetrics|PhysicalPorts:{interface}|{stat_name.upper()}',
'timestamps': [timestamp],
'data': [value]
})- Only includes non-zero values
- Tracks rx_errors, rx_misses, tx_errors
- Maintains historical data per interface
Processes ESXi host statistics with focus on thread performance.
def _process_esxi_stats(self, stats: Dict[str, Any], timestamp: int) -> list:- Thread Counting and Filtering
# Count threads over threshold for each host
if thread_stats.get('used', 0) > self.usage_threshold:
host_threads_over_threshold += 1
metrics.append({
'statKey': f'EdgePerformanceMetrics|ESXi|{host_id}|threads_over_usage_threshold',
'timestamps': [timestamp],
'data': [host_threads_over_threshold]
})- Counts threads exceeding usage threshold per host
- Default threshold: 2% CPU usage
- Generates thread count metrics
- VMNIC Thread Processing
# Process VMNIC threads
if vmnic_data.get('max_used', 0) > 0:
metrics.append({
'statKey': f'EdgePerformanceMetrics|ESXi|{host_id}|{vmnic}|max_values|used',
'timestamps': [timestamp],
'data': [vmnic_data['max_used']]
})- Processes each VMNIC's maximum usage
- Tracks ready time statistics
- Filters based on activity threshold
- EnsNetWorld Thread Processing
# Process TX/RX threads separately
for thread_name, thread_stats in ens_data.get('tx', {}).get('threads', {}).items():
if thread_stats.get('used', 0) > 0:
metrics.append({
'statKey': f'EdgePerformanceMetrics|ESXi|{host_id}|EnsNetWorld|TX|{thread_name}|used',
'timestamps': [timestamp],
'data': [thread_usage]
})- Separates TX and RX threads
- Maintains directional statistics
- Tracks individual thread performance
def collect_cluster_metrics(self, edge_stats, esxi_stats, timestamp):The main script combines metrics from both collectors:
- Edge Maximum Values
edge_max_values = {
'cpu_usage': 0.0,
'crypto': 0.0,
'slowpath': 0.0,
'intercore': 0.0
}- Tracks peak values across all Edge nodes
- Updates maximum values for each metric type
- Maintains historical trends
- Flow Cache Aggregation
flow_cache_totals = {
'micro_hit_rate': {'sum': 0.0, 'count': 0},
'mega_hit_rate': {'sum': 0.0, 'count': 0}
}- Calculates average hit rates
- Excludes inactive cores
- Provides cluster-wide cache performance metrics
- Interface Totals
interface_totals = {
'rx_misses': 0.0,
'tx_errors': 0.0
}- Sums error counts across interfaces
- Tracks total packet misses
- Monitors overall interface health
The processed statistics follow these path patterns:
- Edge Node Metrics:
EdgePerformanceMetrics|EdgeNodes|max_values|<metric>
EdgePerformanceMetrics|PhysicalPorts|<interface>|<metric>
EdgePerformanceMetrics|Flow_Cache_Stats|<type>|Core:<id>
- ESXi Metrics:
EdgePerformanceMetrics|ESXi|<host>|<vmnic>|max_values|<metric>
EdgePerformanceMetrics|ESXi|<host>|EnsNetWorld|<direction>|<thread>|<metric>
EdgePerformanceMetrics|ESXi|max_values|<metric>
The processed metrics are formatted for vROPs ingestion:
{
'id': resource_id,
'stat-contents': [
{
'statKey': metric_path,
'timestamps': [timestamp],
'data': [value]
}
# ... additional metrics
]
}This processed data structure enables:
- Historical tracking of performance metrics
- Aggregated views of cluster performance
- Detailed thread-level analysis
- Threshold-based monitoring
2024-12-11 19:17:25,952 - INFO - NSX Edge Stats Collector initialized
2024-12-11 19:17:25,953 - INFO - Found 2 edge nodes in configuration
2024-12-11 19:17:25,953 - INFO - Starting collection of all edge node statistics
2024-12-11 19:17:25,953 - INFO - Processing node: 18b3dd22-2ba6-482a-80a3-eb90068dfb2d
2024-12-11 19:17:36,326 - INFO - ESXi Stats Collector initialized
2024-12-11 19:17:36,326 - INFO - Found 1 edge clusters in configuration
2024-12-11 19:17:36,326 - INFO - Found 2 ESXi hosts in cluster
USAGE_THRESHOLD = 2.0 # Default 2% threshold
if thread_stats.get('used', 0) > USAGE_THRESHOLD:
metrics.append({
'statKey': f'EdgePerformanceMetrics|ESXi|{host_id}|{thread_name}|used',
'timestamps': [timestamp],
'data': [thread_usage]
})-
VMNIC Poll World Threads
- Format:
vmnic<X>-pollWorld-<ID>-<ADDR> - Example:
vmnic2-pollWorld-65-0x4301158d7d80 - Indicates NIC polling threads
- Format:
-
EnsNetWorld Threads
- Format:
EnsNetWorld-<GROUP>-<ID> - Example:
EnsNetWorld-0-1 - Even IDs: RX threads
- Odd IDs: TX threads
- Format:
EdgePerformanceMetrics|<Component>|<Host/Node>|<Metric>
Example metric paths:
EdgePerformanceMetrics|ESXi|esxi-01|vmnic2|threads_over_usage_threshold
EdgePerformanceMetrics|EdgeNodes|max_values|cpu_usage
suite_api_json = {
"eventType": "NOTIFICATION",
"cancelTimeUTC": 0,
"severity": "WARNING",
"keyIndicator": False,
"managedExternally": False,
"resourceId": adapter_instance_id,
"message": error_message,
"startTimeUTC": current_time_in_millis
}-
SSH Connection Failures
Failed to connect to node {node_id} ({ip}): Connection timed out -
Authentication Failures
Authentication failed for host {host_id} -
Command Execution Errors
Error executing command '{command}' on node {node_id}: {error}
- Edge Node collection: ~2-3 seconds per node
- ESXi collection: ~1-2 seconds per host
- vROPs publication: ~1 second per batch
# Create log file
sudo touch /var/log/edge_monitoring.log
sudo chmod 666 /var/log/edge_monitoring.log# Open crontab editor
crontab -e
# Add this line to run every minute (replace /path/to with your actual path)
* * * * * cd /path/to/ScriptsFolder && python3 getEdgeNodeStatsMainScript.py >> /var/log/edge_monitoring.log 2>&1# Check if script is running (shows recent log entries)
tail -f /var/log/edge_monitoring.log
# Check cron logs if there are issues
grep CRON /var/log/syslog
The NSX adapter instance ID (highlighted in red) can be found in vROPs under:
- Navigate to: Inventory Management -> Adapter Instances
- Look for the NSX adapter entry
- The ID will be a UUID format like: "8e434d15-a29b-4776-aea3-bb21bc5c2c2f"
The collected metrics are organized hierarchically in vROPs under the Edge Cluster object:
- Located under EdgePerformanceMetrics/EdgeNodes
- Shows cluster-wide metrics including:
- Maximum CPU usage across nodes
- Peak values over time
- CPU utilization trends
- Found under EdgePerformanceMetrics/PhysicalPorts
- Per-interface statistics including:
- RX_MISSES counter
- RX_ERRORS counter
- Shows historical trends of packet drops and errors
- Located under EdgePerformanceMetrics/ESXi
- Max value for all network threads across all ESXi supporting the edge cluster
- Detailed view of individual VMNIC threads
- Shows:
- Per-thread CPU usage
- Maximum values over time
- Thread-specific performance data
- Historical trending of thread utilization



