Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
b5a8171
add cpu monitoring
AlexandreKempf Feb 6, 2024
e3b654c
add unit tests and more cpu metrics
AlexandreKempf Feb 6, 2024
e6cff32
change default value for callback
AlexandreKempf Feb 6, 2024
8663011
uses a percentage value for cpu parallelism
AlexandreKempf Feb 6, 2024
1346750
add ram total
AlexandreKempf Feb 6, 2024
5f90bea
remove total ram measure from plots
AlexandreKempf Feb 6, 2024
2ac50f2
update pyproject.toml
AlexandreKempf Feb 7, 2024
5c0a288
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 7, 2024
30315bb
add tmpdir to metrics tests
AlexandreKempf Feb 7, 2024
0f55943
Merge branch 'main' into monitor-cpu-ressources
AlexandreKempf Feb 7, 2024
40686c7
default to no monitoring callbacks
AlexandreKempf Feb 7, 2024
4b98144
fix tmp_dir for test on windows and macos
AlexandreKempf Feb 9, 2024
36fd94d
fix tmp_dir for test on windows and macos
AlexandreKempf Feb 9, 2024
59db522
fix update data to studio live experiment
AlexandreKempf Feb 9, 2024
2393fb0
fix studio update data problem
AlexandreKempf Feb 9, 2024
3b327c5
debug studio updates
AlexandreKempf Feb 9, 2024
b0ac980
improve code readability:
AlexandreKempf Feb 12, 2024
5162ac2
remove hack lightning
AlexandreKempf Feb 12, 2024
9353645
fix lightning problem with steps in studio
AlexandreKempf Feb 12, 2024
871bebc
simplify the metric names
AlexandreKempf Feb 13, 2024
072252d
Merge branch 'main' into monitor-cpu-ressources
AlexandreKempf Feb 14, 2024
e169abc
don't increment `num_point_sent_to_studio` if studio didn't received …
AlexandreKempf Feb 14, 2024
aa5b511
add directory metrics to the list of metrics tracked + refacto
AlexandreKempf Feb 15, 2024
9d0e70d
clean code and split features into several PRs
AlexandreKempf Feb 15, 2024
ba72e01
cleaner user interface
AlexandreKempf Feb 15, 2024
6e29ff6
add docstrings
AlexandreKempf Feb 15, 2024
719087b
mypy conflicts
AlexandreKempf Feb 15, 2024
32de58c
monitor GPU metrics
AlexandreKempf Feb 15, 2024
b9e9bac
detect GPUs with nmvl import
AlexandreKempf Feb 15, 2024
b44881b
replace py3nvml to pynvml and add setter to monitor_gpu
AlexandreKempf Feb 16, 2024
8654aee
change error types and `monitor_cpu` to `cpu_monitor`
AlexandreKempf Feb 16, 2024
f0d6234
add unit tests about _num_points_sent_to_studio behavior
AlexandreKempf Feb 16, 2024
bc48074
Merge branch 'main' into monitor-cpu-ressources
Feb 16, 2024
d2c1c84
use constant values for metrics names
AlexandreKempf Feb 16, 2024
7ab8915
improve test and user inputs
AlexandreKempf Feb 19, 2024
a8d1bfe
improve tests and error catching
AlexandreKempf Feb 20, 2024
3359001
add docstring
AlexandreKempf Feb 20, 2024
30446bc
Merge branch 'monitor-cpu-ressources' into monitor-gpu-ressources
AlexandreKempf Feb 20, 2024
3f94e24
merge cpu and gpu monitoring into a system monitor object
AlexandreKempf Feb 20, 2024
506dd52
change pyproject dependencies and fix typo
AlexandreKempf Feb 21, 2024
5df5d48
install psutil and pynvml by default
AlexandreKempf Feb 22, 2024
29477ba
remove plot argument in SystemMonitor
AlexandreKempf Feb 22, 2024
53d83d5
change call to the SystemMonitor object to be a Live method
AlexandreKempf Feb 22, 2024
38d5c72
Merge branch 'main' into monitor-gpu-ressources
AlexandreKempf Feb 22, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,9 @@ dependencies = [
"funcy",
"gto",
"ruamel.yaml",
"scmrepo>=3,<4"
"scmrepo>=3,<4",
"psutil",
"pynvml"
Comment on lines +41 to +42

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we include by default or make them optional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will include them if we decide to monitor the system's metrics by default :)
I created a new option called system.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant it as a genuine question, not a suggestion to make them optional. How lightweight are they? I think there's also downside to adding lots of options. We have discussed in the past making a lightweight version of dvclive for those who need it very lean. Totally up to you here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, my 2cs. I would prefer "system" to be enabled by default. Agreed to check on how big / complicated those deps are.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both, when installed together, are quite light: ~500KB.

@shcheklein Then should I change the monitor_system default to True in the Live object?
Any idea on that topic @dberenbaum?

My 0.02$ is that it would be nice to have it by default, especially for new users. But for existing users, it can be annoying to have new plots all over the studio dashboard I spend days customizing. What should we prioritize?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

off and consider making it a default in 4.0

sounds good. let's make a ticket for this? :) and do it soon I hope. (I just feel it's a really good and appealing feature - people will stick better with the extension, Studio, etc).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's wait for 4.0. Do we have a roadmap to 4.0 by any chance, so I can add that ticket?
Also this ticket should mention that psutil and pynvml should not be optional anymore.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it is: #746

Copy link

@dberenbaum dberenbaum Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we make the deps optional, do we need to change the PR a bit to make sure we fail gracefully if they are not installed? It looks like loading dvclive.live now will try to import those deps. (I'm also fine to just start installing them by default now since I think it's less annoying and non-breaking)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I put them by default as they are quite light. Also, having them installed by default should simplify the documentation.

]

[project.optional-dependencies]
Expand All @@ -51,7 +53,9 @@ tests = [
"pytest-cov>=3.0.0,<4.0",
"pytest-mock>=3.8.2,<4.0",
"dvclive[image,plots,markdown]",
"ipython"
"ipython",
"pytest_voluptuous",
"dpath"
]
dev = [
"dvclive[all,tests]",
Expand Down
52 changes: 52 additions & 0 deletions src/dvclive/live.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,11 @@
import os
import shutil
import tempfile

from pathlib import Path, PurePath
from typing import Any, Dict, List, Optional, Set, Tuple, Union, TYPE_CHECKING, Literal


if TYPE_CHECKING:
import numpy as np
import pandas as pd
Expand Down Expand Up @@ -41,6 +43,7 @@
from .report import BLANK_NOTEBOOK_REPORT, make_report
from .serialize import dump_json, dump_yaml, load_yaml
from .studio import get_dvc_studio_config, post_to_studio
from .monitor_system import _SystemMonitor
from .utils import (
StrPath,
catch_and_warn,
Expand Down Expand Up @@ -81,6 +84,7 @@ def __init__(
cache_images: bool = False,
exp_name: Optional[str] = None,
exp_message: Optional[str] = None,
monitor_system: bool = False,
):
"""
Initializes a DVCLive logger. A `Live()` instance is required in order to log
Expand Down Expand Up @@ -119,6 +123,8 @@ def __init__(
provided string will be passed to `dvc exp save --message`.
If DVCLive is used inside `dvc exp run`, the option will be ignored, use
`dvc exp run --message` instead.
monitor_system (bool): if `True`, DVCLive will monitor GPU, CPU, ram, and
disk usage. Defaults to `False`.
"""
self.summary: Dict[str, Any] = {}

Expand Down Expand Up @@ -165,6 +171,10 @@ def __init__(
self._dvc_studio_config: Dict[str, Any] = {}
self._init_studio()

self._system_monitor: Optional[_SystemMonitor] = None # Monitoring thread
if monitor_system:
self.monitor_system()

def _init_resume(self):
self._read_params()
self.summary = self.read_latest()
Expand Down Expand Up @@ -370,6 +380,43 @@ def step(self, value: int) -> None:
self._step = value
logger.debug(f"Step: {self.step}")

def monitor_system(
self,
interval: float = 0.05, # seconds
num_samples: int = 20,
directories_to_monitor: Optional[Dict[str, str]] = None,
) -> None:
"""Monitor GPU, CPU, ram, and disk resources and log them to DVC Live.

Args:
interval (float): the time interval between samples in seconds. To keep the
sampling interval small, the maximum value allowed is 0.1 seconds.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker, but is the max value of 0.1 seconds needed?

Copy link
Contributor Author

@AlexandreKempf AlexandreKempf Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that we guide users toward realistic values if they have no clues on good values for the sampling interval. The values are coming from W&B's code.

Default to 0.05.
num_samples (int): the number of samples to collect before the aggregation.
The value should be between 1 and 30 samples. Default to 20.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why between 1 and 30?

Copy link
Contributor Author

@AlexandreKempf AlexandreKempf Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same reason as above

directories_to_monitor (Optional[Dict[str, str]]): a dictionary with the
information about which directories to monitor. The `key` would be the
name of the metric and the `value` is the path to the directory.
The metric tracked concerns the partition that contains the directory.
Default to `{"main": "/"}`.

Raises:
ValueError: if the keys in `directories_to_monitor` contains invalid
characters as defined by `os.path.normpath`.
"""
if directories_to_monitor is None:
directories_to_monitor = {"main": "/"}

if self._system_monitor is not None:
self._system_monitor.end()

self._system_monitor = _SystemMonitor(
live=self,
interval=interval,
num_samples=num_samples,
directories_to_monitor=directories_to_monitor,
)

def sync(self):
self.make_summary()

Expand Down Expand Up @@ -857,6 +904,11 @@ def end(self):
# If next_step called before end, don't want to update step number
if "step" in self.summary:
self.step = self.summary["step"]

# Kill threads that monitor the system metrics
if self._system_monitor is not None:
self._system_monitor.end()

self.sync()

if self._inside_dvc_exp and self._dvc_repo:
Expand Down
240 changes: 240 additions & 0 deletions src/dvclive/monitor_system.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,240 @@
import logging
import os
from typing import Dict, Union, Tuple

import psutil
from statistics import mean
from threading import Event, Thread
from funcy import merge_with

try:
from pynvml import (
nvmlInit,
nvmlDeviceGetCount,
nvmlDeviceGetHandleByIndex,
nvmlDeviceGetMemoryInfo,
nvmlDeviceGetUtilizationRates,
nvmlShutdown,
NVMLError,
)

GPU_AVAILABLE = True
except ImportError:
GPU_AVAILABLE = False

logger = logging.getLogger("dvclive")
GIGABYTES_DIVIDER = 1024.0**3

MINIMUM_CPU_USAGE_TO_BE_ACTIVE = 20

METRIC_CPU_COUNT = "system/cpu/count"
METRIC_CPU_USAGE_PERCENT = "system/cpu/usage (%)"
METRIC_CPU_PARALLELIZATION_PERCENT = "system/cpu/parallelization (%)"

METRIC_RAM_USAGE_PERCENT = "system/ram/usage (%)"
METRIC_RAM_USAGE_GB = "system/ram/usage (GB)"
METRIC_RAM_TOTAL_GB = "system/ram/total (GB)"

METRIC_DISK_USAGE_PERCENT = "system/disk/usage (%)"
METRIC_DISK_USAGE_GB = "system/disk/usage (GB)"
METRIC_DISK_TOTAL_GB = "system/disk/total (GB)"

METRIC_GPU_COUNT = "system/gpu/count"
METRIC_GPU_USAGE_PERCENT = "system/gpu/usage (%)"
METRIC_VRAM_USAGE_PERCENT = "system/vram/usage (%)"
METRIC_VRAM_USAGE_GB = "system/vram/usage (GB)"
METRIC_VRAM_TOTAL_GB = "system/vram/total (GB)"


class _SystemMonitor:
_plot_blacklist_prefix: Tuple = (
METRIC_CPU_COUNT,
METRIC_RAM_TOTAL_GB,
METRIC_DISK_TOTAL_GB,
METRIC_GPU_COUNT,
METRIC_VRAM_TOTAL_GB,
)

def __init__(
self,
live,
interval: float, # seconds
num_samples: int,
directories_to_monitor: Dict[str, str],
):
self._live = live
self._interval = self._check_interval(interval, max_interval=0.1)
self._num_samples = self._check_num_samples(
num_samples, min_num_samples=1, max_num_samples=30
)
self._disks_to_monitor = self._check_directories_to_monitor(
directories_to_monitor
)
self._warn_cpu_problem = True
self._warn_gpu_problem = True
self._warn_disk_doesnt_exist: Dict[str, bool] = {}

self._shutdown_event = Event()
Thread(
target=self._monitoring_loop,
).start()

def _check_interval(self, interval: float, max_interval: float) -> float:
if interval > max_interval:
logger.warning(
f"System monitoring `interval` should be less than {max_interval} "
f"seconds. Setting `interval` to {max_interval} seconds."
)
return max_interval
return interval

def _check_num_samples(
self, num_samples: int, min_num_samples: int, max_num_samples: int
) -> int:
min_num_samples = 1
max_num_samples = 30
if not min_num_samples < num_samples < max_num_samples:
num_samples = max(min(num_samples, max_num_samples), min_num_samples)
logger.warning(
f"System monitoring `num_samples` should be between {min_num_samples} "
f"and {max_num_samples}. Setting `num_samples` to {num_samples}."
)
return num_samples

def _check_directories_to_monitor(
self, directories_to_monitor: Dict[str, str]
) -> Dict[str, str]:
disks_to_monitor = {}
for disk_name, disk_path in directories_to_monitor.items():
if disk_name != os.path.normpath(disk_name):
raise ValueError( # noqa: TRY003
"Keys for `directories_to_monitor` should be a valid name"
f", but got '{disk_name}'."
)
disks_to_monitor[disk_name] = disk_path
return disks_to_monitor

def _monitoring_loop(self):
while not self._shutdown_event.is_set():
self._metrics = {}
for _ in range(self._num_samples):
try:
last_metrics = self._get_metrics()
except psutil.Error:
if self._warn_cpu_problem:
logger.exception("Failed to monitor CPU metrics")
self._warn_cpu_problem = False
except NVMLError:
if self._warn_gpu_problem:
logger.exception("Failed to monitor GPU metrics")
self._warn_gpu_problem = False

self._metrics = merge_with(sum, self._metrics, last_metrics)
self._shutdown_event.wait(self._interval)
if self._shutdown_event.is_set():
break
for name, values in self._metrics.items():
blacklisted = any(
name.startswith(prefix) for prefix in self._plot_blacklist_prefix
)
self._live.log_metric(
name,
values / self._num_samples,
timestamp=True,
plot=None if blacklisted else True,
)

def _get_metrics(self) -> Dict[str, Union[float, int]]:
return {
**self._get_gpu_info(),
**self._get_cpu_info(),
**self._get_ram_info(),
**self._get_disk_info(),
}

def _get_ram_info(self) -> Dict[str, Union[float, int]]:
ram_info = psutil.virtual_memory()
return {
METRIC_RAM_USAGE_PERCENT: ram_info.percent,
METRIC_RAM_USAGE_GB: ram_info.used / GIGABYTES_DIVIDER,
METRIC_RAM_TOTAL_GB: ram_info.total / GIGABYTES_DIVIDER,
}

def _get_cpu_info(self) -> Dict[str, Union[float, int]]:
num_cpus = psutil.cpu_count()
cpus_percent = psutil.cpu_percent(percpu=True)
return {
METRIC_CPU_COUNT: num_cpus,
METRIC_CPU_USAGE_PERCENT: mean(cpus_percent),
METRIC_CPU_PARALLELIZATION_PERCENT: len(
[
percent
for percent in cpus_percent
if percent >= MINIMUM_CPU_USAGE_TO_BE_ACTIVE
]
)
* 100
/ num_cpus,
}

def _get_disk_info(self) -> Dict[str, Union[float, int]]:
result = {}
for disk_name, disk_path in self._disks_to_monitor.items():
try:
disk_info = psutil.disk_usage(disk_path)
except OSError:
if self._warn_disk_doesnt_exist.get(disk_name, True):
logger.warning(
f"Couldn't find directory '{disk_path}', ignoring it."
)
self._warn_disk_doesnt_exist[disk_name] = False
continue
disk_metrics = {
f"{METRIC_DISK_USAGE_PERCENT}/{disk_name}": disk_info.percent,
f"{METRIC_DISK_USAGE_GB}/{disk_name}": disk_info.used
/ GIGABYTES_DIVIDER,
f"{METRIC_DISK_TOTAL_GB}/{disk_name}": disk_info.total
/ GIGABYTES_DIVIDER,
}
disk_metrics = {k.rstrip("/"): v for k, v in disk_metrics.items()}
result.update(disk_metrics)
return result

def _get_gpu_info(self) -> Dict[str, Union[float, int]]:
if not GPU_AVAILABLE:
return {}

nvmlInit()
num_gpus = nvmlDeviceGetCount()
gpu_metrics = {
"system/gpu/count": num_gpus,
}

for gpu_idx in range(num_gpus):
gpu_handle = nvmlDeviceGetHandleByIndex(gpu_idx)
memory_info = nvmlDeviceGetMemoryInfo(gpu_handle)
usage_info = nvmlDeviceGetUtilizationRates(gpu_handle)

gpu_metrics.update(
{
f"{METRIC_GPU_USAGE_PERCENT}/{gpu_idx}": (
100 * usage_info.memory / usage_info.gpu
if usage_info.gpu
else 0
),
f"{METRIC_VRAM_USAGE_PERCENT}/{gpu_idx}": (
100 * memory_info.used / memory_info.total
),
f"{METRIC_VRAM_USAGE_GB}/{gpu_idx}": (
memory_info.used / GIGABYTES_DIVIDER
),
f"{METRIC_VRAM_TOTAL_GB}/{gpu_idx}": (
memory_info.total / GIGABYTES_DIVIDER
),
}
)
nvmlShutdown()
return gpu_metrics

def end(self):
self._shutdown_event.set()
Loading