-
Notifications
You must be signed in to change notification settings - Fork 39
Monitors CPU, RAM, and disk usage #773
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
35 commits
Select commit
Hold shift + click to select a range
b5a8171
add cpu monitoring
AlexandreKempf e3b654c
add unit tests and more cpu metrics
AlexandreKempf e6cff32
change default value for callback
AlexandreKempf 8663011
uses a percentage value for cpu parallelism
AlexandreKempf 1346750
add ram total
AlexandreKempf 5f90bea
remove total ram measure from plots
AlexandreKempf 2ac50f2
update pyproject.toml
AlexandreKempf 5c0a288
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 30315bb
add tmpdir to metrics tests
AlexandreKempf 0f55943
Merge branch 'main' into monitor-cpu-ressources
AlexandreKempf 40686c7
default to no monitoring callbacks
AlexandreKempf 4b98144
fix tmp_dir for test on windows and macos
AlexandreKempf 36fd94d
fix tmp_dir for test on windows and macos
AlexandreKempf 59db522
fix update data to studio live experiment
AlexandreKempf 2393fb0
fix studio update data problem
AlexandreKempf 3b327c5
debug studio updates
AlexandreKempf b0ac980
improve code readability:
AlexandreKempf 5162ac2
remove hack lightning
AlexandreKempf 9353645
fix lightning problem with steps in studio
AlexandreKempf 871bebc
simplify the metric names
AlexandreKempf 072252d
Merge branch 'main' into monitor-cpu-ressources
AlexandreKempf e169abc
don't increment `num_point_sent_to_studio` if studio didn't received …
AlexandreKempf aa5b511
add directory metrics to the list of metrics tracked + refacto
AlexandreKempf 9d0e70d
clean code and split features into several PRs
AlexandreKempf ba72e01
cleaner user interface
AlexandreKempf 6e29ff6
add docstrings
AlexandreKempf 719087b
mypy conflicts
AlexandreKempf 8654aee
change error types and `monitor_cpu` to `cpu_monitor`
AlexandreKempf f0d6234
add unit tests about _num_points_sent_to_studio behavior
AlexandreKempf bc48074
Merge branch 'main' into monitor-cpu-ressources
d2c1c84
use constant values for metrics names
AlexandreKempf 7ab8915
improve test and user inputs
AlexandreKempf a8d1bfe
improve tests and error catching
AlexandreKempf 4d42f0e
add docstring and fix typo
AlexandreKempf c4485d7
bugfix
AlexandreKempf File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,195 @@ | ||
| import abc | ||
| import logging | ||
| import os | ||
| from typing import Dict, Union, Optional, Tuple | ||
|
|
||
| import psutil | ||
| from statistics import mean | ||
| from threading import Event, Thread | ||
| from funcy import merge_with | ||
|
|
||
|
|
||
| logger = logging.getLogger("dvclive") | ||
| GIGABYTES_DIVIDER = 1024.0**3 | ||
|
|
||
| MINIMUM_CPU_USAGE_TO_BE_ACTIVE = 20 | ||
|
|
||
| METRIC_CPU_COUNT = "system/cpu/count" | ||
| METRIC_CPU_USAGE_PERCENT = "system/cpu/usage (%)" | ||
| METRIC_CPU_PARALLELIZATION_PERCENT = "system/cpu/parallelization (%)" | ||
|
|
||
| METRIC_RAM_USAGE_PERCENT = "system/ram/usage (%)" | ||
| METRIC_RAM_USAGE_GB = "system/ram/usage (GB)" | ||
| METRIC_RAM_TOTAL_GB = "system/ram/total (GB)" | ||
|
|
||
| METRIC_DISK_USAGE_PERCENT = "system/disk/usage (%)" | ||
| METRIC_DISK_USAGE_GB = "system/disk/usage (GB)" | ||
| METRIC_DISK_TOTAL_GB = "system/disk/total (GB)" | ||
|
|
||
|
|
||
| class _SystemMonitor(abc.ABC): | ||
| """ | ||
| Monitor system resources and log them to DVC Live. | ||
| Use a separate thread to call a `_get_metrics` function at fix interval and | ||
| aggregate the results of this sampling using the average. | ||
| """ | ||
|
|
||
| _plot_blacklist_prefix: Tuple = () | ||
|
|
||
| def __init__( | ||
| self, | ||
| interval: float, | ||
| num_samples: int, | ||
| plot: bool = True, | ||
| ): | ||
| max_interval = 0.1 | ||
| if interval > max_interval: | ||
| interval = max_interval | ||
| logger.warning( | ||
| f"System monitoring `interval` should be less than {max_interval} " | ||
| f"seconds. Setting `interval` to {interval} seconds." | ||
| ) | ||
|
|
||
| min_num_samples = 1 | ||
| max_num_samples = 30 | ||
| if not min_num_samples < num_samples < max_num_samples: | ||
| num_samples = max(min(num_samples, max_num_samples), min_num_samples) | ||
| logger.warning( | ||
| f"System monitoring `num_samples` should be between {min_num_samples} " | ||
| f"and {max_num_samples}. Setting `num_samples` to {num_samples}." | ||
| ) | ||
|
|
||
| self._interval = interval # seconds | ||
| self._nb_samples = num_samples | ||
| self._plot = plot | ||
| self._warn_user = True | ||
|
|
||
| def __call__(self, live): | ||
| self._live = live | ||
| self._shutdown_event = Event() | ||
| Thread( | ||
| target=self._monitoring_loop, | ||
| ).start() | ||
|
|
||
| def _monitoring_loop(self): | ||
| while not self._shutdown_event.is_set(): | ||
| self._metrics = {} | ||
| for _ in range(self._nb_samples): | ||
| last_metrics = {} | ||
| try: | ||
| last_metrics = self._get_metrics() | ||
| except psutil.Error: | ||
| if self._warn_user: | ||
| logger.exception("Failed to monitor CPU metrics") | ||
| self._warn_user = False | ||
|
|
||
| self._metrics = merge_with(sum, self._metrics, last_metrics) | ||
| self._shutdown_event.wait(self._interval) | ||
| if self._shutdown_event.is_set(): | ||
| break | ||
| for name, values in self._metrics.items(): | ||
| blacklisted = any( | ||
| name.startswith(prefix) for prefix in self._plot_blacklist_prefix | ||
shcheklein marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| ) | ||
| self._live.log_metric( | ||
| name, | ||
| values / self._nb_samples, | ||
| timestamp=True, | ||
| plot=None if blacklisted else self._plot, | ||
| ) | ||
|
|
||
| @abc.abstractmethod | ||
| def _get_metrics(self) -> Dict[str, Union[float, int]]: | ||
| pass | ||
|
|
||
| def end(self): | ||
| self._shutdown_event.set() | ||
|
|
||
|
|
||
| class CPUMonitor(_SystemMonitor): | ||
| _plot_blacklist_prefix: Tuple = ( | ||
| METRIC_CPU_COUNT, | ||
| METRIC_RAM_TOTAL_GB, | ||
| METRIC_DISK_TOTAL_GB, | ||
| ) | ||
|
|
||
| def __init__( | ||
| self, | ||
| interval: float = 0.1, | ||
| num_samples: int = 20, | ||
| directories_to_monitor: Optional[Dict[str, str]] = None, | ||
| plot: bool = True, | ||
| ): | ||
| """Monitor CPU resources and log them to DVC Live. | ||
|
|
||
| Args: | ||
| interval (float): interval in seconds between two measurements. | ||
| Defaults to 0.5. | ||
| num_samples (int): number of samples to average. Defaults to 10. | ||
| directories_to_monitor (Optional[Dict[str, str]]): monitor disk usage | ||
| statistics about the partition which contains the given paths. The | ||
| statistics include total and used space in gygabytes and percent. | ||
| This argument expect a dict where the key is the name that will be used | ||
| in the metric's name and the value is the path to the directory to | ||
| monitor. Defaults to {"main": "/"}. | ||
| plot (bool): should the system metrics be saved as plots. Defaults to True. | ||
|
|
||
| Raises: | ||
| ValueError: if the arguments passed to the function don't have a | ||
| supported type. | ||
| """ | ||
| super().__init__(interval=interval, num_samples=num_samples, plot=plot) | ||
| directories_to_monitor = ( | ||
| {"main": "/"} if directories_to_monitor is None else directories_to_monitor | ||
| ) | ||
| self._disks_to_monitor = {} | ||
| for disk_name, disk_path in directories_to_monitor.items(): | ||
| if disk_name != os.path.normpath(disk_name): | ||
| raise ValueError( # noqa: TRY003 | ||
| "Keys for `directories_to_monitor` should be a valid name" | ||
| f", but got '{disk_name}'." | ||
| ) | ||
| self._disks_to_monitor[disk_name] = disk_path | ||
|
|
||
| self._warn_disk_doesnt_exist: Dict[str, bool] = {} | ||
|
|
||
| def _get_metrics(self) -> Dict[str, Union[float, int]]: | ||
| ram_info = psutil.virtual_memory() | ||
| nb_cpus = psutil.cpu_count() | ||
| cpus_percent = psutil.cpu_percent(percpu=True) | ||
| result = { | ||
| METRIC_CPU_COUNT: nb_cpus, | ||
| METRIC_CPU_USAGE_PERCENT: mean(cpus_percent), | ||
| METRIC_CPU_PARALLELIZATION_PERCENT: len( | ||
| [ | ||
| percent | ||
| for percent in cpus_percent | ||
| if percent >= MINIMUM_CPU_USAGE_TO_BE_ACTIVE | ||
| ] | ||
| ) | ||
| * 100 | ||
| / nb_cpus, | ||
| METRIC_RAM_USAGE_PERCENT: ram_info.percent, | ||
| METRIC_RAM_USAGE_GB: ram_info.used / GIGABYTES_DIVIDER, | ||
| METRIC_RAM_TOTAL_GB: ram_info.total / GIGABYTES_DIVIDER, | ||
| } | ||
| for disk_name, disk_path in self._disks_to_monitor.items(): | ||
| try: | ||
| disk_info = psutil.disk_usage(disk_path) | ||
| except OSError: | ||
| if self._warn_disk_doesnt_exist.get(disk_name, True): | ||
| logger.warning( | ||
| f"Couldn't find directory '{disk_path}', ignoring it." | ||
| ) | ||
| self._warn_disk_doesnt_exist[disk_name] = False | ||
| continue | ||
| disk_metrics = { | ||
| f"{METRIC_DISK_USAGE_PERCENT}/{disk_name}": disk_info.percent, | ||
| f"{METRIC_DISK_USAGE_GB}/{disk_name}": disk_info.used | ||
| / GIGABYTES_DIVIDER, | ||
| f"{METRIC_DISK_TOTAL_GB}/{disk_name}": disk_info.total | ||
| / GIGABYTES_DIVIDER, | ||
| } | ||
| disk_metrics = {k.rstrip("/"): v for k, v in disk_metrics.items()} | ||
| result.update(disk_metrics) | ||
| return result | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it dump the whole stack trace? should we just include message if we are not in the DEBUG mode (I think we have a flag for this in DVCLive).