Skip to content

logger: track system metrics automatically #81

@aguschin

Description

@aguschin

Right now dvclive doesn't track system metrics, to name a few, CPU, GPU, RAM utilisation. This is useful when comparing experiment results (e.g., I can get +1% accuracy, but what is the price in terms of time/resources?) and analysing how long it takes to run training on different GPUs (e.g., I can rent another GPU model and get 2x speedup?).

The usual practice is to log this somehow manually with https://github.com/giampaolo/psutil and analyse the results later, but because heavy experiments in ML require this quite often, IMO it makes sense to have this functionality out-of-the-box.

Also it would be great to have summary on these metrics in .json file produced in experiment to make quick decisions instead of diving too deep (e.g., the average CPU utilisation was 4 cores, my script doesn't utilise all 32 cores I have; the peak RAM utilisation was 8GB, that means I can rent a smaller server on aws to run this training; etc).

To name one example among ml tools, this is already tracked with W&B, see the bottom of dashboard: https://wandb.ai/stacey/estuary?workspace=user-lavanyashukla

This page states main metrics logged in W&B https://docs.wandb.ai/ref/app/features/system-metrics

If this will be useful, I could gather a list of metrics with notes describing the cases these metrics are helpful to a user in ml tasks.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions