-
Notifications
You must be signed in to change notification settings - Fork 39
Description
Right now dvclive doesn't track system metrics, to name a few, CPU, GPU, RAM utilisation. This is useful when comparing experiment results (e.g., I can get +1% accuracy, but what is the price in terms of time/resources?) and analysing how long it takes to run training on different GPUs (e.g., I can rent another GPU model and get 2x speedup?).
The usual practice is to log this somehow manually with https://github.com/giampaolo/psutil and analyse the results later, but because heavy experiments in ML require this quite often, IMO it makes sense to have this functionality out-of-the-box.
Also it would be great to have summary on these metrics in .json file produced in experiment to make quick decisions instead of diving too deep (e.g., the average CPU utilisation was 4 cores, my script doesn't utilise all 32 cores I have; the peak RAM utilisation was 8GB, that means I can rent a smaller server on aws to run this training; etc).
To name one example among ml tools, this is already tracked with W&B, see the bottom of dashboard: https://wandb.ai/stacey/estuary?workspace=user-lavanyashukla
This page states main metrics logged in W&B https://docs.wandb.ai/ref/app/features/system-metrics
If this will be useful, I could gather a list of metrics with notes describing the cases these metrics are helpful to a user in ml tasks.