-
Notifications
You must be signed in to change notification settings - Fork 259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metric logging #162
Add metric logging #162
Conversation
✅ Deploy Preview for torchtune-preview ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
recipes/finetune_llm.py
Outdated
pbar.set_description(f"{epoch+1}|{idx+1}|Loss: {loss.item()}") | ||
|
||
# Log metrics | ||
metric_logger.log_dict( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, not configurable. This precedent is already set in HF, where if you have wandb logged in, it just automatically logged a bunch of stuff to your project. https://docs.wandb.ai/guides/integrations/huggingface
648b37a
to
e25fc70
Compare
""" | ||
pass | ||
|
||
def log_dict(self, payload: Mapping[str, Scalar], step: int) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm new to Mapping. Why use this instead of a Dict? When I look this up, most resources talk about using Mapping when you don't have a literal Dict. But seems like here we do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mapping has the following (small) advantages:
- Allows for any type that supports a
__getitem__
,__len__
,__iter__
items. - Is immutable, so technically we can reason "stricter" about the values that come in.
For this case, it's not a huge difference. This interface is largely copied from TNT.
torchtune/utils/__init__.py
Outdated
log_dir: Optional[str] = None, | ||
project: Optional[str] = None, | ||
**kwargs: Any | ||
) -> Optional[MetricLogger]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does a function named get_....
return an Optional object? Shouldn't the caller make sure the function is called with reasonable values so we aren't just returning None?
More concretely why is metric_logger
Optional? If this is None, just don't call this function?
recipes/finetune_llm.py
Outdated
logger.addHandler(logging.StreamHandler()) | ||
logger.setLevel(logging.DEBUG) | ||
return logger.info | ||
LOGGER = logging.Logger(__name__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for being a noob, but I dont understand what this change is doing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was handled by #180
Thanks for adding these capabilities! Some questions:
I think this PR needs more testing and thought on how to prevent third-party breakages from breaking TorchTune. Let me know if I misunderstand. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is just a draft, but have significant concerns around the direction of this PR:
- Usage of wandB by default (from what it seems like) + we need failsafes in place to ensure service unavailiblity doesn't affect the ability for users to run their finetunes.
- Think that overall we should refocus the initial metric logging: lets log to stderr or a predefined file important statistics around training speed, the config used, memory utilization, etc). In particular, let's propose a set of initial metrics that are most important for our user cohorts, and simply support robust logging for those to a file. We can then follow up with wandB / TB, but I'm more concerned around getting the proper metrics logged first.
@@ -19,3 +19,7 @@ output_dir: /tmp/alpaca-llama2-finetune | |||
device: cuda | |||
fsdp: False | |||
activation_checkpointing: False | |||
|
|||
# Metrics arguments | |||
metric_logger: wandb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Largely due to what @kartikayk mentioned I'm a bit hesitant to have a third-party logger as a default. i.e. if wandb runs into errors, all our finetuning will just break, no? I think if this is the default, we should at least have some error checking + fallback in case wandB has issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think the default should be a dead simple logger that logs metrics to an output file at each step?
recipes/finetune_llm.py
Outdated
logger.addHandler(logging.StreamHandler()) | ||
logger.setLevel(logging.DEBUG) | ||
return logger.info | ||
LOGGER = logging.Logger(__name__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
recipes/finetune_llm.py
Outdated
@@ -69,7 +61,7 @@ def recipe(kwargs): | |||
else kwargs["device"].split(":")[0] | |||
) | |||
tokenizer = get_tokenizer(kwargs["tokenizer"], path=kwargs["tokenizer_checkpoint"]) | |||
logger(msg=f"Loaded tokenizer from {kwargs['tokenizer_checkpoint']}") | |||
LOGGER.info(msg=f"Loaded tokenizer from {kwargs['tokenizer_checkpoint']}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This stuff probably needs to be rebased.
recipes/finetune_llm.py
Outdated
LOGGER.info(msg=f"Loaded model from {kwargs['model_checkpoint']}") | ||
|
||
# ---- Load tokenizer ---- # | ||
tokenizer = get_tokenizer(kwargs["tokenizer"], path=kwargs["tokenizer_checkpoint"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we getting the tokenizer again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was moved. Changed in #180. Needs rebase.
recipes/finetune_llm.py
Outdated
|
||
# Log metrics at each step | ||
# If no metric logger is specified, this is a no-op | ||
metric_logger.log_dict( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the user need to do any sort of special configuration / registration with wandB then for our FT recipes to work OOTB? If necessary, having this on by default would be a no-go IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WandB does need an init, but if you've already run a wandb.init()
on your machine (which the majority of users probably have), then it'll work OOTB.
torchtune/utils/__init__.py
Outdated
Args: | ||
metric_logger (Optional[str]): The metric logger to use. | ||
log_dir (Optional[str]): The directory to log to. | ||
project (Optional[str]): The project to log to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the defaults in the docstr and what they mean.
torchtune/utils/metric_logging.py
Outdated
"""Log scalar data. | ||
|
||
Args: | ||
name (string): tag name used to group scalars |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "tag name used to group scalar" detailed enough for this arg? Could we add details such like - scalars loged w/ the same name
will be aggregated together, IMO that might be a bit clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure - this was copied from TNT. I can add more decriptions.
torchtune/utils/metric_logging.py
Outdated
|
||
|
||
class WandBLogger(MetricLogger): | ||
"""WandB logger.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add details to this documentation and ensure all docs render properly.
import wandb | ||
|
||
self._wandb = wandb | ||
self._wandb.init( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some things that are not immediately clear to me:
- Does this require the wandB service to be available and accessible through a defined endpoint? If so, have we considered scenaris where users are using torchtune completely offline, or through a proxy/firewall that could affect this?
- Any way to check whether wandB service is available, and fail fast + inform the user to use a different logger? Overall, we want failures in the logging component due to third-party issues to not be in the critical paths of our finetunes.
self._wandb.finish() | ||
|
||
|
||
class TensorBoardLogger(MetricLogger): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have examples of usage, paste of expected outputs, etc for both Tensorboard and wandB logging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are listed in the PR. For "outputs", I'm not sure what you mean, b/c the outputs are images.
e25fc70
to
33f3ed4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good overall, just needs a getter and a default noop or terminal logger.
""" | ||
pass | ||
|
||
def close(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be called from a del method so it automatically gets called by garbage collection?
recipes/finetune_llm.py
Outdated
"lr": opt.param_groups[0]["lr"], | ||
"gpu_resources": torch.cuda.memory_allocated(), | ||
}, | ||
step=idx, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
step needs to be global step so step=epoch*len(dataloader) + idx .
An additional point I forgot, typically you only want metric logging on rank 0. We could either hard code that in or make it optional from init. |
|
||
|
||
@contextmanager | ||
def captured_output() -> Generator[Tuple[TextIO, TextIO], None, None]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From TorchTNT
19f0ac7
to
8330aaa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'll still need a no-op logger to be used by default and for higher ranks.
recipes/finetune_llm.py
Outdated
@@ -37,11 +37,16 @@ def recipe( | |||
output_dir, | |||
run_generation, | |||
max_steps_per_epoch, | |||
metric_logger, | |||
project, | |||
log_dir, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we reuse output-dir for this?
|
||
# Log metrics at each step | ||
# If no metric logger is specified, this is a no-op | ||
if rank == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think metric_loggers should handle the rank functionality. They also need to only init on rank 0. One option is to create a class decorator that returns the given class on rank 0 and returns either a no-op class or terminal logging class on other ranks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They also need to only init on rank 0
Should this then happen in the distributed utils instead of passing in rank information etc to the logger itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add a decorator in distributed utils called "run_on_rank_0(null_class)" that you pass to a class with a null_class that gets called on the other ranks. Or we could just build the logic into the logger and import get_world_size_and_rank so the rank wouldn't need to be passed in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you add a rank check on "get_metric_logger" I won't block on this, and better distributed support can be added in a followup PR.
torchtune/utils/metric_logging.py
Outdated
return WandBLogger(project=project) | ||
elif metric_logger == "tensorboard": | ||
return TensorBoardLogger(log_dir=log_dir) | ||
return StdoutLogger() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs an elif "stdout" and a final else to raise that the input isn't recognized. Since the options are just strings here, we probably don't need to accept None as an option.
recipes/finetune_llm.py
Outdated
parser.add_argument( | ||
"--metric-logger", | ||
type=str, | ||
default="stdout", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This default is going to flood a users terminal with logs and make tqdm useless. Maybe we should have a default no-op still?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we need a bunch of if statements, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kartikayk Could you share your thoughts here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it help to default to stderr?
@@ -1,3 +1,4 @@ | |||
pytest | |||
pytest-cov | |||
pre-commit | |||
tensorboard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move these to requirements.txt as the user needs access to. In general we need to add support for optional-dependencies as per this RFC. For now we could have a commented section of the requirements.txt for optional requirements that will get split off as actual optional requirements in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify your first sentence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would mean that until we specify these as optional, people will think we're taking a hard dependency on tensorboard and wandb. I really would like to avoid this as they are really only needed for tests. Anything else should be the users responsibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that these probably should be optional dependencies. However, as stated in the RFC these fall under training utilities which means they would be in requirements.txt. (I am not saying we should put these in requirements.txt, in fact I think we should reconsider our framing there given that now the set of deps in our recipe but not in our training utilities is pretty much the empty set)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG overall. Main concern is around dependencies - if I can use wandB as a feature in torchtune, then it makes sense that this should be installable via requirements.txt. If we decide not to do this, then we should update the README documenting the installation requirements for our different logging utilities. Also have some more comments in line, will stamp after that
@@ -1,3 +1,5 @@ | |||
pytest | |||
pytest-cov | |||
pre-commit | |||
tensorboard | |||
wandb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why this and tb are dev requirements even after seeing the above discussion. If it's needed for a feature, why can't I just get it by installing from requirements.txt
? cc @kartikayk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They aren't strictly needed for a feature. The default feature is stdout or a no-op, which are plain Python. Then, if a user wants to use wandb or tensorboard, they can, and our docs show how they can install and set it up. However, I don't want to take a hard dependency on these right now b/c they're really only needed for the tests.
Once we know how to add optional dependencies properly in requirements.txt
, we can add them back. I'm trying to be very protective of our "hard" dependencies even if they are transitional.
@@ -20,3 +20,6 @@ device: cuda | |||
dtype: fp32 | |||
fsdp: False | |||
activation_checkpointing: False | |||
|
|||
# Metrics arguments | |||
metric_logger: stdout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is a path instead of a logging abstraction, let's call it metric_logger_path
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a path.
recipes/finetune_llm.py
Outdated
): | ||
# ---- Initialize components ---- # | ||
utils.init_distributed(fsdp) | ||
|
||
logger = utils.get_logger("DEBUG") | ||
metric_logger = utils.get_metric_logger(metric_logger, project, output_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe ideally we'd only initialize this on rank 0, but not super important to me at the moment - can punt on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #233
recipes/finetune_llm.py
Outdated
): | ||
# ---- Initialize components ---- # | ||
utils.init_distributed(fsdp) | ||
|
||
logger = utils.get_logger("DEBUG") | ||
metric_logger = utils.get_metric_logger(metric_logger, project, output_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, this is confusing as we're creating a metric_logger
name with an arg that is also metric_logger
that is of a different type i think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, this should be changed.
recipes/finetune_llm.py
Outdated
"--metric-logger", | ||
type=str, | ||
default="stdout", | ||
choices=["wandb", "tensorboard", "stdout"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should stderr also be a choice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
recipes/finetune_llm.py
Outdated
parser.add_argument( | ||
"--metric-logger", | ||
type=str, | ||
default="stdout", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it help to default to stderr?
"--project", | ||
type=str, | ||
default=None, | ||
help="Project name for WandB metric logger.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add error checking / warning that this is ignored if not using wandB (can be done as follow up, or more generally as part of config validation cc @RdoubleA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is part of a larger question of validation of configs
recipes/finetune_llm.py
Outdated
): | ||
# ---- Initialize components ---- # | ||
utils.init_distributed(fsdp) | ||
|
||
logger = utils.get_logger("DEBUG") | ||
metric_logger = utils.get_metric_logger(metric_logger, project, output_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
output_dir
docstring should also be updated to indicate that it's the TB log dir, or we should provide a separate arg that acts as this dir explicitly.
) | ||
|
||
|
||
class MetricLogger(Protocol): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you consider the use of abc.AbstractClass
and how it could differ from Protocol
? Assume protocol choice is from our design principles, but any details on how we're thinking about its usage here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bad answer but protocol is the only option "allowed" by our design principles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's still under discussion https://github.com/pytorch-labs/torchtune/pull/209/files#r1455763545 and I personally do not understand what a Protocol
is. It seems to be related to type checking, so... something largely irrelevant here. I think that if we're going for Protocol
instead of an standard ABC, it should be for clear good reasons that we all understand - like @rohan-varma , I'd love to learn more about this here. CC @kartikayk since I believe the original design principle comes from you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with enforcing interfaces through abstract classes. But I don't fully understand why an abstract class is a better option here than a Protocol. Can someone help me understand this? Maybe I'm just being stupid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My question is: What is a Protocol??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are several online resources on this. But maybe look at this:
https://jellis18.github.io/post/2022-01-11-abc-vs-protocol/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the resource. I'm still not sure why a Protocol is needed over an ABC here. That article confirmed that the Protocol
is clearly focused on type checking, and none of that seems relevant here since both WandBLogger
and TensorBoardLogger
inherit from the base MetricLogger
. An ABC would do just fine, the Protocol
doesn't seem to bring anything.
But I don't fully understand why an abstract class is a better option here than a Protocol
I am personally much more familiar with ABCs than with Protocols. The first time I ever heard about Protocol was on Torchtune: it's a 3.8+ features and pytorch (and most other packages) only recently dropped support for 3.7. So the main benefit of ABC here is familiarity, IMHO. Out of fairness, let me return the question: why is a Protocol better than an ABC, for the purpose of Torchtune?
print(f"{name}:{data} ", end="") | ||
print("\n", end="") | ||
|
||
def close(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it important to close
resources in the case of errors? i.e. in a del method that @pbontrager mentioned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @daniellepintz here b/c truthfully I'm not sure. They didn't include them in TNT.
from tests.test_utils import captured_output | ||
|
||
|
||
class StdoutLoggerTest(unittest.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use pytest
""" | ||
|
||
def __init__(self, log_dir: str, **kwargs): | ||
from torch.utils.tensorboard import SummaryWriter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just in torch core, right? Why do we need to gate the import behind initialization of the TensorBoardLogger class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no reason, just deferred import b/c someone might not want to use tensorboard.
@@ -1,3 +1,4 @@ | |||
pytest | |||
pytest-cov | |||
pre-commit | |||
tensorboard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that these probably should be optional dependencies. However, as stated in the RFC these fall under training utilities which means they would be in requirements.txt. (I am not saying we should put these in requirements.txt, in fact I think we should reconsider our framing there given that now the set of deps in our recipe but not in our training utilities is pretty much the empty set)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall - I think the discussion on stdout/stderr/default logger can continue to evolve after this PR and it is a relatively easy change, we just need to make a decision. Pls close on open comments & let's have fast follow ons to add important metrics around performance, memory utilization, etc added.
torchtune/utils/metric_logging.py
Outdated
group: Optional[str] = None, | ||
**kwargs, | ||
): | ||
import wandb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be within a try: except:
block with a nicer error message indicating users that they should be installing wandb
, or optionally set the appropriate parameter to other options (like "stdout"). Same for tensorboard below. This could be done here, or more generally above in get_metric_logger
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in here is good, will add now before submitting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think it's good to land and then we can iterate from there. I only request that you pull out the list of metric loggers as mentioned in my last comment.
torchtune/utils/metric_logging.py
Outdated
return StdoutLogger() | ||
else: | ||
raise ValueError( | ||
f"Metric logger not recognized. Expected 'wandb', 'tensorboard', or 'stdout', received {metric_logger}." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of the other unit tests handle their getters by having a global hidden list [wandb, tensorboard, stdout] and a list_metric_loggers function so you don't have to maintain this list in multiple places (e.g. Error messages, getter checks, choices).
e1bb6b9
to
6c2640c
Compare
Changelog
Test plan
Docs