Skip to content

Conversation

artek0chumak
Copy link
Contributor

Issue #ENG-18828 , #ENG-18829

Describe your changes

Improve fine-tuning checkpoint support with two features:

  • Add new command to show all checkpoints for the jobs
  • Add new argument to start a job from the checkpoint

@artek0chumak artek0chumak marked this pull request as ready for review March 10, 2025 09:47
@mryab mryab requested review from mryab and removed request for VProv March 10, 2025 09:53
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
click.echo("\nTo download a checkpoint, use cmd: together fine-tuning download")
click.echo("\nTo download a checkpoint, use `together fine-tuning download`")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to format the timestamps to strings only in this function. As you can see in your own example, truncating the seconds etc. can lead to a result where the order of checkpoints becomes incorrect: therefore, it's better to sort by the integer timestamp value. Furthermore, users of the Python CLI might want to apply other datetime formats to display the values in their code, so our current behavior is too destructive

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should be the behavior if the user passes a FT job ID without a colon? I believe this expression won't match, perhaps it's best to rename it to clarify that it only handles job IDs with step numbers?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

step=0 will put the code in a surprise :)

Suggested change
checkpoint_name = f"{id}:{step}" if step else id
checkpoint_name = f"{id}:{step}" if step is not None else id

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For correct sorting, timestamps should not be strings at this point

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about listing the step here as well, like "Intermediate (step 20)"? It's not completely obvious how to interpret the table now, but if we add a bit of extra information, it will be clearer to the user what the names stand for

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from_checkpoint (str, optional): The checkpoint to be used in the fine-tuning.
from_checkpoint (str, optional): The checkpoint identifier to continue training from a previous fine-tuning job.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from_checkpoint (str, optional): The checkpoint to be used in the fine-tuning.
from_checkpoint (str, optional): The checkpoint identifier to continue training from a previous fine-tuning job.

@artek0chumak artek0chumak force-pushed the artem/add-checkpoints branch from 6c41e61 to 05d85d3 Compare March 11, 2025 14:35
@artek0chumak artek0chumak merged commit 271af98 into main Mar 11, 2025
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants