Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[raysgd] Cleanup User API #7384

Merged
merged 111 commits into from
Mar 10, 2020
Merged

Conversation

richardliaw
Copy link
Contributor

@richardliaw richardliaw commented Feb 29, 2020

Why are these changes needed?

improves usability of the trainer with the following:

  • data_creator returns iterables/loaders rather than datasets.
  • automatically sets "sampler" when returning a DataLoader.
  • remove data_loader_config
  • remove batch_size (becomes hyperparameter)
  • rename "replica" to "worker"
  • adds timing profiles to the stats
  • automatically averages custom stats
  • Make loss function optional (Only if passing in a custom operator)
  • Forces all Trainer args to be named

TODO:

  • Add UnifiedLogger to Trainer by default

Related issue number

Checks

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22782/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22783/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22789/
Test FAILed.

@richardliaw richardliaw marked this pull request as ready for review March 6, 2020 04:47
@richardliaw richardliaw changed the title [raysgd] Cleanups [raysgd] Cleanup User API Mar 6, 2020
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22843/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22852/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22853/
Test FAILed.

Copy link
Contributor

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but mostly trusting @maximsmol because I'm not too familiar

model_creator=model_creator,
data_creator=single_loader,
optimizer_creator=optimizer_creator,
config=dict(batch_size=100000),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
config=dict(batch_size=100000),
config={batch_size: 100000},

To match others

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same with below

return loaders
else:
raise ValueError(
"loaders must be <= 2. Got {}".format(loaders))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"loaders must be <= 2. Got {}".format(loaders))
"number of loaders must be <= 2. Got {}".format(loaders))

logger.debug("Instantiating dataloaders.")
# When creating loaders, a filelock will be used to ensure no
# race conditions in data downloading among different workers.
with FileLock(os.path.expanduser("~/.ray_data.lock")):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be in a tmpfile to avoid issues with persistence (e.g., if the machine crashes)? Not sure, not familiar with FileLock

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22922/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22927/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22931/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22936/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22941/
Test FAILed.

@richardliaw richardliaw merged commit d192ef0 into ray-project:master Mar 10, 2020
@richardliaw richardliaw deleted the benchmark branch March 10, 2020 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants