Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

alpaca.py -> _alpaca.py. slimorca.py -> _slimorca.py #407

Merged
merged 2 commits into from
Feb 26, 2024

Conversation

NicolasHug
Copy link
Member

This PR implements #25 (comment) for the torchtune.datasets namespace. This is just an example of what should be done, I'm not signing up to address the rest of the code-base :) (but I can still help if needed)

changes

  • rename alpaca.py into _alpaca.py and update usages accordingly.
  • same for slimorca.py

test plan

CI + git grep:

(tune) ➜  tune git:(underscore_fun) git grep -e datasets.alpa -e datasets.slimo
(tune) ➜  tune git:(underscore_fun) 

Did it do any good???

CC @kartikayk @RdoubleA @rohan-varma @joecummings @ebsmothers hopefully this can clarify a few things:

Yes. Take a look at the current alpaca.py file:

https://github.com/pytorch-labs/torchtune/blob/44d829e3a28a7e663a3089524afae3d8d1946c63/torchtune/datasets/alpaca.py#L16

Do you need CROSS_ENTROPY_IGNORE_IDX to be public? Probably not. Well, in main, users can access it without any underscore via torchtune.datasets.alpaca.CROSS_ENTROPY_IGNORE_IDX, and removing it or even changing its value would effectively be BC-breaking. That's really annoying, isn't it?

Now that alpaca.py is _alpaca.py, you don't need to care about it. That CROSS_ENTROPY_IGNORE_IDX variable is private because its path is now torchtune.datasets._alpaca.CROSS_ENTROPY_IGNORE_IDX i.e. it can't be accessed without an underscore. So you can remove / change it as you please, that's not a BC-break.

To exemplify @RdoubleA 's question in yesterday's meeting: is renaming alpaca into _alpaca equivalent to just putting underscores in front of the private stuff in alpaca, i.e. could we just put an underscore in front of _CROSS_ENTROPY_IGNORE_IDX and keep alpaca.py? Yes, the end result is largely equivalent, but putting the underscore at the file makes it a lot easier for reviewers to not miss anytyhing. Case in point, _CROSS_ENTROPY_IGNORE_IDX was made explicitly private in slimorca but it was missed in reviews for alpaca. If alpaca had been _alpaca, the fact that CROSS_ENTROPY_IGNORE_IDX was "missed" would have been a non-issue (there's no "missing" it when it's already private anyway).

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 23, 2024
Copy link

netlify bot commented Feb 23, 2024

Deploy Preview for torchtune-preview ready!

Name Link
🔨 Latest commit fbb56fc
🔍 Latest deploy log https://app.netlify.com/sites/torchtune-preview/deploys/65d879e79a51c00008c0f2fd
😎 Deploy Preview https://deploy-preview-407--torchtune-preview.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@@ -11,7 +11,7 @@
from tests.test_utils import get_assets_path

from torchtune import datasets
from torchtune.datasets.alpaca import CROSS_ENTROPY_IGNORE_IDX
from torchtune.datasets._alpaca import CROSS_ENTROPY_IGNORE_IDX
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note the underscore: we're accessing private APIs in _alpaca now. And that's completely OK to do so in tests files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Is it ok to access private fields in tests as well? I've been doing this but leaving apologetic comments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean private fields like private attributes of torchtune classes and things like that? Yes of course, you should be testing those as well.

If you mean private fields as in "private stuff from other libraries" then HELL NO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I meant the former - private fields from TorchTune classes. Ok thats awesome!

@@ -61,7 +61,7 @@ To run the recipe without any changes on 4 GPUs, launch a training run using Tun
Dataset
-------

In this example, we use the `Alpaca Dataset <https://github.com/pytorch-labs/torchtune/blob/main/torchtune/datasets/alpaca.py>`_
In this example, we use :class:`~torchtune.datasets.AlpacaDataset`
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc rendering:

image

Copy link
Contributor

@joecummings joecummings left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!!

@kartikayk
Copy link
Contributor

kartikayk commented Feb 23, 2024

@NicolasHug This PR is awesome and the explanation is very clear. Thank you!

Just making sure I understand correctly:

  • Files should have _ by default. Within the file just use "normal convention" i.e. _ for private APIs and not for public APIs. This also means that if we want to ever a flip a file, then no other change should be needed
  • Public classes and APIs should be exposed through the init file
  • If all files in the folder are prefixed with a _ then the folder should be prefixed by a _

If this makes sense, then when do we NOT have an _?

@NicolasHug
Copy link
Member Author

NicolasHug commented Feb 23, 2024

Files should have _ by default.

In general, yes. There might be case-by-case rare exceptions.

Within the file just use "normal convention" i.e. _ for private APIs and not for public APIs.

Yes although since the file has a _, it's OK if you forget a _ in front of a private API. It's better to do it though, at least to show intent through the code.

Public classes and APIs should be exposed through the init file

Yes

If all files in the folder are prefixed with a _ then the folder should be prefixed by a _

No, the whole datasets/ folder only contains _* files (or __init__.py) and yet we don't want it to become _datasets. If we did, torchtune._dataset.AlpacaDataset would be private!

There are cases where it make sense to prefix an entire folder with _ but... we don't need to worry about that right now. Here's an example though if you're interested (those APIs defined in _builtin are publicly exposed higher-up in the module hierarchy)

@kartikayk
Copy link
Contributor

@NicolasHug very clear. Thank you!

@RdoubleA
Copy link
Contributor

Thanks for putting this together @NicolasHug , I'm largely onboard with this but had one question.

That CROSS_ENTROPY_IGNORE_IDX variable is private because its path is now torchtune.datasets._alpaca.CROSS_ENTROPY_IGNORE_IDX i.e. it can't be accessed without an underscore. So you can remove / change it as you please, that's not a BC-break.

Why is this not BC-breaking but alpaca.CROSS_ENTROPY_IGNORE_IDX is? Can't users remove / modify it regardless of whether there's an underscore in alpaca or not? Or are we just relying on the underscore convention to indicate that users / reviewers should pay attention if they're modifying this?

@NicolasHug
Copy link
Member Author

There's no BC-break for private APIs. BC only applies to public APIs. torchtune.datasets._alpaca.CROSS_ENTROPY_IGNORE_IDX is private while torchtune.datasets.alpaca.CROSS_ENTROPY_IGNORE_IDX is public. It is a largely accepted Python convention (basically, pep8) that an API is public if and only if it can be accessed without underscores.

Can't users remove / modify it regardless of whether there's an underscore in alpaca or not?

Just to clarify, we're not concerned about what users do with these APIs: they can't remove anything from torchtune anyway, and even if they were to modify CROSS_ENTROPY_IGNORE_IDX well... That's entirely out of scope for BC/private/public considerations. We're only concerned about their ability to access APIs and whether that access comes with BC guarantees or not.

@NicolasHug NicolasHug merged commit c02aff6 into pytorch:main Feb 26, 2024
17 checks passed
@NicolasHug NicolasHug deleted the underscore_fun branch February 26, 2024 11:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants