ENH: add basename for loaders #4363

jisuoqing · 2023-03-03T16:19:32Z

The handler.name are hard-coded with a string in dataset loaders, and assigning values to ds.basename will pop out an AttributeError: can't set attribute 'basename' (which is allowed by previous yt versions, though).

Here an additional parameter basename for dataset loaders is proposed, so users can load multiple datasets with different base names and analyze/visualize them in a workflow (e.g., saving plots with different filenames without worrying about overwriting each other.)

neutrinoceros

I think we can use actual default values instead of None. I've made explicit suggestions in one case but it should be generalised

yt/loaders.py

Co-authored-by: Clément Robert <cr52@protonmail.com>

jisuoqing · 2023-03-03T17:08:32Z

Thanks! Learnt a nicer way.

chrishavlin

I like this! Left a comment on fixing the type check failures.

yt/loaders.py

neutrinoceros · 2023-03-12T11:47:36Z

Had a couple minutes to think about this patch, and while I don't want to oppose it, I want to raise a couple important issues with the design.

assigning values to ds.basename will pop out an AttributeError: can't set attribute 'basename' (which is allowed by previous yt versions, though).

Setting ds.basename is disallowed in yt 4.1 because it breaks the consistency that ds.basename, ds.directory, ds.fullpath and ds.backup_filename are supposed to have (they all relate to ds.filename). This is why they are now implemented as properties.
The fundamental (private) attribute from which all these properties are derived is ds._input_filename, so shouldn't this is the attribute that loaders should be setting in order to control the output image prefixes ? (this question isn't rhetorical)

Furthermore, wouldn't image_prefix make more sense than basename as a parameter ? I think it conveys the intention much better. If not, it means I didn't get what it was meant for.

The image filename generating code is already difficult to maintain, and I'll fully admit that I didn't grasp the full picture in 20min of exploration. I think it's important that we try our best not to make the code even harder to reason about.
In short, we need to have a very clear understanding of how setting handler.name results in setting an image prefix, and simplify the code if we can rather than making it more complex.

jisuoqing · 2023-03-16T16:41:11Z

Thanks for your comments! Sorry for my late response since I'm traveling these days. I agree to make the logic simpler rather than more complex, while it might be better to make the name customizable than hard-coded? I raise this question when I have to read a series of dataset (say, with filename snapshot_000, snapshot_001, etc.), load the content with some loader (say, load_uniform_grid) and perform analysis with parallelism, but it sometime causes trouble when all loaded data comes with the same base name (say, UniformGrid), including the image prefix issue mentioned before. Therefore, it might be more natural if the base name from the loaders can be customized into snapshot_000, snapshot_001 etc. Not sure whether there's a way to achieve this for now.
I think this PR might be helpful, but would not insist. I total understand that ds.basename should not be changed.

neutrinoceros · 2023-03-17T13:38:23Z

Not sure whether there's a way to achieve this for now.

It is certainly achievable if you construct the filename outside yt and pass it to the save method instead of relying on the default (which I agree are not ideal).

jisuoqing · 2023-03-17T14:36:05Z

It is certainly achievable if you construct the filename outside yt and pass it to the save method instead of relying on the default (which I agree are not ideal).

Let me double-confirm: I used to wrap up a data loader into a custom function ds = load_dataset(fn) in which the function load_uniform_grid is called. Is it possible to let the returned ds have the basename of fn (so any yt functionality can deal with this ds without any issue)?

neutrinoceros · 2023-03-17T15:02:00Z

not yet. I agree that something in the spirit of your patch is needed to support that. I would feel much more confortable with discussing this if anyone could explain how this line

yt/yt/loaders.py

Line 388 in 7ef07cf

handler.name = "UniformGridData" # type: ignore [attr-defined]

ends up affecting image file prefixes.

chrishavlin · 2023-03-22T14:55:44Z

I would feel much more confortable with discussing this if anyone could explain how this line <....>
ends up affecting image file prefixes.

For all the stream dataset types, stream_handler gets passed to the StreamDataset.__init__() where it gets stored as an attribute (StreamDataset.stream_handler).

The filename property of a StreamDataset is:

    @property
    def filename(self):
        return self.stream_handler.name

and I don't see anywhere else the stream_handler.name gets used.

So in terms of how that relates to the image file prefixes, it should be similar to any other dataset type. When a name is not supplied to PlotContainer.save(), it sets:

        if name is None:
            name = str(self.ds)

which returns the dataset basename. i.e., in the base Dataset we have:

    def __str__(self):
        return self.basename

and

    @property
    def basename(self):
        return os.path.basename(self.filename)

So in effect, stream_handler.name is the dataset "filename" and was likely just called name because it's an in-memory dataset and IMO, making the stream_handler.name configurable like this PR does is reasonable. It brings this name attribute more in line with how the filename is used as an identifier for the dataset. I definitely would make use of this functionality!

Slight confusing aside... but it's worth noting that during the StreamDataset initialization, it passes down a different "name" down to the base Dataset.__init__ that is a unique hash to store as the _input_filename of the dataset. But since the StreamDataset overrides the filename property to point to the stream_handler.name, I don't think that the hashed _input_filename ever actually gets used.

neutrinoceros · 2023-03-22T16:50:36Z

Thanks @chrishavlin for the detailed walkthrough. I missed that StreamHandler overrides the filename property, it's much clearer now.
So it seems to me there is plenty room to simplify this internal mess of implementation details, but I'm confident now that this PR can be implemented without worrying about it. The one thing that should be properly discussed for the present patch, since it exposes a new keyword argument, is what we should name it (because we won't be able to change it afterward).

As I already pointed, I'm personally -1 on basename and I would rather go for something like image_prefix. What do you guys think ?

chrishavlin · 2023-03-22T19:45:33Z

Ya, I agree -1 for basename, for all the reasons you point out. I'm also not so sure about image_prefix since it's a little more general than that (e.g., print(ds) is not obviously related to an image prefix). How about just dataset_name with an explanation in the docstring that this is used in place of a filename for in-memory stream datasets?

matthewturk · 2023-03-22T19:47:15Z

Is it just images we prefix with this? Do we use it for sidecar files, ever?

…

On Wed, Mar 22, 2023 at 2:45 PM Chris Havlin ***@***.***> wrote: Ya, I agree -1 for basename, for all the reasons you point out. I'm also not so sure about image_prefix since it's a little more general than that (e.g., print(ds) is not obviously related to an image prefix). How about just dataset_name with an explanation in the docstring that this is used in place of a filename for in-memory stream datasets? — Reply to this email directly, view it on GitHub <#4363 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAVXO5W63I6DFU5DKEWLCDW5NJGRANCNFSM6AAAAAAVOZTBNM> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

neutrinoceros · 2023-03-22T20:09:52Z

How about just dataset_name with an explanation in the docstring that this is used in place of a filename for in-memory stream datasets?

Sounds good to me

chrishavlin · 2023-03-23T18:19:51Z

Is it just images we prefix with this? Do we use it for sidecar files, ever?

So from what I can tell, in general, the ds.filename attribute ends up being used in different ways (sometimes differing by frontend) and in some cases, yes, it is ultimately used to write to or read from sidecar files:

ds.parameter_file is a direct alias to ds.filename and that gets used all over the frontends and is often passed down to index objects where it's stored as an index_filename. And that's used, e.g., by the ParticleIndex._initialize_index to check for existing .ewah files:

        # Load Morton index from file if provided
        fname = getattr(ds, "index_filename", None) or f"{ds.parameter_filename}.ewah"

or to store sidecar kdtree files for SPHDataset.

storage_filename will be derived from ds.basename if it's not set and needed for Index._initialize_data_storage().

I'm not sure how much of that is relevant to the Stream frontend in particular, but I think that yes, the stream_handler.name would be used for any sidecar files (if they're used at all).

All that to say, yes, the stream_handler.name will be used for more than just image prefixing so that's a good reason to use a more general variable name like dataset_name as the new keyword argument here.

chrishavlin · 2023-03-28T16:58:41Z

So I think we're in agreement here, ya?

use dataset_name as the keyword argument instead of basename and adjust the docstring entry to something like:

    dataset_name: string, optional
        Optional string used to assign a name to the dataset. Stream datasets will use
        this value in place of a filename (in image prefixing, etc.).

Sound good, @neutrinoceros and @matthewturk ?

@jisuoqing do you mind making that change? (Maybe wait a bit to give @neutrinoceros and/or @matthewturk a chance to reply to this comment)

jisuoqing · 2023-03-28T17:02:27Z

Thanks! Happy to make the change after hearing back from others.

neutrinoceros · 2023-03-28T18:45:16Z

Sounds good to me

matthewturk · 2023-03-29T15:27:34Z

Let's do it!

neutrinoceros

thank you ! LGTM

chrishavlin

Great, thanks @jisuoqing !

add basename for loaders

fec062d

jisuoqing added the proposal Proposals for enhancements, changes, etc label Mar 3, 2023

add ignore

c2162e2

jisuoqing added enhancement Making something better and removed proposal Proposals for enhancements, changes, etc labels Mar 3, 2023

neutrinoceros requested changes Mar 3, 2023

View reviewed changes

yt/loaders.py Outdated Show resolved Hide resolved

yt/loaders.py Outdated Show resolved Hide resolved

jisuoqing and others added 3 commits March 4, 2023 00:58

Update yt/loaders.py

1a82dea

Co-authored-by: Clément Robert <cr52@protonmail.com>

Update yt/loaders.py

c9166bd

Co-authored-by: Clément Robert <cr52@protonmail.com>

adjust syntax

20b437e

fix typo

52e0388

chrishavlin reviewed Mar 10, 2023

View reviewed changes

yt/loaders.py Outdated Show resolved Hide resolved

fix attribute error

1a22d75

matthewturk enabled auto-merge March 29, 2023 15:27

chrishavlin disabled auto-merge March 29, 2023 21:54

rename argument to dataset_name and add docs

bae0d99

jisuoqing added the docs label Mar 30, 2023

neutrinoceros approved these changes Mar 30, 2023

View reviewed changes

chrishavlin approved these changes Mar 30, 2023

View reviewed changes

chrishavlin merged commit 55b9c50 into yt-project:main Mar 30, 2023

jisuoqing deleted the loader_basename branch March 31, 2023 02:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: add basename for loaders #4363

ENH: add basename for loaders #4363

jisuoqing commented Mar 3, 2023

neutrinoceros left a comment

jisuoqing commented Mar 3, 2023

chrishavlin left a comment

neutrinoceros commented Mar 12, 2023

jisuoqing commented Mar 16, 2023

neutrinoceros commented Mar 17, 2023

jisuoqing commented Mar 17, 2023

neutrinoceros commented Mar 17, 2023

chrishavlin commented Mar 22, 2023 •

edited

neutrinoceros commented Mar 22, 2023

chrishavlin commented Mar 22, 2023

matthewturk commented Mar 22, 2023 via email

neutrinoceros commented Mar 22, 2023

chrishavlin commented Mar 23, 2023

chrishavlin commented Mar 28, 2023

jisuoqing commented Mar 28, 2023

neutrinoceros commented Mar 28, 2023

matthewturk commented Mar 29, 2023

neutrinoceros left a comment

chrishavlin left a comment

ENH: add basename for loaders #4363

ENH: add basename for loaders #4363

Conversation

jisuoqing commented Mar 3, 2023

neutrinoceros left a comment

Choose a reason for hiding this comment

jisuoqing commented Mar 3, 2023

chrishavlin left a comment

Choose a reason for hiding this comment

neutrinoceros commented Mar 12, 2023

jisuoqing commented Mar 16, 2023

neutrinoceros commented Mar 17, 2023

jisuoqing commented Mar 17, 2023

neutrinoceros commented Mar 17, 2023

chrishavlin commented Mar 22, 2023 • edited

neutrinoceros commented Mar 22, 2023

chrishavlin commented Mar 22, 2023

matthewturk commented Mar 22, 2023 via email

neutrinoceros commented Mar 22, 2023

chrishavlin commented Mar 23, 2023

chrishavlin commented Mar 28, 2023

jisuoqing commented Mar 28, 2023

neutrinoceros commented Mar 28, 2023

matthewturk commented Mar 29, 2023

neutrinoceros left a comment

Choose a reason for hiding this comment

chrishavlin left a comment

Choose a reason for hiding this comment

chrishavlin commented Mar 22, 2023 •

edited