Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Negative start/stop/step in sile_read_multiple #578

Closed
wants to merge 8 commits into from

Conversation

tfrederiksen
Copy link
Contributor

@tfrederiksen tfrederiksen commented Jun 11, 2023

Here an attempt to handle negative start/stop/step in sile_read_multiple as discussed in #577 . If one specifies only start=-1 one now gets the last entry.

If a negative start/stop/step is provided, the code reads all entries and returns slice(start, stop, step).

NB I am unsure if all corner cases are appropriately handled.

@tfrederiksen
Copy link
Contributor Author

tfrederiksen commented Jun 12, 2023

A major step with the last commit is the proposal to completely remove the all-keyword. Instead reading would be solely controlled by (the equivalent of) slice(start, stop, step).

For example:

  • start=-1 returns only the last entry (after reading the whole file),
  • stop=1 reads and returns only the first entry,
  • start=10, stop=11 skips the first 10 entries (using skip_call), reads the 11th entry (index 10), and stops reading further,
  • start=-2, step=-1 returns two entries (step-1, step-2) i.e., in reversed order,
  • etc.

src/sisl/io/sile.py Fixed Show resolved Hide resolved
@@ -100,10 +100,10 @@ def _r_geometry_skip(self, *args, **kwargs):
return na

@sile_fh_open()
@sile_read_multiple(skip_call=_r_geometry_skip, postprocess=GeometryCollection)
@sile_read_multiple(stop=1, skip_call=_r_geometry_skip, postprocess=GeometryCollection)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would somehow be cleaner if the default was always to return everything, unless the user provides some selection with start/stop/step.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, probably that would be wisest... But this depends on the discussion above.

@codecov
Copy link

codecov bot commented Jun 13, 2023

Codecov Report

Merging #578 (c3c92ab) into main (1dae3e0) will increase coverage by 0.01%.
The diff coverage is 96.59%.

@@            Coverage Diff             @@
##             main     #578      +/-   ##
==========================================
+ Coverage   87.30%   87.32%   +0.01%     
==========================================
  Files         374      374              
  Lines       47506    47496      -10     
==========================================
  Hits        41474    41474              
+ Misses       6032     6022      -10     
Impacted Files Coverage Δ
src/sisl/io/sile.py 85.52% <89.65%> (+0.01%) ⬆️
src/sisl/geometry.py 85.87% <100.00%> (-0.01%) ⬇️
src/sisl/io/openmx/md.py 100.00% <100.00%> (ø)
src/sisl/io/siesta/ani.py 100.00% <100.00%> (ø)
src/sisl/io/siesta/tests/test_ani.py 100.00% <100.00%> (ø)
src/sisl/io/siesta/tests/test_stdout.py 100.00% <100.00%> (ø)
src/sisl/io/siesta/tests/test_stdout_charges.py 97.72% <100.00%> (ø)
src/sisl/io/tests/test_xsf.py 100.00% <100.00%> (ø)
src/sisl/io/tests/test_xyz.py 100.00% <100.00%> (ø)
src/sisl/io/xsf.py 66.89% <100.00%> (ø)
... and 2 more

... and 8 files with indirect coverage changes

Copy link
Owner

@zerothi zerothi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I largely agree on everything, just some minor comments.

It would be great if @pfebrer could have a word here, he was interested in some MD functionality. It is a bit hard this one, since some file-formats are generically used for single geometries (xyz) while others are primarily MD files (ANI). It would feel weird for end-users that xyz-files returns a GeometryCollection. But I think the current approach of returning the raw output could be annoying for those doing stuff on the fly.

We could accept it, but then it needs clear documentation.

src/sisl/io/sile.py Outdated Show resolved Hide resolved
src/sisl/io/sile.py Outdated Show resolved Hide resolved
src/sisl/io/tests/test_xsf.py Outdated Show resolved Hide resolved
@@ -100,10 +100,10 @@ def _r_geometry_skip(self, *args, **kwargs):
return na

@sile_fh_open()
@sile_read_multiple(skip_call=_r_geometry_skip, postprocess=GeometryCollection)
@sile_read_multiple(stop=1, skip_call=_r_geometry_skip, postprocess=GeometryCollection)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, probably that would be wisest... But this depends on the discussion above.

Signed-off-by: Nick Papior <nickpapior@gmail.com>
@pfebrer
Copy link
Contributor

pfebrer commented Jun 13, 2023

It would be great if @pfebrer could have a word here, he was interested in some MD functionality.

As I said when we discussed these things, the generic GeometryCollection is not efficient enough to read large MD files because it repeats unnecessarily a lot of computation and stores redundant data. To read those files for now I am using the ASE reader and then parsing to sisl geometries as I need it.

But I agree with all the points of this PR, I think the best way to go is to mimic python slice functionality since it will feel more natural for users and also you can use the python slicing syntax as syntactic sugar if it makes sense at some point.

@pfebrer
Copy link
Contributor

pfebrer commented Jun 13, 2023

So I would try to make arguments mean exactly the same and have exactly the same defaults as slice.

As you discuss, this might be an issue because for some files it feels natural to read only one geometry, so the default should be stop=1. I think it's a bad idea to have different defaults for different siles because it's confusing. Maybe it is not a good idea that the function named read_geometry can read more than one geometry, and the user could be required to ask explicitly for the iterator. Would it be that bad? Something like:

sile = sisl.get_sile("file.xyz")

# Read the first geometry (or wherever the file handle is at)
sile.read_geometry() 

# Reads a slice of geometries by first creating the iterator and then calling it.
sile.read_geometry[20:2:-1]()
# or
sile.read_geometry.multi(20, 2, -1)()

# In fact, you could use it as an iterator, which feels more natural than passing some `wrap` argument
reader = sile.read_geometry[20:2:-1]

for i, geom in reader:
    ...

To me it feels natural and it would avoid these problems of "arbitrary" defaults for each sile.

@tfrederiksen
Copy link
Contributor Author

@pfebrer , thanks for your inputs. For read_geometry I agree that it would be intuitive that the default would be to return one entry (the next), but what about read_energy? Here one would typically be most interested in the last entry.

@pfebrer
Copy link
Contributor

pfebrer commented Jun 13, 2023

Hmm in fact I found it confusing at the beggining that there was a read_energy method in the out sile with preference for either the last/first, because it's not obvious to me that one is better than the other. In a relaxation, it might make sense to have last as the default, but if it's an MD, is there really an energy that makes more sense reading than all the others in the file?

Since it's not very well defined, I don't see it as a big problem to require the user to ask explicitly for something. I.e. read_energy[-1]() or something similar.

By the way, last, all and single arguments are probably not necessary if we fully support slicing. For example read_energy[-1]() should give you just one value while read_energy[-1:]() should give you a list of length one. I think it would be easier to make all the reading functions in sisl have a consistent interface if we remove the extra keywords.

@zerothi
Copy link
Owner

zerothi commented Jun 14, 2023

I think @pfebrer is on to something here.

I like both the slicing approach and the .multi approach. Likely the .multi function is easier in the long run since it can allow additional expressions. But both could be supported for the temporary class.

  • allow slicing
  • allow .multi or perhaps just .m/M/s?
  • what should we do to improve performance, perhaps have a same argument that will allow the first call to pass the returned value to all subsequent methods, that should remove the Geometry creation heavy part?
  • should we allow as_xyz to circumvent the Geometry class, I kind of don't like this since the return type will then heavily vary depending on the arguments. Perhaps this is ok, but the documentation of each method will be heavy... However, when returning multiple things, forces, coordinates etc. it might be fine to return in a PropertyDict (or some variant of the collection to perform manipulations of the internal things.

@zerothi
Copy link
Owner

zerothi commented Jun 14, 2023

I think the convention for returning single elements should be the last, for all methods. Then there will be no confusion.

@pfebrer
Copy link
Contributor

pfebrer commented Jun 14, 2023

what should we do to improve performance, perhaps have a same argument that will allow the first call to pass the returned value to all subsequent methods, that should remove the Geometry creation heavy part?

I think it would be enough if you can use the atoms argument of read_geometry. You can read the basis first and then read all the coordinates using the same basis for all geometries. In fact this is how I do it when using ASE and it's efficient enough. But I tried to do it in the current implementation of read_multiple and I think it didn't work. Perhaps I was doing it wrong.

I think the convention for returning single elements should be the last, for all methods. Then there will be no confusion.

But then you can't do what you usually suggest:

with sile as open_sile:
    for i in range(3):
        open_sile.read_geometry()

By the way I was using the word "slicing" but using read_energy[...] is in general the get item operation so ... could be a single integer, an array of indices or whatever other more complex selection. Just as it is done with atoms selection.

@zerothi
Copy link
Owner

zerothi commented Jun 14, 2023

what should we do to improve performance, perhaps have a same argument that will allow the first call to pass the returned value to all subsequent methods, that should remove the Geometry creation heavy part?

I think it would be enough if you can use the atoms argument of read_geometry. You can read the basis first and then read all the coordinates using the same basis for all geometries. In fact this is how I do it when using ASE and it's efficient enough. But I tried to do it in the current implementation of read_multiple and I think it didn't work. Perhaps I was doing it wrong.

I think the convention for returning single elements should be the last, for all methods. Then there will be no confusion.

But then you can't do what you usually suggest:

with sile as open_sile:
    for i in range(3):
        open_sile.read_geometry()

Ah, true, so actually it should default to only the first one!

By the way I was using the word "slicing" but using read_energy[...] is in general the get item operation so ... could be a single integer, an array of indices or whatever other more complex selection. Just as it is done with atoms selection.
Agreed.
I

@tfrederiksen
Copy link
Contributor Author

Ah, true, so actually it should default to only the first one!

I think this highlights a need for different defaults for different read_*.

@zerothi
Copy link
Owner

zerothi commented Jun 14, 2023

Ah, true, so actually it should default to only the first one!

I think this highlights a need for different defaults for different read_*.

True... But this is complicated for end-users... Hmm... We need to think carefully about the impacts in this regard.

@tfrederiksen
Copy link
Contributor Author

Ah, true, so actually it should default to only the first one!

I think this highlights a need for different defaults for different read_*.

True... But this is complicated for end-users... Hmm... We need to think carefully about the impacts in this regard.

The simplest seem to be defaulting to read all entries (ie start=stop=step=None) and let the user select from a list/collection or actively specify start/stop/step.

In the example by @pfebrer , one could do

with sile as open_sile:
    for i in range(3):
        open_sile.read_geometry(step=0)

@zerothi
Copy link
Owner

zerothi commented Jun 14, 2023

Otherwise the defaults would depend on whether the file is open or not?
I.e. if the file is opened in a with ... as sile: then we only read the next entry, otherwise everything/last/first...

@tfrederiksen
Copy link
Contributor Author

Now I've implemented step=0 as the default for xyzSile/xsfSile/mdSileOpenMX(single entry) while we keep step=1 for aniSileSiesta (collection).

@pfebrer
Copy link
Contributor

pfebrer commented Jun 14, 2023

In my opinion, read_* routines -where * is a word in singular- should not worry about skipping and they should just focus on reading the next thing available. Then we could get rid of all these extra arguments, which will be very hard to document in a way that doesn't confuse users, specially the different defaults for different methods/siles.

It seems clean and clear enough for me to specify the iterator externally (slicing, multi, slice,apply...), just as it is done with the BZ methods for example. If some method is very common to use with all=True, there could be an extra method with the plural word, i.e. read_energy -> read_energies as an alternative for read_energy[:].

@zerothi
Copy link
Owner

zerothi commented Jun 14, 2023

I think @pfebrer is on to something simple and effective. I'll have a look at this, as it is vital to get this in ASAP for API stability

@zerothi
Copy link
Owner

zerothi commented Jun 14, 2023

I have a somewhat working solution, see in the sile-slicer branch, a test is here:

open("/tmp/test.file", 'w').write("""1

C   0.00000000  0.00000000  0.00000000
2

C   0.00000000  0.00000000  0.00000000
C   1.000000  0.00000000  0.00000000
3

C   0.00000000  0.00000000  0.00000000
C   1.000000  0.00000000  0.00000000
C   2.00000  0.00000000  0.00000000
""")

from sisl.io import xyzSile

sile = xyzSile("/tmp/test.file")
print(sile.read_geometry(), help(sile.read_geometry))
print(sile.read_geometry[0](), help(sile.read_geometry[0]))
print(sile.read_geometry[1]())
print(sile.read_geometry[1:]())

It isn't complete, but just to get an idea?
I quickly ran into problems when decoration the function, since that happens at class-creation, the function won't bind self and thus they won't be passed, hence the __get__ work-around... A bit ugly IMHO.
Other suggestions on the implementation side?

@pfebrer
Copy link
Contributor

pfebrer commented Jun 14, 2023

I think the __get__ implementation is fine, but probably it's better to create a new (bound) instance of SileContainer instead of assigning the object instance (self) to the shared SileContainer.

After all, this is how functions turn into methods so there's nothing very strange going on. Unless you can modify the __getitem__ method of the function directly to implement a subscriptable method, I'd say this would be the best way to do it.

@pfebrer
Copy link
Contributor

pfebrer commented Jun 14, 2023

By the way, what is the reason behind the name SileContainer ? 😅

@zerothi zerothi mentioned this pull request Jun 15, 2023
@zerothi
Copy link
Owner

zerothi commented Jun 15, 2023

Lets continue the discussion in #584.

Perhaps this MR should be closed, and then focus should be on the other branch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants