Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] support for sinks in csv format for RAMSES frontend #4424

Merged
merged 15 commits into from
Jul 31, 2023

Conversation

Lenoble-lab
Copy link
Contributor

Hi everyone,
this pull request is to implement a reader to read the output from ramses sink particle algorithm.
For now, the sink ouput from the standard ramses (the one from the bitbucket) cannot be read. It's only an other format.

In each output_*****, the sink algo write a single sink_****.csv file with the properties of the sinks. I have added an example in attached file.
We need to read a single file per output folder in the csv format.

I think that the ideal would be to have a single ptype corresponding to the sinks (do not create a 'sink_csv' type). The read_header would have to be modified and specified as appropriate. For now, I have impleneted a new name 'sink_csv' to differentiate from the first class.

The main issue I had is the IOHandlerRAMSES._read_particle_fields which reads in every chunks concerned, whereas here there is only one single file to read. I hacked this with the if ptype == 'sink_csv' line 217 in io.py but this is really not ideal.

Thinking about it, I see several solutions:

  • add a flag in the sink class to specify that it should only be read once -> add an if in IOHandlerRAMSES._read_particle_fields to say that if this is the case, only one read is enough. And we avoid entering the for chunk in chunks loop.

  • we create a new IOHandlerRAMSES._read_sink_fields which reads only once the csv file, directly in IOHandlerRAMSES. It is possible to add a field type that would not be separated into cpu (I don't know if other Ramses features also write a single file per output). But then you lose the inheritance of the particle class (or re-create another one?).
    sink_00010.csv

Thanks in advance !
Romain Lenoble

@welcome
Copy link

welcome bot commented Apr 28, 2023

Hi! Welcome, and thanks for opening this pull request. We have some guidelines for new pull requests, and soon you'll hear back about the results of our tests and continuous integration checks. Thank you for your contribution!

@cphyc cphyc changed the title Sink csv [ENH] support for sinks in csv format for RAMSES frontend Apr 28, 2023
@cphyc cphyc added enhancement Making something better code frontends Things related to specific frontends labels Apr 28, 2023
yt/frontends/ramses/particle_handlers.py Outdated Show resolved Hide resolved
yt/frontends/ramses/particle_handlers.py Outdated Show resolved Hide resolved
field_offsets = {}
_pfields = {}

import pandas as pd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have in yt a way of supporting imports of external packages using

from yt.utilities.on_demand_imports import _pandas as pd

This will fail gracefully if pandas is not installed with a useful message.

@@ -131,6 +131,22 @@ class RAMSESFieldInfo(FieldInfoContainer):
("particle_metallicity", ("", [], None)),
("particle_family", ("", [], None)),
("particle_tag", ("", [], None)),
# sink field parameters
("particle_msink", ("code_mass", [], None)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason not to use the canonical particle_mass here and below?
Also the convention in yt is to use explicit names rather than common shorthand notations, e.g. lx, ly, lz should be angular_momentum_{xyz}, rho should be density, cs** should be sound_speed_square(d), tform should be formation_time, ....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I see. Here, I use the format : "particle_" + name of the header column in the csv file.

The column names are : msink, x, y, z, vx,..., lx,...
Maybe using the default convention of yt is better

@@ -92,6 +92,27 @@ def _ramses_particle_file_handler(fname, foffsets, data_types, subset, fields, c
tr = {}
ds = subset.domain.ds
current_time = ds.current_time.in_units("code_time").v
#test if sink csv file -> not same reader as the fortran classic files
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest moving that piece of code to its own function (_ramses_particle_csv_file_handler?) and rename the current function _ramses_particle_file_handler by _ramses_particle_binary_file_handler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok that is probably more clear

@@ -192,6 +213,9 @@ def _read_particle_fields(self, chunks, ptf, selector):
for ax in "xyz":
if pn % ax not in field_list:
fields.append((ptype, pn % ax))
# attention : read only one chunk as its a global file, don't know wether this is the good way to do it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest removing this piece and code and instead modifying the following chunk (currently sitting at line 260):

                _ramses_particle_file_handler(
                    fname, foffsets, data_types, subset, subs_fields, count=count
                )
            )

with something like

            tr.update(
                ph.reader(
                    fname, foffsets, data_types, subset, subs_fields, count=count
                )
            )

The reader attribute could then be a method of the particle readers rather than a stand-alone method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes, that would definitely solve our problem and it can be much more easy to later implement reader for any sort of file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can try to implement this if you want

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But actually, if you keep something like this :

           tr.update(
                ph.reader(
                    fname, foffsets, data_types, subset, subs_fields, count=count
                )
            )

You are still updated the value of tr by appending the variable at the end, iterating over the chunks concerned.

To keep using this method, you need to check that the particle you are adding are well in the subset when you read it in the csv file. Thus checking with a few if the position of the particle. I don't know if it is fast (but here the files are less than 1e3 sinks so not too fast).

A 2nd method would be to read once for all the data of the csv file, in this case, you need to avoid to go in the iteration line 240 in _read_particle_fields :

 for chunk in chunks:
            for subset in chunk.objs:

In this case, the class would store all the sink produced, independently of their position.
It requires to either write some condition in _read_particle_fields before going onto the iteration.
Maybe, we can implement this with _read_particle_data_file method ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be difficult to check whether a particle is in the current subset though.

I think we have two options here that avoid having code specific to sinks here:

  • we update _read_particle_fields so that it returns a boolean that informs whether all the data has been read (sink case) or not (all other cases).
  • we update the particle handlers so that they carry a flag that says "all data is in one file, please only read one chunk".

At the end of the day, the reason for this (slight) mess is that the index assumes that the data is spread over Ncpu different domains, each of which forms one chunk. The best solution would require having a second index I guess, but that would be a massive change, especially for a simple case like this one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • we update _read_particle_fields so that it returns a boolean that informs whether all the data has been read (sink case) or not (all other cases).

    • in that case, we need to modify deeply the class BaseIOHandler, as we modify the output of _read_particle_fields no ?
  • we update the particle handlers so that they carry a flag that says "all data is in one file, please only read one chunk".

    • yes, that could be the simplest idea no ?

what I wanted to say is that in the BaseIOHandler there is a _read_particle_data_file. I don't get it but there is a raise NotImplementedError, whereas it is not implemented in the IOHandlerRAMSES class. Maybe we can implement a method for data_file like this. So when reading the csv data, the called function is this one and not the former _read_particle_fields. The ideal, would be a function setted as an attribute of the particle class, but that might require to modify the class BaseIOHandler I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may be right. I don't remember the inheritance of BaseIOHandler. In that case, updating the particle handlers (they're specific to RAMSES and already encode all the specifics of each particle output file anyway) seems the best route.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hum, yes that seems the good idea. But for now, I don't manage the yield of _read_particle_fields.
Do you know if this can be called from another function like this (this code does not work, I don't know a lot about coroutine in python) :

    def _read_particle_fields(self, chunks, ptf, selector):
        pn = "particle_position_%s"
        chunks = list(chunks)
        fields = [
            (ptype, fname) for ptype, field_list in ptf.items() for fname in field_list
        ]
        for ptype, field_list in sorted(ptf.items()):
            for ax in "xyz":
                if pn % ax not in field_list:
                    fields.append((ptype, pn % ax))

        self._read_particle_from_binary_file(chunks, ptf, selector, fields)  



   def _read_particle_from_binary_file(self, chunks, ptf, selector, fields):
        pn = "particle_position_%s"

        for chunk in chunks:
            for subset in chunk.objs:
                rv = self._read_particle_subset(subset, fields)
                for ptype, field_list in sorted(ptf.items()):
                    x, y, z = (np.asarray(rv[ptype, pn % ax], "=f8") for ax in "xyz")
                    mask = selector.select_points(x, y, z, 0.0) 
                    if mask is None:
                        mask = []
                    for field in field_list:
                        data = np.asarray(rv.pop((ptype, field))[mask], "=f8")
                        print(data)
                        yield (ptype, field), data

In this case, the function _read_particle_from_binary_file would be an attribute of the ParticleHandler class.

@neutrinoceros
Copy link
Member

@Lenoble-lab when you are done pushing new changes, please take a minute to address linting error reported by pre-commit.ci. All of them seem to be auto-fixable (see the note at the bottom of this page https://results.pre-commit.ci/run/github/88776493/1683098596.TWKhnqatSLSZQeSp2akaZw)

@Lenoble-lab
Copy link
Contributor Author

Hello guys,
Did you have time to look into this project. I'm using my branch for the sink in my simulations (but I may miss it for the next update of yt). We can discuss of it if you want (I'm far from knowing everything, but I can help a bit with the change I have done).
Cheers,
Romain

@neutrinoceros
Copy link
Member

Let's try to get this across the finish line. The priority would be to fix failing checks.

@cphyc
Copy link
Member

cphyc commented Jul 17, 2023 via email

@Lenoble-lab
Copy link
Contributor Author

Hi, I don't get the errors from https://results.pre-commit.ci/run/github/88776493/1690275703.l2NvkRYcQf62QbImBS2O4g

Do you know what are the problems ?

@neutrinoceros
Copy link
Member

Hi, I don't get the errors from results.pre-commit.ci/run/github/88776493/1690275703.l2NvkRYcQf62QbImBS2O4g

Do you know what are the problems ?

All the linting errors on your branch appear to be auto-fixable (that's why the error messages are minimal).
If you look carefully, you'll note that, at the bottom of the page you linked, are instructions for how to solve these problems by commenting this PR. I would also advise you have a look at our contribution guidelines for how to work with linters locally, which would help keeping the PR tidy enough for merging.

@neutrinoceros
Copy link
Member

There's a merge conflict to resolve now. Would you allow me to rebase your branch or would you prefer to resolve it yourself ?

@Lenoble-lab
Copy link
Contributor Author

hum, as you want I was eating so I didn't see it. Now, I don't have time but later in the afternoon I can have a deeper look. I run 'pre-commit run --all-files` but it changes deeply some files

@neutrinoceros neutrinoceros force-pushed the sink_csv branch 3 times, most recently from 65ed224 to 53a8288 Compare July 25, 2023 12:42
Co-authored-by: Corentin Cadiou <contact@cphyc.me>
Co-authored-by: Clément Robert <cr52@protonmail.com>
@neutrinoceros
Copy link
Member

neutrinoceros commented Jul 25, 2023

fixed. I condensed the branch's history to eliminate back-and-forth edits which make stuff like git bisect and git blame less useful for future maintenance.

Here are the commands you need to run to update your local clone of the repo

git checkout main
git branch -D sink_csv
git fetch
git checkout sink_csv

now we should be able to focus on getting tests to pass

Copy link
Member

@neutrinoceros neutrinoceros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here are a couple of light weighted questions and suggestions to improve clarity and readability

yt/frontends/ramses/particle_handlers.py Outdated Show resolved Hide resolved
yt/frontends/ramses/io.py Outdated Show resolved Hide resolved
yt/frontends/ramses/io.py Outdated Show resolved Hide resolved
yt/frontends/ramses/io.py Outdated Show resolved Hide resolved
for chunk in chunks:
for subset in chunk.objs:

if ptype != "sink_csv":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused. Shouldn't the guard clause be if ptype == "sink_csv": here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an if and then an else, so it's a bit equivalent, isn't it ?

I was to force to add this condition as the reader is really not the same between the fortran file and the csv file. I didn't manage to find a way without.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed there's an else in there, and having this if/else branching isn't an issue, but I would expect the old code to correspond to previously supported cases (i.e. if ptype != "sink_csv"). Here, it seems to me that the old code is run specifically in the else clause, which matches the case you are adding. Am I missing anything ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I see. I will change with this idea

yt/frontends/ramses/io.py Outdated Show resolved Hide resolved
yt/frontends/ramses/io.py Outdated Show resolved Hide resolved
Lenoble-lab and others added 6 commits July 25, 2023 18:28
Co-authored-by: Clément Robert <cr52@protonmail.com>
Co-authored-by: Clément Robert <cr52@protonmail.com>
Co-authored-by: Clément Robert <cr52@protonmail.com>
Co-authored-by: Clément Robert <cr52@protonmail.com>
Co-authored-by: Clément Robert <cr52@protonmail.com>
Comment on lines 144 to 146
dat = pd.read_csv(
fname, delimiter=",", usecols=[ind], skiprows=2, header=None
) # read only selected fields
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we could call pd.read_csv just once ahead of the loop instead of doing multiple (costly) reading operations ? If there was a chance of memory errors, I would expect the current implementation to hit those any way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I change now.
But, it's far from being optimized : I nead to read the entire file in _read_part_csv_file_descriptor to get the number of lines (to then initialize self.local_particle_count). And then I read it again (but only the columns i'm interested in)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think reading is a csv file is to different from readinf the fortan binay file, so thying to use the same structure is not ideal

Copy link
Member

@neutrinoceros neutrinoceros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there ! just a couple more small nits

yt/frontends/ramses/io.py Outdated Show resolved Hide resolved
yt/frontends/ramses/io.py Outdated Show resolved Hide resolved
yt/frontends/ramses/io.py Outdated Show resolved Hide resolved
Lenoble-lab and others added 3 commits July 31, 2023 10:51
Co-authored-by: Clément Robert <cr52@protonmail.com>
Co-authored-by: Clément Robert <cr52@protonmail.com>
Co-authored-by: Clément Robert <cr52@protonmail.com>
@Lenoble-lab
Copy link
Contributor Author

Sorry for the delay, I was away and I completely forget...

@neutrinoceros neutrinoceros enabled auto-merge (squash) July 31, 2023 13:21
@neutrinoceros
Copy link
Member

No need to apologise, and thank you for your contribution !

@neutrinoceros neutrinoceros added this to the 4.3.0 milestone Jul 31, 2023
@neutrinoceros neutrinoceros merged commit 6c715aa into yt-project:main Jul 31, 2023
12 checks passed
@welcome
Copy link

welcome bot commented Jul 31, 2023

Hooray! Congratulations on your first merged pull request! We hope we keep seeing you around! 🎆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
code frontends Things related to specific frontends enhancement Making something better
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants