Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace a bunch of independent collections with a dataclass #716

Merged
merged 2 commits into from
Aug 13, 2022

Conversation

liamhuber
Copy link
Member

Advantages:

  • Less verbose
  • Easier to extend with more output fields
  • Makes eventual transition to jobs using DataContainer to hold input and output a bit easier

Disadvantages:

  • A bit more complicated

@Leimeroth
Copy link
Member

In principle I like this more than lists, but do you know how this manages memory etc internally? In the original version instead of having separate lists and a dict for computes it was simply one large dict. I changed it to separate lists because I am unsure how the memory layout of a dictionary and similar structures works, so I feared that having a dictionary/class would require continuous memory and therefore copying around some very large data on ram rather often.
Also I am not sure how the overall memory footprint changes and how it interacts with pythons garbage collection. Since the whole point of the pr is to have no crashes due to memory upon parsing large dumps these questions could become relevant.

@Leimeroth
Copy link
Member

If we could clarify the memory concerns we could directly use a DataContainer instead of a dictionary to hold all the lists if that is the plan for all in and output data.

@pmrv
Copy link
Contributor

pmrv commented Aug 10, 2022

I'm not sure if I correctly understand, but values in dictionaries are not continuous in memory. The dict just stores pointers to the original values you assign to it. No copying takes place and the only overhead of the dictionary is a small piece of (continuous memory) for its buckets where the pointers to values are kept.

You can check this by

a = np.linspace(0, 1, 10_000)
d = {'a': a, 'b': a}
assert a is d['a'] is d['b']
assert id(a) == id(d['a']) == id(d['b'])

@Leimeroth
Copy link
Member

Than nothing speaks against exchanging the independent lists with a dict / DataContainer imo. Is there an advantage of using the class proposed by @liamhuber compared to a datacontainer/dictionary?

What I could imagine to be helpful is to store everything in dicts/DataContainers/classes based on the necessary treatment. What I do right now is to manually call the necessary np.matmul etc. for each part individually. What I could imagine is something like:
scale_transform_dict = {
direct_unwrapped_positions
}
transform_dict = {
forces,
velocities,
unwrapped_positions
}
no_ops_dict = {
computes,
some special stuff,
}
that only takes raw data and than do all postprocessing in loops.
This would have the advantage that it would be rather easy to add other possible dump keywords to the dicts if this should be extended. Eventually would also help to make the parsing work more general and not only with the predefined settings.

As a first workaround for my memory problem I wanted to reduce the amount of steps within dump using OVITO and than noticed that pyiron uses and parses only direct coordinates, while ovito uses the cartesian values. Probably not the most common use case but right now parsing will fail if someone defines the dump command without direct unwrapped coordinates f.e.

@niklassiemer
Copy link
Member

Yes, in general python only stores pointers and also lists are not continuous in memory. Therefore, .append() may write the new value anywhere in memory and just stores a pointer. numpy arrays, however, are stored in C data structures and are continuous in memory. This also explains the big advantage on computing with those.

@liamhuber
Copy link
Member Author

Than nothing speaks against exchanging the independent lists with a dict / DataContainer imo.

Agreed, I don't see any hard barrier between the different implementations

Is there an advantage of using the class proposed by @liamhuber compared to a datacontainer/dictionary?

IMO:

  • dataclass > dict: has access to . access notation and autocompletion
  • dataclass > dict: much less verbose while writing the HDF files
    • (You could lose the verbosity by putting all the current fields in one big dict, but then you sacrifice the first point)
  • dataclass > DataContainer: DataContainer doesn't play well with writing directly to ['output/generic'] does it @pmrv? I thought it mangles all the names prior to HDFing. If we can't exploit DataContainer's automated storage routines, then I think it's just overkill for this purpose a small dataclass is sufficient. I would still imagine that DataContainer is the long term goal once we're rationalize other parts of the code, e.g. where input and output get stored.

What I could imagine to be helpful is to store everything in dicts/DataContainers/classes based on the necessary treatment. What I do right now is to manually call the necessary np.matmul etc. for each part individually. What I could imagine is something like: ...

I'm not sure I totally understand. For now it looks like both this PR and the original PR are compatible with that, e.g.

to_transform = [
    dump.unwrapped_positions,
    dump.forces,
    ...
]
no_ops = [
    v for v in dump.computes.values()
]

I could also imagine adding some sort of transformation method right to the dataclass, either by making separate dataclasses for each case and modifying __getattr__, or by having an extra method like LammpsData.transform(matrix: np.array, data_to_transform: List). But I would save both of these ideas for the future. Since the dataclass doesn't introduce any memory issues, I would suggest to just merge 716->705->master, with the caveat that Jan's request of @samwaseda taking a look before we merge into master hasn't(?) been fulfilled.

@Leimeroth Leimeroth merged commit b754a0e into lammps-parse-dump-pandas Aug 13, 2022
@delete-merged-branch delete-merged-branch bot deleted the with_dataclass branch August 13, 2022 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants