Replace a bunch of independent collections with a dataclass #716

liamhuber · 2022-08-09T19:11:16Z

Advantages:

Less verbose
Easier to extend with more output fields
Makes eventual transition to jobs using DataContainer to hold input and output a bit easier

Disadvantages:

A bit more complicated

Leimeroth · 2022-08-10T06:51:32Z

In principle I like this more than lists, but do you know how this manages memory etc internally? In the original version instead of having separate lists and a dict for computes it was simply one large dict. I changed it to separate lists because I am unsure how the memory layout of a dictionary and similar structures works, so I feared that having a dictionary/class would require continuous memory and therefore copying around some very large data on ram rather often.
Also I am not sure how the overall memory footprint changes and how it interacts with pythons garbage collection. Since the whole point of the pr is to have no crashes due to memory upon parsing large dumps these questions could become relevant.

Leimeroth · 2022-08-10T06:57:32Z

If we could clarify the memory concerns we could directly use a DataContainer instead of a dictionary to hold all the lists if that is the plan for all in and output data.

pmrv · 2022-08-10T12:19:06Z

I'm not sure if I correctly understand, but values in dictionaries are not continuous in memory. The dict just stores pointers to the original values you assign to it. No copying takes place and the only overhead of the dictionary is a small piece of (continuous memory) for its buckets where the pointers to values are kept.

You can check this by

a = np.linspace(0, 1, 10_000)
d = {'a': a, 'b': a}
assert a is d['a'] is d['b']
assert id(a) == id(d['a']) == id(d['b'])

Leimeroth · 2022-08-10T12:32:11Z

Than nothing speaks against exchanging the independent lists with a dict / DataContainer imo. Is there an advantage of using the class proposed by @liamhuber compared to a datacontainer/dictionary?

What I could imagine to be helpful is to store everything in dicts/DataContainers/classes based on the necessary treatment. What I do right now is to manually call the necessary np.matmul etc. for each part individually. What I could imagine is something like:
scale_transform_dict = {
direct_unwrapped_positions
}
transform_dict = {
forces,
velocities,
unwrapped_positions
}
no_ops_dict = {
computes,
some special stuff,
}
that only takes raw data and than do all postprocessing in loops.
This would have the advantage that it would be rather easy to add other possible dump keywords to the dicts if this should be extended. Eventually would also help to make the parsing work more general and not only with the predefined settings.

As a first workaround for my memory problem I wanted to reduce the amount of steps within dump using OVITO and than noticed that pyiron uses and parses only direct coordinates, while ovito uses the cartesian values. Probably not the most common use case but right now parsing will fail if someone defines the dump command without direct unwrapped coordinates f.e.

niklassiemer · 2022-08-10T12:34:22Z

Yes, in general python only stores pointers and also lists are not continuous in memory. Therefore, .append() may write the new value anywhere in memory and just stores a pointer. numpy arrays, however, are stored in C data structures and are continuous in memory. This also explains the big advantage on computing with those.

pyiron_atomistics/lammps/base.py

Fix typo

liamhuber · 2022-08-12T16:51:36Z

Than nothing speaks against exchanging the independent lists with a dict / DataContainer imo.

Agreed, I don't see any hard barrier between the different implementations

Is there an advantage of using the class proposed by @liamhuber compared to a datacontainer/dictionary?

IMO:

dataclass > dict: has access to . access notation and autocompletion
dataclass > dict: much less verbose while writing the HDF files
- (You could lose the verbosity by putting all the current fields in one big dict, but then you sacrifice the first point)
dataclass > DataContainer: DataContainer doesn't play well with writing directly to ['output/generic'] does it @pmrv? I thought it mangles all the names prior to HDFing. If we can't exploit DataContainer's automated storage routines, then I think it's just overkill for this purpose a small dataclass is sufficient. I would still imagine that DataContainer is the long term goal once we're rationalize other parts of the code, e.g. where input and output get stored.

What I could imagine to be helpful is to store everything in dicts/DataContainers/classes based on the necessary treatment. What I do right now is to manually call the necessary np.matmul etc. for each part individually. What I could imagine is something like: ...

I'm not sure I totally understand. For now it looks like both this PR and the original PR are compatible with that, e.g.

to_transform = [
    dump.unwrapped_positions,
    dump.forces,
    ...
]
no_ops = [
    v for v in dump.computes.values()
]

I could also imagine adding some sort of transformation method right to the dataclass, either by making separate dataclasses for each case and modifying __getattr__, or by having an extra method like LammpsData.transform(matrix: np.array, data_to_transform: List). But I would save both of these ideas for the future. Since the dataclass doesn't introduce any memory issues, I would suggest to just merge 716->705->master, with the caveat that Jan's request of @samwaseda taking a look before we merge into master hasn't(?) been fulfilled.

Replace a bunch of independent collections with a dataclass

6aa0422

liamhuber requested a review from Leimeroth August 9, 2022 19:11

liamhuber commented Aug 12, 2022

View reviewed changes

pyiron_atomistics/lammps/base.py Outdated Show resolved Hide resolved

Update pyiron_atomistics/lammps/base.py

acfa41b

Fix typo

Leimeroth merged commit b754a0e into lammps-parse-dump-pandas Aug 13, 2022

delete-merged-branch bot deleted the with_dataclass branch August 13, 2022 15:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace a bunch of independent collections with a dataclass #716

Replace a bunch of independent collections with a dataclass #716

liamhuber commented Aug 9, 2022

Leimeroth commented Aug 10, 2022

Leimeroth commented Aug 10, 2022

pmrv commented Aug 10, 2022

Leimeroth commented Aug 10, 2022

niklassiemer commented Aug 10, 2022

liamhuber commented Aug 12, 2022

Replace a bunch of independent collections with a dataclass #716

Replace a bunch of independent collections with a dataclass #716

Conversation

liamhuber commented Aug 9, 2022

Leimeroth commented Aug 10, 2022

Leimeroth commented Aug 10, 2022

pmrv commented Aug 10, 2022

Leimeroth commented Aug 10, 2022

niklassiemer commented Aug 10, 2022

liamhuber commented Aug 12, 2022