[Feature Request] Optional loading of `raw_data` attribute #239

patrickherring-TRI · 2021-04-16T17:38:18Z

Problem statement
Loading of the structured files used to be relative fast since the files contained only the interpolated dataframes and these were much smaller. Having the raw data available is very convenient but slows down the loading time and dramatically increases the amount of memory required.

Solution suggestion
Making the loading the of the raw data optional should solve the problem, so that unless a flag was set, the attribute would remain empty (using the auto_load_processed method). However, it's also desirable to make the to/from file methods simple and transparent so if a lot of additional complexity is required this might not be the best solution.

Alternative solutions
Another solution could be to remove the raw data attribute or turn it into a loading method that goes and looks for the original raw file. This is less desirable since it returns to the problem of locating the raw data that was used to create the structured file.

The text was updated successfully, but these errors were encountered:

ardunn · 2021-04-27T02:48:43Z

So after examining this a bit, it seems the breakdown is this:

Reading the data file into a dictionary: 58% of the time
Creating the Datapath object (from_dict): 42% of the time

By simply not adding in the raw data to the object, we can reduce the "Creating the Datapath object" by about 78%, AKA reducing the total load time by 31%. Source: loading the structured big test file (250MB) on laptop repeatedly

This simple fix might also take care of the memory issues.

I'm not sure the best way to programmatically not load the raw file from disk into dictionary; one way around it would be changing the defaults so that raw data is not saved unless specified, then the current BEEPDatapath code would automatically ignore the data and the run would be marked as "legacy". I'll update my PR with some timing results.

Edit:

Decided to just not save raw data when using CLI by default. It's the easiest and cleanest solution, requiring no changes to as/from_dict. See #250

patrickherring-TRI added the enhancement New feature or request label Apr 16, 2021

patrickherring-TRI assigned ardunn Apr 16, 2021

This was referenced Apr 27, 2021

profile loading performance for previously serialized BEEPDatapaths to see if raw_data slows it down #211

Closed

[mat-2533] Optional saving/loading of raw_data, compression + fix test env #250

Merged

ardunn added the structuring label Apr 27, 2021

ardunn closed this as completed in #250 Apr 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Optional loading of `raw_data` attribute #239

[Feature Request] Optional loading of `raw_data` attribute #239

patrickherring-TRI commented Apr 16, 2021 •

edited

Loading

ardunn commented Apr 27, 2021 •

edited

Loading

[Feature Request] Optional loading of raw_data attribute #239

[Feature Request] Optional loading of raw_data attribute #239

Comments

patrickherring-TRI commented Apr 16, 2021 • edited Loading

ardunn commented Apr 27, 2021 • edited Loading

Edit:

[Feature Request] Optional loading of `raw_data` attribute #239

[Feature Request] Optional loading of `raw_data` attribute #239

patrickherring-TRI commented Apr 16, 2021 •

edited

Loading

ardunn commented Apr 27, 2021 •

edited

Loading