Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Optional loading of raw_data attribute #239

Closed
patrickherring-TRI opened this issue Apr 16, 2021 · 1 comment · Fixed by #250
Closed

[Feature Request] Optional loading of raw_data attribute #239

patrickherring-TRI opened this issue Apr 16, 2021 · 1 comment · Fixed by #250
Assignees
Labels
enhancement New feature or request structuring

Comments

@patrickherring-TRI
Copy link
Contributor

patrickherring-TRI commented Apr 16, 2021

Problem statement
Loading of the structured files used to be relative fast since the files contained only the interpolated dataframes and these were much smaller. Having the raw data available is very convenient but slows down the loading time and dramatically increases the amount of memory required.

Solution suggestion
Making the loading the of the raw data optional should solve the problem, so that unless a flag was set, the attribute would remain empty (using the auto_load_processed method). However, it's also desirable to make the to/from file methods simple and transparent so if a lot of additional complexity is required this might not be the best solution.

Alternative solutions
Another solution could be to remove the raw data attribute or turn it into a loading method that goes and looks for the original raw file. This is less desirable since it returns to the problem of locating the raw data that was used to create the structured file.

@patrickherring-TRI patrickherring-TRI added the enhancement New feature or request label Apr 16, 2021
@ardunn
Copy link
Collaborator

ardunn commented Apr 27, 2021

So after examining this a bit, it seems the breakdown is this:

  • Reading the data file into a dictionary: 58% of the time
  • Creating the Datapath object (from_dict): 42% of the time

By simply not adding in the raw data to the object, we can reduce the "Creating the Datapath object" by about 78%, AKA reducing the total load time by 31%. Source: loading the structured big test file (250MB) on laptop repeatedly

This simple fix might also take care of the memory issues.

I'm not sure the best way to programmatically not load the raw file from disk into dictionary; one way around it would be changing the defaults so that raw data is not saved unless specified, then the current BEEPDatapath code would automatically ignore the data and the run would be marked as "legacy". I'll update my PR with some timing results.

Edit:

Decided to just not save raw data when using CLI by default. It's the easiest and cleanest solution, requiring no changes to as/from_dict. See #250

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request structuring
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants