[Question] Loading of huge df to extract few columns #284
Comments
Great question! Currently Hamilton does not specially handle this as the loading is done by the function itself. So, unfortunately, the best way to do this now is: (1) make the function only load specific columns (this depends where you're loading from/what the data source is...) So not perfect, but, we have some ideas here that might help, that we'd like to work on. We've been mulling over a few things -- running custom data loading adapters, E.G. ones that say "load these columns from these sources", and then can optimize. An issue that gets at it is: #197. Duckdb might be a good workaround for now as well. See, e.g. this: https://github.com/stitchfix/hamilton/pull/195/files. |
@IgorHoholko what format is the data stored in? We can help come up with some code snippets for you to try. |
The data is originally stored in MongoDB, then the replica is made to Postgress. From Postgress I save CSV tables for further processing to train my DL algorithms (Recommendations). For training, I want to be able to make data preparation with a config:
So having raw data I want to extract particular features mentioned in the config from specific date intervals. Data Interval is usually such that after slicing it will fit into memory. I am thinking about skipping the "dump to CSV" step and writing a data loader that takes data directly from Postgress within SQL requests with a date slice in it. I would really like to hear your thoughts about this. Probably there are some other even better approaches |
@IgorHoholko do you want to jump on a quick call? It might be quicker to clarify a few details. You can join our slack here and I can then send you a google meet link? or? |
To recap we talked about two options:
If performance is an issue with (1), then duckdb could also be used to connect to the database and return a pandas dataframe. Let us know how you get on! |
@IgorHoholko also I found https://blog.datasyndrome.com/python-and-parquet-performance-e71da65269ce particularly helpful if you want to go down the parquet route. |
@IgorHoholko I'm going to close this issue, if there isn't any more follow up. Re-open if needed. |
Hello,
If I want to use a hamilton data loader:
But
log_df
is huge and I want to extract only a few columns. Loading all in memory to extract them is not possible. How should I act here?The text was updated successfully, but these errors were encountered: