Skip to content
This repository has been archived by the owner on Jul 3, 2023. It is now read-only.

[Question] Loading of huge df to extract few columns #284

Closed
IgorHoholko opened this issue Jan 24, 2023 · 7 comments
Closed

[Question] Loading of huge df to extract few columns #284

IgorHoholko opened this issue Jan 24, 2023 · 7 comments
Labels
Answered Question has been answered UserQuestion An open source user question

Comments

@IgorHoholko
Copy link

Hello,

If I want to use a hamilton data loader:

@extract_columns("spend", "signups")
def log_df(log_df_location: str) -> pd.DataFrame:
    """Dummy function showing how to wire through loading data.
    :param location:
    :return:
    """
    return

But log_df is huge and I want to extract only a few columns. Loading all in memory to extract them is not possible. How should I act here?

@elijahbenizzy
Copy link
Collaborator

Great question! Currently Hamilton does not specially handle this as the loading is done by the function itself. So, unfortunately, the best way to do this now is:

(1) make the function only load specific columns (this depends where you're loading from/what the data source is...)
(2) use extract_columns to load those

So not perfect, but, we have some ideas here that might help, that we'd like to work on. We've been mulling over a few things -- running custom data loading adapters, E.G. ones that say "load these columns from these sources", and then can optimize. An issue that gets at it is: #197.

Duckdb might be a good workaround for now as well. See, e.g. this: https://github.com/stitchfix/hamilton/pull/195/files.

@skrawcz
Copy link
Collaborator

skrawcz commented Jan 24, 2023

@IgorHoholko what format is the data stored in? We can help come up with some code snippets for you to try.

@IgorHoholko
Copy link
Author

The data is originally stored in MongoDB, then the replica is made to Postgress. From Postgress I save CSV tables for further processing to train my DL algorithms (Recommendations).

For training, I want to be able to make data preparation with a config:

# example.yaml

user_features: [user_id, age, country, ...]
video_features: [video_id, author, channel_id, ...]

# dates to make a slice of log data.
date_start: # 
date_end: #

So having raw data I want to extract particular features mentioned in the config from specific date intervals. Data Interval is usually such that after slicing it will fit into memory.
To make this configuration I think hamilton is a good choice.
But because these CSV files are big enough to load in memory at once for further processing I am not sure I can use Hamilton as it is.

I am thinking about skipping the "dump to CSV" step and writing a data loader that takes data directly from Postgress within SQL requests with a date slice in it.

I would really like to hear your thoughts about this. Probably there are some other even better approaches

@skrawcz
Copy link
Collaborator

skrawcz commented Jan 24, 2023

@IgorHoholko do you want to jump on a quick call? It might be quicker to clarify a few details. You can join our slack here and I can then send you a google meet link? or?

@skrawcz
Copy link
Collaborator

skrawcz commented Jan 24, 2023

To recap we talked about two options:

  1. use pd.read_sql and connect directly to the postgres database. Postgres will do the work and only return what's required to fit in memory.
  2. use pd.read_parquet with the columns= argument, which will only load the columns requested, but you need to rewrite the CSVs into parquet format.

If performance is an issue with (1), then duckdb could also be used to connect to the database and return a pandas dataframe.

Let us know how you get on!

@skrawcz skrawcz added UserQuestion An open source user question Answered Question has been answered labels Jan 25, 2023
@skrawcz
Copy link
Collaborator

skrawcz commented Jan 25, 2023

@IgorHoholko also I found https://blog.datasyndrome.com/python-and-parquet-performance-e71da65269ce particularly helpful if you want to go down the parquet route.

@skrawcz
Copy link
Collaborator

skrawcz commented Feb 2, 2023

@IgorHoholko I'm going to close this issue, if there isn't any more follow up. Re-open if needed.

@skrawcz skrawcz closed this as completed Feb 2, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Answered Question has been answered UserQuestion An open source user question
Projects
None yet
Development

No branches or pull requests

3 participants