[Question] Loading of huge df to extract few columns #284

IgorHoholko · 2023-01-24T07:51:14Z

Hello,

If I want to use a hamilton data loader:

@extract_columns("spend", "signups")
def log_df(log_df_location: str) -> pd.DataFrame:
    """Dummy function showing how to wire through loading data.
    :param location:
    :return:
    """
    return

But log_df is huge and I want to extract only a few columns. Loading all in memory to extract them is not possible. How should I act here?

The text was updated successfully, but these errors were encountered:

elijahbenizzy · 2023-01-24T17:25:10Z

Great question! Currently Hamilton does not specially handle this as the loading is done by the function itself. So, unfortunately, the best way to do this now is:

(1) make the function only load specific columns (this depends where you're loading from/what the data source is...)
(2) use extract_columns to load those

So not perfect, but, we have some ideas here that might help, that we'd like to work on. We've been mulling over a few things -- running custom data loading adapters, E.G. ones that say "load these columns from these sources", and then can optimize. An issue that gets at it is: #197.

Duckdb might be a good workaround for now as well. See, e.g. this: https://github.com/stitchfix/hamilton/pull/195/files.

skrawcz · 2023-01-24T17:38:32Z

@IgorHoholko what format is the data stored in? We can help come up with some code snippets for you to try.

IgorHoholko · 2023-01-24T18:15:02Z

The data is originally stored in MongoDB, then the replica is made to Postgress. From Postgress I save CSV tables for further processing to train my DL algorithms (Recommendations).

For training, I want to be able to make data preparation with a config:

# example.yaml

user_features: [user_id, age, country, ...]
video_features: [video_id, author, channel_id, ...]

# dates to make a slice of log data.
date_start: # 
date_end: #

So having raw data I want to extract particular features mentioned in the config from specific date intervals. Data Interval is usually such that after slicing it will fit into memory.
To make this configuration I think hamilton is a good choice.
But because these CSV files are big enough to load in memory at once for further processing I am not sure I can use Hamilton as it is.

I am thinking about skipping the "dump to CSV" step and writing a data loader that takes data directly from Postgress within SQL requests with a date slice in it.

I would really like to hear your thoughts about this. Probably there are some other even better approaches

skrawcz · 2023-01-24T18:32:05Z

@IgorHoholko do you want to jump on a quick call? It might be quicker to clarify a few details. You can join our slack here and I can then send you a google meet link? or?

skrawcz · 2023-01-24T18:56:54Z

To recap we talked about two options:

use pd.read_sql and connect directly to the postgres database. Postgres will do the work and only return what's required to fit in memory.
use pd.read_parquet with the columns= argument, which will only load the columns requested, but you need to rewrite the CSVs into parquet format.

If performance is an issue with (1), then duckdb could also be used to connect to the database and return a pandas dataframe.

Let us know how you get on!

skrawcz · 2023-01-25T23:11:12Z

@IgorHoholko also I found https://blog.datasyndrome.com/python-and-parquet-performance-e71da65269ce particularly helpful if you want to go down the parquet route.

skrawcz · 2023-02-02T00:32:34Z

@IgorHoholko I'm going to close this issue, if there isn't any more follow up. Re-open if needed.

skrawcz added UserQuestion An open source user question Answered Question has been answered labels Jan 25, 2023

skrawcz closed this as completed Feb 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Loading of huge df to extract few columns #284

[Question] Loading of huge df to extract few columns #284

IgorHoholko commented Jan 24, 2023

elijahbenizzy commented Jan 24, 2023

skrawcz commented Jan 24, 2023

IgorHoholko commented Jan 24, 2023

skrawcz commented Jan 24, 2023

skrawcz commented Jan 24, 2023 •

edited

skrawcz commented Jan 25, 2023

skrawcz commented Feb 2, 2023

[Question] Loading of huge df to extract few columns #284

[Question] Loading of huge df to extract few columns #284

Comments

IgorHoholko commented Jan 24, 2023

elijahbenizzy commented Jan 24, 2023

skrawcz commented Jan 24, 2023

IgorHoholko commented Jan 24, 2023

skrawcz commented Jan 24, 2023

skrawcz commented Jan 24, 2023 • edited

skrawcz commented Jan 25, 2023

skrawcz commented Feb 2, 2023

skrawcz commented Jan 24, 2023 •

edited