Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Suggestion] Update the logic of preprocessing for efficiency #60

Closed
Kimyungi opened this issue Jul 4, 2023 · 1 comment
Closed

[Suggestion] Update the logic of preprocessing for efficiency #60

Kimyungi opened this issue Jul 4, 2023 · 1 comment

Comments

@Kimyungi
Copy link

Kimyungi commented Jul 4, 2023

Suggest to update the logic of preprocessing for efficiency.

In many cases, a user's behavior sequence is the same for all training samples. In addition, the features of a user or an item are often the same for all training samples.

However, the current version of FuxiCTR receives the training dataset as a single DataFrame, so these features (e.g., a user's behavior sequence, the features of a user or an item) should be stored redundantly in that DataFrame, which consumes too much memory (especially, in large-scale dataset). Also, fit/transform of feature_preprocessor should be performed on redundant behavior sequences and features, which takes too long (especially, in large-scale dataset).

So, to operate more efficiently, I hope these redundancies are removed. To this end, I suggest to change the logic of preprocessing to receive user_df and item_df for each dataset together and fit/transform unique features (i.e., user_df and item_df).

@zhujiem
Copy link
Contributor

zhujiem commented Jul 4, 2023

Thanks for the suggestion. If decoupling the dataset to user_df and item_df, the dataset cannot handle some cross features and real-time sequences. In some datasets we have tested, a user has different sequences which are computed according to the timestamp. It follows the common practice in industry. But the preprocessing should be accelerated if we partition the dataset into chunks and remove some redundancy.

@zhujiem zhujiem closed this as completed Jul 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants