[Suggestion] Update the logic of preprocessing for efficiency #60

Kimyungi · 2023-07-04T09:16:25Z

Suggest to update the logic of preprocessing for efficiency.

In many cases, a user's behavior sequence is the same for all training samples. In addition, the features of a user or an item are often the same for all training samples.

However, the current version of FuxiCTR receives the training dataset as a single DataFrame, so these features (e.g., a user's behavior sequence, the features of a user or an item) should be stored redundantly in that DataFrame, which consumes too much memory (especially, in large-scale dataset). Also, fit/transform of feature_preprocessor should be performed on redundant behavior sequences and features, which takes too long (especially, in large-scale dataset).

So, to operate more efficiently, I hope these redundancies are removed. To this end, I suggest to change the logic of preprocessing to receive user_df and item_df for each dataset together and fit/transform unique features (i.e., user_df and item_df).

zhujiem · 2023-07-04T10:59:48Z

Thanks for the suggestion. If decoupling the dataset to user_df and item_df, the dataset cannot handle some cross features and real-time sequences. In some datasets we have tested, a user has different sequences which are computed according to the timestamp. It follows the common practice in industry. But the preprocessing should be accelerated if we partition the dataset into chunks and remove some redundancy.

zhujiem closed this as completed Jul 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Suggestion] Update the logic of preprocessing for efficiency #60

[Suggestion] Update the logic of preprocessing for efficiency #60

Kimyungi commented Jul 4, 2023

zhujiem commented Jul 4, 2023

[Suggestion] Update the logic of preprocessing for efficiency #60

[Suggestion] Update the logic of preprocessing for efficiency #60

Comments

Kimyungi commented Jul 4, 2023

zhujiem commented Jul 4, 2023