You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In pattern mining it is well established that items are represented as non-negative integers (ids). FIMI contain only integer entries.
Therefore we should implement an encoder to get from any raw data to an integer-based representation of items, getting closer from the kind of input one can find in the literature
Describe your proposed solution
Instantiate a preprocessing/ module with a simple class, namely ItemsetEncoder, to provide a simple 2-way mapping (something like a bi-dict) to encode any hashable items into integers, and the other way around
Describe alternatives you've considered, if relevant
It is not straightforward though that this brings a performance improvement for the mining method applied downstream. From my first benchmark it is true if the mining method makes a lot of inter-items comparisons (eg. our LCM implementation), but may not hold true for every method
Additional context
In terms of data-engineering : Should we make this integer-based representation mandatory and forbid any other input from our method ? This will come at the cost of harder data ingestion, and may not guarantee better algorithms ...
The text was updated successfully, but these errors were encountered:
pandas already provide some utilities to convert labeled indexes to their position, so better not reinvent the wheel
defining this "convert to integer" step as a preprocessing step makes an external post-processing step mandatory in most cases (convert integers back to labels)
From my own experience it's not very costly to do this as an internal step, eg in SLIM
Describe the workflow you want to enable
In pattern mining it is well established that items are represented as non-negative integers (ids). FIMI contain only integer entries.
Therefore we should implement an encoder to get from any raw data to an integer-based representation of items, getting closer from the kind of input one can find in the literature
Describe your proposed solution
Instantiate a preprocessing/ module with a simple class, namely ItemsetEncoder, to provide a simple 2-way mapping (something like a bi-dict) to encode any hashable items into integers, and the other way around
Describe alternatives you've considered, if relevant
It is not straightforward though that this brings a performance improvement for the mining method applied downstream. From my first benchmark it is true if the mining method makes a lot of inter-items comparisons (eg. our LCM implementation), but may not hold true for every method
Additional context
In terms of data-engineering : Should we make this integer-based representation mandatory and forbid any other input from our method ? This will come at the cost of harder data ingestion, and may not guarantee better algorithms ...
The text was updated successfully, but these errors were encountered: