Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

init preprocessing/ module with an ItemsetEncoder #6

Closed
remiadon opened this issue Apr 21, 2020 · 1 comment
Closed

init preprocessing/ module with an ItemsetEncoder #6

remiadon opened this issue Apr 21, 2020 · 1 comment
Labels
enhancement New feature or request

Comments

@remiadon
Copy link
Collaborator

Describe the workflow you want to enable

In pattern mining it is well established that items are represented as non-negative integers (ids). FIMI contain only integer entries.
Therefore we should implement an encoder to get from any raw data to an integer-based representation of items, getting closer from the kind of input one can find in the literature

Describe your proposed solution

Instantiate a preprocessing/ module with a simple class, namely ItemsetEncoder, to provide a simple 2-way mapping (something like a bi-dict) to encode any hashable items into integers, and the other way around

Describe alternatives you've considered, if relevant

It is not straightforward though that this brings a performance improvement for the mining method applied downstream. From my first benchmark it is true if the mining method makes a lot of inter-items comparisons (eg. our LCM implementation), but may not hold true for every method

Additional context

In terms of data-engineering : Should we make this integer-based representation mandatory and forbid any other input from our method ? This will come at the cost of harder data ingestion, and may not guarantee better algorithms ...

@remiadon remiadon added the enhancement New feature or request label Apr 21, 2020
@remiadon
Copy link
Collaborator Author

two things here

  1. pandas already provide some utilities to convert labeled indexes to their position, so better not reinvent the wheel
  2. defining this "convert to integer" step as a preprocessing step makes an external post-processing step mandatory in most cases (convert integers back to labels)

From my own experience it's not very costly to do this as an internal step, eg in SLIM

So closing this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant