Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary matrix as input #8

Closed
remiadon opened this issue Apr 22, 2020 · 2 comments
Closed

Binary matrix as input #8

remiadon opened this issue Apr 22, 2020 · 2 comments
Labels
documentation Improvements or additions to documentation enhancement New feature or request help wanted Extra attention is needed
Projects

Comments

@remiadon
Copy link
Collaborator

Workflow

For many people in the machine learning community representing transactional datasets as we do in skmine is not something usual
Moreover, a common workflow we will encounter at some point consists in using the output of our Transformers as input to sklearn.

In pattern mining researchers are already familiar with matrix representations of transactional databases. Quoting Vreeken and Al. from SLIM, Section 2.1

Note that any binary or categorical dataset can be trivially converted into a transaction database

Proposed solution

To bridge this gap we should show some example working with a transactional dataset
We should be able to

  • transform a standard dataset (eg. chess) into a binary matrix
  • inverse transforming should not be very useful, considering the use cases

The most straight-forward solution seems to be using the sklearn.preprocessing.MultilabelBinarizer, in this way

>>> from sklearn.preprocessing import MultiLabelBinarizer
>>> from skmine.datasets.fimi import fetch_chess
>>> D = fetch_chess()
>>> mlb = MultiLabelBinarizer()
>>> X = mlb.fit_transform(D)
>>> X
array([[1, 0, 1, ..., 0, 0, 1],
       [1, 0, 0, ..., 0, 0, 1],
       [1, 0, 0, ..., 0, 0, 1],
       ...,
       [0, 0, 1, ..., 0, 0, 1],
       [0, 0, 1, ..., 0, 1, 1],
       [0, 0, 1, ..., 0, 1, 1]])

Describe alternatives you've considered, if relevant

If scikit-learn does not fit out purpose we can still implement our own transformer.
But the preferred solution would be to make a PR to scikit-learn

Additional context

Note that the MutliLabelBinarizer is only suitable to mere itemsets, it does not work out of the box on eg. sequential itemsets.
Even if it can be twicked for this purpose, eg:

>>> s = pd.Series([[2, 3, 2], [0, 2]])  # 2 is present at two different positions in the first transaction
>>> mlb = MultiLabelBinarizer()
>>> mlb.fit_transform(s.map(enumerate))  # note that inverse_transform will fail to reconstruct the original input
array([[0, 1, 0, 1, 1, 1],
       [1, 0, 1, 0, 0, 0]])
>>> mbl.classes_
array([(0, 0), (0, 2), (1, 2), (1, 3), (2, 4), (3, 2)], dtype=object)
@remiadon remiadon added documentation Improvements or additions to documentation enhancement New feature or request help wanted Extra attention is needed labels Apr 22, 2020
@remiadon remiadon changed the title Example with Binary matrix Binary matrix as input Apr 30, 2020
@remiadon
Copy link
Collaborator Author

remiadon commented May 14, 2020

UPDATE

Considering only basic transactional datasets, MultiLabelBinarizer gives access to a .classes_ attributes. The problem is that it is an attribute of the object, it does not transit within data --> sklearn MultiLabelBinarizer produces np.array or scipy.sparse matrics, but they are not labels

In the context of Pattern mining we absolutely need the labels : theses are our beloved symbols$

>>> from skmine.preprocessing import MulitLabelBinarizer
>>> mb = MultiLabelBinarizer(sparse_output=True)
>>> D = pd.Series([  # SLIM takes a pd.Series as input  
>>>    ['bananas', 'milk'],  
>>>     ['milk', 'bananas', 'cookies'],  
>>>   ['cookies', 'butter', 'tea'],  
>>>    ['tea'],   
>>>    ['milk', 'bananas', 'tea'],  
>>>  ])
>>> tab_D = mb.fit_transform(D)  # scipy sparse matrix
>>> tab_D[:2].todense()
matrix([[1, 0, 0, 1, 0],
        [1, 0, 1, 1, 0]])
>>> mb.classes_
array(['bananas', 'butter', 'cookies', 'milk', 'tea'], dtype=object)

Note that pandas can build a DataFrame from sparse matrices, as introduced in version 0.25

>>> tab_D = pd.DataFrame.sparse.from_spmatrix(tab_D, columns=mb.classes_)
>>> tab_D
   bananas  butter  cookies  milk  tea
0        1       0        0     1    0
1        1       0        1     1    0
2        0       1        1     0    1
3        0       0        0     0    1
4        1       0        0     1    1

>>> tab_D.dtypes
bananas    Sparse[int64, 0]
butter     Sparse[int64, 0]
cookies    Sparse[int64, 0]
milk       Sparse[int64, 0]
tea        Sparse[int64, 0]
dtype: object

My solution:

  • reimplement skearn MultiLabelBinarizer to make its transform function return a pandas.DataFrame with some explicit columns (our symbols)
  • propose this in sklearn as a feature request. If accepted, we will use it as implemented in sklearn to make sure we don't reimplement things twice

@remiadon remiadon added this to To do in sprint 2 May 14, 2020
@remiadon remiadon moved this from To do to Done in sprint 2 May 19, 2020
@remiadon
Copy link
Collaborator Author

see
e4b467b

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request help wanted Extra attention is needed
Projects
No open projects
sprint 2
  
Done
Development

No branches or pull requests

1 participant