Binary matrix as input #8

remiadon · 2020-04-22T13:23:21Z

Workflow

For many people in the machine learning community representing transactional datasets as we do in skmine is not something usual
Moreover, a common workflow we will encounter at some point consists in using the output of our Transformers as input to sklearn.

In pattern mining researchers are already familiar with matrix representations of transactional databases. Quoting Vreeken and Al. from SLIM, Section 2.1

Note that any binary or categorical dataset can be trivially converted into a transaction database

Proposed solution

To bridge this gap we should show some example working with a transactional dataset
We should be able to

transform a standard dataset (eg. chess) into a binary matrix
inverse transforming should not be very useful, considering the use cases

The most straight-forward solution seems to be using the sklearn.preprocessing.MultilabelBinarizer, in this way

>>> from sklearn.preprocessing import MultiLabelBinarizer
>>> from skmine.datasets.fimi import fetch_chess
>>> D = fetch_chess()
>>> mlb = MultiLabelBinarizer()
>>> X = mlb.fit_transform(D)
>>> X
array([[1, 0, 1, ..., 0, 0, 1],
       [1, 0, 0, ..., 0, 0, 1],
       [1, 0, 0, ..., 0, 0, 1],
       ...,
       [0, 0, 1, ..., 0, 0, 1],
       [0, 0, 1, ..., 0, 1, 1],
       [0, 0, 1, ..., 0, 1, 1]])

Describe alternatives you've considered, if relevant

If scikit-learn does not fit out purpose we can still implement our own transformer.
But the preferred solution would be to make a PR to scikit-learn

Additional context

Note that the MutliLabelBinarizer is only suitable to mere itemsets, it does not work out of the box on eg. sequential itemsets.
Even if it can be twicked for this purpose, eg:

>>> s = pd.Series([[2, 3, 2], [0, 2]])  # 2 is present at two different positions in the first transaction
>>> mlb = MultiLabelBinarizer()
>>> mlb.fit_transform(s.map(enumerate))  # note that inverse_transform will fail to reconstruct the original input
array([[0, 1, 0, 1, 1, 1],
       [1, 0, 1, 0, 0, 0]])
>>> mbl.classes_
array([(0, 0), (0, 2), (1, 2), (1, 3), (2, 4), (3, 2)], dtype=object)

The text was updated successfully, but these errors were encountered:

remiadon · 2020-05-14T09:52:35Z

UPDATE

Considering only basic transactional datasets, MultiLabelBinarizer gives access to a .classes_ attributes. The problem is that it is an attribute of the object, it does not transit within data --> sklearn MultiLabelBinarizer produces np.array or scipy.sparse matrics, but they are not labels

In the context of Pattern mining we absolutely need the labels : theses are our beloved symbols$

>>> from skmine.preprocessing import MulitLabelBinarizer
>>> mb = MultiLabelBinarizer(sparse_output=True)
>>> D = pd.Series([  # SLIM takes a pd.Series as input  
>>>    ['bananas', 'milk'],  
>>>     ['milk', 'bananas', 'cookies'],  
>>>   ['cookies', 'butter', 'tea'],  
>>>    ['tea'],   
>>>    ['milk', 'bananas', 'tea'],  
>>>  ])
>>> tab_D = mb.fit_transform(D)  # scipy sparse matrix
>>> tab_D[:2].todense()
matrix([[1, 0, 0, 1, 0],
        [1, 0, 1, 1, 0]])
>>> mb.classes_
array(['bananas', 'butter', 'cookies', 'milk', 'tea'], dtype=object)

Note that pandas can build a DataFrame from sparse matrices, as introduced in version 0.25

>>> tab_D = pd.DataFrame.sparse.from_spmatrix(tab_D, columns=mb.classes_)
>>> tab_D
   bananas  butter  cookies  milk  tea
0        1       0        0     1    0
1        1       0        1     1    0
2        0       1        1     0    1
3        0       0        0     0    1
4        1       0        0     1    1

>>> tab_D.dtypes
bananas    Sparse[int64, 0]
butter     Sparse[int64, 0]
cookies    Sparse[int64, 0]
milk       Sparse[int64, 0]
tea        Sparse[int64, 0]
dtype: object

My solution:

reimplement skearn MultiLabelBinarizer to make its transform function return a pandas.DataFrame with some explicit columns (our symbols)
propose this in sklearn as a feature request. If accepted, we will use it as implemented in sklearn to make sure we don't reimplement things twice

remiadon · 2020-05-19T06:57:27Z

see
e4b467b

remiadon added documentation Improvements or additions to documentation enhancement New feature or request help wanted Extra attention is needed labels Apr 22, 2020

remiadon changed the title ~~Example with Binary matrix~~ Binary matrix as input Apr 30, 2020

remiadon mentioned this issue May 14, 2020

first performance improvements for SLIM #32

Closed

remiadon added this to To do in sprint 2 May 14, 2020

remiadon moved this from To do to Done in sprint 2 May 19, 2020

remiadon closed this as completed May 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Binary matrix as input #8

Binary matrix as input #8

remiadon commented Apr 22, 2020

remiadon commented May 14, 2020 •

edited

Loading

remiadon commented May 19, 2020

Binary matrix as input #8

Binary matrix as input #8

Comments

remiadon commented Apr 22, 2020

Workflow

Proposed solution

Describe alternatives you've considered, if relevant

Additional context

remiadon commented May 14, 2020 • edited Loading

remiadon commented May 19, 2020

remiadon commented May 14, 2020 •

edited

Loading