## Building a symbolic classifier

MDL based algorithms encode data according to a given codetable

When calling ``.fit``, we iteratively look for the codetable that compress
the training data the best

**When we are done with training our model, we can benefit from the refined codetable 
to make some predictions**

In [1]:
import pandas as pd
from skmine.itemsets import SLIM
from sklearn.preprocessing import MultiLabelBinarizer

In [2]:
class TransactionEncoder(MultiLabelBinarizer):  # pandas DataFrames are easier to read ;)
    def transform(self, X):
        _X = super().transform(X)
        return pd.DataFrame(data=_X, columns=self.classes_)

In [3]:
transactions = [ 
     ['bananas', 'milk'], 
     ['milk', 'bananas', 'cookies'], 
     ['cookies', 'butter', 'tea'], 
     ['tea'],  
     ['milk', 'bananas', 'tea'], 
]
te = TransactionEncoder()
D = te.fit(transactions).transform(transactions)
D 

Unnamed: 0,bananas,butter,cookies,milk,tea
0,1,0,0,1,0
1,1,0,1,1,0
2,0,1,1,0,1
3,0,0,0,0,1
4,1,0,0,1,1


In [4]:
slim = SLIM()
slim.fit(D).discover()

Unnamed: 0,itemset,usage
0,"[bananas, milk]",3
1,[tea],3
2,[cookies],2
3,[butter],1


We keep this **codetable** in mind, as we will later use it **to interpret our predictions**

----------
### First "predictions" 

We define a new transactional dataset, and can call our ``decision_function`` on it. This will yield ``distances`` w.r.t the encoding scheme provided by our codetable

In [5]:
new_transactions = [ 
   ['bananas', 'milk'], 
   ['milk', 'sirup', 'cookies'], 
   ['butter', 'tea'], 
   ['tea'],  
   ['milk', 'bananas', 'tea'], 
]
new_D = te.transform(new_transactions)
new_D



Unnamed: 0,bananas,butter,cookies,milk,tea
0,1,0,0,1,0
1,0,0,1,1,0
2,0,1,0,0,1
3,0,0,0,0,1
4,1,0,0,1,1


In [6]:
codes = slim.decision_function(new_D)

In [7]:
pd.DataFrame([pd.Series(new_transactions), codes], index=['transaction', 'distance']).T

Unnamed: 0,transaction,distance
0,"[bananas, milk]",-1.906891
1,"[milk, sirup, cookies]",-6.228819
2,"[butter, tea]",-4.813781
3,[tea],-1.906891
4,"[milk, bananas, tea]",-3.813781


---------------
### Built-in interpretations
Now we can interpret codes for the new data, directly by **looking at the codetable inferred from training data**

First observations

* Entry 1 has the highest distance w.r.t the encoding scheme.
  You can see it contains `milk`, `sirup` and `cookies`. From the codetable we see `milk` and `cookies` are not    grouped together, while `sirup` has never been seen
  

*  Entry 4 has the lowest distance. It contains `bananas` and `milk`, which are grouped together in the codetable and have high occurence in the training data.

--------------
### Shortest code wins !!
Next, we are going to use an ensemble of SLIM encoding schemes, and utilize them via a ``OneVsRest`` methodology, to perform **multi-class classification**.
The methodology is very simple

1. We clone our base estimator as many time as we need (one per class)
2. We fit every estimator on entries corresponding to its class in the input data
3. When calling ``.predict``, we actually call ``.decision_function`` and get distances from our decision boundaries for every class
4. The shorted code wins : we choose the class with the lowest distance for a given transaction

In [8]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.pipeline import Pipeline

In [9]:
pipe = Pipeline([
    ('transaction_encoder', TransactionEncoder(sparse_output=False)),
    ('slim', SLIM()),
])

In [10]:
transactions = [
    ['milk', 'bananas'],
    ['tea', 'New York Times', 'El Pais'],
    ['New York Times'],
    ['El Pais', 'The Economist'],
    ['milk', 'tea'],
    ['croissant', 'tea'],
    ['croissant', 'chocolatine', 'milk'],
]
y = [
    'foodstore', 
    'newspaper', 
    'newspaper', 
    'newspaper', 
    'foodstore',
    'bakery',
    'bakery',
]

In [11]:
te = TransactionEncoder()
D = te.fit(transactions).transform(transactions)
D

Unnamed: 0,El Pais,New York Times,The Economist,bananas,chocolatine,croissant,milk,tea
0,0,0,0,1,0,0,1,0
1,1,1,0,0,0,0,0,1
2,0,1,0,0,0,0,0,0
3,1,0,1,0,0,0,0,0
4,0,0,0,0,0,0,1,1
5,0,0,0,0,0,1,0,1
6,0,0,0,0,1,1,1,0


In [12]:
ovr = OneVsRestClassifier(SLIM())

In [13]:
ovr.fit(D, y=y)
ovr.estimators_

[SLIM(), SLIM(), SLIM()]

In [14]:
pd.DataFrame(ovr.decision_function(D), columns=ovr.classes_)

Unnamed: 0,bakery,foodstore,newspaper
0,-7.169925,-4.584963,-7.61471
1,-10.754888,-9.754888,-7.25214
2,-3.584963,-3.584963,-2.222392
3,-7.169925,-7.169925,-5.029747
4,-7.169925,-4.584963,-6.61471
5,-2.584963,-6.169925,-6.61471
6,-2.584963,-9.169926,-11.422065


In [15]:
ovr.predict(D)

array(['foodstore', 'newspaper', 'newspaper', 'newspaper', 'foodstore',
       'bakery', 'bakery'], dtype='<U9')