## Building a symbolic classifier

MDL based algorithms encode data according to a given codetable

When calling ``.fit``, we iteratively look for the codetable that compress
the training data the best

**When we are done with training our model, we can benefit from the refined codetable 
to make some predictions**

In [1]:
import pandas as pd
from skmine.itemsets import SLIM
from skmine.preprocessing import TransactionEncoder

In [2]:
transactions = [ 
     ['bananas', 'milk'], 
     ['milk', 'bananas', 'cookies'], 
     ['cookies', 'butter', 'tea'], 
     ['tea'],  
     ['milk', 'bananas', 'tea'], 
]
te = TransactionEncoder()
D = te.fit_transform(transactions)
D 

Unnamed: 0,bananas,butter,cookies,milk,tea
0,1,0,0,1,0
1,1,0,1,1,0
2,0,1,1,0,1
3,0,0,0,0,1
4,1,0,0,1,1


In [3]:
slim = SLIM()
slim.fit(D)

(bananas, milk)    [0, 1, 4]
(tea)              [2, 3, 4]
(cookies)             [1, 2]
(butter)                 [2]
dtype: object

We keep this **codetable** in mind, as we will later use it **to interpret our predictions**

----------
We define a new transactional dataset, we some unseen items inside
and call the ``predict_proba`` function. This basically computes the probability
of a given transaction to belong to the current codetable (Shannon Entropy)

In [4]:
new_transactions = [ 
   ['bananas', 'milk'], 
   ['milk', 'sirup', 'cookies'], 
   ['butter', 'tea'], 
   ['tea'],  
   ['milk', 'bananas', 'tea'], 
]
new_D = te.transform(new_transactions)
new_D

  .format(sorted(unknown, key=str)))


Unnamed: 0,bananas,butter,cookies,milk,tea
0,1,0,0,1,0
1,0,0,1,1,0
2,0,1,0,0,1
3,0,0,0,0,1
4,1,0,0,1,1


In [5]:
slim.predict_proba(new_D)

0    0.333333
1    0.222222
2    0.444444
3    0.333333
4    0.666667
dtype: float32

---------------
### Built-in interpretations

* Entry 1 has the lowest probability to belong to the training data.
  You can see it contains `milk`, `sirup` and `cookies`. From the codetable we see `milk` and `cookies` are not    grouped together, while `sirup` and never been seen (we even get a warning from the preprocessing module)
  

*  Entry 4 has the highest probability. It contains `bananas` and `milk`, which are grouped together in the codetable and have high occurence in the data.