## Building a symbolic classifier

In [1]:
import skmine

print("This tutorial was tested with the following version of skmine :", skmine.__version__)

This tutorial was tested with the following version of skmine : 0.0.9


MDL based algorithms encode data according to a given codetable

When calling ``.fit``, we iteratively look for the codetable that compress
the training data the best

**When we are done with training our model, we can benefit from the refined codetable 
to make some predictions**

### SLIM Classifier for k>=2 classes

An **integrated classifier in scikit mine is available** and allows to **solve binary and multiclass problems**. It uses the SLIM compression algorithm. 

To use it, we need to have **discretized dataset**. Let's take for example the **discretized iris dataset**.

In [2]:
from skmine.datasets.fimi import fetch_iris
X, y = fetch_iris(return_y=True)  # without return_y=True, the method would have returned the whole dataset in one variable
print("X shape:", X.shape)
print("y shape:", y.shape)
X.head()

X shape: (150,)
y shape: (150,)


0     [2, 9, 12, 15]
1    [1, 10, 11, 14]
2    [5, 10, 13, 16]
3     [2, 6, 12, 15]
4     [1, 8, 11, 14]
Name: iris.D19.N150.C3.num, dtype: object

Note that in the discretized iris dataset, each features is discretized **with different labels** : 

In [3]:
import numpy as np
print("Labels in colunms 0 : ", np.unique([X[k][0] for k in range(len(X))]))
print("Labels in colunms 1 : ", np.unique([X[k][1] for k in range(len(X))]))
print("Labels in colunms 2 : ", np.unique([X[k][2] for k in range(len(X))]))
print("Labels in colunms 3 : ", np.unique([X[k][3] for k in range(len(X))]))


Labels in colunms 0 :  [1 2 3 4 5]
Labels in colunms 1 :  [ 6  7  8  9 10]
Labels in colunms 2 :  [11 12 13]
Labels in colunms 3 :  [14 15 16]


There are **3 classes** in Iris dataset : 

In [4]:
np.unique(y)

array([17, 18, 19])

The purpose of this dataset is to **predict the last column of db from the other 4**. The possible targets are: 17, 18, 19. We can prepare our train and test data set.

In [5]:
from sklearn.model_selection import train_test_split

(X_train, X_test, y_train, y_test) = train_test_split(X, y, random_state=1, test_size=0.2, shuffle=True)
print("X_train shape:", X_train.shape, "y_train shape:", y_train.shape)
print("X_test shape:", X_test.shape, "y_test shape:", y_test.shape)

X_train shape: (120,) y_train shape: (120,)
X_test shape: (30,) y_test shape: (30,)


Now we can use our **SlimClassifier**.

In [6]:
from skmine.itemsets.slim_classifier import SlimClassifier

# You can pass in parameter of your classifier the set of your items. 
# This will improve its performance especially on small data sets like iris.
items = set(item for transaction in X for item in transaction)
print("items", items)
clf = SlimClassifier(items=items)  # You can also enable or disable the pruning of SLIM compressors via the `pruning` parameter
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

items {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}


0.8333333333333334

You can use many functions of sckit learn that are compatible with classifiers. For example, build a **confusion matrix**, use **GridSearchCV** or **cross validation**.

- **Confusion matrix**

In [7]:
from sklearn.metrics import confusion_matrix

y_pred = clf.predict(X_test)
confusion_matrix(y_test, y_pred)

array([[13,  1,  0],
       [ 0,  8,  1],
       [ 0,  3,  4]])

- **GridSearchCV** (this method allows us to test many parameters for a classifier and to retain the best combination)

In [8]:
from sklearn.model_selection import GridSearchCV

parameters = {'pruning': [False, True], 'items': [None, items]}
grid = GridSearchCV(clf, parameters)
print(grid.fit(X_train,y_train))
print(grid.best_params_)
print(grid.score(X_train, y_train))

GridSearchCV(estimator=SlimClassifier(items={1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
                                             12, 13, 14, 15, 16}),
             param_grid={'items': [None,
                                   {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
                                    14, 15, 16}],
                         'pruning': [False, True]})
{'items': {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}, 'pruning': False}
0.9833333333333333


With GridSearchCV we get with the best parameters an **accuracy of more than 98%**, much better than the previous score. With this combination, the item list is passed as a parameter and pruning is disabled. Since pruning does not improve the compression of codetables in the SLIM algorithm on iris, it does not matter whether it is enabled or not.

To reduce overfitting, we can use the cross validation of sklearn.
- **Cross validation**

In [9]:
from sklearn.model_selection import cross_val_score

cross_validation = cross_val_score(clf, X, y, cv=10)
print(cross_validation)
cross_validation.mean()

[0.93333333 0.93333333 0.86666667 0.93333333 0.93333333 0.93333333
 1.         1.         1.         0.93333333]


0.9466666666666667

After cross validation, we see that the **accuracy is almost 95% on average**. So in 95% of the cases, the right type of flower is given.

### SLIM classifier from numerical dataset

#### Preprocessing

Load standard iris dataset : 

In [10]:
from sklearn.datasets import load_iris
data = load_iris()
X, y = data.data, data.target
X.shape, X[0:10]

((150, 4),
 array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1]]))

Classic standardisation 

In [11]:
from sklearn.preprocessing import StandardScaler
Xst = StandardScaler().fit_transform(X)
Xst.shape,Xst[0:10]

((150, 4),
 array([[-0.90068117,  1.01900435, -1.34022653, -1.3154443 ],
        [-1.14301691, -0.13197948, -1.34022653, -1.3154443 ],
        [-1.38535265,  0.32841405, -1.39706395, -1.3154443 ],
        [-1.50652052,  0.09821729, -1.2833891 , -1.3154443 ],
        [-1.02184904,  1.24920112, -1.34022653, -1.3154443 ],
        [-0.53717756,  1.93979142, -1.16971425, -1.05217993],
        [-1.50652052,  0.78880759, -1.34022653, -1.18381211],
        [-1.02184904,  0.78880759, -1.2833891 , -1.3154443 ],
        [-1.74885626, -0.36217625, -1.34022653, -1.3154443 ],
        [-1.14301691,  0.09821729, -1.2833891 , -1.44707648]]))

KBins discretisation

In [12]:
from sklearn.preprocessing import KBinsDiscretizer
Xt =  KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform').fit_transform(Xst).astype(int)
Xt.shape,  Xt[:10]

((150, 4),
 array([[0, 1, 0, 0],
        [0, 1, 0, 0],
        [0, 1, 0, 0],
        [0, 1, 0, 0],
        [0, 1, 0, 0],
        [0, 2, 0, 0],
        [0, 1, 0, 0],
        [0, 1, 0, 0],
        [0, 1, 0, 0],
        [0, 1, 0, 0]]))

Note that in this discretization of iris dataset, **each feature** is discretized with the **same labels**, which **is not what we want**

In [13]:
import numpy as np
print("Labels in colunms 0 : ", np.unique(Xt[:,0]))
print("Labels in colunms 1 : ", np.unique(Xt[:,1]))
print("Labels in colunms 2 : ", np.unique(Xt[:,2]))
print("Labels in colunms 3 : ", np.unique(Xt[:,3]))

Labels in colunms 0 :  [0 1 2]
Labels in colunms 1 :  [0 1 2]
Labels in colunms 2 :  [0 1 2]
Labels in colunms 3 :  [0 1 2]


We must **shift** values in columns in order to **avoid identical labels between columns**.    

In [14]:

shift_col = np.max(Xt, axis=0)
for k in range(1, len(shift_col))  : 
     shift_col[k]+= shift_col[k-1] + 1
shift_col+=-shift_col[0]

for k in range(len(shift_col))  : 
    Xt[:,k]+=shift_col[k]

print("Labels in colunms 0 : ", np.unique(Xt[:,0]))
print("Labels in colunms 1 : ", np.unique(Xt[:,1]))
print("Labels in colunms 2 : ", np.unique(Xt[:,2]))
print("Labels in colunms 3 : ", np.unique(Xt[:,3]))

import pandas as pd
Xt = pd.Series( Xt.tolist() )   # we must tranform the array into series of list
Xt.shape, Xt[50:60]

Labels in colunms 0 :  [0 1 2]
Labels in colunms 1 :  [3 4 5]
Labels in colunms 2 :  [6 7 8]
Labels in colunms 3 :  [ 9 10 11]


((150,),
 50    [2, 4, 7, 10]
 51    [1, 4, 7, 10]
 52    [2, 4, 7, 10]
 53    [1, 3, 7, 10]
 54    [1, 3, 7, 10]
 55    [1, 3, 7, 10]
 56    [1, 4, 7, 10]
 57    [0, 3, 7, 10]
 58    [1, 4, 7, 10]
 59    [0, 3, 7, 10]
 dtype: object)

In [15]:
np.unique(y)

array([0, 1, 2])

#### Preprocessing with pipelines : 

In [47]:
from sklearn.pipeline import Pipeline
import pandas as pd
import numpy as np

class MultiLabelsKbins(KBinsDiscretizer):  # pandas DataFrames are easier to read ;)
    def transform(self, X):
        Xt = super().transform(X).astype(int)
        
        shift_col = np.max(Xt, axis=0)
        for k in range(1, len(shift_col))  : 
             shift_col[k]+= shift_col[k-1] + 1
        shift_col+=-shift_col[0]
        for k in range(len(shift_col))  : 
            Xt[:,k]+=shift_col[k]
            
        return pd.Series( Xt.tolist() )

preproc = Pipeline([
    ('StandardScaler', StandardScaler()),
    ('MultiLabelsKbins', MultiLabelsKbins(n_bins=3, encode='ordinal', strategy='uniform')),
])

Xt = preproc.fit(X).transform(X)
Xt.shape, Xt[50:60]

((150,),
 50    [2, 4, 7, 10]
 51    [1, 4, 7, 10]
 52    [2, 4, 7, 10]
 53    [1, 3, 7, 10]
 54    [1, 3, 7, 10]
 55    [1, 3, 7, 10]
 56    [1, 4, 7, 10]
 57    [0, 3, 7, 10]
 58    [1, 4, 7, 10]
 59    [0, 3, 7, 10]
 dtype: object)

#### Train-test dataset and SlimClassifier.

Now we can prepare our **train and test data set**.

In [48]:
from sklearn.model_selection import train_test_split

(X_train, X_test, y_train, y_test) = train_test_split(Xt, y, random_state=1, test_size=0.2, shuffle=True)
print("X_train shape:", X_train.shape, "y_train shape:", y_train.shape)
print("X_test shape:", X_test.shape, "y_test shape:", y_test.shape)

X_train shape: (120,) y_train shape: (120,)
X_test shape: (30,) y_test shape: (30,)


And use our **SlimClassifier**.

In [49]:
from skmine.itemsets.slim_classifier import SlimClassifier

# You can pass in parameter of your classifier the set of your items. 
# This will improve its performance especially on small data sets like iris.
items = set(item for transaction in Xt for item in transaction)
print("items", items)
clf = SlimClassifier(items=items)  # You can also enable or disable the pruning of SLIM compressors via the `pruning` parameter
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

items {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11}


0.9333333333333333

----------

### OneVsRest classifier for k>2 classes

The **SLIM algorithm** is also compatible with **scikit-learn** to be used from other classifiers like **One-vs-the-rest (OvR)** (https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html). The limitation of this method is that the classifier **only works for multiclass classification** problems while the embedded classifier works for both binary and multiclass problems.

In [19]:
import pandas as pd
from skmine.itemsets import SLIM
from sklearn.preprocessing import MultiLabelBinarizer

In [20]:
class TransactionEncoder(MultiLabelBinarizer):  # pandas DataFrames are easier to read ;)
    def transform(self, X):
        _X = super().transform(X)
        return pd.DataFrame(data=_X, columns=self.classes_)

In [21]:
transactions = [ 
     ['bananas', 'milk'], 
     ['milk', 'bananas', 'cookies'], 
     ['cookies', 'butter', 'tea'], 
     ['tea'],  
     ['milk', 'bananas', 'tea'], 
]
te = TransactionEncoder()
D = te.fit(transactions).transform(transactions)
D 

Unnamed: 0,bananas,butter,cookies,milk,tea
0,1,0,0,1,0
1,1,0,1,1,0
2,0,1,1,0,1
3,0,0,0,0,1
4,1,0,0,1,1


In [22]:
slim = SLIM()
slim.fit(D).transform(D)

Unnamed: 0,itemset,usage
0,"[bananas, milk]",3
1,[tea],3
2,[cookies],2
3,[butter],1


We keep this **codetable** in mind, as we will later use it **to interpret our predictions**

----------
#### First "predictions" 

We define a new transactional dataset, and can call our ``decision_function`` on it. This will yield a decrease exponential function ***Exp_neg*** of the ``distances`` ($d$) w.r.t the encoding scheme provided by our codetable : 
$$
Exp\_neg(d) = \exp(-0.2 * d)
$$

In [23]:
new_transactions = [ 
   ['bananas', 'milk'], 
   ['milk', 'sirup', 'cookies'], 
   ['butter', 'tea'], 
   ['tea'],  
   ['milk', 'bananas', 'tea'], 
]
new_D = te.transform(new_transactions)
new_D



Unnamed: 0,bananas,butter,cookies,milk,tea
0,1,0,0,1,0
1,0,0,1,1,0
2,0,1,0,0,1
3,0,0,0,0,1
4,1,0,0,1,1


In [24]:
codes = slim.decision_function(new_D)
codes

0    0.682920
1    0.287721
2    0.381839
3    0.682920
4    0.466379
dtype: float32

In [25]:
pd.DataFrame([pd.Series(new_transactions), codes], index=['transaction', 'distance']).T

Unnamed: 0,transaction,distance
0,"[bananas, milk]",0.68292
1,"[milk, sirup, cookies]",0.287721
2,"[butter, tea]",0.381839
3,[tea],0.68292
4,"[milk, bananas, tea]",0.466379


#### Built-in interpretations
Now we can interpret codes for the new data, directly by **looking at the codetable inferred from training data**

First observations

* Entry 1 has the highest distance w.r.t the encoding scheme : so the smallest score for decision function (*Exp_neg*). 
  You can see it contains `milk`, `sirup` and `cookies`. From the codetable we see `milk` and `cookies` are not    grouped together, while `sirup` has never been seen
  

*  Entry 4 (as for entry 0) has the lowest distance, so the highest score for decision function (*Exp_neg*). It contains `bananas` and `milk`, which are grouped together in the codetable and have high occurence in the training data.

#### Shortest code wins !!
Next, we are going to use an ensemble of SLIM encoding schemes, and utilize them via a ``OneVsRest`` methodology, to perform **multi-class classification**.
The methodology is very simple

1. We clone our base estimator as many time as we need (one per class)
2. We fit every estimator on entries corresponding to its class in the input data
3. When calling ``.predict``, we actually call ``.decision_function`` and get negative exponential of distances, from our decision boundaries for every class
4. The shorted code wins : we choose the class with the lowest distance (so the highest negative exponential of distances) for a given transaction

In [26]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.pipeline import Pipeline

In [27]:
pipe = Pipeline([
    ('transaction_encoder', TransactionEncoder(sparse_output=False)),
    ('slim', SLIM()),
])

In [28]:
transactions = [
    ['milk', 'bananas'],
    ['tea', 'New York Times', 'El Pais'],
    ['New York Times'],
    ['El Pais', 'The Economist'],
    ['milk', 'tea'],
    ['croissant', 'tea'],
    ['croissant', 'chocolatine', 'milk'],
]
target = [
    'foodstore', 
    'newspaper', 
    'newspaper', 
    'newspaper', 
    'foodstore',
    'bakery',
    'bakery',
]

In [29]:
te = TransactionEncoder()
D = te.fit(transactions).transform(transactions)
D

Unnamed: 0,El Pais,New York Times,The Economist,bananas,chocolatine,croissant,milk,tea
0,0,0,0,1,0,0,1,0
1,1,1,0,0,0,0,0,1
2,0,1,0,0,0,0,0,0
3,1,0,1,0,0,0,0,0
4,0,0,0,0,0,0,1,1
5,0,0,0,0,0,1,0,1
6,0,0,0,0,1,1,1,0


In [30]:
ovr = OneVsRestClassifier(SLIM())

In [31]:
ovr.fit(D, y=target)
ovr.estimators_

[SLIM(), SLIM(), SLIM()]

In [32]:
pd.DataFrame(ovr.decision_function(D), columns=ovr.classes_)

Unnamed: 0,bakery,foodstore,newspaper
0,0.238357,0.399719,0.218069
1,0.11637,0.142135,0.23447
2,0.488218,0.488218,0.641159
3,0.238357,0.238357,0.365697
4,0.238357,0.399719,0.266351
5,0.596311,0.29113,0.266351
6,0.596311,0.159776,0.101834


In [33]:
ovr.predict(D)

array(['foodstore', 'newspaper', 'newspaper', 'newspaper', 'foodstore',
       'bakery', 'bakery'], dtype='<U9')

#### Questions on binary OneVsRest classifier

In [34]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ('transaction_encoder', TransactionEncoder(sparse_output=False)),
    ('slim', SLIM()),
])

In [35]:
transactions = [
    ['milk', 'bananas'],
    ['tea', 'New York Times', 'El Pais'],
    ['New York Times'],
    ['El Pais', 'The Economist'],
    ['milk', 'tea'],
]
target = [
    'foodstore', 
    'newspaper', 
    'newspaper', 
    'newspaper', 
    'foodstore',
]

In [36]:
te = TransactionEncoder()
D = te.fit(transactions).transform(transactions)
D

Unnamed: 0,El Pais,New York Times,The Economist,bananas,milk,tea
0,0,0,0,1,1,0
1,1,1,0,0,0,1
2,0,1,0,0,0,0
3,1,0,1,0,0,0
4,0,0,0,0,1,1


In [37]:
ovr = OneVsRestClassifier(SLIM())

In [38]:
ovr.fit(D, y=target)
ovr.estimators_

[SLIM()]

In [39]:
ovr.decision_function(D)

0    0.238357
1    0.267940
2    0.670320
3    0.399719
4    0.291130
dtype: float32

**=> Which thresold can we take to decide if the prediction is 'foodstore', or 'newspaper' ?**

#### OneVsOne

In [40]:
from sklearn.multiclass import OneVsOneClassifier
from sklearn.pipeline import Pipeline

In [41]:
pipe = Pipeline([
    ('transaction_encoder', TransactionEncoder(sparse_output=False)),
    ('slim', SLIM()),
])

In [42]:
transactions = [
    ['milk', 'bananas'],
    ['tea', 'New York Times', 'El Pais'],
    ['New York Times'],
    ['El Pais', 'The Economist'],
    ['milk', 'tea'],
    ['croissant', 'tea'],
    ['croissant', 'chocolatine', 'milk'],
    ['Monde diplo', 'Game of Thrones', 'Harry Potter'],
    ['New York Times', 'Harry Potter'],
]
target = [
    'foodstore', 
    'newspaper', 
    'newspaper', 
    'newspaper', 
    'foodstore',
    'bakery',
    'bakery',
    'library',
    'library',
]

In [43]:
te = TransactionEncoder()
D = te.fit(transactions).transform(transactions)
D

Unnamed: 0,El Pais,Game of Thrones,Harry Potter,Monde diplo,New York Times,The Economist,bananas,chocolatine,croissant,milk,tea
0,0,0,0,0,0,0,1,0,0,1,0
1,1,0,0,0,1,0,0,0,0,0,1
2,0,0,0,0,1,0,0,0,0,0,0
3,1,0,0,0,0,1,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,1,1
5,0,0,0,0,0,0,0,0,1,0,1
6,0,0,0,0,0,0,0,1,1,1,0
7,0,1,1,1,0,0,0,0,0,0,0
8,0,0,1,0,1,0,0,0,0,0,0


In [44]:
ovr = OneVsOneClassifier(SLIM())

In [45]:
ovr.fit(D, y=target)
ovr.estimators_

(SLIM(), SLIM(), SLIM(), SLIM(), SLIM(), SLIM())

In [46]:
ovr.decision_function(D)

AttributeError: 'SLIM' object has no attribute 'predict'