## python-rdm

The [python-rdm](https://pypi.org/project/python-rdm/) package is a collection of wrappers for relational data mining algorithmss that aims to make them easy accessible. The input to the algorithms can be read from a relational database (MySQL, PostgreSQL, and SQLite are supported) or from plain CSV text files (with two additional header rows). The following packages are required to run this notebook:

- python-rdm==0.3.5

Additionally, YAP prolog is required for this notebook. Specifically, [YAP version 6.3.3](https://github.com/vscosta/yap-6.3/archive/yap-6.3.3.zip) is known to work; other version may or may not work due to internal changes in YAP.

We demonstrate the `python-rdm` by connecting to a remote relational database, reading data, running a selected relational data mining algorithm and presenting the results. We start by importing the relevant parts of `python-rdm` and establishing a connection to a remote MySQL database where a copy of the Michalski's East-West trains challenge dataset is stored. Keep in mind that in order for this remote connection to work the port 3306 must be open as it is used by MySQL for communication.

In [5]:
from rdm.db import DBContext, DBVendor, DBConnection
connection = DBConnection(
    'ilp',                # User
    'ilp123',             # Password
    'workflow.ijs.si',    # Host
    'trains',             # Database
    vendor=DBVendor.MySQL # Database type
)

The next step is to define the learning context. The target table is _trains_ and the target attribute is _direction_.

In [6]:
context = DBContext(connection, 
                    target_table='trains',
                    target_att='direction')

`DBContext` reads the data, parses column types, foreign keys etc. and loads the data into the `Orange.data.Table` object, stored in memory. Attributes such as `id` are stored as meta attributes. We print out all tables, their domains and first 5 instances.

In [7]:
for table in context.tables:
    print('Table: "{}"'.format(table))
    print('Domain: {}'.format(context.orng_tables[table].domain))
    print(context.orng_tables[table][:5])
    print('')

Table: "cars"
Domain: [position, shape, len, sides, roof, wheels, load_shape, load_num] {id, tid}
[[1, rectangle, short, not_double, none, 2, circle, 1] {1, 1},
 [2, rectangle, long, not_double, none, 3, hexagon, 1] {2, 1},
 [3, rectangle, short, not_double, peaked, 2, triangle, 1] {3, 1},
 [4, rectangle, long, not_double, none, 2, rectangle, 3] {4, 1},
 [1, rectangle, short, not_double, flat, 2, circle, 2] {5, 2}]

Table: "trains"
Domain: [ | direction] {id}
[[ | east] {1},
 [ | east] {2},
 [ | east] {3},
 [ | east] {4},
 [ | east] {5}]



We run the RSD learner. First, we create an `RSDConverter` instance which transforms the data into a form appropriate for RSD (Prolog clauses). RSD supports the following parameters which control feature construction process:

- clauselength
- depth
- negation
- min_coverage
- filtering

We set the maxim length of a feature body to 6, run RSD and display the first 10 induced features.

In [8]:
from rdm.db.converters import RSDConverter
from rdm.wrappers import RSD
from pprint import pprint

conv = RSDConverter(context)
rsd = RSD()
rsd.set('clauselength', 6)

features, arff, _ = rsd.induce(conv.background_knowledge(),
                                   examples=conv.all_examples(),
                                   cn2sd=False)
pprint(features.split('\n')[:10])

Exception in thread Thread-5:
Traceback (most recent call last):
  File "/opt/tljh/user/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/opt/tljh/user/lib/python3.7/site-packages/rdm/wrappers/security.py", line 58, in run
    self.p = Popen(self.args, **self.kwargs)
  File "/opt/tljh/user/lib/python3.7/subprocess.py", line 775, in __init__
    restore_signals, start_new_session)
  File "/opt/tljh/user/lib/python3.7/subprocess.py", line 1522, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'yap': 'yap'



AttributeError: 'SafePopen' object has no attribute 'p'

The generated ARFF file can be loaded into the `Table` structure of Orange data mining suite. `Table` is a wrapper around Numpy arrays and can be used with `scikit-learn` data mining library.

In [9]:
from rdm.helpers import arff_to_orange_table
table = arff_to_orange_table(arff)
print(table.X.shape)
print(table.Y.shape)
print(table.domain)

NameError: name 'arff' is not defined

Finally, we select 20 best features and build a simple decision tree.

In [10]:
import numpy as np
from sklearn.svm import SVC
from sklearn.feature_selection import RFE

estimator = SVC(kernel='linear')
selector = RFE(estimator, n_features_to_select=20, step=1)
selector = selector.fit(table.X, table.Y)
X1 = table.X[:,selector.support_]

AttributeError: 'str' object has no attribute 'X'

In [11]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree
tree = DecisionTreeClassifier(min_samples_leaf=2)
tree.fit(X1, table.Y)
y_names = np.array([x.name for x in table.domain.attributes])[selector.support_]
_=plot_tree(tree, 
            feature_names=y_names, 
            filled=True, 
            fontsize=7, 
            class_names=[table.domain.class_var.str_val(i) for i in [0,1]],
            label='all')

NameError: name 'X1' is not defined

Because feature names are very long we print the tree in textual form to improve its readability.

In [12]:
from sklearn.tree import export_text
names = list(np.array([x.split(':-')[1] for x in features.split('\n') if x])[selector.support_])
tree_rules = export_text(tree, feature_names=names)
print(tree_rules)
for i in [0,1]:
    print('class {}: {}'.format(i, table.domain.class_var.str_val(i)))

NameError: name 'features' is not defined