

# Installation


In [1]:
pip install mlconcepts pandas scipy h5py



# Loading data


The library implements a simple and easily extensible data loader, which allows to import data from different sources, and data frames from other libraries. All these different sources are abstracted by the class `mlconcepts.data.Dataset`, which represent generic datasets with numerical and categorical data, and, possibly, some labels on the items of the dataset (e.g., indicating what elements are outliers, or the class of each element). In the remainder of this document we will use the following utility function that writes the content of a dataset as `numpy.array`s:

In [2]:
import os
def print_dataset(dataset):
    if dataset.X is not None:
        print("X:", os.linesep, dataset.X)
    if dataset.Xc is not None:
        print("Xc:", os.linesep, dataset.Xc)
    if dataset.y is not None:
        print("y:", os.linesep, dataset.y)

In general, datasets are loaded using the function `mlconcepts.load`, which accepts any type of data that it can transform into a dataset, such as `numpy` arrays, `pandas` dataframes, or paths to files with a known format (csv, xlsx, mat, json, or sql). For instance, let us create a dataset `"data.csv"`, which we will load right after.

In [3]:
# Writes the dataset |  a  |  b  | color | outlier |
#                    | 1.5 |  6  |  red  |    no   |
#                    | 2.7 |  1  | green |   yes   |
with open("data.csv", "w") as f:
    f.write("a,b,color,outlier\n")
    f.write("1.5,6,red,no\n")
    f.write("2.7,1,green,yes")

In [4]:
import mlconcepts
dataset = mlconcepts.load("data.csv")
print_dataset(dataset)

X: 
 [[1.5 6. ]
 [2.7 1. ]]
Xc: 
 [[0 0]
 [1 1]]


The example above loads a dataset suitable for unsupervised machine learning tasks, as no label column is specified in the dataset. To load a dataset with a some labels in the `"outlier"` column, we can run

In [5]:
dataset = mlconcepts.load("data.csv", labels="outlier")
print_dataset(dataset)

X: 
 [[1.5 6. ]
 [2.7 1. ]]
Xc: 
 [[0]
 [1]]
y: 
 [0 1]


### How to load from a numpy array?


Loading a numpy array is as easy as passing it to the load function:

In [6]:
import numpy
import mlconcepts
mat = numpy.array([[4.3, 2.4, 1.2], [3.3, 2.1, 0.6]])
dataset = mlconcepts.load(mat)
print_dataset(dataset)

X: 
 [[4.3 2.4 1.2]
 [3.3 2.1 0.6]]


The dtype of the passed matrix has to be convertible to a floating point number. Categorical data can be added in a second matrix, whose elements must be (convertible to) integers representing an index to the category of each element for every column. For instance, a single categorical feature can be added as follows:

In [7]:
num = numpy.array([[4.3, 2.4, 1.2], [3.3, 2.1, 0.6]])
cat = numpy.array([[2], [1]])
dataset = mlconcepts.load(num, Xc=cat)
print_dataset(dataset)

X: 
 [[4.3 2.4 1.2]
 [3.3 2.1 0.6]]
Xc: 
 [[2]
 [1]]


Similarly, labels can added by setting the parameter "y" to a (numpy) vector whose dtype is convertible to integers. For instance, to say that the first element is an outlier, and the second is not:

In [8]:
num = numpy.array([[4.3, 2.4, 1.2], [3.3, 2.1, 0.6]])
cat = numpy.array([[2], [1]])
outliers = numpy.array([1, 0])
dataset = mlconcepts.load(num, Xc=cat, y=outliers)
print_dataset(dataset)

X: 
 [[4.3 2.4 1.2]
 [3.3 2.1 0.6]]
Xc: 
 [[2]
 [1]]
y: 
 [1 0]


### How to load from a pandas dataframe?

Pandas dataframes can be loaded by just passing them to the load function, e.g.,

In [9]:
import pandas
df = pandas.DataFrame({"a" : [4.2, 1.2], "b" : ["no", "yes"]})
data = mlconcepts.load(df)
print_dataset(data)

X: 
 [[4.2]
 [1.2]]
Xc: 
 [[0]
 [1]]


The loading function automatically detects that "b" is a categorical feature and stores it as such. Labels can be set by indicating a suitable column (categorical or integer) as follows:

In [10]:
df = pandas.DataFrame({"a" : [4.2, 1.2], "b" : ["no", "yes"]})
data = mlconcepts.load(df, labels="b")
print_dataset(data)

X: 
 [[4.2]
 [1.2]]
y: 
 [0 1]


Alternatively, a label column can be specified as a numpy array as follows:

In [11]:
df = pandas.DataFrame({"a" : [4.2, 1.2], "b" : ["no", "yes"]})
outliers = numpy.array([1, 0])
data = mlconcepts.load(df, y=outliers)
print_dataset(data)

X: 
 [[4.2]
 [1.2]]
Xc: 
 [[0]
 [1]]
y: 
 [1 0]


### How to load a matlab file?

To load from a matlab file, we need to tell mlconcepts what are the names of the matlab variables containing the data. To do so, we pass a map containing the parameters "Xname", "Xcname", "yname", which default to "X", "Xc", and "y", respectively. For instance, to load from a matlab file `mammography.mat` (you can download it [here](https://odds.cs.stonybrook.edu/mammography-dataset/)) where numerical data is in a matrix `X`, and the labels are in a vector `y`, run

In [12]:
data = mlconcepts.load("mammography.mat",
                       settings={ "Xname" : "X", "yname" : "y" })
print_dataset(data)

X: 
 [[ 0.23001961  5.0725783  -0.27606055  0.83244412 -0.37786573  0.4803223 ]
 [ 0.15549112 -0.16939038  0.67065219 -0.85955255 -0.37786573 -0.94572324]
 [-0.78441482 -0.44365372  5.6747053  -0.85955255 -0.37786573 -0.94572324]
 ...
 [ 1.2049878   1.7637238  -0.50146835  1.5624078   6.4890725   0.93129397]
 [ 0.73664398 -0.22247361 -0.05065276  1.5096647   0.53926914  1.3152293 ]
 [ 0.17700275 -0.19150839 -0.50146835  1.5788636   7.750705    1.5559507 ]]
y: 
 [[0]
 [0]
 [0]
 ...
 [1]
 [1]
 [1]]


### How to load categorical data?

Most dataloaders handle categorical data automatically, as it is usually not hard to distinguish it from numerical data. The only exception is when you want to consider a column of integers to be categorical. In this case the parameter "categorical" can be used to specify a list of features which are in principle numerical, but should be considered as categorical, e.g.,

In [13]:
df = pandas.DataFrame({ 'age' : [32, 86], 'has_internet_plan' : [1, 0] })
data = mlconcepts.load(df, categorical=["has_internet_plan"])
print_dataset(data)

X: 
 [[32.]
 [86.]]
Xc: 
 [[0]
 [1]]


### How to generate splits of a dataset?

mlconcept's Dataset objects support splitting using any split generator, i.e., an object with a method <em>split</em>, which, given the dataset, returns a generator yielding a set of indices used to sample the training set, and a set of indices used to sample the test set.

For example, all the split generators of sklearn are supported, as shown in the following snippet

In [14]:
import sklearn.model_selection
df = pandas.DataFrame({
    "a" : [1.4, 2.2, 7.3, 2.5, 4.6, 3.5, 9.8, 4.5, 1.3, 10.5],
    "c" : ["y", "n", "y", "y", "n", "n", "n", "y", "y", "y"],
    "o" : [1, 1, 1, 0, 0, 1, 1, 0, 0, 0]
})
data = mlconcepts.load(df, labels = "o")

skf = sklearn.model_selection.StratifiedKFold(n_splits = 2, shuffle = True)
for train, test in data.split(skf):
    print_dataset(train)
    print_dataset(test)

X: 
 [[ 1.4]
 [ 4.6]
 [ 9.8]
 [ 1.3]
 [10.5]]
Xc: 
 [[0]
 [1]
 [1]
 [0]
 [0]]
y: 
 [1 0 1 0 0]
X: 
 [[2.2]
 [7.3]
 [2.5]
 [3.5]
 [4.5]]
Xc: 
 [[1]
 [0]
 [0]
 [1]
 [0]]
y: 
 [1 1 0 1 0]
X: 
 [[2.2]
 [7.3]
 [2.5]
 [3.5]
 [4.5]]
Xc: 
 [[1]
 [0]
 [0]
 [1]
 [0]]
y: 
 [1 1 0 1 0]
X: 
 [[ 1.4]
 [ 4.6]
 [ 9.8]
 [ 1.3]
 [10.5]]
Xc: 
 [[0]
 [1]
 [1]
 [0]
 [0]]
y: 
 [1 0 1 0 0]


# Models

The library currently exposes two classes for outlier detection: `UODModel` for unsupervised OD, and `SODModel` for supervised OD. Models can simply be created as follows:

In [22]:
unsup_model = mlconcepts.UODModel()
sup_model = mlconcepts.SODModel()

After a model is created, the methods `fit`, `predict`, and `predict_explain` train the model, compute predictions for some set, and compute predictions with accompanying explanations for a set, respectively. These three methods can either take a `mlconcepts.data.Dataset` object, or can be called with the same signature as the method `mlconcepts.load` [described above](#Datasets). In the latter case, a dataset is loaded by forwarding the arguments to the function `mlconcepts.load`, and then the task is executed on that dataset.

For instance, the following basic example trains a supervised model on a dataset `trainset` and computes predictions on the dataset `data`.

In [23]:
import pandas
trainset = mlconcepts.load(pandas.DataFrame({
    "f": [5.6, 6.6, 1.2, 9.6, 3.5, 8.9],
    "g" : [1, 1, 1, 1, 1, 1],
    "h" : [2, 2, 2, 7, 2, 7],
    "outlier" : [0, 0, 0, 1, 0, 1]
}), labels = "outlier")
data = mlconcepts.load(pandas.DataFrame({
    "f": [1.1, 9.0],
    "g:": [1, 1],
    "h": [2, 7],
    "outlier" : [0, 1]
}), labels = "outlier")
model = mlconcepts.SODModel(n=64, epochs=100)
model.fit(trainset)
predictions = model.predict(data)
print(predictions)


[0.33570436 0.92473796]


Similarly, explanation data can be extracted by running `predict_explain` as follows:

In [24]:
explanations = model.predict_explain(data)
print(explanations[0]) # prints the explanations for the first element
print(explanations[1]) # prints the explanations for the second element

Prediction: 0.33570436442575713. Explainers: { { f } : 0.4223886096968753, { f, h } : 0.39364539123871706, { g: } : 8.826424838925244e-08 }
Prediction: 0.9247379562839234. Explainers: { { f } : 1.1481712822870962, { f, h } : 1.070039113860856, { h } : 0.6610560865403848 }


Of course, the explanation data in this dummy example does not make too much sense, but it can be seen that the model values the feature `f` for making a prediction, while it essentially diregards `g`.

### Parameters for unsupervised models

Unsupervised models support parameters that guide the way in which the underlying FCA algorithm works:

In [25]:
model = mlconcepts.UODModel(
	n=32, # number of bins for numerical feature quantization
	quantizer="uniform", # numerical features are uniformly split in bins
	explorer="none", # strategy to explore different feature sets
	singletons=True, # whether to explore all single element sets
	doubletons=True, # whether to explore all two-elements sets
	full=True # whether to explorer the set of all features
)

The first two parameters concern the treatment of numerical features. Numerical features are indeed quantized into cateogrical ones in a first step. The only supported quantizer at the time of writing is `uniform`.

To make its predictions, the algorithm essentially creates different agents which make a prediction considering different perspectives, i.e., different feature sets they observe. At the end all the agents combine their scores into a final prediction. The `explorer` determines what agents to generate. Currently, in the stable version of the program, only `none` is supported. The last three parameters set the agents which are generated at the start. The indicated explorer is then used to generate more agents (to consider more feature sets) depending on which are more promising.

### Parameters for supervised models

Supervised models support all parameters of unsupervised models, plus parameters for gradient descent:

In [26]:
model = mlconcepts.SODModel(
    epochs=1000, # maximum number of training iterations
    show_training=True, # whether to show training information in the terminal
    learning_rate=0.01, # learning rate for gradient descent
    momentum=0.01, # momentum for gradient descent
    stop_threshold=0.001 # loss change under which training is halted
)

### What is the output of predict_explain?

The method `predict_explain` returns an `ExplanationData` object. ExplanationData overrides the subscript operator, so as to return an `ExplanationEntry` for any index of an element in the dataset the prediction was computed on. An `ExplanationEntry` is an iterable, which, when iterated over, returns pairs (set of feature, importance). The pairs are yielded in decreasing order with respect to their importance.

Roughly speaking, each set of features is mapped to its relevance in making a prediction in the model.

The following snippet shows the relevance of every (computed) feature set of the second item in a dataset:

In [27]:
for feature_set, relevance in explanations[0]:
    print(feature_set, relevance)

{ f } 0.4223886096968753
{ f, h } 0.39364539123871706
{ g: } 8.826424838925244e-08
{ h } 7.439206219805188e-08
{ g:, h } 3.215917183791385e-08
full -0.04781196671852176
{ f, g: } -0.1459617584479418


### A slightly more complicated example

The following example shows how to load a dataset, generate splits, and compute its performance (AUC score). It requires you to upload the `mammography.mat` dataset (which you can download [here](https://odds.cs.stonybrook.edu/mammography-dataset/)).

In [28]:
import sklearn.model_selection
import sklearn.metrics
import mlconcepts

data = mlconcepts.load("mammography.mat")

skf = sklearn.model_selection.StratifiedKFold(n_splits = 4, shuffle = True)
for train, test in data.split(skf):
    model = mlconcepts.SODModel(n=64, epochs=1000)
    model.fit(train)
    predictions = model.predict(test)
    print("AUC: ", sklearn.metrics.roc_auc_score(test.y, predictions))

AUC:  0.8958369715235331
AUC:  0.878756724783821
AUC:  0.905498126918852
AUC:  0.8527078050154973
