A versatile relational clustering toolbox. It contains several clustering algorithms, check the original paper for details (see below). Still under development.
- get SBT
- clone this repository
- position in the root folder and
- build a jar with dependencies
sbt assembly
- build a jar without dependencies
sbt package
- build a jar with dependencies
Usage: RelationalClustering.jar [OPTIONS]
OPTIONS
--db filepath database(s) with data to cluster (May be specified multiple times.)
--declarations file path predicate declarations
--query comma-separated list domains to query
--similarity [RCNT|HS|HSAG|CCFonseca|RKOH] similarity measure
--domain filepath predicate definitions
--aggregates comma-separated list [mean/min/max] a list of aggregator functions to use for the numerical attributes
--algorithm [Spectral|Hierarchical|DBscan|Affinity] clustering algorithm
--bagCombination [union|intersection] multiset combination method
--bagSimilarity [chiSquared|maximum|minimum|union] multiset similarity measure
--clauseLength n (CC and RKOH) maximal length of clause/walk
--damping d damping parameter for Affinity Propagation
--definitionsDeviance Double maximum standard deviation for a numeric attribute to be preserved (in % of the mean value)
--definitionsK Int top K most occurring tuples to select
--depth n depth of the neighbourhood tree
--eps d eps value for DBscan
--labels filepath labels for the query objects (May be specified multiple times.)
--linkage [average|complete|ward] (Hierarchical) linkage
--preference d (Affinity Propagation) preference parameter
--root filePath temporary folder to use
--selection [model|saturation] method to choose a single clustering
--selectionValidation [intraCluster|silhouette] evaluation criteria for clustering selection
--validationMethod [ARI|RI|intraCluster|majorityClass] cluster validation method
--vertexCombination [avg|min|max] how to combine the similarities of individual vertices in a hyperedge
--weights Array[Double]
comma-separated list of weights [attributes,attribute distribution,connections,vertex neighbourhood,edge distribution]
--exportNTrees flag export neighbourhood trees as gspan
--findDefinitions flag extract definitions of clusters
-k n number of clusters to create
--localRepo flag use local NodeRepository for all neighbourhood trees
--selectSingle flag select single clustering
--validate flag perform clustering validation
Knowledge base/graph containing the facts in a domain (*.db)
Movie(Aoceanstwelve,Anelsonpeltz)
Movie(Aplayerthe,Awhoopigoldberg)
Movie(Apelicanbriefthe,Ajuliaroberts)
Movie(Aoceanstwelve,Ajuliaroberts)
...
Gender_male(Adavidsontag)
Gender_male(Arobertculp)
Gender_female(Acynthiastevenson)
Gender_male(Afredward)
Gender_female(Adinamerrill)
...
Genre(Asoderberghsteven,Acrime)
Genre(Apakulaalanj,Adrama)
Genre(Apakulaalanj,Amystery)
Genre(Aaltmanroberti,Adrama)
Workedunder(Aminianden,Asoderberghsteven)
Workedunder(Acaseyaffleck,Asoderberghsteven)
Workedunder(Aelliottgould,Asoderberghsteven)
Workedunder(Adenzelwashington,Apakulaalanj)
...
Definitions file specifying the domains of objects (*.def)
Gender_male(person)
Gender_female(person)
Genre(person,genre)
Movie(movie,person)
Workedunder(person,person)
Declarations file specifying the meaning of the arguments of predicates (*.dcl)
Gender_male(name)
Gender_female(name)
Genre(name,attr)
Movie(name,name)
Workedunder(name,name)
The arguments can have the following roles:
name
- identifier of an objects/instance/example; this is essentially treated as the name of an instanceattr
- identifies a discrete attribute value. Attribute name is given by the name of a predicate. It needs to have exactly onename
argument.number
- identifies a continuous attribute value. Attribute name is given by the name of a predicate. It needs to have exactly onename
argument.
These roles influence the way a neighbourhood tree is constructed.
The following similarity measure a currently supported:
- Relational clustering over neighbourhood trees (see Citing section)
- Hybrid similarity measure introduced in
Neville, Adler and Jensen: Clustering Relational Data Using Attribute and Link Information. Text Mining and Link Analysis Workshop, ICAI 2003
- Hybrid similarity measure for annotated graphs introduced in
Witsenburg and Blockeel: Improving the accuracy of similarity measures by using link information. Foundations of Intelligent Systems 2001
- Conceptual clustering introduced in
Fonseca, Santos Costa, Camacho: Conceptual clustering of multi-relational data. ILP 2011
- [Not ready yet] Relational instance based learning
- Graph kernels:
- Rooted Kernel for ordered hypergraphs from
Wachman, Khardon: Learning from Interpretations: A Rooted Kernel for Ordered Hypergraphs. ICML 2007
- Fork it!
- Create your feature branch:
git checkout -b my-new-feature
- Commit your changes:
git commit -am 'Add some feature'
- Push to the branch:
git push origin my-new-feature
- Submit a pull request
If you have any question, feel free to send then at sebastijan.dumancic@cs.kuleuven.be
Please cite the following paper if you are using the code
@article{,
author = {Dumancic, Sebastijan and Blockeel, Hendrik},
title = {An expressive dissimilarity measure for relational clustering over neighbourhood trees},
journal = {Machine Learning journal},
year = {2017},
url = {https://lirias.kuleuven.be/handle/123456789/582293}
}
Release under Apache License, version 2.