{{ message }}

# twanvl / graph-cluster Public

Local search based graph clustering software

Switch branches/tags
Nothing to show

## Files

Failed to load latest commit information.
Type
Name
Commit time

# Local Search Optimization for graph clustering

## Introduction

This program uses a local search optimization to find an approximately optimal clustering of a undirected graph according to some objective function. Many different objectives are supported. There are bindings for octave and MATLAB.

Formally, the problem that is solved is to find a clustering C of the nodes in a graph that minimizes the loss

``````loss(C) = f(sum_{c ∈ C} g(c))
``````

The optimization method is the one introduced by Blondel et.al. , and also used by  and .

## Installation

To build the stand alone version:

``````make
``````

To build the octave interface:

``````octave
make_octave
``````

To build the matlab interface:

``````matlab
make_matlab
``````

## Usage

for the Octave and MATLAB bindings, the usage is:

``````[c,l,k] = lso_cluster(A, <optional parameters>);
``````

Where:

• `A` is a symmetric weighted adjacency matrix. It can be sparse.
• The optional parameters are a list of `'name',value` pairs, see below.
• `c` is the labeling that represents the clustering. The first cluster will have label `0`, the next label `1`, etc.
• `l` is the objective (loss) value of the clustering.
• `k` is the number of clusters, `k = numel(unique(c))`.

Optional parameters regarding the clustering:

• `'eval', clustering` Evaluate the objective function on `clustering`, do not optimize.

• `'init', clustering` Use the given `clustering` as initial value for the optimization. Default: each node is initially in a separate cluster, i.e. `clustering=1:length(A)`.

Optional parameters regarding the objective:

• `'loss', loss` Use the given loss/objective function. The loss is given a string name. See below for a list of supported loss functions. Default: `loss = 'modularity'`

• `'total_volume', m` Replace the total volume (sum of edges) by `m`. Many objectives use the total volume for normalization, and changing it will change the scale at which clusters are found. Usually increasing the total volume will result in larger clusters. Default: `m = sum(sum(A))`

• `'extra_loss', alpha` Add a term to the loss function that penalizes the volume of clusters, with weight `alpha`.

• `'num_clusters', n` Force the solution to have the given number of clusters. The algorithm uses a binary search to alter the objective until it finds a solution with the given number of clusters. The alteration is the same as the one used by `extra_loss`.

• `'min_num_cluster', n` Force the solution to have at least the given number of clusters.

• `'max_num_cluster', n` Force the solution to have at most the given number of clusters.

• `'max_cluster_size', n` Allow clusters to have at most n nodes.

Optional parameters about internal algorithm details, you only need these if you know what you are doing:

• `'seed', random_seed` Use a given random seed. By default a fixed seed is used, so repeated runs with the same input give the same output.

• `'num_repeats', n` Repeat the search n times from scratch with different random seeds and return the best result. Default: 1

• `'num_partitions', n` Number of times to try and break apart the clusters and re-cluster them Default: 0

• `'optimize_exhaustive', bool` Use an exhaustive search instead of local search. This is of course very slow. Can be combined with `max_num_cluster`. Default: false.

• `'optimize_higher_level', bool` Use a hierarchical optimizer, where small clusters are considered as nodes of a higher level graph. Default: true.

• `'always_consider_empty', bool` Always consider the move of a node into a new singleton cluster. Default: true.

• `'num_loss_tweaks', n` Maximum number of iterations in the binary search to force the specified number of clusters. Default: 32

• `'check_invariants', bool` Check invariants of the algorithm (for debugging). Default: false.

• `'trace_file', filename` Write a trace of the steps performed by the optimization algorithm to a file in JSON format.

• `'verbose', level` Level of debug output. Levels go up to 7 and are increasingly chatty. Default: `level = 0`, i.e. no output.

## Loss functions

The optional parameter `'loss'` specifies the loss function to use. The default is modularity. In this section the notation is:

``````v_c: The volume of cluster c, i.e. the sum of degrees/strengths of nodes in c.
In MATLAB notation:
v_c = sum(sum(A(C==c,:)))
w_c: The weight of edges within cluster c
In MATLAB notation:
w_c = sum(sum(A(C==c,C==c))
n_c: The number of nodes in cluster c
m:   The total volume of the graph,
m = v_V, where V is the set of all nodes.
This is altered by the 'total_volume' option.
``````

Since the loss is minimized, some objectives are negated compared to their usual definition.

Some of the supported loss functions are:

• `'modularity'`, `loss = -sum_c (w_c/m - v_c^2/m^2)` This is the negation of the usual definition

• `'infomap'`: The infomap objective by .

• `'ncut'`: Normalized cut, `loss = sum_c (v_c - w_c) / n_c`

• `'rcut'`: Ratio cut, `loss = sum_c (v_c - w_c) / v_c`

• `{'pmod',p}`: Modularity with a different power, `loss = -sum_c (w_c/m - (v_c/m)^p / (p-1))`

• `{'mom',m}`: Monotonic variant of modularity, `loss = -sum_c (w_c/(m + 2v_c) - (v_c/(m + 2v_c))^2)`

• `'w-log-v'`, `loss = sum_c (w_c/m * log(v_c) )`

• `num` `loss = |C|`: Minimize the number of clusters, this leads to finding the connected components.

See `loss_functions.hpp` for the full list. Some loss functions have parameters, these are passed as a cell array where the first element is the name.

## Examples

Find the clustering of a graph by optimizing modularity:

``````% Build a graph that has a cluster structure
A = (blkdiag(rand(4), rand(5), rand(6)) > 1-0.5) | (rand(15) > 1-0.05);
A = A | A';
% Find clusterings
c = greedy_cluster(A);
% c = [0 0 0 0 1 1 1 1 1 2 2 2 2 2 2]'
``````

Evaluate the modularity of a given clustering:

``````c = [0 0 0 0 5 5 5 1 1 2 2 2 2 2 2]
[ignored, l] = greedy_cluster(A, 'eval',c);
l
% l = -0.40633
``````

Optimize infomap:

``````c = greedy_cluster(A, 'loss','infomap');
``````

Find a solution with exactly 3 clusters:

``````c = greedy_cluster(A, 'num_clusters',3);
``````

Here is a larger example:

``````% An LFR graph with mixing 0.6 and 1000 nodes (See )
% The gaph is in A, the ground truth is in c

> [d,l,k] = lso_cluster(A,'loss','modularity');
> normalized_mutual_information(c,d)
ans = 0.93011

% This solution has too few clusters
> k
k = 25
> numel(unique(c))
ans = 41

% So, force the number of clusters to be 41
> [d,l,k] = lso_cluster(A,'loss','modularity','num_clusters',41);
> normalized_mutual_information(c,d)
ans = 1

% Or use a different loss function
> [d,l,k] = lso_cluster(A,'loss','w-log-v');
> normalized_mutual_information(c,d)
ans = 1
``````

## Usage, stand alone version

Graphs should be given on in a text file whos name is passed as the first argument (or `-` for stdin). Each line should look like:

`````` <from> <to> <weight>
``````

Where `<from>` and `<to>` are node identifiers. These are either strings or integers from `0...n`. Only nodes that have edges will be included in the graph. If needed, you can add edges `<n> <n> 0` to force the graph to contain a node `n`. The graph is made symmetric, i.e. the reversed edges `<to> <from> <weight>` are also added.

The output is written as text to stdout, with `-o <filename>` the found clustering is directed to a file instead. Each line of the output will have the format

``````<node> <clusterid>
``````

Nodes will be listed in alphabetical order, or in numerical order if the `--numeric` flag is given.

Other parameters are specified as `--parameter value`, with the same name as the matlab version parameters. For example `--loss infomap` uses the infomap loss function.

``````\$ lso-cluster example.in
``````

## References

 Graph clustering: does the optimization procedure matter more than the objective function?; Twan van Laarhoven and Elena Marchiori; Physical Review E 87, 012812 (2013) [pdf] [publisher] [website]

 Fast unfolding of communities in large networks; Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte and Etienne Lefebvre; J. Stat. Mech. Theory Exp. 2008, P10008 (2008), [publisher] [arxiv]

 Maps of random walks on complex networks reveal community structure; M. Rosvall and C. T. Bergstrom; Proc. Natl. Acad. Sci. USA 105, 1118 (2008). [publisher] [arxiv] [code]

 Benchmark graphs for testing community detection algorithms; A. Lancichinetti, S. Fortunato and F. Radicchi; Physical Review E 78, 046110 (2008) [pdf] [website]

Local search based graph clustering software