Welcome to the tangles Package!

This is a python package for data clustering using the tangles algorithmic framework.

It implements the core algorithm to discover tangles in a set of cuts on a given dataset.

It includes some pre- and postprocessing implementations: options to compute initial cuts on different types of data, and options to postprocess tangles to hierarchical soft and hard clustering.

The framework works with arbitrary sets of cuts and cost functions and can be adapted or implemented manually. The heuristics for finding tangles as well as the derivation of the cluster assignments is task dependent and should be chosen with domain knowledge.

For demonstrations for three types of data: graphs, binary questionnaires and metric data, see the demo folder.

Setup

Download the repository.

Navigate to the tangles directory and install the package using

pip install .

A Simple Example

Imports

from tangles import cost_functions, data_types, plotting, loading, cut_finding, utils, tree_tangles

Load your data

xs, ys = loading.load_GMM(blob_sizes=[200, 200], blob_centers=[[-2, -2], [2, 2]], blob_variances=[[1, 1], [1, 1]], seed=42)
data = data_types.Data(xs=xs, ys=ys)

Compute cuts

values, names = cut_finding.a_slice(xs=data.xs, a=20)
bipartitions = data_types.Cuts(values=values, names=names)

Define a cost function and order cuts

cost_fun = cost_functions.gauss_kernel_distance
bipartitions = utils.compute_cost_and_order_cuts(bipartitions, partial(cost_fun, data.xs, None))

Computing costs of cuts...

100%|█████████████████████████████████████████| 42/42 [00:00<00:00, 1382.53it/s]

Invoke core algorithm to compute tree of all tangles

tangles_tree = tree_tangles.tangle_computation(cuts=bipartitions,
                                               agreement=20,
                                               verbose=0
                                               )

Contract the tree to necessary information

contracted_tree = tree_tangles.ContractedTangleTree(tangles_tree)
contracted_tree.prune(5)

Compute soft cluster assignments

contracted_tree.calculate_setP()

#weight cuts by their cost
weight = np.exp(-utils.normalize(bipartitions.costs))

# propagate down the tree
tree_tangles.compute_soft_predictions_children(node=contracted_tree.root,
                                  cuts=bipartitions,
                                  weight=weight,
                                  verbose=3)

contracted_tree.processed_soft_prediction = True

Derive hard clustering from soft assignments

ys_predicted, _ = utils.compute_hard_predictions(contracted_tree, cuts=bipartitions)

Calculate performance given the ground truth clusters

ARS = sklearn.metrics.adjusted_rand_score(data.ys, ys_predicted)
NMI = sklearn.metrics.normalized_mutual_info_score(data.ys, ys_predicted)

print('Adjusted Rand Score: {}'.format(np.round(ARS, 4)), flush=True)
print('Normalized Mutual Information: {}'.format(np.round(NMI, 4)), flush=True)

Adjusted Rand Score: 0.99

Normalized Mutual Information: 0.9773

If we have a low dimensional embedding we can vizualise the hard and soft clustering:

plotting.plot_hard_predictions(data=data, ys_predicted=ys_predicted, path=None)

plotting.plot_soft_predictions(data=data,
                               contracted_tree=contracted_tree,
                               eq_cuts=bipartitions.equations,
                               path=None)

Overview of the package

Definig heuristics to compute bipartitions

The framework needs a set of cuts as input. Based on these and the datapoints the algorithm discovers the tangles that is cohesive structures in the datas. In tangles/cut_finding, some basic implementations are given. If the data allows for a natural interpretaion of cuts (as a binary questionnaire) we do not need this preprocessing step. A cut finding method needs to return an array of boolean arrays of bipartitions, optionally it can return names of the bipartitions for interpretable output, else return None.

Parameters: xs : ndarray

data matrix on which the cuts shall be computed

**kwargs

Returns: cuts : ndarray, type: bool

array of boolean arrays of cuts

names : ndarray, type: string, optional

optionally return an array with corresponding names for the explanation of the output

Definig cost functions

The cost function implicitly defines the similarity measure between two objects from the dataset and thus introduces an important bias. It will the definition of close objects and the cluster structure of the output. Given the dataset and a cut the cost function returns a real value assigning a score representing the usefulness. High values indicate low informational content, little use. Low values indicate useful cuts. Some predefined cost functions are given tangles/cost_function.py, task and data dependent cost function can help improve the performance.

Parameters: xs : ndarray

data matrix $n \times d$ with $n$ datapoints and $d$ features dimension

cut : ndarray

bool array of length $n$ indication a bipartition

**kwargs

Returns: cuts : float

cost of the cut

optionally return an array with corresponding names for the explanation of the output

Postprocessing tangles

The output of the core algorithm is the tangle search tree holding all tangle infromation about the data. There are several options to aggregate the information to clustering and to interpret the output. If domain knowledge is given then the given heuristics might be improvable.

Citing tangles

If you use this software in your research, we encourage you to cite our paper.

@article{DBLP:journals/corr/abs-2006-14444,
  title={Tangles: From Weak to Strong Clustering},
  author={Solveig Klepper and
            Christian Elbracht and
            Diego Fioravanti and
            Jakob Kneip and
            Luca Rendsburg and
            Maximilian Teegen and
            Ulrike von Luxburg},
  url = {https://arxiv.org/abs/2006.14444},
  publisher = {arXiv},
  year = {2020},
}

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
.github/workflows		.github/workflows
demos		demos
images		images
tangles		tangles
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
demo.ipynb		demo.ipynb
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pyvenv.cfg		pyvenv.cfg
requirements.txt		requirements.txt

License

tml-tuebingen/tangles

Folders and files

Latest commit

History

Repository files navigation

Welcome to the tangles Package!

Setup

A Simple Example

Overview of the package

Definig heuristics to compute bipartitions

Definig cost functions

Postprocessing tangles

Citing tangles

About

Resources

License

Stars

Watchers

Forks

Languages