Skip to content

sean-chester/generalised-brown

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 

generalised-brown

version 1.0 © 2015 Sean Chester and Leon Derczynski


Table of Contents


Introduction

The generalised-brown software suite clusters word types by distributional similarity in two phases. It first generates a list of merges based on the well-known Brown clustering algorithm and then recalls historical states to vary the granularity of the clusters. For example, given the following corpus:

Alice likes dogs and Bob likes cats while Alice hates snakes and Bob hates spiders

Greedily clustering word types based on average mutual information (i.e., running the C++ merge generator) produces the following merge list (assuming a = |V| = 10):

snakes spiders 8

dogs cats 7

Alice Bob 6

and while 5

likes hates 4

dogs snakes 3

dogs and 2

dogs Alice 1

dogs likes 0

One can then recall any historical state of the computation in order to produce a set of clusters (i.e., run the python cluster generator). For example, with c = 5, we recall the state c - 1 = 4 to produce the following clusters:

{snakes, spiders} {dogs, cats} {Alice, Bob} {likes, hates} {and, while}

This approach (setting separate values of a and c) we refer to as Roll-up feature generation. By contrast, traditional Brown clustering would produce the following five clusters (equivalent to running the C++ merge generator with a = 5 and the python cluster generator with c = 5):

{likes, hates} {snakes, spiders, cats, dogs} {and, while} {Alice} {Bob}

For details about the concepts implemented in this software, please read our recent AAAI paper:

L. Derczynski and S. Chester. 2016. "Generalised Brown Clustering and Roll-up Feature Generation." In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16). 7 pages. To appear.

For details about traditional Brown clustering, consult the article in which it was introduced:

PF Brown et al. 1992. "Class-based n-gram models of natural language." Computational Linguistics 18(4): 467--479.

or the implementation that our C++ merge generator forked:

wcluster.


Requirements

generalised-brown relies on the following applications:

  • For compiling the C++ merge generator: A C++ compiler that is compatible with C++ 11 and OpenMP (e.g., the newest GNU compiler) and the make program

  • For running the python cluster generator: A python interpreter


Installation

The python cluster generator does not need to be compiled. To compile the C++ merge generator, navigate to the merge_generator/ subdirectory of the project and type:

make


Usage

To produce a set of features for a corpus, you will first want to use Generalised Brown (i.e., the C++ merge generator) to create a merge list. Then, you can create c clusters by running the python cluster generator on the merge list. This second step can be done for as many values of c as you like, but we recommend that each value of c is not larger than the value of a used to generate the merge list.

To run the C++ merge generator, type:

./merge_generator/wcluster --text [input_file] --a [active_set_size]

The resultant merges will be recorded in:

./[input_file]-c[active_set_size]-p1.out/merges

To run the python cluster generator, type:

python ./cluster_generator/cluster.py -in ./[input_file]-c[active_set_size]-p1.out/merges -c 3

Each word type will be printed to stdout with its cluster id.

The C++ merge generator runs in O(|V| a^2) time, where |V| is the number of distinct word types in the corpus (i.e., the size of the vocabulary) and a is a bound on the algorithm's search space. The python cluster generator runs in O(|V|) time.


License

This software consists of two sub-modules, each released under a different license:

  • The python cluster generator is subject to the terms of The MIT License

  • The C++ merge generator follows the original licensing terms
    of wcluster.

See the relevant sub-directories of this repository for the specific details of each license.


Datasets

You can find aclusters induced from the RCV1 data, including a full merge file, with a=2560, here: http://derczynski.com/sheffield/resources/rcv.a2560.tar.bz2


Contact

This software suite will undergo a major revision; so, you are encouraged to ensure that this is still the latest version. Please do not hesitate to contact the authors if you have comments, questions, or bugs to report.

generalised-brown on GitHub