Skip to content

WingBaldridge_EMNLP2014

jasonbaldridge edited this page Sep 16, 2014 · 7 revisions

Hierarchical Discriminative Classification for Text-Based Geolocation

This page explains the process of replicating the results of:

Benjamin Wing and Jason Baldridge. Hierarchical Discriminative Classification for Text-Based Geolocation. EMNLP 2014. Doha, Qatar.

Getting the code

The first step is to get the code. Check out or download the code from

 https://github.com/utcompling/textgrounder/commits/emnlp-2014-release-candidate-same-results

Setting things up

You'll need to set the $TEXTGROUNDER_DIR variable to the root of the TextGrounder source code, and add $TEXTGROUNDER_DIR/bin to your $PATH variable. You will also need to make sure that $JAVA_HOME is appropriately set to the location of your Java installation (e.g. /Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home on Mac OS X Mavericks).

Installing Vowpal Wabbit

The experiments in the paper used Vowpal Wabbit to do logistic regression. You will need to download and compile the program from the following location:

 https://github.com/JohnLangford/vowpal_wabbit

Doing this will also require the C++ Boost library. It may be easiest if you can install this using a package manager. For example, on Mac OS X, using MacPorts, run

 sudo port install boost

This will install Boost in /opt/local/lib and /opt/local/include. (The Vowpal Wabbit Makefile knows about these locations.)

Finally, make sure that the vw executable is in your PATH.

Getting the data

Next you'll need the data. The Wikipedia and Twitter data sets are available from

http://web.corral.tacc.utexas.edu/utcompling/wing-baldridge-2014/

For instructions on obtaining and processing the Cophir data set, please contact the first author (ben@benwing.com).

Compiling

Run textgrounder build compile. This should download a number of packages and then compile the code.

Running

Running the hierarchical classifier using a uniform grid

To run the program using Vowpal Wabbit as the classifier and doing hierarchical classification, using a uniform grid, do the following:

$ tg-geolocate --memory MEMORY $PATH_TO_CORPUS/$CORPUS_NAME --ranker hier-classifier --classifier vowpal-wabbit --classify-features gram-binary --num-levels 3 --dpc $DEGREES_PER_CELL --subdivide-factor $SUBDIVIDE_FACTOR --beam-size $BEAM_SIZE --vw-args "--bfgs -b 26 --passes 40 --holdout_off" --nested-vw-args "--bfgs -b 24 --passes 12 --holdout_off" --debug parallel-hier-classifier,parallel-evaluation --eval-set (dev|test)

where dev|test indicates either dev or test, for the development or test sets, respectively.

Note that this uses the default Vowpal Wabbit parameters of 26-bit features, 40 passes for the top level, and 24-bit features, 12 passes for the lower levels. Some of the experiments used different parameters, as described in the paper. DEGREES_PER_CELL, SUBDIVIDE_FACTOR and BEAM_SIZE are as described in the paper. The amount of memory required varies somewhat depending on the particular experiment and is specified as e.g. 32g for 32 gigabytes, which is the default that we used for most experiments.

Note that the --debug flags in this case aren't actually related to debugging, but turn on parallel training (parallel-hier-classifier) and evaluation (parallel-evaluation). This is useful on multi-core machines but will increase the total memory usage, as multiple instances of Vowpal Wabbit will run at once. (The memory used here is separate from the memory specified using --memory, which only controls the Java virtual machine.)

Loading and saving models

The above experiment may take a long time, and will save all the learned models in /tmp or equivalent. You can use --load-vw-model, --save-vw-model, --load-vw-submodels, --save-vw-submodels, --load-vw-submodels-levels and --save-vw-submodels-levels to load and/or save the top-level or lower-level models.

For example, to load a previously-saved top-level model and save out lower-level models, run as follows:

$ tg-geolocate ... --load-vw-model $CORPUS.bfgs.b26.passes40.dpc$DEGREES_PER_CELL.model \
    --save-vw-submodels $CORPUS.hier.dpc$DEGREES_PER_CELL.subdiv$SUBDIVIDE_FACTOR.bfgs.b24.passes12.l%l.i%i.submodel

To load the level 1 and level 2 models, while saving level 3 models, run as follows:

$ tg-geolocate ... --load-vw-model $CORPUS.bfgs.b26.passes40.dpc$DEGREES_PER_CELL.model --load-vw-submodels $CORPUS.hier.dpc$DEGREES_PER_CELL.subdiv$SUBDIVIDE_FACTOR.bfgs.b24.passes12.l%l.i%i.submodel --load-vw-submodels-levels 2 --save-vw-submodels $CORPUS.hier.dpc$DEGREES_PER_CELL.subdiv$SUBDIVIDE_FACTOR.bfgs.b24.passes12.l%l.i%i.submodel --save-vw-submodels-levels 3

Running the hierarchical classifier using a K-d grid

To run the program using Vowpal Wabbit as the classifier and doing hierarchical classification, using a K-d grid, do the following:

$ tg-geolocate --memory MEMORY $PATH_TO_CORPUS/$CORPUS_NAME --ranker hier-classifier --classifier vowpal-wabbit --classify-features gram-binary --num-levels 3 --kd-tree --kd-bucket-size $BUCKET_SIZE --subdivide-factor $SUBDIVIDE_FACTOR --beam-size $BEAM_SIZE --vw-args "--bfgs -b 26 --passes 40 --holdout_off" --nested-vw-args "--bfgs -b 24 --passes 12 --holdout_off" --debug parallel-hier-classifier,parallel-evaluation --eval-set (dev|test)

This changes only the --dpc parameter, which is replaced with --kd-tree and --kd-bucket-size.

Running flat logistic regression

To run the program using Vowpal Wabbit as the classifier but doing flat logistic regression instead of hierarchical classification, using a uniform grid, saving out the model (so that e.g. it can be loaded as level 1 of a hierarchical classifier), do the following:

$ tg-geolocate --memory MEMORY $PATH_TO_CORPUS/$CORPUS_NAME --ranker classifier --classifier vowpal-wabbit --classify-features gram-binary --dpc $DEGREES_PER_CELL --vw-args "--bfgs -b 26 --passes 40 --holdout_off" --debug parallel-evaluation --eval-set (dev|test) --save-vw-model $CORPUS.bfgs.b26.passes40.dpc$DEGREES_PER_CELL.model

Running Naive Bayes

To run the program using Naive Bayes as the classifier, using a uniform grid, do the following:

$ tg-geolocate --memory MEMORY $PATH_TO_CORPUS/$CORPUS_NAME --ranker naive-bayes --dpc $DEGREES_PER_CELL --eval-set (dev|test)

The main thing in this case to note is that --debug parallel-evaluation should probably not be specified; doing so seems to greatly increase the memory requirements, and there is already parallelism built in even without this.

Support

Please contact Ben Wing ben@benwing.com for any questions pertaining to replicating results. This program can take some effort to get up and running, and so please feel free to ask for help.