Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
KDDCup 2005 query classification with word representations
Branch: master

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
data
scripts
.gitignore
.hgignore
README
README.batch
TODO

README

Query classification with word representations
--------------------------------------

    scripts and steps by Joseph Turian

Based upon the KDDcup challenge.

    http://www.sigkdd.org/kdd2005/kddcup.html

Using l2-regularized logistic regression, balancing positive examples
vs negative examples, as described in Dekang Lin's "Phrase Clustering
for Discriminative Learning".

We use Megam for l2-regularized logistic regression, because it makes
it easy to weight the examples. It might also be interesting to try
svmsgd, Vowpal Wabbit, or a polynomial SVM.

We have instructions and scripts for how we add word representations
(Brown clusters and/or word embeddings) to the training.

INSTALLATION:
-------------

Download and install megam:
    http://www.cs.utah.edu/~hal/megam/
We use the pre-compiled optimized binary version.

You will need my common Python library:
    http://github.com/turian/common

Download data:
    mkdir data/
    cd data/
    mkdir train/
    cd train/
    wget http://www.sigkdd.org/kdd2005/kddcup/KDDCUPData.zip
    unzip KDDCUPData.zip
    cd ..
    mkdir test/
    cd test/
    wget http://www.sigkdd.org/kdd2005/Labeled800Queries.zip
    unzip Labeled800Queries.zip
    cp labeler2.txt.new labeler2.txt
# Last line is necessary because of inconsistent formatting in one line.

    
Download word representations:
    # Go back to base directory
    mkdir representations/
    cd representations/

    wget http://pylearn.org/turian/brown-clusters/brown-rcv1.clean.tokenized-CoNLL03.txt-c100-freq1.txt
    wget http://pylearn.org/turian/brown-clusters/brown-rcv1.clean.tokenized-CoNLL03.txt-c320-freq1.txt
    wget http://pylearn.org/turian/brown-clusters/brown-rcv1.clean.tokenized-CoNLL03.txt-c1000-freq1.txt
    wget http://pylearn.org/turian/brown-clusters/brown-rcv1.clean.tokenized-CoNLL03.txt-c3200-freq1.txt

    wget http://pylearn.org/turian/embeddings-ACL2010-20100116-redo-baseline-with-100dims/model-1750000000.LEARNING_RATE%3d1e-09.EMBEDDING_LEARNING_RATE%3d1e-06.EMBEDDING_SIZE%3d200.txt.gz
    wget http://pylearn.org/turian/embeddings-ACL2010-20100116-redo-baseline-with-100dims/model-2030000000.LEARNING_RATE%3d1e-09.EMBEDDING_LEARNING_RATE%3d1e-06.EMBEDDING_SIZE%3d100.txt.gz
    wget http://pylearn.org/turian/embeddings-ACL2010-20100116-redo-baseline-with-100dims/model-2270000000.LEARNING_RATE%3d1e-09.EMBEDDING_LEARNING_RATE%3d1e-06.txt.gz
    wget http://pylearn.org/turian/embeddings-ACL2010-20100116-redo-baseline-with-100dims/model-2280000000.LEARNING_RATE%3d1e-08.EMBEDDING_LEARNING_RATE%3d1e-07.EMBEDDING_SIZE%3d25.txt.gz
    ln -s model-1750000000.LEARNING_RATE=1e-09.EMBEDDING_LEARNING_RATE=1e-06.EMBEDDING_SIZE=200.txt.gz cw-embeddings-200dim.txt.gz
    ln -s model-2030000000.LEARNING_RATE\=1e-09.EMBEDDING_LEARNING_RATE\=1e-06.EMBEDDING_SIZE\=100.txt.gz cw-embeddings-100dim.txt.gz 
    ln -s model-2270000000.LEARNING_RATE\=1e-09.EMBEDDING_LEARNING_RATE\=1e-06.txt.gz cw-embeddings-50dim.txt.gz
    ln -s model-2280000000.LEARNING_RATE\=1e-08.EMBEDDING_LEARNING_RATE\=1e-07.EMBEDDING_SIZE\=25.txt.gz cw-embeddings-25dim.txt.gz

    wget http://pylearn.org/turian/hlbl_reps_clean_1.rcv1.clean.tokenized-CoNLL03.case-intact.txt.gz
    wget http://pylearn.org/turian/hlbl_reps_clean_2.50d.rcv1.clean.tokenized-CoNLL03.case-intact.txt.gz
    ln -s hlbl_reps_clean_1.rcv1.clean.tokenized-CoNLL03.case-intact.txt.gz hlbl-embeddings-100dim.txt.gz
    ln -s hlbl_reps_clean_2.50d.rcv1.clean.tokenized-CoNLL03.case-intact.txt.gz hlbl-embeddings-50dim.txt.gz 

RUNNING
----

See README.batch

NOTE
----

I generate one feature file per l2 parameter. Otherwise, different runs might clobber each other.
Something went wrong with that request. Please try again.