The code in this repository supplements the SIGMOD 2018 paper Sketching Linear Classifiers over Data Streams by Kai Sheng Tai, Vatsal Sharan, Peter Bailis, and Gregory Valiant.
The Weight-Median Sketch (WM-Sketch) and Active-Set Weight-Median Sketch (AWM-Sketch) are two algorithms for learning compressed, memory-budgeted linear classifiers on streaming data. In addition to supporting fast binary classification, these sketches also support the approximate retrieval of high-magnitude model weights in the original, uncompressed feature space -- this is analogous to the estimation of frequent items using the Count-Sketch or the Count-Min Sketch.
The WM-Sketch and AWM-Sketch can be seen as improved variants of feature hashing, a commonly-used technique for handling high-dimensional feature spaces (e.g., in scikit-learn and Vowpal Wabbit).
The code can be built using CMake. From the root project directory, run:
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make
This builds the library libwmsketch
and the binaries wmsketch_classification
and wmsketch_pmi
.
wmsketch_classification
trains binary linear classifiers using the WM-Sketch and other baseline methods described
in the paper. wmsketch_pmi
is an application of the WM-Sketch to streaming pointwise mutual information estimation --
this is described in more detail below.
The wmsketch_classification
binary trains logistic regression classifiers using the WM-Sketch, AWM-Sketch, and with
a number of other baseline methods described in the paper.
Training and testing data should be provided in LIBSVM format -- many standard classification datasets are available in
LIBSVM format at this link.
For example, RCV1 binary classification dataset can be downloaded here (we use the split marked "test" for training since it is larger than the "train" split).
Run:
wmsketch_classification --train <dataset_dir>/rcv1_test.binary
This outputs the following in JSON format:
{
"params": {
"consv_update": false,
"count_smooth": 1.0,
"depth": 1,
"feature_dim": 47237,
...
},
"results": {
...
"train_count": 677399,
"train_err_count": 34049,
"train_err_rate": 0.050264319846944,
"train_ms": 13078
}
}
The train_err_rate
field indicates the online error rate achieved by the classifier: the number of examples that were
misclassified by the classifier before applying the online update. This value can differ from run to run due to the
randomness in defining the sketch -- for deterministic results, the random seed can be set using the --seed
option.
The top_indices
and top_weights
fields indicate the estimates for the highest-magnitude model weights. These can be
compared to the indices and weights returned when training a full, uncompressed classifier (using the option
--method logistic
) to evaluate the accuracy of top-k weight estimation achieved by a memory-budgeted method.
By default, the executable uses the AWM-Sketch to learn the classifier. The full list of methods is:
logistic
: Uncompressed logistic regressionlogistic_sketch
: WM-Sketchactiveset_logistic
: AWM-Sketchtruncated_logistic
: Simple truncation to k highest-magnitude featuresprobtruncated_logistic
: Probabilistic truncation to k highest-magnitude features (similar to weighted reservoir sampling)countmin_logistic
: Track most frequent features as estimated by a Count-Min sketchspacesaving_logistic
: Track most frequent features as estimated by the Space Saving algorithm
For the feature hashing baseline, run with the options --method logistic_sketch --depth 1
.
See the paper for full details on the baseline methods.
For a full list of options, run:
wmsketch_classification --help
Pointwise mutual information (PMI) is an information theoretic measure of association between random events. In natural language processing, PMI is often used as a measure of the degree to which pairs of words tend to co-occur in text -- bigrams with high PMI are more "phrase-like" since the occurrence of one word is indicative of the co-occurrence of the other. In fact, the popular word2vec skip-gram algorithm for learning word embeddings can be interpreted as an implicit factorization of a matrix of PMI values.
As an application of the WM-Sketch, we show how pairs of words with high PMI values can be estimated online from a stream of text data while using a small memory budget of less than 2MB.
In the paper, we ran on the first 20 files of the Google Billion-Word Language Modeling Benchmark, which was derived from a newswire corpus. This dataset can be downloaded here.
Following the --data
flag, the executable accepts a list of whitespace-delimited paths to data files. We assume that
each line in each file is a separate sentence and that each sentence has already been tokenized.
For example, to run on the first two files of the dataset:
wmsketch_pmi --data <dataset_dir>/news.en-00001-of-00100 <dataset_dir>/news.en-00002-of-00100
This outputs a list of bigrams along with their estimated PMI values in JSON format:
{
...
"results": {
"num_tokens": 7769690,
"tokens": [
[
"!",
"!"
],
[
"don",
"'t"
],
[
"per",
"cent"
],
[
"united",
"states"
],
[
"$",
"million"
],
...
],
...
}
}
For a full list of options, run:
wmsketch_pmi --help