Nystrom Bestiary v2.0
written by Alex Gittens
licensed under the Creative Commons ShareAlike 4.0 International License
Nystrom Bestiary is a collection of code for experimenting with various SPSD Sketches, including Nystrom extensions based on column sampling, Nystrom extensions based on random mixtures of columns, and 'pinched' and 'prolonged' eigensketches (see the review paper "Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions" equations (5.11) and (5.12) for the precise definition of these two sketches).
It was used to produce the figures in the paper "Revisiting the Nystrom Method for Improved Large-scale Machine Learning" (arXiv preprint link) by Alex Gittens and Michael Mahoney. In particular, the experimental setup to generate exactly those figures is included.
Send comments, suggestions, and complaints to gittens AT icsi FULLSTOP berkeley FULLSTOP edu
extensions/directory contains the implementations of various Nystrom extensions
io/directory contains the code used to create, load, and process the datasets
datasets/directory contains the datasets used in the experiments
experiments/directory contains a set of m-files that actually runs the Nystrom extensions on various datasets and stores statistics on the errors and timing
outputs/directory is used to store the output of the experiments
plots/directory stores the plots of the timings and errors
auxiliary/directory contains code needed in computing the extensions
visualization/directory contains the code used to produce the plots of the various timings and errors
misc/directory contains miscellany (so far, the code to generate the data for Table 2 in the paper)
ALL m files should be run from the base folder, otherwise you'll run into path issues
To produce the figures in the paper:
#####Short story Ensure that you are in the base directory,
NystromBestiary, and run the following commands from the Matlab prompt:
addpath(genpath('.')) create_bestiary_datasets maxNumCompThreads = 1; # if you want accurate timing info runall visualizeall
- add all the subdirectories in this folder to your path
create_bestiary_datasetsto generate some required distance matrices; this step generates about 1.5Gb of data
- If you want to have email notifications at the start and end of each
runall.mto set the
sendEmailsflag to true and set the email-related variables appropriately, then run
setpref('Internet', 'SMTP_Password', 'youremailpassword')at the Matlab command line
runall(or pick individual experiments) in the experiments directory; this step generates about 2.7Gb of data
- wait several days for the experiments to stop running!
The pdfs will be located in the output directory
See the individual m-files for more details. Make appropriate modifications to substitute your own datasets.
- jdqrpcg.m is due to Yvan Notay (see the m file for full attribution)
- notifier.m is due to Benjamin Krause (see the m file for full attribution)
for dataset provenances, see Table 3 in the above mentioned paper (datasets: Abalone, Wine, Spam, Kin8nm, Dexter, Gisette, Enron, Protein, SNPs, HEP, GR, Gnutella)
two additional datasets, Cranfield and Medline, are from the Text to Matrix Generator Matlab Toolbox's website.