Under the folder toy_example, we provide a jupyter notebook jan22_toy_example.ipynb that works through the training and evaluation of our autoencoder (as well as other baseline algorithms) for the synthetic1 dataset. We highly recommend interested readers to take a look before diving deep into our code.
The source code contains four parts:
- Core
- model.py
- utils.py
- datasets.py
- baselines.py
- Code for each dataset
- synthetic_main.py
- synthetic_powerlaw_main.py
- amazon_main.py
- amazon_parallel_l1.py
- rcv1_main.py
- rcv1_parallel_l1.py
- Scripts for reproducing our results
- scripts/synthetic1.sh
- scripts/synthetic2.sh
- scripts/amazon.sh
- scripts/rcv1.sh
- scripts/synthetic_powerlaw.sh
- Code and scripts for one of the baselines
Simple AE + l1-min- synthetic_simpleAE.py
- amazon_simpleAE.py
- rcv1_simpleAE.py
- scripts are under simpleAE_scripts/
To reproduce our experimental results, first run chmod +x scripts/*.sh to make the scripts executable. After that, run the given scripts:
$ ./scripts/synthetic1.sh$ ./scripts/synthetic2.sh$ ./scripts/amazon.sh$ ./scripts/rcv1.sh$ ./scripts/synthetic_powerlaw.sh
Note:
- The results are stored in a python dictionary which is then saved under the folder
ckpts/. They can be used to reproduce the figures shown in our paper. - Before running
amazon.sh, downloadtrain.csvfrom this kaggle competition and specify its location via --data_dir. - The RCV1 dataset will be fetched automatically using the
sklearn.datasets.fetch_rcv1function. - To reproduce results of one of the baselines
Simple AE + l1-min, run scripts under the folder simpleAE_scripts/. - For high-dimensional vectors, solving
l1-minusing Gurobi takes a long time on a single CPU. To speed up, we solvel1_minin parallel on a multi-core machine. Inamazon_main.pyandrcv1_main.py, performance evaluation is performed on a small set of the test samples (while training is still done using the complete training set). After training the autoencoder, we use a multi-core machine and solvel1_minin parallel on the complete test set usingamazon_parallel_l1.pyandrcv1_parallel_l1.py. Depending on your multi-core machine, solvingl1_minin parallel on the complete test set may still take a long time, I would recommend runningamazon_parallel_l1.pyandrcv1_parallel_l1.pyfirst with a small subset (by setting a small number for the parametersnum_coreandbatchin the python file).
Here is our software environment.
- Python 2.7.12
- numpy 1.13.3
- sklearn 0.19.1
- scipy 1.0.0
- joblib 0.10.0
- Tensorflow r1.4
- Gurobi 7.5.1