GitHub - vioshyvo/genome_test: Testing MRPT algorithm with the ultra-high-dimensional Ecoli genome data set.

Pipeline

Installing C++ tools

./install_fsm.sh

Installs fsm-lite, if not already installed.
Installs Eigen 3.3.4, if not already installed.
Installs googletest, if not already installed.

Local scripts

Scripts that can be used to run C++ tools locally

./prepare_data.sh <zip-file> <data-name>

Unzip data from zip-file (give full path of the zip file), preprocess it for fsm-lite, and save the files to data/data-name
Removes spades.fa - files, and renames rest fasta-files into f0001, f0002,...
Outputs the names of the original files as data_name_file_list into the data directory.

./read_kmers.sh <data-name> <n-points>

Wrapper for fsm-lite, reads kmers counts into the sparse matrix data/<data-name><n_points>/<data-name><n_points>.mat from the specified fasta-files, that reside in the directory data/data-name.
Argument data-name is the name of the data set, for example Ecol.
Argument n_points controls how many first points you want to read, for example 250.
Creates for example file data/Ecol250/Ecol250.mat.

./writer.sh <data-name> <n_train> <n_test> <counts>

Divides the data set into a training set with n_train points and test set with n_test points and writes these to directory data/data-name/ as files train.bin and test.bin. Dimensions of the data set are written to dimensions.sh. Wrapper for binary_writer/binary_writer.
Assumes that the data set with n_points points is written by read_kmers.
Argument counts controls if the kmer counts (counts=1) in samples are written, or only binary (counts=0) yes/no (kmer is in sample or not).

./comparison.sh <data-name> <n> <postfix>

Run exact k-NN search and approximate k-NN search with the MRPT algorithm.
Wrapper for exact/tester and mrpt/mrpt_comparison.
Assumes that the parameters of the test run are saved in the file parameters/<data-name><n><postfix>.sh or parameters/<data-name><n>.sh (parameter <postfix> is optional).
If data set (such as mnist) name has no sample size, you can give empty string ("") as the second argument n.
Saves results into a directory results/<data-name><n><postfix> (or into parameters/<data-name><n>, respectively).

./file_finder.sh <data_name> <k>

write nearest neighbors of test set to file
k is number of nearest neighbors written
exact results should exist in results/data-name-exact/truth_k-file
results (one file for each of the point of the test set) are written into the directory results/data-name-exact/file_names

SLURM wrappers

For SLURM scripts remember to set

upper limit for memory, for example 5 gigabytes: #SBATCH --mem=5G
upper limit for computing time, for example one hour: #SBATCH --time=01:00:00

Scripts that can be used to run the C++ tools in SLURM system are in the directory wrapper-SLURM:

prepare_data_slurm.sh <zip-file> <data-name>

Slurm wrapper for prepare_data.sh.

read_kmers_slurm.sh <data-name> <n-points>

Wrapper for fsm-lite, has same arguments as read_kmers.sh.
Set variable BASE_DIR to your local clone of this repo, for example BASE_DIR=/home/mydir/genome_test

writer_slurm.sh <data-set-name> <n_train> <n_test> <counts>

Wrapper for binary_writer/binary_writer, same functionality as writer.sh.
Set variable BASE_DIR to your local clone of this repo, for example BASE_DIR=/home/mydir/genome_test
For Ecol data set with 1500 points #SBATCH --mem=150G and #SBATCH --time=02:00:00 are good values.

comparison_slurm.sh <data-name> <n> <postfix>

Wrapper for exact/tester and mrpt/mrpt_comparison, same functionality as comparison.sh.
Set variable BASE_DIR to your local clone of this repo, for example BASE_DIR=/home/mydir/genome_test

Plot results

python plot.py <k> results/<result-name1>/mrpt.txt results/<result-name2>/mrpt.txt

plots running time vs. accuracy for k-nn queries.
one line for each of the results file.
uses sparsity values (expected proportion of the non-zero components in the random vectors) in the legend.
configuration is done straight to the script:
- n_test : test set size.
- legend : draw legend or not.
- save : is file saved into a file called file_name or showed.
- log : is the scale of y-axis logarithmic or linear
- set_ylim : is the limit of y axis set to ylim, or show all data points.
- legend_label : which attribute is used for legend; current choices are sparsity, depth, and filename.
- show_title : add title given by the argument title to plots.
- exact_time : time of exact search for one query point.

Misc scripts

get_mnist.sh

load mnist data set into data/mnist/ for testing.
converts it into binary form (float array in saved in col-major form, dimension of data is d = 784).
loads the whole data set (data.bin), and divides it into a training set (train.bin) and a test set (test.bin); the test set has TEST_N = 100 points the and training set has 59900 points with this value of TEST_N.

Test github pages

Link to the automatically generated documentation

Name		Name	Last commit message	Last commit date
Latest commit History 227 Commits
binary_writer		binary_writer
bits		bits
cpp		cpp
data		data
docs		docs
exact		exact
file_finder		file_finder
files		files
gtest		gtest
mrpt		mrpt
parameters		parameters
patches		patches
results		results
sdsl-lite		sdsl-lite
test		test
tools		tools
wrapper-SLURM		wrapper-SLURM
.gitignore		.gitignore
Lentua.jpg		Lentua.jpg
README.md		README.md
common.h		common.h
comparison.sh		comparison.sh
config_doxygen		config_doxygen
exact.sh		exact.sh
file_finder.sh		file_finder.sh
get_mnist.sh		get_mnist.sh
install_fsm.sh		install_fsm.sh
plot.py		plot.py
prepare_data.sh		prepare_data.sh
prepare_data2.sh		prepare_data2.sh
read_kmers.sh		read_kmers.sh
test_writer.sh		test_writer.sh
writer.sh		writer.sh

vioshyvo/genome_test

Folders and files

Latest commit

History

Repository files navigation

Pipeline

Installing C++ tools

Local scripts

SLURM wrappers

Plot results

Misc scripts

Test github pages

About

Resources

Stars

Watchers

Forks

Languages