readme

There are 2 relevant packages in this project:

test
eigenfaces

a) In package test there is just one class TestSVD.
It starts by generating an nxm=100x2000 matrix
with a certain percentage of non zero values.
This percentage is 5% of the entries by default 
but it can be changed through a command line arg.

It then computes the SVD of this matrix using:
a) Mahout Stochastic SVD solver (Hadoop based)
b) Mahout sequential Stochastic SVD solver.
c) Spark MLib implementation of Stochastic SVD.

A comparison of the singular values of a) and c) is made 
against the singular values computed by b) and if they 
are not all within 10E-12 precision a message is printed out.

To run TestSVD from eclipse create a run/debug
configuration then enter arguments for command line.

or use the jar (to create a jar file from this project do from within 
Eclipse

Right Click the project->Export Java Jar file->Enter location
and name of jar)

To run it from the jar run it either from a local machine (rather
than using hdfs-this is ok because the matrix is fairly small).


java -cp RandomizedSVD.jar:lib/*:. test.TestSVD 15 some_local_folder

The first argument is the percent of non zero entries in the 
randomly generated matrix and the second argument is some location
where the matrix, and the output of the mahout svd is written
to.

RandomizedSVD.jar is *this* jar and it is assumed that in the same folder
there is a subfolder lib containing all the jars in the lib
folder of the git commit + the spark-assembly-1.1.1-hadoop2.4.0.jar
which is too big for git (it is available at
/mnt/scratch/u0082100/RandomizedSVD/lib in apt023)

The code will output the top 100 singular values produced
by each of the 3 methods.

To run using hadoop an extra argument is needed specifying 
the location of a remote (hdfs) folder [The initial matrix
is written to the local path then copied to the remote path
for SVD computation]

Here it is an example cmd

/usr/local/hadoop-2.5.0/bin/hadoop jar RandomizedSVD.jar test.TestSVD 5 /mnt/scratch/u0082100/RandomizedSVD/test /user/u0082100/test

b) In package eigenfaces

There are several Main programs that all implement similar 
functionality but using different algorithms/implementations
of computing the SVD

b.1) EigenfacesMain.java

This uses Mahout's Distributed version of the Lanczos algorithm
to compute the eigenvectors of the covariance matrix

Here it is an example run command

/usr/local/hadoop-2.5.0/bin/hadoop jar RandomizedSVD.jar eigenfaces.EigenFacesMain 10 /mnt/scratch/u0082100/RandomizedSVD /user/u0082100

the first argument (10 above) is the rank, the second is
a local directory and the third is a folder in the hdfs system

b.2) EigenFacesSSVDMain.java

This uses Mahout's Stochastic SVD to compute the 
SVD decomposition of the covariance matrix and the right
eigenvectors that allows computation of the eigenfaces

Here it is an example cmd

/usr/local/hadoop-2.5.0/bin/hadoop jar RandomizedSVD.jar eigenfaces.EigenFacesSSVDMain 10 /mnt/scratch/u0082100/RandomizedSVD /user/u0082100

the first argument (10 above) is the rank, the second is
a local directory and the third is a folder in the hdfs system

b.3) EingenFacesSparkSVD.java

This uses Spark's MLlib Stochastic SVD implementation to compute the 
SVD decomposition of the covariance matrix and the right
eigenvectors that allows computation of the eigenfaces

Here it is an example cmd

/usr/local/hadoop-2.5.0/bin/hadoop jar RandomizedSVD.jar eigenfaces.EigenFacesSparkMain 10 /mnt/scratch/u0082100/RandomizedSVD /user/u0082100

the first argument (10 above) is the rank, the second is
a local directory and the third is a folder in the hdfs system

This needs the spark-assembly-1.1.1-hadoop2.4.0.jar in order
for spark to be able to run from within hadoop.

The code assumes that there are 2 directories called 
training-set and testing-set off the local path. So in the examples
above they would be at

/mnt/scratch/u0082100/RandomizedSVD/training-set

/mnt/scratch/u0082100/RandomizedSVD/testing-set