-
Notifications
You must be signed in to change notification settings - Fork 2
/
readme
112 lines (75 loc) · 3.93 KB
/
readme
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
There are 2 relevant packages in this project:
test
eigenfaces
a) In package test there is just one class TestSVD.
It starts by generating an nxm=100x2000 matrix
with a certain percentage of non zero values.
This percentage is 5% of the entries by default
but it can be changed through a command line arg.
It then computes the SVD of this matrix using:
a) Mahout Stochastic SVD solver (Hadoop based)
b) Mahout sequential Stochastic SVD solver.
c) Spark MLib implementation of Stochastic SVD.
A comparison of the singular values of a) and c) is made
against the singular values computed by b) and if they
are not all within 10E-12 precision a message is printed out.
To run TestSVD from eclipse create a run/debug
configuration then enter arguments for command line.
or use the jar (to create a jar file from this project do from within
Eclipse
Right Click the project->Export Java Jar file->Enter location
and name of jar)
To run it from the jar run it either from a local machine (rather
than using hdfs-this is ok because the matrix is fairly small).
java -cp RandomizedSVD.jar:lib/*:. test.TestSVD 15 some_local_folder
The first argument is the percent of non zero entries in the
randomly generated matrix and the second argument is some location
where the matrix, and the output of the mahout svd is written
to.
RandomizedSVD.jar is *this* jar and it is assumed that in the same folder
there is a subfolder lib containing all the jars in the lib
folder of the git commit + the spark-assembly-1.1.1-hadoop2.4.0.jar
which is too big for git (it is available at
/mnt/scratch/u0082100/RandomizedSVD/lib in apt023)
The code will output the top 100 singular values produced
by each of the 3 methods.
To run using hadoop an extra argument is needed specifying
the location of a remote (hdfs) folder [The initial matrix
is written to the local path then copied to the remote path
for SVD computation]
Here it is an example cmd
/usr/local/hadoop-2.5.0/bin/hadoop jar RandomizedSVD.jar test.TestSVD 5 /mnt/scratch/u0082100/RandomizedSVD/test /user/u0082100/test
b) In package eigenfaces
There are several Main programs that all implement similar
functionality but using different algorithms/implementations
of computing the SVD
b.1) EigenfacesMain.java
This uses Mahout's Distributed version of the Lanczos algorithm
to compute the eigenvectors of the covariance matrix
Here it is an example run command
/usr/local/hadoop-2.5.0/bin/hadoop jar RandomizedSVD.jar eigenfaces.EigenFacesMain 10 /mnt/scratch/u0082100/RandomizedSVD /user/u0082100
the first argument (10 above) is the rank, the second is
a local directory and the third is a folder in the hdfs system
b.2) EigenFacesSSVDMain.java
This uses Mahout's Stochastic SVD to compute the
SVD decomposition of the covariance matrix and the right
eigenvectors that allows computation of the eigenfaces
Here it is an example cmd
/usr/local/hadoop-2.5.0/bin/hadoop jar RandomizedSVD.jar eigenfaces.EigenFacesSSVDMain 10 /mnt/scratch/u0082100/RandomizedSVD /user/u0082100
the first argument (10 above) is the rank, the second is
a local directory and the third is a folder in the hdfs system
b.3) EingenFacesSparkSVD.java
This uses Spark's MLlib Stochastic SVD implementation to compute the
SVD decomposition of the covariance matrix and the right
eigenvectors that allows computation of the eigenfaces
Here it is an example cmd
/usr/local/hadoop-2.5.0/bin/hadoop jar RandomizedSVD.jar eigenfaces.EigenFacesSparkMain 10 /mnt/scratch/u0082100/RandomizedSVD /user/u0082100
the first argument (10 above) is the rank, the second is
a local directory and the third is a folder in the hdfs system
This needs the spark-assembly-1.1.1-hadoop2.4.0.jar in order
for spark to be able to run from within hadoop.
The code assumes that there are 2 directories called
training-set and testing-set off the local path. So in the examples
above they would be at
/mnt/scratch/u0082100/RandomizedSVD/training-set
/mnt/scratch/u0082100/RandomizedSVD/testing-set