New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Spot add ldaonlineoptimizer support #3

Open

rabarona wants to merge 6 commits into master from SPOT-add_ldaonlineoptimizer_support

Owner

rabarona commented May 30, 2017

No description provided.

Ricardo Barona added 5 commits

May 22, 2017 15:35


          Adding support for OnlineLDAOptimizer. Still need to work on TODOs.

c99070f


          Unit test fails for OnlineOptimizer, need to fix

dcc0a99


          Adding OnlineLDAOptimizer support

01351fc


          Added unit testing test configuration for online optimizer in DNS and…

895d5a4

… Proxy implementations.

Refactored code in SpotLDAWrapper to implement one or the other LDA optimizers.


          Updated ml_ops.sh and ml_test.sh to include LDA parameters: optimizer…

bb52a9a

…, alpha and beta.

Updated spot.conf to include the same parameters.

NathanSegerlind reviewed

View reviewed changes

spot-ml/ml_test.sh Outdated

@@ @@ -21,6 +21,7 @@ @@
               DSOURCE=$1
               RAWDATA_PATH=$2
+              LDAOPTIMIZER="online"

NathanSegerlind May 30, 2017

why is online the default?

Owner Author

rabarona May 30, 2017

It shouldn't and also is referencing to an out of date variable name. It will read from spot.conf.

NathanSegerlind reviewed

View reviewed changes

spot-ml/ml_test.sh Outdated

--ldamaxiterations 11 \

NathanSegerlind May 30, 2017

this is the ml_test.sh script but why are we using a different value than what is in spot.conf?

Owner Author

rabarona May 30, 2017

Not sure, I think it's a leftover from some tests. Will match with ml_ops.sh.

NathanSegerlind reviewed

View reviewed changes

spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAWrapper.scala Outdated

-                  val optimizer = new EMLDAOptimizer
+                  val ldaOptimizer = ldaOptimizerOption match {
+                    case "em" => new EMLDAOptimizer
+                    case "online" => new OnlineLDAOptimizer().setOptimizeDocConcentration(true).setMiniBatchFraction({

NathanSegerlind May 30, 2017

where did these values come from? are they taken from spark documentation, a paper, or some experiments that we ran?

Owner Author

rabarona May 30, 2017

(0.05+ 1) / corpus size I'm sure it's from https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/LDAExample.scala if corpus size < 2 then 0.75 I can't recall right now but I'm pretty sure that came up in a conversation with @brandon.

NathanSegerlind reviewed

View reviewed changes

spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAWrapper.scala Outdated

                   // If caller does not provide seed to lda, ie. ldaSeed is empty, lda is seeded automatically set to hash value of class name
                   if (ldaSeed.nonEmpty) {
                     lda.setSeed(ldaSeed.get)
                   }
+                  val (wordTopicMat, docTopicDist) = ldaOptimizer match {
+                    case _: EMLDAOptimizer => {
+                       val ldaModel = lda.run(ldaCorpus).asInstanceOf[DistributedLDAModel]//.toLocal

NathanSegerlind May 30, 2017

the //.toLocal is code in a comment, it should be removed

NathanSegerlind reviewed

View reviewed changes

spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAWrapper.scala Outdated

-                  //Create LDA model
-                  val ldaModel = lda.run(ldaCorpus)
+                      //Topic distribution: for each document, return distribution (vector) over topics for that docs

NathanSegerlind May 30, 2017

maybe add "entry i is the fraction of the document which belongs to topic i"

Owner Author

rabarona May 30, 2017

Something like this?
// Topic distribution: for each document, return distribution (vector) over topics for that docs where entry i is the fraction of the document which belongs to topic i

NathanSegerlind reviewed

View reviewed changes

spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAWrapper.scala Outdated

+                    case _: OnlineLDAOptimizer => {
+                      val ldaModel = lda.run(ldaCorpus).asInstanceOf[LocalLDAModel]
+                      //Get word topic mix: columns = topic (in no guaranteed order), rows = words (# rows = vocab size)

NathanSegerlind May 30, 2017

this comment confuses me... shouldn't column i correspond to topic i?
how else can we interpret the results?

Owner Author

rabarona May 30, 2017

It confuses me too. If I remember correctly (and based on the subsequent code), topicsMatrix contains 20 rows, one row for each topic and N columns where N is the number of words or vocab size.

The code

In line 155 we call val wordResults = formatSparkLDAWordOutput(wordTopicMat, revWordMap)
In that function we can find the line 255 val wordProbs: Seq[Array[Double]] = wordTopMat.transpose.toArray.grouped(n).toSeq
Until then we have a matrix where each row is a word and each column is a topic.

Owner Author

rabarona May 30, 2017

Nevermind, I just checked and it's like you said. Columns = number of topics.

NathanSegerlind reviewed

View reviewed changes

spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAWrapper.scala Outdated

+                      //Get word topic mix: columns = topic (in no guaranteed order), rows = words (# rows = vocab size)
+                      val wordTopicMat: Matrix = ldaModel.topicsMatrix
+                      //Topic distribution: for each document, return distribution (vector) over topics for that docs

NathanSegerlind May 30, 2017

minor formatting issue throughout (easy query replace)
white space after //
eg.
//Topic
should be
// Topic

Owner Author

rabarona May 30, 2017

I actually tried to have IntelliJ to do this for me with auto formatting but seems like IntelliJ doesn't care if you didn't type a space between the // and your comment.

NathanSegerlind approved these changes

View reviewed changes

NathanSegerlind left a comment

approved with minor comments on defaults and commets


          Made changes to SpotLDAWrapper and ml_ops.sh based on feedback from @…

94d812c

…NathanSegerlind

Fixed inline comments format, added one space after //.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment