-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spot add ldaonlineoptimizer support #3
base: master
Are you sure you want to change the base?
Conversation
… Proxy implementations. Refactored code in SpotLDAWrapper to implement one or the other LDA optimizers.
…, alpha and beta. Updated spot.conf to include the same parameters.
spot-ml/ml_test.sh
Outdated
@@ -21,6 +21,7 @@ | |||
|
|||
DSOURCE=$1 | |||
RAWDATA_PATH=$2 | |||
LDAOPTIMIZER="online" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is online the default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't and also is referencing to an out of date variable name. It will read from spot.conf.
spot-ml/ml_test.sh
Outdated
--ldamaxiterations 11 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the ml_test.sh script but why are we using a different value than what is in spot.conf?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, I think it's a leftover from some tests. Will match with ml_ops.sh.
val optimizer = new EMLDAOptimizer | ||
val ldaOptimizer = ldaOptimizerOption match { | ||
case "em" => new EMLDAOptimizer | ||
case "online" => new OnlineLDAOptimizer().setOptimizeDocConcentration(true).setMiniBatchFraction({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where did these values come from? are they taken from spark documentation, a paper, or some experiments that we ran?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(0.05+ 1) / corpus size I'm sure it's from https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/LDAExample.scala if corpus size < 2 then 0.75 I can't recall right now but I'm pretty sure that came up in a conversation with @brandon.
|
||
// If caller does not provide seed to lda, ie. ldaSeed is empty, lda is seeded automatically set to hash value of class name | ||
|
||
if (ldaSeed.nonEmpty) { | ||
lda.setSeed(ldaSeed.get) | ||
} | ||
|
||
val (wordTopicMat, docTopicDist) = ldaOptimizer match { | ||
case _: EMLDAOptimizer => { | ||
val ldaModel = lda.run(ldaCorpus).asInstanceOf[DistributedLDAModel]//.toLocal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the //.toLocal is code in a comment, it should be removed
|
||
//Create LDA model | ||
val ldaModel = lda.run(ldaCorpus) | ||
//Topic distribution: for each document, return distribution (vector) over topics for that docs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add "entry i is the fraction of the document which belongs to topic i"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like this?
// Topic distribution: for each document, return distribution (vector) over topics for that docs where entry i is the fraction of the document which belongs to topic i
case _: OnlineLDAOptimizer => { | ||
val ldaModel = lda.run(ldaCorpus).asInstanceOf[LocalLDAModel] | ||
|
||
//Get word topic mix: columns = topic (in no guaranteed order), rows = words (# rows = vocab size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment confuses me... shouldn't column i correspond to topic i?
how else can we interpret the results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It confuses me too. If I remember correctly (and based on the subsequent code), topicsMatrix contains 20 rows, one row for each topic and N columns where N is the number of words or vocab size.
The code
In line 155 we call val wordResults = formatSparkLDAWordOutput(wordTopicMat, revWordMap)
In that function we can find the line 255 val wordProbs: Seq[Array[Double]] = wordTopMat.transpose.toArray.grouped(n).toSeq
Until then we have a matrix where each row is a word and each column is a topic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, I just checked and it's like you said. Columns = number of topics.
//Get word topic mix: columns = topic (in no guaranteed order), rows = words (# rows = vocab size) | ||
val wordTopicMat: Matrix = ldaModel.topicsMatrix | ||
|
||
//Topic distribution: for each document, return distribution (vector) over topics for that docs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor formatting issue throughout (easy query replace)
white space after //
eg.
//Topic
should be
// Topic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually tried to have IntelliJ to do this for me with auto formatting but seems like IntelliJ doesn't care if you didn't type a space between the // and your comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approved with minor comments on defaults and commets
…NathanSegerlind Fixed inline comments format, added one space after //.
No description provided.