-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation bug in ColumnIterator? #49
Comments
I think you're right, that looks like a bug, it wouldn't make much sense to use Seems like this is missed by the unit tests. We'll also have to write some unit tests that test for this case, so it doesn't retroactively happen again in the future. |
create a pull request #50 , only target to simply fix this bug, without adding tests. This method is |
oh, I think there is still a potential bug, |
Thanks, I'll look into that! 👍 I think it would be best to add a check to see if For the tests, I'll see if I can write some up tomorrow. |
The check for |
And I managed to add tests for |
Thanks! It looks great 👍 |
eh, after further reading the code, I find the real leak in the init implementation, in this block of code, it shortcuts the cols as the test has a smaller row. So the appropriate test should be something like in #53 . And the tests in #52 is meaninglesss, as they should be tested via public interface By the way, in the near future, I may have to implement logistic regression with parameter server. Do you know whether there have somebody already been working on that based on glint? |
Yeah, now that you say it, it's probably better to test the public interface I think @MLnick has been working on a basic logistic regression implementation. I plan to implement logistic regression, SVMs, etc. myself but haven't gotten around to it yet due to other obligations. One of the major difficulties I ran into is regularization. L2 regularization would require scalar multiplication on the entire distributed vector, which is not yet implemented and will probably require some code refactoring at certain parts. Unregularized linear methods should be relatively easy to implement though. |
I roughly think about it today, I find it hard to enable arbitrary user define function on the server side. If I have enough time, I will take a look at the code of parameterserver.org to see how they handle the problem. |
Yes, the idea is correct, but I should note that the tasks are not issued by the parameter servers and instead job scheduling is intended to be handled by Spark. Spark issues workers to operate on subsets of some data set. These workers then asynchronously pull parts of the model and push updates to the parameter server. So the point of view is definitely different from other parameter server implementations, because Glint functions more akin to a key-value store. It has no concept of tasks or workers and merely provides distributed matrices and vectors that can be read and updated efficiently. So it would look something like this:
Perhaps you will find this interesting as well: https://github.com/rjagerman/glintlda. It is an implementation of LightLDA using Glint. Although it is a completely different algorithm, it provides a good example of using Glint for a complex ML task. The code may be a bit hard to read though due the many optimizations that it uses. |
I've read the code of glintlda yesterday, and currently reading the code of parameterserver.org's code optimizing with L1L2 reg. The code of parameterserver.org is c++(didn't program with c++ for years) and the project is quite large, it may cost serveral days to understand the implementation of parameterserver.org. With your proposal on logistic regression with L2, it has strong consistent (different from glintlda, which I think it's bounded staling, as you use futures with locks to enable little inconsistency. am I right? or what is the usage of semaphore? ). If I'm right, it maybe also possible to be bounded staling with something like in glintlda. and multiply a scale is something like a mapper function. But with L1, I think it needs to be strong consistent or we need to keep the old gradients of delayed works so as to enable bounded staling, which may increase the complexity of the system. and with L1, it will require something like mapper function to update weights. And I also find that in yahoo's implementation, the parameter server provides map, reduce, and aggregate interface on server side. That may make the system more powerful. |
Hi there I have been working on some basic linear models with Glint - mostly logistic regression. I've used an implementation based on the older Spark MLlib gradient descent primitives. It's fairly easy to do LR with L2 regularization and I get the same results as MLlib. However, I haven't really run larger scale experiments yet where the async impact on consistency could be greater. L1 is more challenging though and I think will definitely require some form of computation on the param servers themselves. How to handle that in Glint is an interesting issue - I think there must be some form of UDFs supported for these types of operations - whether supporting shipping some function to the servers, or plugging in a custom Actor of some sort. |
@rjagerman @MLnick |
In the line of the file, it looks like it should be
val cols = Array.fill[Int](matrix.rows)(index)
instead ofval cols = Array.fill[Int](matrix.cols)(index)
?The text was updated successfully, but these errors were encountered: