Completing the implementation with the distributed memory method #1
Comments
Hi @oersted. This is an old project from last year for some of my school work - sorry for the delayed response, I haven't exactly been keeping up with the issues here. On DeepDist, when I started this project it was based on DeepDist. However, as I remember In context of DeepDist this makes it very difficult to broadcast model parameters to all worker nodes (I did try rsync based solution to make backing array avaiable to all worker nodes, but performance is abysmal). Eventually, I reached the conclusion that for Given that I made this decision to distribute model params, it made sense to rely on Spark's native synchronization mechanism, namely, broadcast to handle the process so I dropped DeepDist entirely, but it certainly was a major source of inspiration. In addition, Spark's broadcast has torrent like behavior as opposed to DeepDist's master/slave architecture, which makes network traffic for big models less of a bottleneck. I also referred to this paper: https://blog.acolyer.org/2015/11/26/asip/ when doing this project, granteed, I only understood it on a surface level. Currently, I don't have plans to update this project in near terms, so not going to prototyping PV-DM. But I'd imagine it will take a similar appoach: try to be offload work as much as possible by using In bigger context, I believe distributed TensorFlow might be a better bet for machine learning practitioners (less hassle free) than Spark, although I had little experience with it. The BSP (bulk synchronous processing) model of Spark/Hadoop does not accommodate asynchronous SGD-type model training well, the immutability of RDDs adds too much overhead as well. |
Thanks a lot for the input, it's very helpful. Right now I'm in the experimentation phase, so I don't even know if I'll need this, I may have some further questions in the future if I decide to go forward with this. My main motivation for this is that I may need to be able to use doc2vec at large scale in production. Most other aspects of the system fit very well into Spark and it would be interesting to have it all integrated. |
It looks like you don't have plans to keep developing this prototype, but I may be interested in integrating the PV-DM distributed memory method into the implementation (mainly because it has shown better overall performance as stated in the original paper) and in completing the model so that it has all the original gensim functionalities.
Any tips? How would you approach it?
Also, I see that you based this on DeepDist, but I don't see that you use it. I'd like to understand why rewriting the whole thing is a better approach to directly using DeepDist's API like proposed here http://stackoverflow.com/questions/35616088/doc2vec-and-pyspark-gensim-doc2vec-over-deepdist.
Thank you for your time.
The text was updated successfully, but these errors were encountered: