Skip to content
This repository has been archived by the owner on Jul 23, 2022. It is now read-only.

Completing the implementation with the distributed memory method #1

Open
oersted opened this issue Aug 24, 2016 · 2 comments
Open

Completing the implementation with the distributed memory method #1

oersted opened this issue Aug 24, 2016 · 2 comments

Comments

@oersted
Copy link

oersted commented Aug 24, 2016

It looks like you don't have plans to keep developing this prototype, but I may be interested in integrating the PV-DM distributed memory method into the implementation (mainly because it has shown better overall performance as stated in the original paper) and in completing the model so that it has all the original gensim functionalities.

Any tips? How would you approach it?

Also, I see that you based this on DeepDist, but I don't see that you use it. I'd like to understand why rewriting the whole thing is a better approach to directly using DeepDist's API like proposed here http://stackoverflow.com/questions/35616088/doc2vec-and-pyspark-gensim-doc2vec-over-deepdist.

Thank you for your time.

@yiransheng
Copy link
Owner

yiransheng commented Sep 15, 2016

Hi @oersted. This is an old project from last year for some of my school work - sorry for the delayed response, I haven't exactly been keeping up with the issues here.

On DeepDist, when I started this project it was based on DeepDist. However, as I remember gensim's Doc2Vec uses numpy memory mapped arrays to store model parameters (docvecs.doctag_syn0). It makes sense as the corpus would typically be too big to fit in memory and batch SGD updates usually only mutate a small portion of the entire model params.

In context of DeepDist this makes it very difficult to broadcast model parameters to all worker nodes (I did try rsync based solution to make backing array avaiable to all worker nodes, but performance is abysmal).

Eventually, I reached the conclusion that for Doc2Vec models model parallelization is as important as training parallelization. Specifically, the doc vectors portion of model params can be distributed as RDD easily, and each partition/worker's own batch training only updates its own share of doc vectors.

Given that I made this decision to distribute model params, it made sense to rely on Spark's native synchronization mechanism, namely, broadcast to handle the process so I dropped DeepDist entirely, but it certainly was a major source of inspiration.

In addition, Spark's broadcast has torrent like behavior as opposed to DeepDist's master/slave architecture, which makes network traffic for big models less of a bottleneck.

I also referred to this paper: https://blog.acolyer.org/2015/11/26/asip/ when doing this project, granteed, I only understood it on a surface level.

Currently, I don't have plans to update this project in near terms, so not going to prototyping PV-DM. But I'd imagine it will take a similar appoach: try to be offload work as much as possible by using _inner modules from gensim, and figure out a synchronization strategy on Spark side.

In bigger context, I believe distributed TensorFlow might be a better bet for machine learning practitioners (less hassle free) than Spark, although I had little experience with it. The BSP (bulk synchronous processing) model of Spark/Hadoop does not accommodate asynchronous SGD-type model training well, the immutability of RDDs adds too much overhead as well.

@oersted
Copy link
Author

oersted commented Sep 16, 2016

Thanks a lot for the input, it's very helpful. Right now I'm in the experimentation phase, so I don't even know if I'll need this, I may have some further questions in the future if I decide to go forward with this.

My main motivation for this is that I may need to be able to use doc2vec at large scale in production. Most other aspects of the system fit very well into Spark and it would be interesting to have it all integrated.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants