parameter server strategy #8

huangrunhui · 2021-05-10T08:15:09Z

ps_strategy
add ps_strategy to jax example
add Typing in ps_strategy.py, allreduce_strategy.py and base_strategy.py
strategy and jax operator save/load states

zhisbug · 2021-05-12T05:56:44Z

distml/strategy/ps_strategy.py

+
+
+class ParameterServerStrategy(BaseStrategy):
+    """Strategy that trains a model via collective AllReduce.


change this docstring summary ?

zhisbug · 2021-05-12T05:57:29Z

distml/strategy/ps_strategy.py

+                 training_operator_cls,
+                 operator_config=None,
+                 initialization_hook=None,
+                 num_workers=1,


num_worker

zhisbug · 2021-05-12T05:58:26Z

distml/strategy/ps_strategy.py

+
+        assert num_ps
+        self.num_ps = num_ps
+        self.num_workers = num_workers


same here... don't use the plural form

zhisbug · 2021-05-12T05:58:45Z

distml/strategy/ps_strategy.py

+        assert num_ps
+        self.num_ps = num_ps
+        self.num_workers = num_workers
+        self.num_cpus_per_server = num_cpus_per_server


and here and following

zhisbug · 2021-05-12T06:01:18Z

distml/strategy/ps_strategy.py

+            ray.get([server.set_params.remote(this_shard_ref)])
+
+    def _start_workers(self):
+        """Create worker(actor), maybe need worker group to manager these workers.


rewrite this docstring

zhisbug · 2021-05-12T06:01:45Z

distml/strategy/ps_strategy.py

+        """
+        # TODO (Hao): infer the per-replica batch size here...
+
+        # so here we get two set of params that will be passed around:


you can clean this comment as it is redundent with those I left in AllReduceStrategy

zhisbug · 2021-05-12T06:03:37Z

distml/strategy/ps_strategy.py

+        }
+
+        # Should we make two groups for worker and server?
+        self.worker_group = DataParallelGroup(**workergroup_init_args)


This is strange. Is this the same DataParallelGroup with the one in AllReduceStrategy?
If yes -- then fine
If not -- is there any way we can share the same class? If it is hard then we should at least use a different class name?

zhisbug · 2021-05-12T06:04:25Z

distml/strategy/ps_strategy.py

+        self.server_group.start_actors(
+            self.num_ps)  # server at the last num_ps processes.
+
+        worker_rets = self.worker_group.test_connection()


Are testing connection necessary? if not, probably move it to DEBUG mode.

zhisbug · 2021-05-12T06:09:31Z

distml/strategy/ps_strategy.py

+
+    def setup_operator(self):
+        # figure out the signature of training_operator_cls later.
+        self.training_operator = self.training_operator_cls(


I am not sure whether we should setup the whole operator on the server side? One drawback is that this will take a lot of GPU memory?

huangrunhui and others added 10 commits April 29, 2021 01:14

add jax operator

79b410d

add jax operator

dac98ab

lint

da251fb

setup string

225abfd

reset format.sh

560dd75

init ps strategy

b8d2ec9

delete some trash files

a68dbf3

Merge branch 'jax-operator' of https://yuan.cm/https://github.com/ray…

2f2d042

…-project/distml into ps_strategy

jax ps example

e3792ab

merge master

73971f7

zhisbug reviewed May 12, 2021

View reviewed changes

huangrunhui added 4 commits May 15, 2021 01:50

base_data_parallel_group and add typing in function params

21340f4

lint

21d9a35

strategy load/save states

43498f8

callable check

9a4acb1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parameter server strategy #8

parameter server strategy #8

huangrunhui commented May 10, 2021 •

edited

zhisbug May 12, 2021

zhisbug May 12, 2021

zhisbug May 12, 2021

zhisbug May 12, 2021

zhisbug May 12, 2021

zhisbug May 12, 2021

zhisbug May 12, 2021

zhisbug May 12, 2021

zhisbug May 12, 2021



		class ParameterServerStrategy(BaseStrategy):
		"""Strategy that trains a model via collective AllReduce.

parameter server strategy #8

Are you sure you want to change the base?

parameter server strategy #8

Conversation

huangrunhui commented May 10, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huangrunhui commented May 10, 2021 •

edited