Skip to content
Permalink
Branch: develop
Commits on Jan 23, 2020
  1. Use pod IP to start Go-based GRPC server. (#1682)

    skydoorkai committed Jan 23, 2020
    * get ip for go ps grpc server
    
    * fix method
    
    * fix arg
    
    * fix unittest
    
    * rewrite
Commits on Jan 21, 2020
  1. add ps command/args for Go PS (#1674)

    skydoorkai authored and terrytangyuan committed Jan 21, 2020
    * add ps command/args for Go PS
    
    * comment
    
    * required args
Commits on Jan 20, 2020
  1. add arg parsing in Go PS (#1666)

    skydoorkai authored and QiJune committed Jan 20, 2020
    * add arg parsing in Go PS
    
    * rebased
Commits on Jan 16, 2020
  1. Create golang PS server (#1636)

    skydoorkai committed Jan 16, 2020
    * go server
    
    * fix
    
    * fix golint
    
    * add test
    
    * fix gofmt
    
    * change path
    
    * rename server to ps
Commits on Jan 10, 2020
  1. generate Go files from proto file (#1629)

    skydoorkai authored and terrytangyuan committed Jan 10, 2020
  2. add golang support in Dockerfile.dev (#1625)

    skydoorkai committed Jan 10, 2020
Commits on Jan 4, 2020
  1. fix a worker relaunch bug when worker is relaunched after all trainin…

    skydoorkai committed Jan 4, 2020
    …g tasks are done (#1612)
    
    * fix relaunch bug
    
    * fix black
Commits on Jan 3, 2020
  1. variables from different ps have its own model_version (#1604)

    skydoorkai committed Jan 3, 2020
    * variables from different ps have its own model_version
    
    * fix test failure
    
    * fix unittest
    
    * no need to check model version
Commits on Jan 1, 2020
  1. avoid worker hang in multi-process (#1606)

    skydoorkai committed Jan 1, 2020
    * avoid worker hang in multi-process
    
    * fix unittest
    
    * fix
    
    * trigger
    
    * trigger travis
Commits on Dec 24, 2019
  1. add timing debug info (#1587)

    skydoorkai committed Dec 24, 2019
    * add timing debug info
    
    * fix unittest
    
    * restruct timing
    
    * fix
Commits on Dec 20, 2019
  1. Support learning rate scheduler (#1581)

    skydoorkai committed Dec 20, 2019
    * lr scheduler support
    
    * use tls
    
    * add test
    
    * fix black
    
    * fix tests
Commits on Dec 17, 2019
  1. Only relaunch failed pod with reason as OOMKilled (#1574)

    skydoorkai authored and terrytangyuan committed Dec 17, 2019
Commits on Dec 14, 2019
  1. relaunch failed pod (#1563)

    skydoorkai committed Dec 14, 2019
Commits on Dec 5, 2019
  1. fix pod env (#1550)

    skydoorkai authored and terrytangyuan committed Dec 5, 2019
Commits on Dec 4, 2019
  1. support partition format (#1548)

    skydoorkai authored and QiJune committed Dec 4, 2019
    * support partition format
    
    * fix black
    
    * add test and rewrite wkargs get
    
    * change default value to None
Commits on Nov 25, 2019
  1. grads are accepted if it is accepted by one ps (#1514)

    skydoorkai authored and terrytangyuan committed Nov 25, 2019
Commits on Nov 22, 2019
Commits on Nov 19, 2019
  1. Use async GRPC call to parallel data communcation between worker and …

    skydoorkai committed Nov 19, 2019
    …ps (#1486)
    
    * use async GRPC call to parallel data communcation between worker and ps
    
    * rename to pairs
Commits on Nov 18, 2019
  1. init wrap optimizer when embedding info is set (#1475)

    skydoorkai authored and terrytangyuan committed Nov 18, 2019
Commits on Nov 15, 2019
  1. wait until ps pod is ready (#1465)

    skydoorkai authored and QiJune committed Nov 15, 2019
Commits on Nov 14, 2019
  1. add workaround for ps grpc connection (#1461)

    skydoorkai authored and LiMinghao1994 committed Nov 14, 2019
Commits on Nov 12, 2019
  1. report_variable for ps init when needed (#1449)

    skydoorkai committed Nov 12, 2019
    * report_variable for ps init when needed
    
    * use teatDown in restart
    
    * add model_init_status check
    
    * reduce test time
Commits on Nov 1, 2019
  1. pull_embedding_vector RPC implementation (#1401)

    skydoorkai committed Nov 1, 2019
    * pull_embedding_vector RPC implementation
    
    * revision according to comments
Commits on Oct 31, 2019
  1. pull_variable RPC implementation (#1393)

    skydoorkai committed Oct 31, 2019
    * pull_variable RPC implementation
    
    * move set version inside lock
    
    * rewrite dict access
Commits on Oct 30, 2019
  1. push_model rpc implementation (#1385)

    skydoorkai committed Oct 30, 2019
    * push_model rpc implementation
    
    * fix according to comment
Commits on Oct 29, 2019
  1. create legacy ps related services only when num_ps_pods==0 (#1383)

    skydoorkai committed Oct 29, 2019
    * create legacy ps service only num_ps_pods==0
    
    * fix
  2. Init PS RPC servicer (#1369)

    skydoorkai authored and LiMinghao1994 committed Oct 29, 2019
    * init ps rpc servicer
    
    * fix flake8
    
    * fix proto
    
    * move channel/port to class attributes for reuse
    
    * reorg ps.main
Commits on Oct 24, 2019
  1. create `ps.main` for PS main process and parse arguments (#1349)

    skydoorkai committed Oct 24, 2019
    * Create ps.main
    
    * fix flake8
Commits on Oct 23, 2019
  1. PS design doc (combined version) (#1317)

    skydoorkai committed Oct 23, 2019
    * ps design revision
    
    * refine doc
    
    * refine fixed domain
    
    * rewrite
    
    * revision
    
    * add two diagrams
    
    * Add code snippets in appendix section
    
    * polish
    
    * revision
    
    * change the order inside appendix section
    
    * remove dup
    
    * update diagrams
    
    * add more description
    
    * add more description
    
    * revise
    
    * polish
    
    Modify a long line into multiple lines.
    
    * polish doc
    
    * revision again
    
    * modify code snippets
    
    * polish
    
    * polish
    
    * fix format
    
    * polish
    
    * polish
    
    * move ps_design.md out of archived folder
    
    * Update ps_design.md
Commits on Oct 9, 2019
  1. Only training needs shuffle (#1277)

    skydoorkai committed Oct 9, 2019
    * only training needs shuffle
    
    * more fix
Commits on Sep 24, 2019
  1. Add async test cases in example_test.py (#1228)

    skydoorkai committed Sep 24, 2019
    * add async test case in example_test.py
    
    * add version check
Commits on Sep 20, 2019
  1. Apply gradients directly for async (#1217)

    skydoorkai committed Sep 20, 2019
    * apply grads directly for async
    
    * fix test
    
    * revise according to comments
    
    * fix redundant check
    
    * fix
Commits on Sep 18, 2019
  1. do not use lock for async SGD in ReportGradient (#1200)

    skydoorkai committed Sep 18, 2019
    * not use lock for async in ReportGradient
    
    * fix for comments
Commits on Sep 17, 2019
  1. fix command in embedding service (#1193)

    skydoorkai authored and LiMinghao1994 committed Sep 17, 2019
  2. add staleness-aware learning rate modulation (#1172)

    skydoorkai committed Sep 17, 2019
    * add staleness-aware lr modulation
    
    * fix isort
    
    * change test according to comments
Older
You can’t perform that action at this time.