Skip to content
Permalink
Branch: develop
Commits on Dec 9, 2019
  1. Update .pre-commit-config.yaml (#1558)

    terrytangyuan authored and skydoorkai committed Dec 9, 2019
Commits on Dec 7, 2019
  1. Reuse the same ps id and address when relaunching ps (#1555)

    terrytangyuan authored and skydoorkai committed Dec 7, 2019
    * Reuse the same ps id and address when relaunching ps
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
    
    * Revert to previous behavior
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
    
    * Address comments
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
Commits on Dec 6, 2019
  1. Add logic to broadcast model params to all workers (#1551)

    terrytangyuan committed Dec 6, 2019
    * wip
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
    
    * Add ip check
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
    
    * Address comments
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
Commits on Dec 4, 2019
  1. Refactor ODPS env var check to a reusable function (#1547)

    terrytangyuan committed Dec 4, 2019
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
  2. Dynamically update service addresses after worker/ps relaunch (#1543)

    terrytangyuan committed Dec 4, 2019
    * Dynamically update service addresses after worker/ps relaunch
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
    
    * Fix test
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
    
    * Fix test
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
    
    * Fix typo
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
Commits on Dec 3, 2019
  1. Improve code style in TaskDataService (#1544)

    terrytangyuan committed Dec 3, 2019
    * Improve code style in TaskDataService
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
    
    * Remove comment
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
Commits on Dec 2, 2019
  1. Run pre-commit on Python CI scripts (#1542)

    terrytangyuan committed Dec 2, 2019
Commits on Nov 27, 2019
  1. Remove redis related dependencies (#1534)

    terrytangyuan committed Nov 27, 2019
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
  2. Print out master pod label status (#1533)

    terrytangyuan authored and ywskycn committed Nov 27, 2019
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
  3. Provide default implementation of dataset_fn for ODPS data source (#1531

    terrytangyuan committed Nov 27, 2019
    )
    
    * Provide default implementation of dataset_fn for ODPS data source
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
    
    * Fix if-else logic
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
    
    * Fix test
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
    
    * Address comments
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
    
    * , -> ;
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
  4. Add missing docstrings in test_utils.distributed_train_and_evaluate (#…

    terrytangyuan committed Nov 27, 2019
    …1532)
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
Commits on Nov 26, 2019
  1. Create k8s services for worker-worker communication (#1523)

    terrytangyuan committed Nov 26, 2019
    * Create k8s services for worker-worker communication
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
    
    * Get service name instead
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
  2. Reuse grpc_utils.build_channel() in Worker.main (#1524)

    terrytangyuan committed Nov 26, 2019
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
  3. Report gradients to local model for allreduce strategy (#1516)

    terrytangyuan committed Nov 26, 2019
    * Report gradients to local model when using allreduce
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
    
    * Fix precommit
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
    
    * Update worker.py
Commits on Nov 25, 2019
  1. Add logic for allreduce failure handling in Worker (#1501)

    terrytangyuan authored and QiJune committed Nov 25, 2019
    * Add logic for allreduce failure handling in Worker
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
    
    * Reformat
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
    
    * Fix isort
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
    
    * Fix unit test
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
    
    * Remove unnecessary loss
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
    
    * Add TODO
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
Commits on Nov 21, 2019
  1. Allow user to specify tf.keras style loss function (#1490)

    terrytangyuan committed Nov 21, 2019
    * Allow user to specify tf.keras style loss function
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
    
    * Fix ps interaction test
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
Commits on Nov 20, 2019
  1. Setup allreduce strategy toggle in Worker (#1480)

    terrytangyuan committed Nov 20, 2019
    * Setup allreduce strategy toggle in Worker
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
    
    * Fix isort
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
    
    * Address feedback
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
    
    * Add tests
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
    
    * Fix tuple unpack
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
    
    * Fix tuple unpack in worker
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
Commits on Nov 18, 2019
  1. Add allreduce strategy option to CLI arg (#1470)

    terrytangyuan authored and QiJune committed Nov 18, 2019
    * Add allreduce strategy option to CLI arg
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
    
    * Fix test
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
  2. Improve code style (#1471)

    terrytangyuan authored and QiJune committed Nov 18, 2019
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
Commits on Nov 14, 2019
  1. Add design doc for allreduce-based distributed training (#1420)

    terrytangyuan committed Nov 14, 2019
    * Add section on fault-tolerant Allreduce implementation
    
    * Add initial interface design
    
    * Add task continuation and training with evaluation
    
    * Add section on failure handling and relevant CLI args
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
    
    * Details for failure handling and section on embedding layer
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
    
    * Add motivation section
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
    
    * Edits on section flow
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
    
    * Address comments
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
    
    * Address new comments
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
    
    * Add section on Data Distribution among Workers
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
    
    * Add barrier interface and mention conversion/copy in the process
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
    
    * Add potential optimizatioin on model evaluation
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
  2. Add check for empty string for ODPS endpoint (#1464)

    terrytangyuan committed Nov 14, 2019
    * Add check for empty string for ODPS endpoint
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
    
    * Fix precommit
  3. Include layers module in elasticdl.python.elasticdl (#1463)

    terrytangyuan committed Nov 14, 2019
    * Include layers module in elasticdl.python.elasticdl
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
    
    * Ignore flake8 check locally
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
Commits on Nov 13, 2019
  1. Add choices for --image_pull_policy and --restart_policy (#1451)

    terrytangyuan authored and LiMinghao1994 committed Nov 13, 2019
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
Commits on Nov 7, 2019
  1. Check the true status for master pod when TensorBoard is enabled (#1429)

    terrytangyuan committed Nov 7, 2019
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
  2. Properly set pipeline exit option when validating job status (#1430)

    terrytangyuan committed Nov 7, 2019
    * Properly set pipeline exit option when validating job status
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
    
    * Address commennts
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
Commits on Nov 6, 2019
  1. Add services to rbac manifest (#1422)

    terrytangyuan authored and ywskycn committed Nov 6, 2019
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
Commits on Nov 5, 2019
  1. Update to use stable release of black formatter (#1415)

    terrytangyuan authored and ywskycn committed Nov 5, 2019
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
  2. Add TensorBoard service in integration tests (#1416)

    terrytangyuan authored and ywskycn committed Nov 5, 2019
Commits on Nov 4, 2019
  1. Add --prediction_outputs_processor to ElasticDL CLI (#1410)

    terrytangyuan authored and ywskycn committed Nov 4, 2019
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
  2. Remove the redundant _design suffix from design docs (#1411)

    terrytangyuan authored and ywskycn committed Nov 4, 2019
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
Commits on Nov 1, 2019
  1. SQLFlow integration design doc (#1402)

    terrytangyuan committed Nov 1, 2019
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
  2. Improve docstring for TaskDataService and remove mentions of RecordIO (

    terrytangyuan authored and LiMinghao1994 committed Nov 1, 2019
    …#1403)
    
    * Improve docstring for TaskDataService and remove mentions of RecordIO
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
    
    * Update task_data_service.py
Commits on Oct 25, 2019
  1. Add Allreduce doc on relevant technologies (#1353)

    terrytangyuan authored and ywskycn committed Oct 25, 2019
    * Add Allreduce doc
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
    
    * Address comments
    
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
  2. Improve help doc on get_model_steps (#1381)

    terrytangyuan authored and ywskycn committed Oct 25, 2019
    Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
Older
You can’t perform that action at this time.