Skip to content

Commit

Permalink
Fix linking _pywrap_tensorflow_internal.so and re-enable XLA on macOS (
Browse files Browse the repository at this point in the history
…horovod#3173)

Spark/Lightning: fix the usage of checkpoint callback (horovod#3186)

Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>

Fix Cometlogger experiment key lost issue (horovod#3184)

* test

Signed-off-by: Peng Zhang <pengz@uber.com>

* test

Signed-off-by: Peng Zhang <pengz@uber.com>

* fix_logger

Signed-off-by: Peng Zhang <pengz@uber.com>

* fix_logger

Signed-off-by: Peng Zhang <pengz@uber.com>

* recreate_loger

Signed-off-by: Peng Zhang <pengz@uber.com>

* fix_var

Signed-off-by: Peng Zhang <pengz@uber.com>

* test

Signed-off-by: Peng Zhang <pengz@uber.com>

* test

Signed-off-by: Peng Zhang <pengz@uber.com>

Updated torch c++ to use new aten api (horovod#3175)

Spark/Keras: remove bare Keras support (horovod#3191)

Make fork PRs publish test change stats (horovod#3185)

Signed-off-by: Enrico Minack <github@enrico.minack.dev>

Support for nccl on cuda 11.4 (horovod#3182)

Signed-off-by: Evan Brossard <evanb@maka-ars.com>

Fix MPICH support (horovod#3148)

* fix MPICH implementation
* enable tests for MPICH and Intel MPI

Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>

Increase build timeout to 40m on Buildkite (horovod#3192)

Signed-off-by: Enrico Minack <github@enrico.minack.dev>

Change CMake syntax to be compatible with old versions of CMake (horovod#3196)

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Reinit every torch test (horovod#3194)

Add barrier call to torch module to support easy synchronization for process sets (horovod#3139)

* Added barrier call to torch module

Signed-off-by: TJ <tix@uber.com>

Bump version to 0.23.0 (horovod#3200)

Signed-off-by: Travis Addair <tgaddair@gmail.com>

Co-authored-by: Max H. Gerlach <git@maxgerlach.de>

Increase Parallel PyTest timeout to 10m (horovod#3198)

* Increase MPI and Gloo Parallel PyTest timeout to 10m

Signed-off-by: Enrico Minack <github@enrico.minack.dev>

Spark/Lightning: don't overwrite model with checkpoint by default (horovod#3201)

Lightning estimator saves model by default if there is no specified
checkpoint callback. However, model is not overwritten with checkpoint
file in that case.

Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>

Spark/Lightning: fix checkpoint callback dirpath typo (horovod#3204)

Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>

Rework events in CI workflows (horovod#3202)

Signed-off-by: Enrico Minack <github@enrico.minack.dev>

Allow for concurrent schedule and master build, document concurrency (horovod#3206)

Signed-off-by: Enrico Minack <github@enrico.minack.dev>

Ray: fix RayExecutor to fail when num_workers=0 and num_hosts=None (horovod#3210)

Signed-off-by: Travis Addair <tgaddair@gmail.com>

add_history_in_lightning_estimator (horovod#3214)

Signed-off-by: Peng Zhang <pengz@uber.com>

Allow buildkite building merge commits on forks (horovod#3215)

Signed-off-by: Enrico Minack <github@enrico.minack.dev>

Fix json output in ci-results.yaml (horovod#3217)

Spark/Lightning: fix history metrics for estimator serialization (horovod#3216)

Save metrics inside the checkpoint dict , which will be load with map_location=torch.device('cpu')

Signed-off-by: Peng Zhang <pengz@uber.com>

patch python source files on macCI (horovod#3220)

* patch python source files on macCI

* Trigger build and test CI

Signed-off-by: TJ <tix@uber.com>

Co-authored-by: Enrico Minack <github@enrico.minack.dev>

Updated examples of torch and tf to include mixed precision training (horovod#3222)

* Added mixed precision example for pytorch

* added mixed precision for keras

Signed-off-by: TJ <tix@uber.com>

Job buildkite-heads accesses ci-workflow outputs, add it to the needs (horovod#3225)

Signed-off-by: Enrico Minack <github@enrico.minack.dev>

Fixes race condition for ray scale up down tests (horovod#3205)

Ensure that at least one host from the previous set of hosts have
been registered.
Without this, the discovery script will "discover" the new
set of hosts before the current set can register.
This would result in a race condition.
Consider a discovery schedule:
```
discovery_schedule = [
    (10, ['host-1:2']),
    (30, ['host-1:2', 'host-2:1', 'host-3:1']),
    (None, ['host-2:1']),
]
```
The initial set is: ['host-1:2']. Before this is registered in the driver, the discovery script
discovers the set: ['host-1:2', 'host-2:1', 'host-3:1'], and adds ['host-2:1', 'host-3:1'].
However, since ['host-1:2'] has not registered, there is no coordinator to notify the workers.
When host-1 and host-3 are removed, driver.resume will call _activate_workers, which will update the host assignments.
It has a check to see if the intersection between the previous and current set of hosts. It finds that the previous
set is ['host-1:2'], and the current set is ['host-2:1'], since there was no notification for the added and removed
hosts.
This ensures that the previous set of hosts can register before the current set is discovered.

Signed-off-by: Abin Shahab <ashahab@linkedin.com>

Removed a case of the default mutable argument pitfall (horovod#3227)

Signed-off-by: Naelson Douglas <naelson17@gmail.com>

Updates to TSC members (horovod#3234)

Signed-off-by: Travis Addair <tgaddair@gmail.com>

Add in-place broadcast for TensorFlow (horovod#3128)

* Update comment in FindTensorflow.cmake

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

* Add in-place broadcast_() and broadcast_variables() for TF

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

* Include source files from TF in build to avoid missing symbol errors

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

* Limit build and test to TF 2.6+

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

* Remove source files copied from TensorFlow

The missing symbols are resolved by linking against _pywrap_tensorflow_internal.so,
which was introduced to Horovod with PR horovod#3053.

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

* Fix possible type attribute values for HorovodBroadcastInplace

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

* Add reference variables to test

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

* Update comments, doc strings, changelog

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

[Elastic Horovod] Fix the bug for ElasticSampler and hvd.elastic.state (horovod#3144)

Co-authored-by: gethinhu <gethinhu@tencent.com>

a better way to handle nccl error under elastic scenario (horovod#3112)

Signed-off-by: guoze.lin <guozelin@tencent.com>

check torch version for mixed precision example (horovod#3238)

Lightning: set limit_train_batches and limit_val_batches (horovod#3237)

Tell Lightning trainer that how many batches a single epoch needs.

Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>

Spark/Lightning: reduce memory footprint of async dataloader (horovod#3239)

Limit async data loader queue size.

Signed-off-by: Peng Zhang <pengz@uber.com>

Change default fusion threshold from 64MB to 128MB in docs (horovod#3241)

fix the example of pytorch_lightning_mnist.py (horovod#3245)

- remove unused arg parameters
- fix model test issue on GPU

Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>

CI: use latest pytorch_lightning with torchhead (horovod#3243)

test_gradient_aggregation with real gradient instead of a constant (horovod#3176)

This fixes issue horovod#2664 by performing gradient aggregation with a real gradient instead of a constant.
PR: horovod#2647 shifts the gradient allreduce when the gradient is computed (both through the DistributedOptimizer or through the DistributedGradientTape). Which means that this unittest, by design in TF2.4, doesn't call allreduce in _aggregate_gradients().

Since this unittest provide a gradient as constant (without effectively computing it), the gradient will never be allreduced.
The current change ensure that instead of a constant a real gradient is computed from a loss-function.

Note: The current loss-function intentionally evaluates to zero. A future PR should convert it to a real loss function(e.g. MeanSquaredError) and compute gradients from that to test gradient aggregation.
Signed-off-by: Abin Shahab <ashahab@linkedin.com>
  • Loading branch information
maxhgerlach authored and weihanmines committed Nov 8, 2021
1 parent f0c44d2 commit 00f2586
Show file tree
Hide file tree
Showing 72 changed files with 2,554 additions and 1,505 deletions.
8 changes: 4 additions & 4 deletions .buildkite/gen-pipeline.sh
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ tests=$(if [[ -n "${PIPELINE_MODE:-}" ]] && ( [[ "${BUILDKITE_BRANCH:-}" == "${B
printf "test-cpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2 "
# then we vary the frameworks for gpu
printf "test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_3_1-mxnet1_5_1_p0-pyspark3_1_2 "
printf "test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_6_0-mxnet1_5_1_p0-pyspark3_1_2 "
# this is required as we cannot test mxnet-1.6.0.post0 with cpu
printf "test-gpu-gloo-py3_8-tf2_4_3-keras2_3_1-torch1_7_1-mxnet1_6_0_p0-pyspark3_1_2 "
# we additionally test the previous framework combination (CUDA 10.x) with mxnet 1.7.x
Expand Down Expand Up @@ -78,7 +78,7 @@ build_test() {
echo " push-retries: 5"
echo " - ecr#v1.2.0:"
echo " login: true"
echo " timeout_in_minutes: 30"
echo " timeout_in_minutes: 40"
echo " retry:"
echo " automatic: true"
echo " agents:"
Expand Down Expand Up @@ -125,7 +125,7 @@ run_mpi_pytest() {
run_test "${test}" "${queue}" \
":pytest: MPI Parallel PyTests (${test})" \
"bash -c \"${oneccl_env} ${test_env} cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 \\\$(cat /mpirun_command) /bin/bash /pytest.sh mpi)\"" \
5
10
run_test "${test}" "${queue}" \
":pytest: MPI Single PyTests (${test})" \
"bash -c \"${oneccl_env} ${test_env} cd /horovod/test/single && (ls -1 test_*.py | xargs -n 1 /bin/bash /pytest_standalone.sh mpi)\"" \
Expand Down Expand Up @@ -231,7 +231,7 @@ run_gloo_pytest() {
run_test "${test}" "${queue}" \
":pytest: Gloo Parallel PyTests (${test})" \
"bash -c \"${test_env} cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 horovodrun -np 2 -H localhost:2 --gloo /bin/bash /pytest.sh gloo)\"" \
5
10
run_test "${test}" "${queue}" \
":pytest: Gloo Single PyTests (${test})" \
"bash -c \"${test_env} cd /horovod/test/single && (ls -1 test_*.py | xargs -n 1 /bin/bash /pytest_standalone.sh gloo)\"" \
Expand Down
207 changes: 114 additions & 93 deletions .github/gen-workflow-ci.py

Large diffs are not rendered by default.

8 changes: 4 additions & 4 deletions .github/get-changed-code-files.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@
import requests

# this script outputs all code files that have changed between commit and master
# environment variable GITHUB_HEAD provides the commit SHA
# environment variable GITHUB_BASE provides the master SHA
# environment variable GITHUB_HEAD_SHA provides the commit SHA
# environment variable GITHUB_BASE_SHA provides the master SHA

# files that match any of these regexps are considered non-code files
# even though those files have changed, they will not be in the output of this script
Expand Down Expand Up @@ -49,8 +49,8 @@ def is_non_code_file(file):
if __name__ == "__main__":
logging.getLogger().level = logging.DEBUG

base = os.environ.get('GITHUB_BASE')
head = os.environ.get('GITHUB_HEAD')
base = os.environ.get('GITHUB_BASE_SHA')
head = os.environ.get('GITHUB_HEAD_SHA')
if head is None or base is None:
logging.warning('no base commit ({}) or head commit ({}) given'.format(base, head))
sys.exit(1)
Expand Down
106 changes: 88 additions & 18 deletions .github/workflows/ci-fork.yaml → .github/workflows/ci-results.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
name: CI (Fork)
# publishes test results from the CI workflow (not when run on schedule)
# this publishes test results of PRs from horovod repository and fork repositories
# buildkite tests are only run here for fork repositories
name: CI (Results)

on:
workflow_run:
Expand All @@ -16,39 +19,67 @@ jobs:
ci-workflow:
name: "Check CI workflow outcome"
runs-on: ubuntu-latest
# only run if CI workflow ran on a fork
# only run if CI workflow has not been skipped or cancelled
# only run if CI workflow did not run on schedule
if: >
github.event.workflow_run.conclusion != 'skipped' &&
github.event.workflow_run.conclusion != 'cancelled' &&
github.event.workflow_run.head_repository.fork
github.event.workflow_run.event != 'schedule'
outputs:
build-and-test: ${{ steps.workflow-conclusion.outputs.build-and-test }}
pr-json: ${{ steps.pr.outputs.json }}

steps:
- name: Fetch workflow conclusion
id: workflow-conclusion
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
curl -s "${{ github.event.workflow_run.jobs_url }}" > workflow_run_jobs.json
conclusion=$(jq -r '.jobs[] | select(.name | startswith("Build and Test (")) | .conclusion' workflow_run_jobs.json | sort | uniq | paste -sd "," -)
conclusion=$(gh api "${{ github.event.workflow_run.jobs_url }}" -q '.jobs[] | select(.name | startswith("Build and Test (")) | .conclusion' | sort | uniq | paste -sd "," -)
echo "build-and-test conclusion: ${conclusion}"
echo "::set-output name=build-and-test::${conclusion}"
shell: bash

- name: Fetch PR meta
id: pr
if: github.event.workflow_run.event == 'pull_request' && github.event.workflow_run.head_repository.fork
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
artifacts_url=${{ github.event.workflow_run.artifacts_url }}
gh api "$artifacts_url" -q '.artifacts[] | select(.name == "PR Meta") .archive_download_url' | while read url
do
gh api "$url" > "pr.zip"
unzip -o "pr.zip"
echo "::set-output name=json::$(cat pr.json)"
cat pr.json
echo
done
if [[ ! -e "pr.json" ]]
then
echo "::error title=Artifact 'PR Meta' missing::Expected artifact 'PR Meta' does not exist for pull_request event."
exit 1
fi
buildkite:
name: "Build and Test GPU (on Builtkite)"
needs: [ci-workflow]
runs-on: ubuntu-latest
# only run if CI workflow's build-and-test job succeeded and CI workflow ran on a fork
if: needs.ci-workflow.outputs.build-and-test == 'success'
if: >
needs.ci-workflow.outputs.build-and-test == 'success' &&
github.event.workflow_run.head_repository.fork
steps:
- name: Trigger Buildkite Pipeline
id: buildkite
uses: EnricoMi/trigger-pipeline-action@master
env:
PIPELINE: "horovod/horovod"
COMMIT: "${{ github.event.workflow_run.head_sha }}"
COMMIT: "${{ fromJSON( needs.ci-workflow.outputs.pr-json ).merge_sha }}"
BRANCH: "${{ github.event.workflow_run.head_repository.owner.login }}:${{ github.event.workflow_run.head_branch }}"
MESSAGE: "${{ github.event.workflow_run.message }}"
MESSAGE: "${{ github.event.workflow_run.head_commit.message }} (release versions)"
BUILDKITE_API_ACCESS_TOKEN: ${{ secrets.BUILDKITE_TOKEN }}
BUILD_ENV_VARS: "{\"PIPELINE_MODE\": \"GPU NON HEADS\"}"

Expand Down Expand Up @@ -81,20 +112,22 @@ jobs:
buildkite-heads:
name: "Build and Test GPU heads (on Builtkite)"
needs: [buildkite]
needs: [ci-workflow, buildkite]
runs-on: ubuntu-latest
# only run if CI workflow's build-and-test job succeeded and CI workflow ran on a fork
if: needs.ci-workflow.outputs.build-and-test == 'success'
if: >
needs.ci-workflow.outputs.build-and-test == 'success' &&
github.event.workflow_run.head_repository.fork
steps:
- name: Trigger Buildkite Pipeline
id: buildkite
uses: EnricoMi/trigger-pipeline-action@master
env:
PIPELINE: "horovod/horovod"
COMMIT: "${{ github.event.workflow_run.head_sha }}"
COMMIT: "${{ fromJSON( needs.ci-workflow.outputs.pr-json ).merge_sha }}"
BRANCH: "${{ github.event.workflow_run.head_repository.owner.login }}:${{ github.event.workflow_run.head_branch }}"
MESSAGE: "${{ github.event.workflow_run.message }}"
MESSAGE: "${{ github.event.workflow_run.head_commit.message }} (head versions)"
BUILDKITE_API_ACCESS_TOKEN: ${{ secrets.BUILDKITE_TOKEN }}
BUILD_ENV_VARS: "{\"PIPELINE_MODE\": \"GPU HEADS\"}"

Expand Down Expand Up @@ -127,14 +160,16 @@ jobs:
publish-test-results:
name: "Publish Unit Tests Results"
needs: [buildkite, buildkite-heads]
needs: [ci-workflow, buildkite, buildkite-heads]
runs-on: ubuntu-latest
# only run if CI workflow ran on a fork
# only publish results when ci-workflow job has not been skipped, meaning:
# - CI workflow has not been skipped or cancelled
# - CI workflow did not run on schedule
# and CI workflow's build-and-test jobs have not all been skipped
if: >
always() &&
github.event.workflow_run.conclusion != 'skipped' &&
github.event.workflow_run.conclusion != 'cancelled' &&
github.event.workflow_run.head_repository.fork
needs.ci-workflow.result != 'skipped' &&
needs.ci-workflow.outputs.build-and-test != 'skipped'
steps:
- name: Debug Action
Expand All @@ -160,8 +195,43 @@ jobs:
with:
path: artifacts

- name: Identify last run of each test
continue-on-error: true
run: |
declare -A last_runs
ls -d artifacts/Unit\ Test\ Results\ */* | sort > runs.txt
while read run
do
test=${run/%[_-]run[_-][0123456789]/}
last_runs[$test]=$run
done < runs.txt
echo "LAST_RUNS<<EOF" >> $GITHUB_ENV
for test in "${!last_runs[@]}"
do
echo "${last_runs[$test]}" >&2
echo "${last_runs[$test]}/**/*.xml" >> $GITHUB_ENV
done
echo "EOF" >> $GITHUB_ENV
shell: bash

- name: Publish Unit Test Results
uses: EnricoMi/publish-unit-test-result-action@v1
if: always()
with:
check_name: Unit Test Results
event_file: artifacts/Event File/event.json
event_name: ${{ github.event.workflow_run.event }}
commit: ${{ github.event.workflow_run.head_sha }}
files: "${{ env.LAST_RUNS }}"

- name: Publish Unit Test Results (with flaky tests)
uses: EnricoMi/publish-unit-test-result-action@v1
if: always()
with:
check_name: Unit Test Results (with flaky tests)
event_file: artifacts/Event File/event.json
event_name: ${{ github.event.workflow_run.event }}
commit: ${{ github.event.workflow_run.head_sha }}
files: "artifacts/*/**/*.xml"
files: "artifacts/Unit Test Results */**/*.xml"
fail_on: errors
Loading

0 comments on commit 00f2586

Please sign in to comment.