Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Horovod in PySpark #606
This PR adds convenient helper function
Under the hood, we set up a little TCP service on Driver and each Task that perform few functions:
This flow is depicted on the following sequence diagram (also available in editor):
All external programs, such as
One detail not depicted in the diagram above: if there are multiple tasks on the same host,
The data is expected to be saved in Parquet format and ingested using Petastorm. This has been found to be more reliable and scalable than directly ingesting Spark RDDs.
This PR is a work in progress.