Potential issues observed in straggler handling policy #981

ParthMandaliya · 2024-05-31T09:07:12Z

Describe the bug
Unexpected behavior observed when simulating the different straggler handling scenarios.

Scenario:

Any one collaborator manually terminated during the execution of the experiment.
Parameters for CutoffTimeBasedStragglerHandling policy:
Cutoff time: 10 seconds
Minimum reporting: 1
Total collaborators: 2

Expected behavior:

After the cutoff time (10 seconds) has elapsed aggregator should move to 2nd round as it has received results from minimum required collaborators (1 collaborator)

Observed behavior:

Aggregator keeps waiting for the results from terminated collaborator even though Cutofftime has expired and Minimum reporting collaborators have reported the results

To Reproduce
Steps to reproduce the behavior:

Go to: openfl/component/straggler_handling_functions/cutoff_time_based_straggler_handling.py
Change: straggler_cutoff_time parameter default value from np.inf to 10 in CutoffTimeBasedStragglerHandling class constructor.
Tutorial: Modify openfl-tutorials/interactive_api/PyTorch_LinearRegression tutorial to run with 2 collaborators and run the experiment
While the experiment is in progress terminate any one collaborator before it sends the results of the particular round.
See error: Aggregator will not start next round and will keep waiting for the terminated collaborators to send results.

Desktop (please complete the following information):

ParthMandaliya linked a pull request Jul 4, 2024 that will close this issue

Updates to straggler handling functionality #996

Open

ParthMandaliya linked a pull request Jul 9, 2024 that will close this issue

Updates to straggler handling functionality #996

Open

Provide feedback