Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[data] Stability & accuracy improvements for Data+Train benchmark (#4…
…2027) Shuffles input images for read_images_train_4_gpu release test, which fixes the issue with accuracy going to 0. Add AWS Error NETWORK_CONNECTION and AWS Error ACCESS_DENIED as an Exception type to retry during reads, since this can be a transient error that is fine upon retry. Other small fixes for optional parameters in benchmark file, used for debugging purposes. Results of sample release test run: read_images_train_4_gpu: Result of case cache-none: {'time': 11964.644112934, 'tput': 429.22158930338344, 'accuracy': 0.4667895757295709, 'extra_metrics': {}} read_images_train_16_gpu: Result of case cache-none: {'time': 5400.357632072, 'tput': 1593.6668981608586, 'accuracy': 0.5293150227295434, 'extra_metrics': {}} read_images_train_16_gpu_preserve_order: Result of case cache-none: {'time': 5566.524269388, 'tput': 1571.1312653719967, 'accuracy': 0.5295374787691078, 'extra_metrics': {}} (The difference is accuracy is because the 4 worker test only runs for 3 epochs, the 16 worker test runs for 5 epochs, using the entire dataset per epoch.) --------- Signed-off-by: Andrew Xue <andewzxue@gmail.com> Signed-off-by: Scott Lee <sjl@anyscale.com> Co-authored-by: Scott Lee <scottjlee@users.noreply.github.com> Co-authored-by: Scott Lee <sjl@anyscale.com>
- Loading branch information