Skip to content

Commit

Permalink
add cosmoflow
Browse files Browse the repository at this point in the history
  • Loading branch information
laszewsk committed Jan 26, 2024
1 parent 26adb1f commit 9b2bcf6
Show file tree
Hide file tree
Showing 4 changed files with 130 additions and 0 deletions.
106 changes: 106 additions & 0 deletions content/en/docs/Surogates/cosmoflow/#index.md#
@@ -0,0 +1,106 @@
---
title: "OSMI"
linkTitle: "OSMI"
weight: 100
description: >
We explore the relationship between certain network configurations and the performance of distributed Machine
Learning systems. We build upon the Open Surrogate Model Inference (OSMI) Benchmark, a distributed inference
benchmark for analyzing the performance of machine-learned surrogate models
---

## Heading


Edit this template to create your new page.



We explore the relationship between certain network configurations and
the performance of distributed Machine Learning systems. We build upon
the Open Surrogate Model Inference (OSMI) Benchmark, a distributed
inference benchmark for analyzing the performance of machine-learned
surrogate models developed by Wes Brewer et. Al. We focus on analyzing
distributed machine-learning systems, via machine-learned surrogate
models, across varied hardware environments. By deploying the OSMI
Benchmark on platforms like Rivanna HPC, WSL, and Ubuntu, we offer a
comprehensive study of system performance under different
configurations. The paper presents insights into optimizing
distributed machine learning systems, enhancing their scalability and
efficiency. We also develope a framework for automating the OSMI
benchmark.


## Introdcution


With the proliferation of machine learning as a tool for science, the
need for efficient and scalable systems is paramount. This paper
explores the Open Surrogate Model Inference (OSMI) Benchmark, a tool
for testing the performance of machine-learning systems via
machine-learned surrogate models. The OSMI Benchmark, originally
created by Wes Brewer and colleagues, serves to evaluate various
configurations and their impact on system performance.

Our research pivots around the deployment and analysis of the OSMI
Benchmark across various hardware platforms, including the
high-performance computing (HPC) system Rivanna, Windows Subsystem for
Linux (WSL), and Ubuntu environments.

In each experiment, there are a variable number of TensorFlow model
server instances, overseen by a HAProxy load balancer that distributes
inference requests among the servers. Each server instance operates on
a dedicated GPU, choosing between the V100 or A100 GPUs available on
Rivanna. This setup mirrors real-world scenarios where load balancing
is crucial for system efficiency.

On the client side, we initiate a variable number of concurrent
clients executing the OSMI benchmark to simulate different levels of
system load and analyze the corresponding inference throughput.

On top of the original OSMI-Bench, we implemented an object-oriented
interface in Python for running experiments with ease, streamlining
the process of benchmarking and analysis. The experiments rely on
custom-built images based on NVIDIA's tensorflow image. The code works
on several hardwares, assuming the proper images are built.

Additionally, We develop a script for launching simultaneous
experiments with permutations of pre-defined parameters with Cloudmesh
Experiment-Executor. The Experiment Executor is a tool that automates
the generation and execution of experiment variations with different
parameters. This automation is crucial for conducting tests across a
spectrum of scenarios.

Finally, we analyze the inference throughput and total time for each
experiment. By graphing and examining these results, we draw critical
insights into the performand
ce dynamics of distributed machine learning
systems.

In summary, a comprehensive examination of the OSMI Benchmark in
diverse distributed ML systems is provided. We aim to contribute to
the optimization of these systems, by providing a framework for
finding the best performant system configuration for a given use
case. Our findings pave the way for more efficient and scalable
distributed computing environments.

[^1][^2]

## References

[^1]: Brewer, Wesley, Daniel Martinez, Mathew Boyer, Dylan Jude, Andy
Wissink, Ben Parsons, Junqi Yin, and Valentine Anantharaj. "Production
Deployment of Machine-Learned Rotorcraft Surrogate Models on HPC." In
2021 IEEE/ACM Workshop on Machine Learning in High Performance
Computing Environments (MLHPC), pp. 21-32. IEEE,
2021. <https://ieeexplore.ieee.org/abstract/document/9652868>. Note
that OSMI-Bench differs from SMI-Bench described in the paper only in
that the models that are used in OSMI are trained on synthetic data,
whereas the models in SMI were trained using data from proprietary CFD
simulations. Also, the OSMI medium and large models are very similar
architectures as the SMI medium and large models, but not identical.


[^2]: Gregor von Laszewski, J. P. Fleischer, and Geoffrey
C. Fox. 2022. Hybrid Reusable Computational Analytics Workflow
Management with Cloudmesh. https://doi.org/10.48550/ARXIV.2210.16941

24 changes: 24 additions & 0 deletions content/en/docs/Surogates/cosmoflow/index.md
@@ -0,0 +1,24 @@
---
title: "Cosmoflow"
linkTitle: "Cosmoflow"
weight: 15
description: >
The CosmoFlow training application benchmark from the MLPerf HPC v0.5 benchmark suite. It involves training a 3D convolutional neural network on N-body cosmology simulation data to predict physical parameters of the universe.
resources:
- src: "**.{png,jpg}"
title: "Image #:counter"
---

## Overview

This application is based on the original CosmoFlow paper presented at SC18 and continued by the ExaLearn project, and adopted as a benchmark in the MLPerf HPC suite. It involves training a 3D convolutional neural network on N-body cosmology simulation data to predict physical parameters of the universe. The reference implementation for MLPerf HPC v0.5 CosmoFlow uses TensorFlow with the Keras API and Horovod for data-parallel distributed training. The dataset comes from simulations run by ExaLearn, with universe volumes split into cubes of size 128x128x128 with 4 redshift bins. The total dataset volume preprocessed for MLPerf HPC v0.5 in TFRecord format is 5.1 TB. The target objective in MLPerf HPC v0.5 is to train the model to a validation mean-average-error < 0.124. However, the problem size can be scaled down and the training throughput can be used as the primary objective for a small scale or shorter timescale benchmark.[^1][^2][^3]


## References


[^1]: <https://proxyapps.exascaleproject.org/app/mlperf-cosmoflow/>

[^2]: <https://github.com/sparticlesteve/cosmoflow-benchmark>

[^3]: <https://github.com/sparticlesteve/cosmoflow-benchmark/blob/master/README.md>
Binary file added content/en/docs/Surogates/cosmoflow/osmi1.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added content/en/docs/Surogates/cosmoflow/osmi2.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 9b2bcf6

Please sign in to comment.