add cosmoflow

sbi-fair · Jan 26, 2024 · 9b2bcf6 · 9b2bcf6
1 parent 26adb1f
commit 9b2bcf6
Show file tree

Hide file tree

Showing 4 changed files with 130 additions and 0 deletions.
diff --git a/content/en/docs/Surogates/cosmoflow/#index.md# b/content/en/docs/Surogates/cosmoflow/#index.md#
@@ -0,0 +1,106 @@
+---
+title: "OSMI"
+linkTitle: "OSMI"
+weight: 100
+description: >
+  We explore the relationship between certain network configurations and the performance of distributed Machine
+  Learning systems. We build upon the Open Surrogate Model Inference (OSMI) Benchmark, a distributed inference
+  benchmark for analyzing the performance of machine-learned surrogate models
+---
+
+## Heading
+
+
+Edit this template to create your new page.
+
+
+
+We explore the relationship between certain network configurations and
+the performance of distributed Machine Learning systems. We build upon
+the Open Surrogate Model Inference (OSMI) Benchmark, a distributed
+inference benchmark for analyzing the performance of machine-learned
+surrogate models developed by Wes Brewer et. Al. We focus on analyzing
+distributed machine-learning systems, via machine-learned surrogate
+models, across varied hardware environments. By deploying the OSMI
+Benchmark on platforms like Rivanna HPC, WSL, and Ubuntu, we offer a
+comprehensive study of system performance under different
+configurations. The paper presents insights into optimizing
+distributed machine learning systems, enhancing their scalability and
+efficiency. We also develope a framework for automating the OSMI
+benchmark.
+
+
+## Introdcution
+
+
+With the proliferation of machine learning as a tool for science, the
+need for efficient and scalable systems is paramount. This paper
+explores the Open Surrogate Model Inference (OSMI) Benchmark, a tool
+for testing the performance of machine-learning systems via
+machine-learned surrogate models. The OSMI Benchmark, originally
+created by Wes Brewer and colleagues, serves to evaluate various
+configurations and their impact on system performance.
+
+Our research pivots around the deployment and analysis of the OSMI
+Benchmark across various hardware platforms, including the
+high-performance computing (HPC) system Rivanna, Windows Subsystem for
+Linux (WSL), and Ubuntu environments.
+
+In each experiment, there are a variable number of TensorFlow model
+server instances, overseen by a HAProxy load balancer that distributes
+inference requests among the servers. Each server instance operates on
+a dedicated GPU, choosing between the V100 or A100 GPUs available on
+Rivanna. This setup mirrors real-world scenarios where load balancing
+is crucial for system efficiency.
+
+On the client side, we initiate a variable number of concurrent
+clients executing the OSMI benchmark to simulate different levels of
+system load and analyze the corresponding inference throughput.
+
+On top of the original OSMI-Bench, we implemented an object-oriented
+interface in Python for running experiments with ease, streamlining
+the process of benchmarking and analysis. The experiments rely on
+custom-built images based on NVIDIA's tensorflow image. The code works
+on several hardwares, assuming the proper images are built.
+
+Additionally, We develop a script for launching simultaneous
+experiments with permutations of pre-defined parameters with Cloudmesh
+Experiment-Executor. The Experiment Executor is a tool that automates
+the generation and execution of experiment variations with different
+parameters. This automation is crucial for conducting tests across a
+spectrum of scenarios.
+
+Finally, we analyze the inference throughput and total time for each
+experiment. By graphing and examining these results, we draw critical
+insights into the performand
+ce dynamics of distributed machine learning
+systems.
+
+In summary, a comprehensive examination of the OSMI Benchmark in
+diverse distributed ML systems is provided. We aim to contribute to
+the optimization of these systems, by providing a framework for
+finding the best performant system configuration for a given use
+case. Our findings pave the way for more efficient and scalable
+distributed computing environments.
+
+[^1][^2]
+
+## References
+
+[^1]: Brewer, Wesley, Daniel Martinez, Mathew Boyer, Dylan Jude, Andy
+Wissink, Ben Parsons, Junqi Yin, and Valentine Anantharaj. "Production
+Deployment of Machine-Learned Rotorcraft Surrogate Models on HPC." In
+2021 IEEE/ACM Workshop on Machine Learning in High Performance
+Computing Environments (MLHPC), pp. 21-32. IEEE,
+2021. <https://ieeexplore.ieee.org/abstract/document/9652868>. Note
+that OSMI-Bench differs from SMI-Bench described in the paper only in
+that the models that are used in OSMI are trained on synthetic data,
+whereas the models in SMI were trained using data from proprietary CFD
+simulations. Also, the OSMI medium and large models are very similar
+architectures as the SMI medium and large models, but not identical.
+
+
+[^2]: Gregor von Laszewski, J. P. Fleischer, and Geoffrey
+C. Fox. 2022. Hybrid Reusable Computational Analytics Workflow
+Management with Cloudmesh. https://doi.org/10.48550/ARXIV.2210.16941
+
diff --git a/content/en/docs/Surogates/cosmoflow/index.md b/content/en/docs/Surogates/cosmoflow/index.md
@@ -0,0 +1,24 @@
+---
+title: "Cosmoflow"
+linkTitle: "Cosmoflow"
+weight: 15
+description: >
+  The CosmoFlow training application benchmark from the MLPerf HPC v0.5 benchmark suite. It involves training a 3D convolutional neural network on N-body cosmology simulation data to predict physical parameters of the universe.
+resources:
+  - src: "**.{png,jpg}"
+    title: "Image #:counter"
+---
+
+## Overview
+
+This application is based on the original CosmoFlow paper presented at SC18 and continued by the ExaLearn project, and adopted as a benchmark in the MLPerf HPC suite. It involves training a 3D convolutional neural network on N-body cosmology simulation data to predict physical parameters of the universe. The reference implementation for MLPerf HPC v0.5 CosmoFlow uses TensorFlow with the Keras API and Horovod for data-parallel distributed training. The dataset comes from simulations run by ExaLearn, with universe volumes split into cubes of size 128x128x128 with 4 redshift bins. The total dataset volume preprocessed for MLPerf HPC v0.5 in TFRecord format is 5.1 TB. The target objective in MLPerf HPC v0.5 is to train the model to a validation mean-average-error < 0.124. However, the problem size can be scaled down and the training throughput can be used as the primary objective for a small scale or shorter timescale benchmark.[^1][^2][^3]
+
+
+## References
+
+
+[^1]: <https://proxyapps.exascaleproject.org/app/mlperf-cosmoflow/>
+
+[^2]: <https://github.com/sparticlesteve/cosmoflow-benchmark>
+
+[^3]: <https://github.com/sparticlesteve/cosmoflow-benchmark/blob/master/README.md>
diff --git a/content/en/docs/Surogates/cosmoflow/osmi1.png b/content/en/docs/Surogates/cosmoflow/osmi1.png
diff --git a/content/en/docs/Surogates/cosmoflow/osmi2.png b/content/en/docs/Surogates/cosmoflow/osmi2.png