Initially, the files were uploaded to GitHub using Git Large File Storage (LFS), but due to potential recurrent billing issues, they have been removed.
You can now find the files archived as
on Zenodo. This ensures that the data remains accessible without incurring additional costs.
We've tested codes on NVIDIA GeForce RTX 2080 Ti GPU in Ubuntu 20.04 amd64 system. GPU should have Turing architecture.
Note: GPU architecture over Turing could have bug when tracing (NVBit bug).
We tested our code on Ubuntu 20.04 amd64 system, and used CUDA 11.3.1 and cuDNN 8.
Software pre-requisites for installing from the source should be satisfied for the following repositories:
You can install all dependencies by following this document.
Firstly, make sure CUDA is installed in your system:
export CUDA_INSTALL_PATH=/usr/local/cuda # set it to your CUDA installation path
nvcc --version
In Ubuntu 20.04 amd64 system, following commands install package dependencies:
sudo apt-get update
sudo apt-get install -y --no-install-recommends python3-dev ca-certificates g++ python3-numpy gcc make git python3-setuptools python3-wheel python3-pip aria2 wget build-essential xutils-dev bison zlib1g-dev flex libglu1-mesa-dev git libssl-dev libxml2-dev libboost-all-dev vim python-setuptools python-dev ninja-build bc git-lfs libtinfo-dev htop libedit-dev
Next, install Python (>= 3.8) dependencies.
Note: you need to use specific PyTorch version (= 1.11.0). Later version could generate different node name that cannot be processed by the current version.
python3 -m pip install -U --force-reinstall pip
pip install torch==1.11.0+cu113 \
torchvision==0.12.0+cu113 \
torchaudio==0.11.0 --extra-index-url
pip3 install pyyaml==5.1 onnx plotly psutil pandas decorator attrs scipy
Install CMake (>= 3.21):
sudo aria2c -q -d /tmp -o cmake-3.21.0-linux-x86_64.tar.gz \
sudo tar -zxf /tmp/cmake-3.21.0-linux-x86_64.tar.gz --strip=1 -C /usr
Install Clang and LLVM (>= 12)
wget -c
tar -xvf clang+llvm-13.0.0-x86_64-linux-gnu-ubuntu-20.04.tar.xz
sudo cp -rl clang+llvm-13.0.0-x86_64-linux-gnu-ubuntu-20.04/* /usr/local
rm -rf clang+llvm-13.0.0-x86_64-linux-gnu-ubuntu-20.04 \
Install and build PIMFlow repositories from the source. We prepared installation script (docker/
cd "$HOME"
git clone -b "$GIT_BRANCH"
cd "$TVM_DIR"
git submodule init && git submodule update
mkdir -p "$BUILD_DIR" && cd "$BUILD_DIR"
cp "$TVM_DIR/cmake/config.cmake" "$BUILD_DIR"
cmake .. -G Ninja -DCMAKE_CXX_COMPILER=$(which g++) -DCMAKE_C_COMPILER=$(which gcc)
cd "$HOME"
git clone -b "$GIT_BRANCH"
cd "$GPU_DIR"
# Generate binary file: $GPU_DIR/bin/release/accel-sim.out
make -j
# Install nvbit
cd "$NVBIT_DIR" && ./ && make -j
cd "$HOME"
git clone -b "$GIT_BRANCH"
cd "$RAM_DIR"
# Generate binary file: $RAM_DIR/ramulator
make -j
cd "$HOME"
git clone -b "$GIT_BRANCH"
pip install -e .
cd "$PIMFLOW_DIR/pim"
# Generate binary file: $PIMFLOW_DIR/pim/pim_codegen
make -j
# Extract network traces
tar -xzf ./data/mobilenet-v2.tar.gz -C .
tar -xzf ./data/traces-mobilenet-v2-16-org.tar.gz -C .
tar -xzf ./data/traces-mobilenet-v2-16-Newton+.tar.gz -C .
tar -xzf ./data/traces-mobilenet-v2-16-Newton++.tar.gz -C .
tar -xzf ./data/traces-mobilenet-v2-16-Pipeline.tar.gz -C .
tar -xzf ./data/traces-mobilenet-v2-16-MDDP.tar.gz -C .
tar -xzf ./data/traces-mobilenet-v2-16-PIMFlow.tar.gz -C .
tar -xzf ./data/mobilenet-v2-csv.tar.gz -C ../
Now, the directory should look like this:
. ($HOME)
Finally, you need to set the following environment variables, and include them to .bashrc for later session.
export TVM_HOME=/root/PIMFlow_tvm
export PYTHONPATH=/root/PIMFlow_tvm/python
You can manually peform profiling to find optimal execution mode and task size.
Note: it takes about 8 hours in server with 8x NVIDIA GeForce RTX 2080 Ti GPU and 2x Intel Xeon Gold 6248R CPU (24-core)
cd PIMFlow
./pimflow -m=profile -t=split -n=mobilenet-v2
./pimflow -m=profile -t=pipeline -n=mobilenet-v2
Or, you can just use the profiled data we've prepared in PIMFlow/mobilenet-v2/ for MobileNet-V2.
Now, you can get the optimal solution using profiled data and get the speedup:
./pimflow -m=stat --conv_only -n=mobilenet-v2
The output should look like this:
=== N_CHANNEL: 16, N_GWRITE: 4, ramulator_disable_gwrite_latency_hiding: False ===
newton++ (vs baseline): 1.365 (-388549.76000000024)
pipeline (vs baseline): 1.413 (-425128.2000000004)
split (vs baseline): 1.436 (-441899.4400000004)
all (vs baseline): 1.481 (-472070.72000000044)
Next, you can get speedup by the following commands: Note: it takes about 8 hours in our system. policy option is either Newton+, Newton++, Pipeline, MDDP, or PIMFlow.
./pimflow -m=solve -n=mobilenet-v2
./pimflow -m=run --gpu_only -n=mobilenet-v2 # get gpu-only execution time
./pimflow -m=run -n=mobilenet-v2 # get pimflow execution time
./pimflow -m=stat -n=mobilenet-v2 --policy=PIMFlow # show end-to-end speedup
GPU CYCLE: 1445620
PIMFlow CYCLE: 1047831.4000000001
You can replace "mobilenet-v2" with "efficientnet-v1-b0", "mnasnet-1.0", "resnet-50" or "vgg-16" for various network testing. We prepared very simple network "toy" for simple but fast test.