# Machine Learning Tools in Action

<hr>

Talk given by Gennady Pekhimenko, UoT.

<hr>

## 1 - Preliminary

There is a dependency circle in machine learning that drives analysis and optimization.
- Performance bottlenecks in DNN training
- Diverse benchmark suite with state-of-the-art models
- Key performance metrics
- Tools

The talk focuses on the latter: **tooling**.

### Feature maps

Feature maps remain still more important than weights for memory consumption. To assess the memory consumption of a model, new tools have been created:

### Skyline

Skyline is an interactive in-editor performance Visualization and Debugging for DNN training. It is motivated by the exploratory need to find why a model runs slow

**Key features**:
- key performance metrics (throughput, memory usage)
- Iteration run time and memory footprint breakdowns
- Interactive visualizations linked to batch size predictions
- Live and proactive performance debugging during development

<hr>

## 2 - The Problem

Many GPUs are available for deep neural networks training. Each has a different cost and performance. AS such, a user should ponder what to choose for training.

**Key observations**:

- DNN training computation is highly repetitive
- Predicting a GPU's training performance by predicting the execution time of a single iteration.


**The work presented**:

- Use an existing GPU to predict execution times on a different GPU using wave scaling and pre-trained MLPs
- Implement ideas in a new tool called Habitat
- Show two case studies where HAbitat leads users to the correct GPU choice..

### Key observations

- Deep learning users may already have an existing GPU
- DNN training is a repetitive process
- Use existing GPU to make iteration execution time predictions for other GPUs

<hr>

## 3 - Habitat: A runtine-base performance predictor

### Process

1. Profile all operations in a training iteartion on an existing GPU
2. Predict each oepration using wave scaling or a MLP

![hw](images/habitatworkflow.png)

### Behind the hood

GPU kernels are entites to which the programmer divide work into thread blocks (Same code, different data). Streaming Multiprocessors run a finite number of blocks concurrently. And blocks round-robin scheduled onto the SMs. As such GPU kernels execute in "**wave**" of thread blocks.

### Wave Scaling

Wave scaling predicts a kernel execution of a GPU on another GPU.

GPU can enjoy different types of scaling factors:
- Memory bandwidth
- Wave size
- Clock frequency

### One wrinkle

- Wave scaling assumes the same kernel is used across GPUs
- A few DNN operations use architecture-specific kernels (kernel variation): Convolutions, linear (dense) layers, LSTMS
- Habitat uses pre-trained MLPs for these operations

### Evaluation

How accurate are Habitat's predictions?

<hr>

## 4 - Habitat's performance

### How accurate is Habitat

This is performed by predicting iteration execution time on a GPU. Habitat makes accurate predictions with an average error of **11.8%** across all configurations (30GPU pairs x 5 models x 3 batch sizes).

### Scenarios answered

- **case 1:** A person wants to train a GNMET and have access to a P4000. Which cloud GPU to use, if any?
    - Habitat correctly predicts that the V100 is the best choice for performance
    - Habitat correctly predicts that the T4 is the best choice for cost

### Why we should not always use the best GPUs

- **case 1:** A person wants to train a DCGAN and have access to a 2080Ti. HAbitat correctly predicts that the V100 only offer a marginal improvement on the 2080Ti.

### Key takeaways

- DNN computation is special (repetitive), enabling new analysis opportunities
- Use runtime-based information to make iteration execution time predictions
- Habitat leads to the correct decision in the case studies
- The hardware landscape is growing, users need help choosing effectively

<hr>

## 5 - Daydream: Accurately Estimating the Efficacy of Optimizations for DNN Training

Benefits of proposed DNN optimizations are note fully exploited because:
- efficacy varies for different HW/SW configurations
- It is onerous to implement optimizations

Daydream efficiently explores the efficacy of various DNN optimization using **Dependency graph analysis**:
- Tracking dependencies at the abstraction of GPU kernels
- Kernel-to-layer mapping
- Transformation rules to model a diverse set of optimizations

The evaluation relies on the low estimation error on 5 optimizations, 5 DNN modesl across 3 applications.

### Advances in ML full stack research

- DNN compute requirements are growing exponentially
- Rapid advances in algorithms, systems optimizations & hardware architectures

> It is hard for a ML programmer to identify the efficacy of new algorithms, optimizations, and hardware improvements in their deployments

### Why dependency analysis

A DNN computational graph is usually quite efficient to represent a model. But it leads to some challenges.

![daydream](images/daydream.png)

#### Challenges for Dependency Graph Analysis in the ML context

1. Thousands of tasks, and dependency needs to be tracked across CPU threads, GPU streams and interconnects

2. Some optimizaitons operate on the kernel-level granularity. Others operate on layer-level granularity. How should one correlate low-level traces with DNN topology?

3. Ability to easily model diverse DNN optimizations

### Daydream Methodology

Daydream is evaluated on image classification (VGG16, DenseNet-121, ResNet50), machine translation (GNMT [seq2seq]) and language modeling (BERT). Each model is evaluated for each optimization X on a benchmark Y.

### Conclusion

Daydream is apparently the first system that aims at estimating efficacy of optimizations for DNN training.

Daydream uses:
- Dependency graph analysis based on the kernel-level granularity
- Sync-Free kernel-to-layer mapping
- Graph transformation rules

Daydream is able to accurately estimate the efficacy of optimizations acrss a wide range of DNN optimizations.

<hr>

## 6 - RL-Scope: cross-stack profiling for reinforcement learning

### Deep reinforcement learning progress

#### Existing problems, limitations

- Training time limits progress (it is unsupervised learning). 
- RL workloads is different from supervised learning workloads
- Profiling tools are designed for GPU-bound workloads (reinformcement learning uses a lot of bandwidth between CPU and GPU).

#### Different between supervised and reinforcement learning

![diffslrl](images/diffslrl.png)

### RL-Scope: Cross-stack RL profiling

![rlalgo](images/rlalgo.png)

![rlswstack](images/rlswstack.png)

#### RL-scope profiler features

- Cross-stack scoping: corss stack view of where CPU and GPU time is spent
- Cross-framework: Works with TF, Torch, etc.
- Corrects for profiling overhead: Correct CPU overhead for accurate insights

#### Contributions

RL-scope profiler
RL workload survey

#### From the user perspective

Apply the annotations

```python
# ML scripting training loop
from t in range(num_timesteps):
    with rls.operation("simulation"):
        # simulation code
    with rls.operation("inference"):
        # inference code
    with rls.operation("backprop"):
        # backprop code
```

Developer annotations help users tie profiler output to high-level code.

![rltakeaway](images/rltakeaway.png)