# Startup

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import os
os.chdir('/content/drive/My Drive/Courses/Fall 2021/dlsys/DeepLearningSystems-Fall2021/HW1')

# Description of assignments



The assignment is at https://colab.research.google.com/drive/17bYkf-qlw0W3PbQLwfsgHErwM_nevG8Z?usp=sharing (Links to an external site.). Save a copy of the example notebook to your own google drive. Study that material, then:
- evaluate the **computational speed** and **accuracy** of
- the four combinations **(PyTorch, TensorFlow 2.0) x (GPU, TPU)**
- providing a short write-up with concise data chart and figures. 

Please organize the write-ups neatly under provided subsections.

Some of the main sources I consulted
- [HW notebook for GPUs](https://colab.research.google.com/drive/17bYkf-qlw0W3PbQLwfsgHErwM_nevG8Z?usp=sharing)
- [timer](https://www.blog.pythonlibrary.org/2016/05/24/python-101-an-intro-to-benchmarking-your-code/)
- [using TPU on TF](https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/guide/tpu.ipynb)
- [using TPU on Torch](https://colab.research.google.com/github/pytorch/xla/blob/master/contrib/colab/mnist-training.ipynb)

# Run benchmarks



Each of benchmarks **(TF, Torch) x (GPU, TPU)** were run for 
- 5 epochs each 
- 10 instances each 
- 3 different batch sizes: 32, 64, 128 
- LR fixed at 0.001

In [None]:
from tqdm.notebook import tqdm
import time

In [None]:
NUM_EPOCH = 5
NUM_RUN = 10
BATCH_SIZES = [32, 64, 128]

## GPU


In [None]:
#@title GPU - PyTorch
for BATCH_SIZE in tqdm(BATCH_SIZES):
  !python src/torch_benchmark.py \
    --device GPU \
    --save-path output \
    --num-epoch $NUM_EPOCH \
    --num-run $NUM_RUN \
    --batch-size $BATCH_SIZE \
    --print-perf
  print('-----------------------------------------------------------------------')

  0%|          | 0/3 [00:00<?, ?it/s]

cuda:0
  return torch.from_numpy(parsed.astype(m[2], copy=False)).view(*s)
Main:   0%|[32m                [0m| 0/10 [00:00<?, ?it/s][0m
ID=00:   0%|                | 0/5 [00:00<?, ?it/s][A
GPU-PyTorch @ epoch 1/5 took 23.2 secs (total 0.39 mins elapsed)
	 + TRAIN: ACC = 79.40%	| LOSS = 1.6679 | TIME = 21.0 s 
	 + TEST:  ACC = 95.36%	| LOSS = 1.5090 | TIME = 2.2 s

ID=00:  20%|█▌      | 1/5 [00:23<01:32, 23.18s/it][A
GPU-PyTorch @ epoch 2/5 took 22.3 secs (total 0.76 mins elapsed)
	 + TRAIN: ACC = 96.90%	| LOSS = 1.4945 | TIME = 20.2 s 
	 + TEST:  ACC = 97.46%	| LOSS = 1.4881 | TIME = 2.2 s

ID=00:  40%|███▏    | 2/5 [00:45<01:08, 22.69s/it][A
GPU-PyTorch @ epoch 3/5 took 22.6 secs (total 1.14 mins elapsed)
	 + TRAIN: ACC = 97.92%	| LOSS = 1.4831 | TIME = 20.3 s 
	 + TEST:  ACC = 97.95%	| LOSS = 1.4818 | TIME = 2.3 s

ID=00:  60%|████▊   | 3/5 [01:08<00:45, 22.64s/it][A
GPU-PyTorch @ epoch 4/5 took 23.5 secs (total 1.53 mins elapsed)
	 + TRAIN: ACC = 98.49%	| LOSS = 1.4772 | TIME

In [None]:
#@title GPU - TensorFlow
for BATCH_SIZE in tqdm(BATCH_SIZES):
  !python src/tf_benchmark_gpu.py \
    --save-path output \
    --num-epoch $NUM_EPOCH \
    --num-run $NUM_RUN \
    --batch-size $BATCH_SIZE \
    --print-perf
  print('-----------------------------------------------------------------------')

  0%|          | 0/3 [00:00<?, ?it/s]

2021-10-06 23:53:49.512893: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-06 23:53:49.522977: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-06 23:53:49.523765: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-06 23:53:50.153705: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-06 23:53:50.154612: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from S

## TPU (multicore)

In [None]:
#@title TPU requirements for PyTorch
!pip install cloud-tpu-client==0.10 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl

In [None]:
#@title TPU - PyTorch

for BATCH_SIZE in tqdm(BATCH_SIZES):
  !python -W ignore src/torch_benchmark_tpu.py \
    --save-path output \
    --num-epoch $NUM_EPOCH \
    --num-run $NUM_RUN \
    --batch-size $BATCH_SIZE \
    --print-perf

  0%|          | 0/3 [00:00<?, ?it/s]

Main:   0%|[32m                [0m| 0/10 [00:00<?, ?it/s][0mFile ./tmp/tmptpu.csv already exists. Will OVERWRITE file
Main:  10%|[32m▋      [0m| 1/10 [03:08<28:19, 188.82s/it][0mFile ./tmp/tmptpu.csv already exists. Will OVERWRITE file
Main:  20%|[32m█▍     [0m| 2/10 [06:03<24:04, 180.58s/it][0mFile ./tmp/tmptpu.csv already exists. Will OVERWRITE file
Main:  30%|[32m██     [0m| 3/10 [08:53<20:30, 175.78s/it][0mFile ./tmp/tmptpu.csv already exists. Will OVERWRITE file
Main:  40%|[32m██▊    [0m| 4/10 [11:52<17:41, 176.85s/it][0mFile ./tmp/tmptpu.csv already exists. Will OVERWRITE file
Main:  50%|[32m███▌   [0m| 5/10 [15:04<15:12, 182.55s/it][0mFile ./tmp/tmptpu.csv already exists. Will OVERWRITE file
Main:  60%|[32m████▏  [0m| 6/10 [18:04<12:05, 181.46s/it][0mFile ./tmp/tmptpu.csv already exists. Will OVERWRITE file
Main:  70%|[32m████▉  [0m| 7/10 [20:52<08:51, 177.30s/it][0mFile ./tmp/tmptpu.csv already exists. Will OVERWRITE file
Main:  80%|[32m█████▌ [0m| 8/1

In [None]:
#@title TPU - TensorFlow

for BATCH_SIZE in tqdm(BATCH_SIZES):
  !python src/tf_benchmark_tpu.py \
    --save-path output \
    --num-epoch $NUM_EPOCH \
    --num-run $NUM_RUN \
    --batch-size $BATCH_SIZE \
    --print-perf
  print('-----------------------------------------------------------------------')

  0%|          | 0/3 [00:00<?, ?it/s]

2021-10-07 00:09:05.360649: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2021-10-07 00:09:05.360712: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (504e1793e616): /proc/driver/nvidia/version does not exist
2021-10-07 00:09:05.439364: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job worker -> {0 -> 10.65.96.66:8470}
2021-10-07 00:09:05.439449: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:33078}
2021-10-07 00:09:05.460054: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job worker -> {0 -> 10.65.96.66:8470}
2021-10-07 00:09:05.460143: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job localhost -> {0 -> loca

# Visualizations

## Load data

In [3]:
import time, os, glob
import pandas as pd
import numpy as np

In [13]:
!ls output

PyTorch-GPU.csv    PyTorch-TPU.csv     TensorFlow-TPU.csv
PyTorch-TPU-1.csv  TensorFlow-GPU.csv


In [19]:
data_path = 'output'
data_files = ['PyTorch-GPU.csv', 'PyTorch-TPU.csv', 'TensorFlow-GPU.csv', 'TensorFlow-TPU.csv']
cols2drop = ['overwrite', 'filename', 'num_epoch', 'num_run', 'learning_rate']
df = [pd.read_csv(os.path.join(data_path, x)) for x in data_files]
df = pd.concat(df, ignore_index=True)
df.drop(cols2drop, axis=1, inplace=True)
df['platform'] = df.library + '-' + df.device

batch_sizes = df.batch_size.unique()
final_epochid = df.epoch.max()
platform = df.platform.unique()

In [20]:
df

Unnamed: 0,exp_begin,model_id,epoch,train_loss,train_acc,train_time,test_loss,test_acc,test_time,library,batch_size,device,platform
0,23:08:43,0,0,1.667938,79.400000,20.974790,1.508983,95.357428,2.207126,PyTorch,32,GPU,PyTorch-GPU
1,23:08:43,0,1,1.494546,96.900000,20.185190,1.488145,97.464058,2.154147,PyTorch,32,GPU,PyTorch-GPU
2,23:08:43,0,2,1.483098,97.915000,20.293474,1.481829,97.953275,2.289190,PyTorch,32,GPU,PyTorch-GPU
3,23:08:43,0,3,1.477206,98.488333,20.506522,1.479799,98.142971,2.944606,PyTorch,32,GPU,PyTorch-GPU
4,23:08:43,0,4,1.473615,98.813333,20.169943,1.479470,98.232827,2.268382,PyTorch,32,GPU,PyTorch-GPU
...,...,...,...,...,...,...,...,...,...,...,...,...,...
595,00:29:48,29,0,0.182240,94.701523,5.910524,0.077809,97.626203,2.748838,TensorFlow,128,TPU,TensorFlow-TPU
596,00:29:48,29,1,0.052885,98.415798,3.234434,0.054334,98.177081,1.763047,TensorFlow,128,TPU,TensorFlow-TPU
597,00:29:48,29,2,0.032052,99.033451,3.471156,0.053262,98.377407,1.773748,TensorFlow,128,TPU,TensorFlow-TPU
598,00:29:48,29,3,0.019954,99.412394,3.273907,0.058078,98.247194,1.844264,TensorFlow,128,TPU,TensorFlow-TPU


## Visualize

### Essentials

In [21]:
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

In [33]:
#@title Plot options
axis_config = dict(
    showline=True,
    showgrid=False,
    showticklabels=True,
    linecolor='rgb(0, 0, 0)',
    linewidth=2,    
    ticks='outside',
    tickwidth=2
    )

font_config = dict(
    family="Fira Sans",
    size=18,
    color='black'
    )

title_config = dict(
    title_x = 0.5,
    title_y = 0.9,
    title_xanchor = 'center',
    title_yanchor = 'top',
    title_font_size=23
)

general_layout = go.Layout(
    xaxis=axis_config,
    yaxis=axis_config,
    xaxis2=axis_config,
    yaxis2=axis_config,
    xaxis3=axis_config,
    yaxis3=axis_config,
    xaxis4=axis_config,
    yaxis4=axis_config,
    font=font_config,
    margin=dict(
        autoexpand=True,
        l=100,
        r=50,
        t=100,
        b=120
    ),
    showlegend=True,
    plot_bgcolor='white',
    autosize=True,
    **title_config
)

colors = {
    32: 'rgba(200, 200, 250, 0.9)',
    64: 'rgba(120, 120, 220, 0.9)',
    128: 'rgba(50, 50, 180, 0.9)'
}

In [53]:
#@title Function for plotting
def plot_benchmark_pairs(fields, title, addtional_query='', 
                         shared_xaxes=True, shared_yaxes=True):
    fig = make_subplots(
        rows=1, 
        cols=2, 
        shared_xaxes=shared_xaxes,
        shared_yaxes=shared_yaxes,
        horizontal_spacing=0.05,
        vertical_spacing=0.05,
        subplot_titles=fields
    )

    fig.update_layout(general_layout)

    common_violin_props = dict(meanline_visible=True, points='all',
                            jitter=0.1, marker_size=3, opacity=0.8)

    for i, field in enumerate(fields):
        for j, bsz in enumerate(batch_sizes):  
            bz_grp = 'batch size = %d' %(bsz)
            df_sel = df.query(addtional_query + 'batch_size == @bsz')
            fig.add_trace(
                go.Violin(
                    x = df_sel['platform'],
                    y = df_sel[field], 
                    line_color = colors[bsz],
                    legendgroup = bz_grp, 
                    scalegroup = bz_grp,
                    name = bz_grp,
                    **common_violin_props,
                    showlegend=i==(len(fields)-1)
                    ), 
                row = 1,
                col = i+1
            )

    for i in fig['layout']['annotations']:
        i['font']['size']=20

    fig.update_layout(
        violinmode='group',
        violingap=0, 
        width=1200,
        height=500,
        title_text=title
    )
    fig.show()

### Results

Below are comparisons of different platforms (i.e. library + device) for 3 different batch sizes: 
- library: Pytorch or TensorFlow
- device: GPU or TPU (8 core)
- comparisons:
    - train/test accuracies: plotted only the final accuracies (i.e. final epoch)
    - train/test speed: plotted are all the speeds (irregardless of the epoch) 

**Final accuracies**

The PyTorch implementations lead to more variable (especially TPU, possibly due to the splitting of the data not done properly) accuracies for both final training and testing evaluation.  

Zooming in, we could also see that TensorFLow ones are slightly better in term of train accuracies than PyTorch, though the PyTorch/GPU reached comparable performance for test performance. 

Between GPU and TPU, PyTorch was much variable for the latter, as GPU outperformed pretty clearly for both accuracies. This might be due to certain need of scaling for batch sizes or learning rates.

As for TensorFlow, there doesn't seem to be much gain inspecting these. 

Lastly, the batch sizes do not seem to matter much, except for PyTorch/TPU, which increased batch sizes might have made things worse.

In [55]:
plot_benchmark_pairs(['train_acc', 'test_acc'], "Final accuracies", addtional_query='epoch == @final_epochid &')

**Running times (all epochs) in seconds**

Generally Tensorflow ran much faster almost on all fonts ecept for the one where testing PyTorch GPU might be faster or comparable to TF/TPU. 

TPU for both cases do not gain any benefits. It actually makes things worse.

In [48]:
plot_benchmark_pairs(['train_time', 'test_time'], "Model running time (seconds)") 