In the notebook, let us test the performance of various individual Ops involved in Inference Pipeline and try to optimize them to get the best performance out of our Style Transfer network.

In [1]:
# Import necessary modules
import torch
import cv2
import numpy as np
from models import *
import time
from torchvision import datasets, transforms

torch.cuda.empty_cache()
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.enabled = True

In [2]:
# Initialize the model
model = TransformerNet().cuda()

# Raw Inference Pipeline
The following block shows the basic inference pipeline which does style transfer on the frames extracted from the webcam.

In broad sense, inference pipeline can be broken down into 4 modules.
- Fetch (frames from Webcam)
- Preprocess (the frames)
- Prediction (The model inference phase)
- Post-process + Render (Rendering the final results)

In [3]:
def test_v1():
    model.eval()
    
    # Setup content transform
    content_transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Lambda(lambda x: x.mul(255))
    ])
    
    camera = cv2.VideoCapture(0)
    
    with torch.no_grad():
        counter = 1000
        start = time.time()
        while counter > 0:
            _, frame = camera.read()
            content_image = content_transform(frame)
            content_image = content_image.unsqueeze(0).cuda()
            output = model(content_image).cpu().detach()[0].clamp(0, 255).numpy().transpose(1,2,0).astype("uint8")
            counter -= 1
        torch.cuda.synchronize()
        end = time.time()
        print('Time taken to infer =', (end-start)/1000.0)
    camera.release()
    return (end-start)/1000.0

# Run 10 experiments
time_v1 = []
for i in range(10):
    time_v1.append(test_v1())
time_v1 = np.asarray(time_v1)
print('Mean = {}, StdDev = {}'.format(np.mean(time_v1, axis=0), np.std(time_v1, axis=0)))

Time taken to infer = 0.06374223518371581
Time taken to infer = 0.06224231505393982
Time taken to infer = 0.06326079225540161
Time taken to infer = 0.06299156737327576
Time taken to infer = 0.06327822160720825
Time taken to infer = 0.06356794118881226
Time taken to infer = 0.06337778925895692
Time taken to infer = 0.06440203619003296
Time taken to infer = 0.06361738705635071
Time taken to infer = 0.06413892650604248
Mean = 0.06346192116737366, StdDev = 0.0005697443838206991


# Optimization

We have seen from above that the model inference takes around 63.5 milliseconds. That is rougly 15.7 FPS.

Now let us focus on each individual module and try to optimize them.

## Preprocessing (HWC -> CHW -> NCHW)

Here is a summary of three tests we are making:
- V1: Use PyTorch's `transforms` to convert the image (UInt8) into tensor (Float32) format.
- V2: Use OpenCV + NumPy to convert the image (UInt8) into tensor (Float32) format.
- V3: Use OpenCV + NumPy to convert the image (UInt8) into tensor (UInt8/Byte) format.

In [3]:
# Optimize Pre-processing (HWC -> CHW -> NCHW)
content_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Lambda(lambda x: x.mul(255))
])
camera = cv2.VideoCapture(0)
_, frame = camera.read()
camera.release()

# V1: Use TorchVision Transforms (Pillow backend)
time_preprocess_v1 = []
for i in range(10):
    counter = 1000
    start = time.time()
    while counter > 0:
        content_image = content_transform(frame)
        content_image = content_image.unsqueeze(0)
        counter -= 1
    end = time.time()
    print('Time taken to pre-process =', (end-start)/1000.0)
    time_preprocess_v1.append((end-start)/1000.0)
time_preprocess_v1 = np.asarray(time_preprocess_v1)
print('V1: Mean = {}, StdDev = {}\n'.format(np.mean(time_preprocess_v1, axis=0), np.std(time_preprocess_v1, axis=0)))

# Use OpenCV and Numpy to transform Image
time_preprocess_v2 = []
for i in range(10):
    counter = 1000
    start = time.time()
    while counter > 0:
        frame_temp = frame.swapaxes(1, 2).swapaxes(0, 1)
        frame_temp = frame_temp[np.newaxis, :, :, :]
        content_image = torch.from_numpy(frame_temp).type(torch.FloatTensor)
        counter -= 1
    end = time.time()
    print('Time taken to pre-process =', (end-start)/1000.0)
    time_preprocess_v2.append((end-start)/1000.0)
time_preprocess_v2 = np.asarray(time_preprocess_v2)
print('V2: Mean = {}, StdDev = {}\n'.format(np.mean(time_preprocess_v2, axis=0), np.std(time_preprocess_v2, axis=0)))

# Modify v2 by not converting it to float on CPU
time_preprocess_v3 = []
for i in range(10):
    counter = 1000
    start = time.time()
    while counter > 0:
        frame_temp = frame.swapaxes(1, 2).swapaxes(0, 1)
        frame_temp = frame_temp[np.newaxis, :, :, :]
        content_image = torch.from_numpy(frame_temp)
        counter -= 1
    end = time.time()
    print('Time taken to pre-process =', (end-start)/1000.0)
    time_preprocess_v3.append((end-start)/1000.0)
time_preprocess_v3 = np.asarray(time_preprocess_v3)
print('V3: Mean = {}, StdDev = {}\n'.format(np.mean(time_preprocess_v3, axis=0), np.std(time_preprocess_v3, axis=0)))

Time taken to pre-process = 0.007729572057723999
Time taken to pre-process = 0.007927640914916993
Time taken to pre-process = 0.00809889245033264
Time taken to pre-process = 0.008704333305358887
Time taken to pre-process = 0.009419626951217652
Time taken to pre-process = 0.009154453754425048
Time taken to pre-process = 0.008751245260238648
Time taken to pre-process = 0.008776693105697632
Time taken to pre-process = 0.008691324234008788
Time taken to pre-process = 0.009496425151824951
V1: Mean = 0.008675020718574521, StdDev = 0.0005710206415376447

Time taken to pre-process = 0.00466198468208313
Time taken to pre-process = 0.004477731704711914
Time taken to pre-process = 0.00536241602897644
Time taken to pre-process = 0.004512583255767822
Time taken to pre-process = 0.0046747827529907224
Time taken to pre-process = 0.0046655011177062986
Time taken to pre-process = 0.004764933109283448
Time taken to pre-process = 0.0051664204597473146
Time taken to pre-process = 0.0047572414875030515
Tim

As we can see from above that `V2` is faster than `V1`, which can clearly say that OpenCV+NumPy is faster than using Pillow for image transforms. In `V3`, we are not converting the tensor to Float32 and leaving it as UInt8. Looks like the costly operation is converting UInt8 to Float in CPU. The idea is whether we can push this Op to GPU instead of doing it on CPU.

## Data Transfer (CPU -> GPU)

Here is a summary of two tests we are applying:
- V1: Transfer Float32 Tensor to GPU
- V2: Transfer UInt8 Tensor to GPU

In [4]:
camera = cv2.VideoCapture(0)
_, frame = camera.read()
camera.release()
frame = frame.swapaxes(1, 2).swapaxes(0, 1)
frame = frame[np.newaxis, :, :, :]
content_image = torch.from_numpy(frame)

# Data Transfer tests
# 1: Transfer Float to GPU
content_image_float = content_image.type(torch.FloatTensor)
time_f_trans = []
for i in range(10):
    counter = 1000
    start = time.time()
    while counter > 0:
        temp = content_image_float.cuda()
        counter -= 1
    torch.cuda.synchronize()
    end = time.time()
    print('Time taken to transfer Float=', (end-start)/1000.0)
    time_f_trans.append((end-start)/1000.0)
print('V1: Mean = {}, StdDev = {}\n'.format(np.mean(time_f_trans, axis=0), np.std(time_f_trans, axis=0)))

# 2: Transfer UInt8 to GPU
time_i8_trans = []
for i in range(10):
    counter = 1000
    start = time.time()
    while counter > 0:
        temp = content_image.cuda()
        counter -= 1
    torch.cuda.synchronize()
    end = time.time()
    print('Time taken to transfer UInt8=', (end-start)/1000.0)
    time_i8_trans.append((end-start)/1000.0)
print('V2: Mean = {}, StdDev = {}\n'.format(np.mean(time_i8_trans, axis=0), np.std(time_i8_trans, axis=0)))

Time taken to transfer Float= 0.0006362755298614502
Time taken to transfer Float= 0.000767960786819458
Time taken to transfer Float= 0.0007988643646240234
Time taken to transfer Float= 0.0005784521102905274
Time taken to transfer Float= 0.0007185821533203126
Time taken to transfer Float= 0.0005704751014709473
Time taken to transfer Float= 0.000593414545059204
Time taken to transfer Float= 0.0005744626522064209
Time taken to transfer Float= 0.0005889854431152344
Time taken to transfer Float= 0.0005680437088012695
V1: Mean = 0.0006395516395568848, StdDev = 8.406322692265474e-05

Time taken to transfer UInt8= 0.0011579360961914062
Time taken to transfer UInt8= 0.0011855967044830322
Time taken to transfer UInt8= 0.0011219978332519531
Time taken to transfer UInt8= 0.0011370019912719726
Time taken to transfer UInt8= 0.0012177438735961913
Time taken to transfer UInt8= 0.0011928267478942872
Time taken to transfer UInt8= 0.0011908178329467774
Time taken to transfer UInt8= 0.0010940773487091065


From above, we can observe that passing a Float32 tensor is way faster (2x) than passing an UInt8 tensor. Now, let us do the following fused tests:

- Convert UInt8 -> Float32 on CPU, and pass Float32 to GPU
- Pass UInt8 to GPU, and convert it to Float32 on GPU.

In [5]:
# Fused Data Transfer tests
# 1: Convert to Float & Transfer to GPU
time_f_trans = []
for i in range(10):
    counter = 1000
    start = time.time()
    while counter > 0:
        content_image_float = content_image.type(torch.FloatTensor)
        temp = content_image_float.cuda()
        counter -= 1
    torch.cuda.synchronize()
    end = time.time()
    print('Time taken to transfer Float =', (end-start)/1000.0)
    time_f_trans.append((end-start)/1000.0)
print('V1: Mean = {}, StdDev = {}\n'.format(np.mean(time_f_trans, axis=0), np.std(time_f_trans, axis=0)))

# 2: Transfer UInt8 to GPU & Convert to Float
time_i8_trans = []
for i in range(10):
    counter = 1000
    start = time.time()
    while counter > 0:
        temp = content_image.cuda()
        temp = temp.type(torch.cuda.FloatTensor)
        counter -= 1
    torch.cuda.synchronize()
    end = time.time()
    print('Time taken to transfer UInt8 =', (end-start)/1000.0)
    time_i8_trans.append((end-start)/1000.0)
print('V2: Mean = {}, StdDev = {}\n'.format(np.mean(time_i8_trans, axis=0), np.std(time_i8_trans, axis=0)))

Time taken to transfer Float = 0.0051630535125732424
Time taken to transfer Float = 0.005124115467071534
Time taken to transfer Float = 0.0052873239517211915
Time taken to transfer Float = 0.005293906688690186
Time taken to transfer Float = 0.0050066132545471195
Time taken to transfer Float = 0.00499297571182251
Time taken to transfer Float = 0.004989104509353638
Time taken to transfer Float = 0.005008767604827881
Time taken to transfer Float = 0.0050498549938201905
Time taken to transfer Float = 0.004984730005264283
V1: Mean = 0.005090044569969178, StdDev = 0.0001152624820694124

Time taken to transfer UInt8 = 0.0012316856384277344
Time taken to transfer UInt8 = 0.001218740463256836
Time taken to transfer UInt8 = 0.0013094983100891114
Time taken to transfer UInt8 = 0.0011509225368499757
Time taken to transfer UInt8 = 0.0011757762432098389
Time taken to transfer UInt8 = 0.0012422752380371093
Time taken to transfer UInt8 = 0.0014308736324310303
Time taken to transfer UInt8 = 0.001411702

From above, we can see that using Preprocessing V3 + Data Transfer (Fused) V2 can give us the most optimal performance before inference.

## Postprocessing (GPU->CPU->Image)

Now let us optimize the post-processing. Here is the summary of tests.
- V1: Do complete post-processing on CPU.
- V2: Do `clamp()` on GPU and rest on CPU.
- V3: Do `clamp()` and `type(torch.cuda.ByteTensor)` on GPU and rest on CPU.

In [6]:
camera = cv2.VideoCapture(0)
_, frame = camera.read()
camera.release()
frame = frame.swapaxes(1, 2).swapaxes(0, 1)
frame = frame[np.newaxis, :, :, :]
content_image = torch.from_numpy(frame)
content_image = content_image.type(torch.cuda.FloatTensor)
content_image = model(content_image)

# 1: Post-process v1
time_f_trans = []
for i in range(10):
    counter = 1000
    start = time.time()
    while counter > 0:
        temp = content_image.cpu().detach()[0].clamp(0, 255).numpy().transpose(1,2,0).astype("uint8")
        counter -= 1
    torch.cuda.synchronize()
    end = time.time()
    print('Time taken to transfer to CPU (v1) =', (end-start)/1000.0)
    time_f_trans.append((end-start)/1000.0)
print('V1: Mean = {}, StdDev = {}\n'.format(np.mean(time_f_trans, axis=0), np.std(time_f_trans, axis=0)))


# 2: Post-process v2
time_f_trans = []
for i in range(10):
    counter = 1000
    start = time.time()
    while counter > 0:
        temp = content_image.clamp(0, 255).cpu().detach()[0].numpy().transpose(1,2,0).astype("uint8")
        counter -= 1
    torch.cuda.synchronize()
    end = time.time()
    print('Time taken to transfer to CPU (v2) =', (end-start)/1000.0)
    time_f_trans.append((end-start)/1000.0)
print('V2: Mean = {}, StdDev = {}\n'.format(np.mean(time_f_trans, axis=0), np.std(time_f_trans, axis=0)))


# 3: Post-process v3
time_f_trans = []
for i in range(10):
    counter = 1000
    start = time.time()
    while counter > 0:
        temp = content_image.clamp(0, 255).type(torch.cuda.ByteTensor).cpu().detach()[0].numpy().transpose(1,2,0)
        counter -= 1
    torch.cuda.synchronize()
    end = time.time()
    print('Time taken to transfer to CPU (v3) =', (end-start)/1000.0)
    time_f_trans.append((end-start)/1000.0)
print('V3: Mean = {}, StdDev = {}\n'.format(np.mean(time_f_trans, axis=0), np.std(time_f_trans, axis=0)))

Time taken to transfer to CPU (v1) = 0.004254332065582275
Time taken to transfer to CPU (v1) = 0.004119851589202881
Time taken to transfer to CPU (v1) = 0.004068934440612793
Time taken to transfer to CPU (v1) = 0.004113720893859863
Time taken to transfer to CPU (v1) = 0.004333008766174316
Time taken to transfer to CPU (v1) = 0.003925081729888916
Time taken to transfer to CPU (v1) = 0.003935510158538818
Time taken to transfer to CPU (v1) = 0.004238506555557251
Time taken to transfer to CPU (v1) = 0.004195333957672119
Time taken to transfer to CPU (v1) = 0.004248223066329956
V1: Mean = 0.004143250322341919, StdDev = 0.00012994191057122414

Time taken to transfer to CPU (v2) = 0.0025672900676727297
Time taken to transfer to CPU (v2) = 0.002452530860900879
Time taken to transfer to CPU (v2) = 0.0024302475452423096
Time taken to transfer to CPU (v2) = 0.0027535762786865233
Time taken to transfer to CPU (v2) = 0.0024895095825195313
Time taken to transfer to CPU (v2) = 0.0024485158920288085
T

From above, we can see that Postprocessing V3 ideally takes less than 1 millisecond.

Finally let us try converting the serialized camera buffer to asynchronous camera buffer. The idea for this approach has been taken with inspiration from - http://blog.blitzblit.com/2017/12/24/asynchronous-video-capture-in-python-with-opencv/

Here are the tests between serialized and asynchronous frame buffer. The underlying idea is don't wait for extracting the next frame buffer from camera while current frame is being processed through inference pipeline.

# Async Video Capture

In [7]:
from videocapture_async import VideoCaptureAsync

camera = cv2.VideoCapture(0)
time_sync_camera = []
for i in range(10):
    counter = 1000
    start = time.time()
    while counter > 0:
        ret, frame = camera.read()
        counter -= 1
    end = time.time()
    print('Time taken for camera frame buffer =', (end-start)/1000.0)
    time_sync_camera.append((end-start)/1000.0)
camera.release()
print('V1: Mean = {}, StdDev = {}\n'.format(np.mean(time_sync_camera, axis=0), np.std(time_sync_camera, axis=0)))

camera = VideoCaptureAsync(0)
camera.start()
time_async_camera = []
for i in range(10):
    counter = 1000
    start = time.time()
    while counter > 0:
        ret, frame = camera.read()
        counter -= 1
    end = time.time()
    print('Time taken for camera frame buffer =', (end-start)/1000.0)
    time_async_camera.append((end-start)/1000.0)
camera.stop()
print('V2: Mean = {}, StdDev = {}\n'.format(np.mean(time_async_camera, axis=0), np.std(time_async_camera, axis=0)))

Time taken for camera frame buffer = 0.03359849333763123
Time taken for camera frame buffer = 0.03401651501655579
Time taken for camera frame buffer = 0.03358498334884644
Time taken for camera frame buffer = 0.033881309747695924
Time taken for camera frame buffer = 0.0338202064037323
Time taken for camera frame buffer = 0.033706856966018675
Time taken for camera frame buffer = 0.033889217138290406
Time taken for camera frame buffer = 0.0338252854347229
Time taken for camera frame buffer = 0.033578208684921264
Time taken for camera frame buffer = 0.033848626375198365
V1: Mean = 0.03377497024536133, StdDev = 0.0001425837027357211

Time taken for camera frame buffer = 3.291463851928711e-05
Time taken for camera frame buffer = 3.2912015914916994e-05
Time taken for camera frame buffer = 3.490662574768066e-05
Time taken for camera frame buffer = 3.291177749633789e-05
Time taken for camera frame buffer = 3.390955924987793e-05
Time taken for camera frame buffer = 3.291153907775879e-05
Time tak

Before getting excited by seeing the above numbers, please be advised that we can't increase the framerate of our camera. The whole idea of AsyncVideoCapture is to run the camera frame extraction on a separate thread and use those frames whenever the inference pipeline is ready for the next frame. 

Now let us keep all of the above together and run the final inference pipeline.

## Optimized Inference

In [1]:
# Optimized Pre-processing + Post-Processing + Camera
import torch
import cv2
import numpy as np
from models import *
import time
from videocapture_async import VideoCaptureAsync

torch.cuda.empty_cache()
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.enabled = True

model = TransformerNet().cuda()

def test_final():
    model.eval()
    
    camera = VideoCaptureAsync(0)
    camera.start()
    with torch.no_grad():
        counter = 1000
        start = time.time()
        while counter > 0:
            _, frame = camera.read()
            # Preprocess the frame
            frame = frame.swapaxes(1, 2).swapaxes(0, 1)
            frame = frame[np.newaxis, :, :, :]
            content_image = torch.from_numpy(frame)
            content_image = content_image.cuda()
            content_image = content_image.type(torch.cuda.FloatTensor)
            # Inference
            output = model(content_image).clamp(0, 255).type(torch.cuda.ByteTensor).cpu().detach()[0].numpy().transpose(1,2,0)
            counter -= 1
        torch.cuda.synchronize()
        end = time.time()
        print('Time taken to infer =', (end-start)/1000.0)
    camera.stop()
    return (end-start)/1000.0

# Run 10 experiments
time_final = []
for i in range(10):
    time_final.append(test_final())
time_final = np.asarray(time_final)
print('Mean = {}, StdDev = {}'.format(np.mean(time_final, axis=0), np.std(time_final, axis=0)))

Time taken to infer = 0.046797704935073854
Time taken to infer = 0.046382086038589475
Time taken to infer = 0.04615999031066895
Time taken to infer = 0.046471667289733884
Time taken to infer = 0.04696861958503723
Time taken to infer = 0.04725742411613464
Time taken to infer = 0.047647882223129275
Time taken to infer = 0.05035635542869568
Time taken to infer = 0.0499716694355011
Time taken to infer = 0.049298920392990116
Mean = 0.04773123197555541, StdDev = 0.0014808264557047176


From above we can see that, by optimizing the Camera Frame Buffer (Async), Preprocessing and postprocessing we can easily increase speed-ups in inference (15.7 FPS -> 21.3 FPS). Here, we have reached the bottleneck of our inference pipeline, i.e., the time taken to do per frame inference. Further speed-ups can be achieved by using [CUDA Streams](https://devblogs.nvidia.com/gpu-pro-tip-cuda-7-streams-simplify-concurrency/) effectively or by modifying model architecture (Playing with different layer combinations, layer fusions etc.)

Here is a sneak peak of most optimal performance you can achieve by playing with model architectures:

## Changes in Model Architecture

In [1]:
# Optimized Pre-processing + Post-Processing + Camera
import torch
import cv2
import numpy as np
from models import *
from models.private import *
import time
from videocapture_async import VideoCaptureAsync

torch.cuda.empty_cache()
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.enabled = True

model = TransformerNet_v3().cuda()

def test_final():
    model.eval()
    
    camera = VideoCaptureAsync(0)
    camera.start()
    with torch.no_grad():
        counter = 1000
        start = time.time()
        while counter > 0:
            _, frame = camera.read()
            # Preprocess the frame
            frame = frame.swapaxes(1, 2).swapaxes(0, 1)
            frame = frame[np.newaxis, :, :, :]
            content_image = torch.from_numpy(frame)
            content_image = content_image.cuda()
            content_image = content_image.type(torch.cuda.FloatTensor)
            # Inference
            output = model(content_image).clamp(0, 255).type(torch.cuda.ByteTensor).cpu().detach()[0].numpy().transpose(1,2,0)
            counter -= 1
        torch.cuda.synchronize()
        end = time.time()
        print('Time taken to infer =', (end-start)/1000.0)
    camera.stop()
    return (end-start)/1000.0

# Run 10 experiments
time_final = []
for i in range(10):
    time_final.append(test_final())
time_final = np.asarray(time_final)
print('Mean = {}, StdDev = {}'.format(np.mean(time_final, axis=0), np.std(time_final, axis=0)))

Time taken to infer = 0.021966790199279784
Time taken to infer = 0.020345489740371703
Time taken to infer = 0.020589972019195556
Time taken to infer = 0.020874754428863527
Time taken to infer = 0.020950890302658082
Time taken to infer = 0.021357654809951783
Time taken to infer = 0.021392205238342284
Time taken to infer = 0.021567578077316286
Time taken to infer = 0.021640504360198976
Time taken to infer = 0.02186718463897705
Mean = 0.021255302381515503, StdDev = 0.0005161043391833319


Here, we can see that by making certain architectural changes (keeping CuDNN optimizations into consideration), we could achieve a performane of 47 FPS! 

Happy coding and you can find the summary of my findings at my blog- 
# [Style Transfer -PyTorch]()