# Hardware Accelerated Video Preprocessing for Deep Learning

### Video Preprocessing

Video is a rich source of information for artificial intelligence (AI) applications, and is used in a wide range of industries and domains. Some of the most common use cases for video in AI include:

 - Computer Vision= AI algorithms can analyze video streams to detect and recognize objects, faces, and actions, which has applications in security and surveillance, autonomous vehicles, and video content analysis.

 - Video Analytics= AI algorithms can process video data to extract valuable insights, such as customer behavior and preferences, which can be used in marketing and customer experience management.

 - Video Content Creation= AI can be used to generate or enhance video content, such as creating 3D models, adding special effects, or synthesizing new images and videos.

 - Video Streaming= AI can be used to optimize video delivery, such as by reducing bandwidth consumption, improving quality of experience, or enabling new use cases like virtual reality and augmented reality.

These are just a few examples of the many ways that AI is being applied to video to create new and innovative solutions for a variety of industries and domains.

#### Decoding

Video decoding is the process of converting compressed video data into a displayable format. The compressed data is first decompressed and then decoded into separate image frames that can be displayed on a screen. This process involves several steps, including entropy decoding, inverse quantization, and inverse discrete cosine transform. 

Fortunatelly, there are ready to use solutions to perform video decoding. Code below shows how to decode video in Python by using OpenCV,

In [None]:
from IPython.display import Video

video_path = 'data/videos/test.mp4'

Video(video_path)

##### Example of using FFMPEG via OpenCV `VideoCapture`

In [None]:
import cv2
from image_utils import show_images

# Open video file by using OpenCV
video =  cv2.VideoCapture(video_path)

print('Backend used to decode the video: ', video.getBackendName())

# Read first 8 frames
frames =  []
for _ in range(8):
    ret, frame =  video.read()
    frame =  cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    frames.append(frame)

show_images(frames)

# Clenup
video.release()

Connecting the previous example with Pytorch:

In [None]:
import torch

video =  cv2.VideoCapture(video_path)
device = torch.device('cuda')

frames = []
for _ in range(8):
    ret, frame = video.read()
    frame =  cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    tensor = torch.from_numpy(frame)
    tensor = tensor.to(device)
    tensor = tensor.float()

    mean = torch.mean(tensor)
    std = torch.std(tensor)
    tensor = (tensor - mean) / std

    frames.append(tensor)


batch = torch.stack(frames)
print(batch.size())

# Pass batch to model...



### Hardware accelerated

The NVIDIA Video Codec SDK is a set of tools and APIs that enables hardware accelerated video decoding on NVIDIA GPUs. It supports a wide range of video codecs including H.264, HEVC, VP9 and AV1, and allows for efficient decoding of high-resolution and high-bitrate video streams. The hardware decoding capabilities of the NVIDIA Video Codec SDK can significantly reduce CPU usage and power consumption, making it ideal for use in high-performance computing

#### Decoding

The NVIDIA Video Processing Framework (VPF) is a software library for video processing and video analysis. It provides a set of tools and APIs to perform various video processing tasks, such as video decoding, encoding, transcoding, and video analysis. VPF leverages the power of NVIDIA GPUs to accelerate video processing and analysis tasks, allowing for real-time performance on demanding video workloads. The framework supports a wide range of video codecs and formats, including H.264, HEVC, VP9, and AV1. VPF is designed for use in a variety of applications, including video streaming, media and entertainment, security and surveillance, and autonomous vehicles.

In [None]:
import PyNvCodec as nvc
import numpy as np

decoder =  nvc.PyNvDecoder(video_path, 0)

raw_surface =  decoder.DecodeSingleSurface()
print(raw_surface)


How to convert raw surface to RGB and move it to the CPU. First we need to prepare some objects:

In [None]:
width =  decoder.Width()
height =  decoder.Height()

# To convert pixel format and color space
to_rgb =  nvc.PySurfaceConverter(
    width,
    height,
    nvc.PixelFormat.NV12,
    nvc.PixelFormat.RGB,
    0)

to_rgb_context =  nvc.ColorspaceConversionContext(
    nvc.ColorSpace.BT_709, nvc.ColorRange.JPEG)

# To download to the CPU
to_cpu =  nvc.PySurfaceDownloader(width, height, nvc.PixelFormat.RGB, 0)


And use them to process frames:

In [None]:
frames =  []
for i in range(8):
    raw_surface =  decoder.DecodeSingleSurface()
    rgb_surface =  to_rgb.Execute(raw_surface, to_rgb_context)

    frame =  np.ndarray(shape= (height, width, 3), dtype= np.uint8)
    to_cpu.DownloadSingleSurface(rgb_surface, frame)
    frames.append(frame)

show_images(frames)

VPF integrates with PyTorch

In [None]:
import PytorchNvCodec as pnvc


def surface_to_tensor(surface: nvc.Surface) -> torch.Tensor:
    surf_plane = surface.PlanePtr()
    img_tensor = pnvc.DptrToTensor(
        surf_plane.GpuMem(),
        surf_plane.Width(),
        surf_plane.Height(),
        surf_plane.Pitch(),
        surf_plane.ElemSize(),
    )
    if img_tensor is None:
        raise RuntimeError("Can not export to tensor.")

    img_tensor.resize_(3, surf_plane.Height(), int(surf_plane.Width() / 3))
    img_tensor = img_tensor.type(dtype=torch.cuda.FloatTensor)

    return img_tensor

In [None]:
for i in range(8):
    raw_surface =  decoder.DecodeSingleSurface()
    rgb_surface =  to_rgb.Execute(raw_surface, to_rgb_context)

    tensor = surface_to_tensor(rgb_surface)
    
    print(tensor.type(), tensor.shape)

#### Benchmark

In [None]:
from utils.torch_utils import time_synchronized

video =  cv2.VideoCapture(video_path)
num_frames = 50

device = torch.device('cuda')

t0 = time_synchronized()
for _ in range(num_frames):
    ret, frame = video.read()
    frame =  cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    tensor = torch.from_numpy(frame)
    tensor = tensor.to(device)
    tensor = tensor.float()

duration = time_synchronized() - t0
print(f'cv2: {num_frames/duration:.3f} fps')

In [None]:
decoder =  nvc.PyNvDecoder(video_path, 0)

t0 = time_synchronized()
for _ in range(num_frames):
    raw_surface =  decoder.DecodeSingleSurface()
    rgb_surface =  to_rgb.Execute(raw_surface, to_rgb_context)

    tensor = surface_to_tensor(rgb_surface)

duration = time_synchronized() - t0
print(f'vpf: {num_frames/duration:.3f} fps')

#### Processing

NVIDIA DALI (Data Augmentation Library for Imaging) is a high-performance library for image pre-processing. It provides a simple and efficient interface for executing complex data augmentation pipelines on GPUs. DALI enables the fast and efficient creation of large and diverse datasets, which is critical for training deep learning models. DALI offers a wide range of augmentation operations, including cropping, scaling, flipping, rotation, and color correction, as well as support for custom operations and pipeline parallelism. Additionally, DALI integrates with popular deep learning frameworks such as TensorFlow and PyTorch, making it easy to incorporate into existing workflows.

In [None]:
from nvidia.dali import pipeline_def
import nvidia.dali.fn as fn

device_id = 0

# Define DALI pipeline
@pipeline_def(batch_size= 1, num_threads= 2, device_id= device_id)
def video_pipeline():
    video =  fn.readers.video(
        device= "gpu",
        filenames= [video_path],
        # step=2,
        sequence_length= 8)
    
    # More operations and augmentations can be placed here

    return video

In [None]:
# Create pipeline object
pipeline =  video_pipeline()
print(pipeline)

# Build the pipeline. Operators are instantiated at this stage
pipeline.build()


In [None]:
# Run pipeline to get outputs
frames =  pipeline.run()

print(type(frames[0]))

# Move frames to the CPU and display them
host_frames =  frames[0].as_cpu().as_array()

show_images(host_frames[0])

### In Deep Learning

#### JAX

Google JAX is a library for high-performance numerical computing and machine learning research, built on top of the popular NumPy library. JAX combines the best features of NumPy with the power of hardware-accelerated GPU and TPU computing, allowing for fast and efficient computation on large arrays and matrices. JAX also includes support for automatic differentiation, which enables gradient-based optimization of machine learning models. Additionally, JAX integrates with popular deep learning frameworks such as TensorFlow and PyTorch, making it easy to incorporate into existing workflows. JAX is designed to be flexible, allowing users to write code that runs on CPUs, GPUs, and TPUs with no code changes, and to take advantage of hardware acceleration with minimal overhead.

In [None]:
import jax.dlpack
import cupy

frames =  pipeline.run()

def dali_tensor_to_jax(dali_tensor):
    return jax.dlpack.from_dlpack(cupy.asarray(dali_tensor).toDlpack())

# Move DALI output to JAX array object
jax_frames =  dali_tensor_to_jax(frames[0][0])

print(jax_frames.shape)
print(jax_frames.sharding)

#### MNIST

The MNIST dataset is a popular benchmark dataset in machine learning, consisting of 70,000 grayscale images of handwritten digits from 0 to 9, each with a resolution of 28x28 pixels. It is widely used for image classification tasks and has been used to train and evaluate various machine learning algorithms, including deep neural networks. The dataset is often used as a starting point for beginners to practice and explore different machine learning techniques, due to its simplicity and availability. The MNIST dataset has played a significant role in advancing the field of computer vision.

Test video contains upsacled MNIST digits.

In [None]:
Video('data/videos/session_number.mp4')

In this example we will train simple model in JAX to recognize MNIST digits. We will use DALI pipeline to read and preproces MNIST dataset and serve it to the model training. Later we will write another pipeline to decode the test video and run inference on the trained model to recognize the numbers in the frames.

First, code below defines a DALI pipeline for reading and preprocessing the MNIST dataset. The pipeline reads the images and labelss from the MNIST training directory, decodes the images, converts them to grayscale, and one-hot encodes the labels. The pipeline is set to have a batch size of 128, run on 2 threads, and use the GPU device with ID 0. The resulting output of the pipeline is a tuple of preprocessed images and their corresponding one-hot encoded labels, which can be used for training a machine learning model.

In [None]:
import nvidia.dali.types as types

batch_size = 128
image_size = 28

@pipeline_def(batch_size=batch_size, num_threads=2, device_id=0)
def mnist_pipeline():
    jpegs, labels = fn.readers.caffe2(path='data/MNIST/training/', random_shuffle=True)
    images = fn.decoders.image(
        jpegs, device='mixed', output_type=types.GRAY)
    labels = labels.gpu()
    labels = fn.one_hot(labels, num_classes=10)

    return images, labels

Next there is a pipeline instantiation and building.

In [None]:
pipeline = mnist_pipeline()
pipeline.build()

We can run the pipeline and visualise the outpus to see that they are correct.

In [None]:
from image_utils import show_images_greyscale

images, labels =  pipeline.run()

host_frames =  images.as_cpu().as_array()
show_images_greyscale(host_frames[0:5])

Code below uses created DALI pipeline to train a simple JAX model to recognize MNIST digits. We display current accuracy on the training data every 500 iterations to visualise training progress.

In [None]:
from jax import numpy as jnp
from mnist import init_params, update, accuracy

# Init model
params = init_params(scale=0.1, layer_sizes=[784, 1024, 1024, 10])

# Train for 3000 iterations
for i in range(3000):
    images, labels = pipeline.run()

    images = dali_tensor_to_jax(images.as_tensor()).reshape((batch_size, image_size * image_size))
    labels = dali_tensor_to_jax(labels.as_tensor())

    params = update(params, images, labels)

    if (i % 500) == 0:
        acc = accuracy(params, images, labels)
        print(f'Accuracy at iteration {i}: {acc}')

Last code sample in this example defines another DALI pipeline that will be used in inference. This pipeline uses hardware accelerated video decoding to get the frames and preprocesses them to match desired model input. After that they are passed to the model and we print the predicted classes.

In [None]:
from mnist import predict

@pipeline_def(batch_size= 1, num_threads= 2, device_id= 0)
def video_pipeline():
    video =  fn.readers.video(
        device= "gpu",
        filenames= ['data/videos/session_number.mp4'],
        sequence_length= 5)
    video = fn.color_space_conversion(video, image_type=types.RGB, output_type=types.GRAY)
    video = fn.resize(video, resize_x = 28, resize_y = 28)

    return video

inference_pipeline = video_pipeline()
inference_pipeline.build()

for i in range(5):
    frames = inference_pipeline.run()

    frames = dali_tensor_to_jax(frames[0].as_tensor()).reshape((5, 28*28))
    predicted_class = jnp.argmax(predict(params, frames), axis=1)
    print(predicted_class)

#### YOLO v7

YOLO (You Only Look Once) is a popular object detection model for real-time object detection and classification. YOLO divides an input image into a grid of cells, and uses a neural network to predict the presence and location of objects within each cell. The YOLO model is fast and efficient, making it well-suited for real-time object detection in video streams and other applications.

YOLO v7 is a version of the YOLO model, and introduces several new features and improvements over previous versions. Some of the key features of YOLO v7 include= improved accuracy, multi-scale training, and improved architecture for detecting smaller objects. YOLO v7 also includes support for the latest advances in deep learning and computer vision, making it a powerful tool for object detection and classification tasks.

In [None]:
Video('data/videos/test2.mp4')

In [None]:
from nvidia.dali import pipeline_def
import nvidia.dali.fn as fn
import nvidia.dali.types as types

device_id = 0

@pipeline_def(batch_size=1, num_threads=2, device_id=device_id)
def video_pipeline():
    raw =  fn.readers.video(
        name="reader",
        device="gpu",
        filenames=['data/videos/test2.mp4'],
        sequence_length=8)
    video = fn.resize(raw, resize_x=640, resize_y=360)
    video = fn.crop(
        video,
        crop_h=360 + 24,
        crop_w=640,
        out_of_bounds_policy='pad',
        fill_values=114.)
    raw = fn.color_space_conversion(
        raw, image_type=types.RGB, output_type=types.BGR)

    return video, raw


pipeline =  video_pipeline()
pipeline.build()

frames =  pipeline.run()
host_frames =  frames[0].as_cpu().as_array()

show_images(host_frames[0])

In [None]:
from nvidia.dali.plugin.pytorch import DALIGenericIterator

pipeline = video_pipeline()
pipeline.build()

dali_pytorch_iterator = DALIGenericIterator([pipeline], ['frames', 'raw'], reader_name="reader")

for i, data in enumerate(dali_pytorch_iterator):
    print(data[device_id]['frames'].shape)

##### YOLOv7 based object detection:

In [None]:
from detect import detect

dali_pytorch_iterator.reset()

detect(dali_pytorch_iterator, 'output.mp4')

In [None]:
Video('output.mp4')