First, let's load the JSON file which describes the human pose task.  This is in COCO format, it is the category descriptor pulled from the annotations file.  We modify the COCO category slightly, to add a neck keypoint.  We will use this task description JSON to create a topology tensor, which is an intermediate data structure that describes the part linkages, as well as which channels in the part affinity field each linkage corresponds to.

In [1]:
import json, sys

sys.path.append("../trt_pose/trt_pose")
import trt_pose.coco

sys.path.append("../jetcam/jetcam")
from jetcam.usb_camera import USBCamera
# from jetcam.csi_camera import CSICamera

import torchvision.transforms as transforms
import torch
import torch2trt
import trt_pose.models

from trt_pose.draw_objects import DrawObjects
from trt_pose.parse_objects import ParseObjects

import PIL.Image
from torch2trt import TRTModule
import cv2, time

import ipywidgets
from IPython.display import display, clear_output

In [2]:
convert_model_to_tensorrt = False # False True
display_in_jupyter = True
test_fps = False

global image_w
image_w = None

WIDTH = 224
HEIGHT = 224

In [3]:
def bgr8_to_jpeg_bytes(value, quality=75):
    return bytes(cv2.imencode('.jpg', value)[1])

In [4]:
camera = cv2.VideoCapture(0, cv2.CAP_V4L2)
camera_width = camera.get(cv2.CAP_PROP_FRAME_WIDTH)
camera_height = camera.get(cv2.CAP_PROP_FRAME_HEIGHT)
print("camera default width/height", camera_width, camera_height)

camera_width, camera_height = 640, 480
camera.set(cv2.CAP_PROP_FRAME_WIDTH, camera_width)
camera.set(cv2.CAP_PROP_FRAME_HEIGHT, camera_width)

if display_in_jupyter:
    image_w = ipywidgets.Image(format='jpeg')
    display(image_w)

while not camera.isOpened():
    time.sleep(camera.isOpened())
start_time = time.time()
for i in range(50):
    ret, image = camera.read()
    if display_in_jupyter:
        image_w.value = bgr8_to_jpeg_bytes(image)
    else:
        cv2.imshow("Video Frame", image)
    key=cv2.waitKey(1)
    if key == 27: # Check for ESC key
        cv2.destroyAllWindows()
        break
print("fps without processing", 50.0 / (time.time() - start_time))

camera.release()
del camera
clear_output()

In [5]:
with open('human_pose.json', 'r') as f:
    human_pose = json.load(f)

topology = trt_pose.coco.coco_category_to_topology(human_pose)

Next, we'll load our model.  Each model takes at least two parameters, *cmap_channels* and *paf_channels* corresponding to the number of heatmap channels
and part affinity field channels.  The number of part affinity field channels is 2x the number of links, because each link has a channel corresponding to the
x and y direction of the vector field for each link.

In [6]:
num_parts = len(human_pose['keypoints'])
num_links = len(human_pose['skeleton'])

if convert_model_to_tensorrt:
    model = trt_pose.models.resnet18_baseline_att(num_parts, 2 * num_links).cuda().eval()

Next, let's load the model weights.  You will need to download these according to the table in the README.

In [7]:
if convert_model_to_tensorrt:
    MODEL_WEIGHTS = './models/resnet18_baseline_att_224x224_A_epoch_249.pth'
    model.load_state_dict(torch.load(MODEL_WEIGHTS))

In order to optimize with TensorRT using the python library *torch2trt* we'll also need to create some example data.  The dimensions
of this data should match the dimensions that the network was trained with.  Since we're using the resnet18 variant that was trained on
an input resolution of 224x224, we set the width and height to these dimensions.

In [8]:
data = torch.zeros((1, 3, HEIGHT, WIDTH)).cuda()

Next, we'll use [torch2trt](https://github.com/NVIDIA-AI-IOT/torch2trt) to optimize the model.  We'll enable fp16_mode to allow optimizations to use reduced half precision.

The optimized model may be saved so that we do not need to perform optimization again, we can just load the model.  Please note that TensorRT has device specific optimizations, so you can only use an optimized model on similar platforms.

In [9]:
OPTIMIZED_MODEL = './models/resnet18_baseline_att_224x224_A_epoch_249_trt.pth'

if convert_model_to_tensorrt:
    model_trt = torch2trt.torch2trt(model, [data], fp16_mode=True, max_workspace_size=1<<25)
    torch.save(model_trt.state_dict(), OPTIMIZED_MODEL)

We could then load the saved model using *torch2trt* as follows.

In [10]:
model_trt = TRTModule()
model_trt.load_state_dict(torch.load(OPTIMIZED_MODEL))

<All keys matched successfully>

We can benchmark the model in FPS with the following code

In [11]:
if test_fps:
    import time

    t0 = time.time()
    torch.cuda.current_stream().synchronize()
    for i in range(50):
        y = model_trt(data)
    torch.cuda.current_stream().synchronize()
    t1 = time.time()

    print(50.0 / (t1 - t0))

Next, let's define a function that will preprocess the image, which is originally in BGR8 / HWC format.

In [12]:
mean = torch.Tensor([0.485, 0.456, 0.406]).cuda()
std = torch.Tensor([0.229, 0.224, 0.225]).cuda()
device = torch.device('cuda')

def preprocess(image):
    global device
    device = torch.device('cuda')
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    image = PIL.Image.fromarray(image)
    image = transforms.functional.to_tensor(image).to(device)
    image.sub_(mean[:, None, None]).div_(std[:, None, None])
    return image[None, ...]

Next, we'll define two callable classes that will be used to parse the objects from the neural network, as well as draw the parsed objects on an image.

In [13]:
parse_objects = ParseObjects(topology)
draw_objects = DrawObjects(topology)

Assuming you're using NVIDIA Jetson, you can use the [jetcam](https://github.com/NVIDIA-AI-IOT/jetcam) package to create an easy to use camera that will produce images in BGR8/HWC format.

If you're not on Jetson, you may need to adapt the code below.

Next, we'll create a widget which will be used to display the camera feed with visualizations.

In [14]:
camera = USBCamera(width=WIDTH, height=HEIGHT, capture_fps=30)
# camera = CSICamera(width=WIDTH, height=HEIGHT, capture_fps=30)
camera.running = True

Finally, we'll define the main execution loop.  This will perform the following steps

1.  Preprocess the camera image
2.  Execute the neural network
3.  Parse the objects from the neural network output
4.  Draw the objects onto the camera image
5.  Convert the image to JPEG format and stream to the display widget

In [15]:
def execute(change):
    global image_w
    image = change['new']
    data = preprocess(image)
    cmap, paf = model_trt(data)
    cmap, paf = cmap.detach().cpu(), paf.detach().cpu()
    counts, objects, peaks = parse_objects(cmap, paf)#, cmap_threshold=0.15, link_threshold=0.15)
    draw_objects(image, counts, objects, peaks)
    if display_in_jupyter:
        image_w.value = bgr8_to_jpeg_bytes(image[:, ::-1, :])
    else:
#         image = cv2.imencode('.jpg', image)[1]
#         resized_image = cv2.resize(image, (camera_width, camera_height)) 
        cv2.imshow("Video Frame", image)
        key=cv2.waitKey(1)
        if key == 27: # Check for ESC key
            cv2.destroyAllWindows()
            return

If we call the cell below it will execute the function once on the current camera frame.

In [16]:
execute({'new': camera.value})

Call the cell below to attach the execution function to the camera's internal value.  This will cause the execute function to be called whenever a new camera frame is received.

In [17]:
camera.observe(execute, names='value')
print("start")

start


In [18]:
if display_in_jupyter:
    display(image_w)

Image(value=b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00\xff\xdb\x00C\x00\x02\x01\x0â€¦

Call the cell below to unattach the camera frame callbacks.

In [19]:
read_key = input()
# print("read_key", read_key)
# if read_key == "q":
camera.unobserve_all()
camera.close()
exit(0)

 
