# 3D Object detection on Non Waymo dataset using KerasCV

**Author:** [Usha Rengaraju](https://www.linkedin.com/in/usha-rengaraju-b570b7a2/)<br>
**Date created:** 2023/07/10<br>
**Last modified:** 2023/07/10<br>
**Description:** 3D Object detection on Non Waymo dataset using KerasCV

##Overview

We all know that the field of 3d Object Detection is a relatively new and upcoming field which is a little difficult when compared with the traditional 2d Object Detection. 3D object detection is the process of identifying, classifying, and localizing objects within a 3D space. 3d Object Detection finds its uses in several domains including security and autonomous driving.

![](https://mscvprojects.ri.cmu.edu/2020teamf/wp-content/uploads/sites/36/2020/05/cropped-perception.png)

KerasCV makes it easy to use the state-of-the-art CenterPillar model for 3d Object Detection on the dataset of your choice. This guide will show you how to work with a 3d dataset and how to build the Centerpillar dataset and train it on your dataset.

## Imports & setup

This tutorial requires you to have KerasCV installed:

```shell
pip install keras-cv
```

We begin by importing all required packages:

In [None]:
import keras_cv
from keras_cv.callbacks import WaymoEvaluationCallback

from keras_cv.datasets.waymo import load
from keras_cv.datasets.waymo import transformer

from keras_cv.layers.object_detection_3d import voxel_utils

from keras_cv.layers import CenterNetLabelEncoder
from keras_cv.layers import DynamicVoxelization

from keras_cv.models.__internal__.unet import UNet

from keras_cv.models.object_detection_3d.center_pillar import MultiClassDetectionHead
from keras_cv.models.object_detection_3d.center_pillar import MultiClassHeatmapDecoder
from keras_cv.models.object_detection_3d.center_pillar import MultiHeadCenterPillar

import tensorflow as tf
from tensorflow import keras
import glob
import tensorflow as tf
from tqdm import tqdm
import numpy as np

## Data loading

This guide uses the
[Kitti 3D object detection dataset](https://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=3d)
for demonstration purposes.

To get started, we first download and unzip the dataset:

In [None]:
!wget https://s3.eu-central-1.amazonaws.com/avg-kitti/data_object_velodyne.zip

--2023-06-08 15:54:07--  https://s3.eu-central-1.amazonaws.com/avg-kitti/data_object_velodyne.zip
Resolving s3.eu-central-1.amazonaws.com (s3.eu-central-1.amazonaws.com)... 52.219.75.63
Connecting to s3.eu-central-1.amazonaws.com (s3.eu-central-1.amazonaws.com)|52.219.75.63|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 28750710812 (27G) [application/zip]
Saving to: ‘data_object_velodyne.zip’


2023-06-08 16:10:38 (27.7 MB/s) - ‘data_object_velodyne.zip’ saved [28750710812/28750710812]



In [None]:
!wget https://s3.eu-central-1.amazonaws.com/avg-kitti/data_object_label_2.zip

--2023-06-08 16:10:39--  https://s3.eu-central-1.amazonaws.com/avg-kitti/data_object_label_2.zip
Resolving s3.eu-central-1.amazonaws.com (s3.eu-central-1.amazonaws.com)... 52.219.169.145
Connecting to s3.eu-central-1.amazonaws.com (s3.eu-central-1.amazonaws.com)|52.219.169.145|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5601213 (5.3M) [application/zip]
Saving to: ‘data_object_label_2.zip’


2023-06-08 16:10:40 (5.91 MB/s) - ‘data_object_label_2.zip’ saved [5601213/5601213]



In [None]:
!unzip -q /content/data_object_label_2.zip -d data/

In [None]:
!unzip -q /content/data_object_velodyne.zip -d data/

In [None]:
class2id = {'Car':0, 'Van':1, 'Truck':2,
                     'Pedestrian':3, 'Person_sitting':4, 'Cyclist':5, 'Tram':6,
                     'Misc':7}

After unzipping the data we preprocess the dataset to be able to feed it into our centerpillar model of kerasCV. The kitti dataset has a lot of dataset parameters from which we can filter out and use the ones which we need.

For the input to the Centerpillar model we actually need the following features: `PointXYZ`, `PointFeatures` and `PointMask`

For the model output we have the following parameters: `labelbox`(the 3d bounding box), `LabelClass` (object class), `LabelBoxDensity`, `LabelBoxDifficulty` and `LabelBoxMask`

The `point_xyz` field represents the XYZ coordinates of each point in the
point cloud.

The `point_features` field represents the LIDAR features of each point in the
poin cloud. Typical features include range, intensity, and elongation.

In KerasCV, 3D box targets for object detection are represented as vertical
pillars rotated with respect to the Z axis. We encode each box as a list (or
Tensor) of 7 floats: the X, Y, and Z coordinates of the box's center, the width,
length, and height of the box, and the rotation of the box with respect to the
Z axis. (This rotation is referrred to as `phi` and is always in radians).

In [None]:
path = '/content/data/training/velodyne/*'
path2 = '/content/data/training/label_2/'
point_cloud = []
bounding_boxes = []
for f in tqdm(glob.glob(path)):

  pp = f.split('/')[-1]
  pp = pp[:-3]+'txt'
  pp=path2+pp
  pc =np.fromfile(f,dtype=np.float32).reshape(-1,4)
  pc = pc[:10000,:]
  point_xyz = pc[:,:3]
  point_feat = np.vstack((pc[:,3],pc[:,3],pc[:,3],pc[:,3])).T
  point_mask = np.ones([pc.shape[0]])
  point_cloud.append(tf.concat(
        [
            point_xyz[tf.newaxis, ...],
            point_feat[tf.newaxis, ...],
            tf.cast(point_mask, tf.float32)[tf.newaxis, :, tf.newaxis],
        ],
        axis=-1,
    ))
  txt =open(pp,'r').readlines()
  label_box=[]
  label_class = []
  label_box_density = []
  label_box_diff = []
  label_box_mask = []
  for line in txt:
    info = line.split() #hwl to wlh
    if info[0]!='DontCare':
      lst = list(map(float,(info[11],info[12],info[13],info[9],info[10],info[8],info[3])))
      label_box.append(lst)
      # print(lst)
      label_class.append(class2id[info[0]])
      label_box_density.append(float(info[2]))
      label_box_diff.append(float(info[1]))
      label_box_mask.append(1)
  label_box = np.array(label_box)
  label_class = np.array(label_class)
  label_box_density = np.array(label_box_density)
  label_box_diff = np.array(label_box_diff)
  label_box_mask = np.array(label_box_mask)
  boxes = tf.concat(
        [
            label_box[tf.newaxis, :],
            tf.cast(label_class, tf.float32)[
                tf.newaxis, :, tf.newaxis
            ],
            tf.cast(label_box_density, tf.float32)[
                tf.newaxis, :, tf.newaxis
            ],
            tf.cast(label_box_diff, tf.float32)[
                tf.newaxis, :, tf.newaxis
            ],
            tf.cast(label_box_mask, tf.float32)[
                tf.newaxis, :, tf.newaxis
            ],
        ],
        axis=-1,
  )
  # boxes = tf.squeeze(boxes,axis=0)
  shape = [1,10,11]
  pad = shape - tf.minimum(tf.shape(boxes), shape)
  zeros = tf.zeros_like(pad)
  paddings = tf.stack([zeros, pad], axis=1)
  slice_begin = zeros
  boxes = tf.pad(boxes, paddings, constant_values=0)
  boxes = tf.slice(boxes, slice_begin, shape)
  bounding_boxes.append(boxes)


  3%|▎         | 200/7481 [00:05<03:30, 34.64it/s]


In [None]:
point_cloud = tf.data.Dataset.from_tensor_slices(point_cloud)
bounding_boxes = tf.data.Dataset.from_tensor_slices(bounding_boxes)
ds = tf.data.Dataset.zip((point_cloud, bounding_boxes))
ds = ds.map(lambda x, y: {"point_clouds": x, "bounding_boxes": y})
ds.batch(1)

<_BatchDataset element_spec={'point_clouds': TensorSpec(shape=(None, 1, 10000, 8), dtype=tf.float32, name=None), 'bounding_boxes': TensorSpec(shape=(None, 1, 10, 11), dtype=tf.float32, name=None)}>

In [None]:
ds.take(1)

<_TakeDataset element_spec={'point_clouds': TensorSpec(shape=(1, 10000, 8), dtype=tf.float32, name=None), 'bounding_boxes': TensorSpec(shape=(1, 10, 11), dtype=tf.float32, name=None)}>

We convert the data to the model format and ready to be fed into into the model

In [None]:
def convert_to_model_format(y):
  point_clouds = {
    "point_xyz": y["point_clouds"][..., :3],
    "point_feature": y["point_clouds"][..., 3:-1],
    "point_mask": tf.cast(y["point_clouds"][..., -1], tf.bool)
  }
  boxes = {
    "boxes": y["bounding_boxes"][..., :7],
    "classes": y["bounding_boxes"][..., 7],
    "difficulty": y["bounding_boxes"][..., -1],
    "mask": tf.cast(y["bounding_boxes"][..., 8], tf.bool)
  }
  return {
      "point_clouds": point_clouds,
      "3d_boxes": boxes,
  }

dataset = ds.map(convert_to_model_format, num_parallel_calls=tf.data.AUTOTUNE)

label_encoder = CenterNetLabelEncoder(
    voxel_size=voxel_size,
    min_radius=[0.8, 0.8, 0],
    max_radius=[8.0, 8.0, 0],
    spatial_size=spatial_size,
    num_classes=8,
    top_k_heatmap=[1024, 512,512,512,512,512,512,512],
)

dataset = dataset.map(label_encoder, num_parallel_calls=tf.data.AUTOTUNE)

def separate_points_and_boxes(y):
      x = y["point_clouds"]
      del y["point_clouds"]

      return x, y

dataset = dataset.map(separate_points_and_boxes, num_parallel_calls=tf.data.AUTOTUNE)

## Model Building

Next we move onto building our CenterPillar model which comprises of the `Voxelisation Layer`, `Unet Layer`, `MultiClassDetection Head` and `MultiClassHeatmap Decoder`

The Unet layer serves as the Backbone of the model whereas Dynamic Voxelisation layer assigns and pools points into voxels, then it concatenates with point features and feed into a neural network, and max pools all point features inside each voxel. Finally the model takes the backbone and voxelisation layer. This model builds box classification and regression for each class separately. It voxelizes the point cloud feature, applies feature extraction on top of voxelized feature, and applies multi-class classification and regression heads on the feature map.

For the model, there are also some hyperparameters to be specified like the size of the anchor boxes of each class, the voxel size and the spatial size of the model.

In [None]:

voxel_size = [0.32, 0.32, 1000]
spatial_size = [-81.92, 81.92, -81.92, 81.92, -20, 20]
voxelization_feature_size = 128
car_anchor_size = [4.5, 2.0, 1.6]
truck_anchor_size = [10.0, 5.0, 3.2]
pedestrian_anchor_size = [0.6, 0.8, 1.8]
cyclist_anchor_size = [2.0, 0.8, 1.6]

def build_centerpillar_model():

  voxelization_point_net = tf.keras.Sequential(
      [
          tf.keras.layers.Dense(voxelization_feature_size),
          tf.keras.layers.BatchNormalization(fused=False),
          tf.keras.layers.ReLU(),
      ]
  )
  voxelization_layer = DynamicVoxelization(
      point_net=voxelization_point_net,
      voxel_size=voxel_size,
      spatial_size=spatial_size,
  )

  unet_layer = UNet(
      input_shape=(
          voxelization_layer._voxel_spatial_size[:2]
          + [voxelization_feature_size]
      ),
      down_block_configs=[(128, 6), (256, 2), (512, 2)],
      up_block_configs=[512, 256, 256],
      sync_bn=False,
  )


  num_heading_bins = [12,12,12,4,4,4,12,4]

  decoder = MultiClassHeatmapDecoder(
      num_classes=8,
      num_head_bin=num_heading_bins,
      anchor_size=[car_anchor_size,truck_anchor_size,truck_anchor_size, pedestrian_anchor_size,pedestrian_anchor_size,cyclist_anchor_size,truck_anchor_size,truck_anchor_size],
      max_pool_size=[7,8,9,3,3,3,9,9],
      max_num_box=[800,800,800,400,400 ,400,800,400],
      heatmap_threshold=[0.1, 0.1,0.1,0.1,0.1,0.1,0.1,0.1],
      voxel_size=voxel_size,
      spatial_size=spatial_size,
  )

  multiclass_head = MultiClassDetectionHead(
      num_classes=8,
      num_head_bin=num_heading_bins,
  )

  model = MultiHeadCenterPillar(
      backbone=unet_layer,
      voxel_net=voxelization_layer,
      multiclass_head=multiclass_head,
      prediction_decoder=decoder,
  )

  return model




car_box_loss = keras_cv.losses.CenterNetBoxLoss(num_heading_bins=12, anchor_size=car_anchor_size, reduction="sum")
truck_box_loss = keras_cv.losses.CenterNetBoxLoss(num_heading_bins=12, anchor_size=truck_anchor_size, reduction="sum")
van_box_loss = keras_cv.losses.CenterNetBoxLoss(num_heading_bins=12, anchor_size=truck_anchor_size, reduction="sum")
pedestrian_box_loss = keras_cv.losses.CenterNetBoxLoss(num_heading_bins=4, anchor_size=pedestrian_anchor_size, reduction="sum")
spedestrian_box_loss = keras_cv.losses.CenterNetBoxLoss(num_heading_bins=4, anchor_size=pedestrian_anchor_size, reduction="sum")
cyclist_box_loss = keras_cv.losses.CenterNetBoxLoss(num_heading_bins=4, anchor_size=cyclist_anchor_size, reduction="sum")
tram_box_loss = keras_cv.losses.CenterNetBoxLoss(num_heading_bins=12, anchor_size=truck_anchor_size, reduction="sum")
misc_box_loss = keras_cv.losses.CenterNetBoxLoss(num_heading_bins=4, anchor_size=truck_anchor_size, reduction="sum")

model = build_centerpillar_model()

model.compile(
  optimizer='adam',
  heatmap_loss=keras_cv.losses.BinaryPenaltyReducedFocalCrossEntropy(reduction="sum"),
  box_loss=[car_box_loss,van_box_loss,truck_box_loss, pedestrian_box_loss,spedestrian_box_loss,cyclist_box_loss,tram_box_loss,misc_box_loss],
)

model.fit(dataset, epochs=5)


In [None]:
pc =np.fromfile('data/testing/velodyne/002576.bin',dtype=np.float32).reshape(-1,4)
point_xyz = pc[:,:3]
point_feat = np.vstack((pc[:,3],pc[:,3],pc[:,3],pc[:,3])).T
point_mask = np.ones([pc.shape[0]])

In [None]:
inp = {
    'point_xyz': point_xyz[tf.newaxis, ...],
    'point_feature':point_feat[tf.newaxis, ...],
    'point_mask':tf.cast(point_mask, tf.bool)[tf.newaxis, :, tf.newaxis]
}

In [None]:
inp

{'point_xyz': array([[[ 71.89 ,  11.545,   2.671],
         [ 49.278,   8.064,   1.898],
         [ 49.247,   8.218,   1.897],
         ...,
         [ -8.962,  -7.068,   0.595],
         [ -9.082,  -7.21 ,   0.601],
         [-10.779,  -8.647,   0.676]]], dtype=float32),
 'point_feature': array([[[0.  , 0.  , 0.  , 0.  ],
         [0.  , 0.  , 0.  , 0.  ],
         [0.  , 0.  , 0.  , 0.  ],
         ...,
         [0.36, 0.36, 0.36, 0.36],
         [0.58, 0.58, 0.58, 0.58],
         [0.2 , 0.2 , 0.2 , 0.2 ]]], dtype=float32),
 'point_mask': <tf.Tensor: shape=(1, 1000, 1), dtype=bool, numpy=
 array([[[ True],
         [ True],
         [ True],
         [ True],
         [ True],
         [ True],
         [ True],
         [ True],
         [ True],
         [ True],
         [ True],
         [ True],
         [ True],
         [ True],
         [ True],
         [ True],
         [ True],
         [ True],
         [ True],
         [ True],
         [ True],
         [ True],
      

## Sample Prediction with model output

In [None]:
model.predict(inp)



{'3d_boxes': {'boxes': array([[[-5.3763557e+01,  5.3969669e+00, -4.8189335e-02, ...,
            1.8558412e+00,  1.0874635e+00, -4.1085625e-01],
          [ 7.2736049e+00,  5.5295017e+01, -3.7862352e-01, ...,
            1.1607754e+00,  2.2833579e+00, -3.4592438e-01],
          [-5.1152077e+01,  6.6764469e+00,  1.1264656e-02, ...,
            2.0087991e+00,  1.3036797e+00,  3.1303015e+00],
          ...,
          [ 5.1199997e+01, -1.2799999e+01,  2.7481551e-07, ...,
            4.9999981e+00,  3.2000000e+00,  1.7331639e-07],
          [ 5.1520000e+01, -1.2480000e+01,  1.6552333e-07, ...,
            4.9999995e+00,  3.1999998e+00,  1.5707965e+00],
          [ 5.2160000e+01,  5.5680000e+01,  4.8604551e-08, ...,
            4.9999995e+00,  3.1999998e+00,  1.5707965e+00]]], dtype=float32),
  'classes': array([[1, 1, 1, ..., 8, 8, 8]], dtype=int32),
  'confidence': array([[0.7432597 , 0.73020416, 0.71901345, ..., 0.50000006, 0.50000006,
          0.50000006]], dtype=float32)}}