ONNX, TFLite, OpenVINO, TensorFlow.js, CoreML.
This model requires two different files to be run in sequence. The order of execution is
first, thenegonet_fc_Nx66
. -
in the input tensor ofegonet_heatmap_Nx3x256x256
is the batch size. What this batch size means is the number of cars detected using your favorite object detection model. This means that before using EgoNet's model, the car must be detected beforehand using an object detection model such as YOLO or SSD, etc, and the car region must be cut out with a bounding box. Thus, the batch sizeN
of the input tensor of the EgoNet model is the number of vehicles obtained by object detection. -
- crop_instances - Loop processing for all detected car bounding boxes
INPUT bbox:
[[654 , 184, 701, 232]] <- [[X1, Y1, X2, Y2],[X1, Y1, X2, Y2], ..., [X1, Y1, X2, Y2]]
INPUT labels:
[-1] <- [-1, -1, ..., -1]
(Cars only, Car = -1) -
INPUT scores:
[0.99902] <- [conf1, conf2, ..., confN]
Resize bbox (Extended car cutout area to 1.1 times larger):
new_width = (701 - 654) * 1.1
new_height = (232 - 184) * 1.1
Calculation of the center point:
center_x = (654 + 701) / 2
center_y = (184 + 232) / 2
Calculation of the new X1, Y1, X2, Y2 point:
new_left = center_x - (0.5 * new_width)
new_right = center_x + (0.5 * new_width)
new_top = center_y - (0.5 * new_height)
new_bottom = center_y + (0.5 * new_height)
Resize according to aspect ratio
is the aspect ratio calculated byheight / width
https://github.com/Nicholasli1995/EgoNet/blob/a3ea8285d0497723dc2a3a60009b2da95937f542/libs/common/img_proc.py#L453-L459ret = resize_bbox(new_left, new_top, new_right, new_bottom, target_ar=target_ar) def resize_bbox(left, top, right, bottom, target_ar=1.): """ Resize a bounding box to pre-defined aspect ratio. """ width = right - left height = bottom - top aspect_ratio = height/width center_x = (left + right)/2 center_y = (top + bottom)/2 if aspect_ratio > target_ar: new_width = height*(1/target_ar) new_left = center_x - 0.5*new_width new_right = center_x + 0.5*new_width new_top = top new_bottom = bottom else: new_height = width*target_ar new_left = left new_right = right new_top = center_y - 0.5*new_height new_bottom = center_y + 0.5*new_height return { 'bbox': [new_left, new_top, new_right, new_bottom], 'c': np.array([center_x, center_y]), 's': np.array([(new_right - new_left)/SIZE, (new_bottom - new_top)/SIZE]) # SIZE=200.0 }
Affine transformation of vertices
is not the detected image of the car, but the whole image that is processed for object detection.resolution
is[256, 256]
. https://github.com/Nicholasli1995/EgoNet/blob/a3ea8285d0497723dc2a3a60009b2da95937f542/libs/model/egonet.py#L81-L89c, s, r = ret['c'], ret['s'], 0.0 trans = get_affine_transform( c, s, r, (height, width) ) instance = cv2.warpAffine( img, trans, (int(resolution[0]), int(resolution[1])), flags=cv2.INTER_LINEAR ) instance = instance if pth_trans is None else pth_trans(instance)
pth_trans: Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
is rotaion.def get_affine_transform( center, scale, rot, output_size, shift=np.array([0, 0], dtype=np.float32), inv=0 ): """ Estimate an affine transformation given crop parameters (center, scale and rotation) and output resolution. """ if isinstance(scale, list): scale = np.array(scale) if isinstance(center, list): center = np.array(center) scale_tmp = scale * SIZE # SIZE=200.0 src_w = scale_tmp[0] dst_h, dst_w = output_size rot_rad = np.pi * rot / 180 src_dir = get_dir([0, src_w * -0.5], rot_rad) dst_dir = np.array([0, dst_w * -0.5], np.float32) src = np.zeros((3, 2), dtype=np.float32) dst = np.zeros((3, 2), dtype=np.float32) src[0, :] = center + scale_tmp * shift src[1, :] = center + src_dir + scale_tmp * shift dst[0, :] = [dst_w * 0.5, dst_h * 0.5] dst[1, :] = np.array([dst_w * 0.5, dst_h * 0.5]) + dst_dir src[2:, :] = get_3rd_point(src[0, :], src[1, :]) dst[2:, :] = get_3rd_point(dst[0, :], dst[1, :]) if inv: trans = cv2.getAffineTransform(np.float32(dst), np.float32(src)) else: trans = cv2.getAffineTransform(np.float32(src), np.float32(dst)) return trans
def get_dir(src_point, rot_rad): sn, cs = np.sin(rot_rad), np.cos(rot_rad) src_result = [0, 0] src_result[0] = src_point[0] * cs - src_point[1] * sn src_result[1] = src_point[0] * sn + src_point[1] * cs return src_result
def get_3rd_point(a, b): direct = a - b return b + np.array([-direct[1], direct[0]], dtype=np.float32)
Intermediate processing is required after
before passinglocal_coord
values toegonet_fc_Nx66
. The process of affine transforming the value of the output ofegonet_heatmap_Nx3x256x256
. The affine transformed values are saved askpts_2d_pred
and passed on to the next model input.instances
is the coordinate information of N cars after affine transformations and normalization. https://github.com/Nicholasli1995/EgoNet/blob/a3ea8285d0497723dc2a3a60009b2da95937f542/libs/model/egonet.py#L424-L467instances = tensor([[[-1.5699, -1.5870, -1.5870, ..., -0.3541, -0.3541, -0.3541], [-1.5699, -1.5699, -1.5870, ..., -0.3369, -0.3369, -0.3541], [-1.5699, -1.5699, -1.5699, ..., -0.3369, -0.3369, -0.3369], ..., [ 2.2489, 2.2489, 2.2489, ..., 1.3413, 1.3242, 1.3242], [ 2.2489, 2.2489, 2.2489, ..., 1.3927, 1.3755, 1.3927], [ 2.2489, 2.2489, 2.2489, ..., 1.4269, 1.4269, 1.4440]], [[-1.4930, -1.5105, -1.5105, ..., -0.1625, -0.1625, -0.1800], [-1.4755, -1.4930, -1.4930, ..., -0.1275, -0.1275, -0.1275], [-1.4580, -1.4755, -1.4930, ..., -0.0924, -0.0924, -0.0924], ..., [ 2.3761, 2.3761, 2.3936, ..., 1.0980, 1.0280, 1.0280], [ 2.3235, 2.3410, 2.3585, ..., 1.1331, 1.0805, 1.0805], [ 2.2710, 2.3060, 2.3235, ..., 1.1856, 1.1155, 1.1331]], [[-1.1596, -1.1596, -1.1770, ..., 0.1476, 0.1476, 0.1476], [-1.1421, -1.1596, -1.1596, ..., 0.2348, 0.2348, 0.2522], [-1.1247, -1.1421, -1.1596, ..., 0.3219, 0.3219, 0.3219], ..., [ 2.3437, 2.3263, 2.3088, ..., 1.2282, 1.1585, 1.1585], [ 2.2914, 2.2740, 2.2740, ..., 1.2457, 1.1585, 1.1411], [ 2.2391, 2.2391, 2.2217, ..., 1.2631, 1.1759, 1.1411]]])
records = [ { 'bbox': array( [ 654.4174967, 184.97550979, 701.93145589, 232.48946898 ] ), 'bbox_resize': [ 652.0417987381174, 182.59981183069667, 704.3071538461882, 234.86516693876743 ], 'center': array( [ 678.17447629, 208.73248938 ] ), 'label': -1, 'path': '/xxx/kitti/training/image_2/000002.png', 'rotation': 0.0, 'scale': array( [ 0.26132678, 0.26132678 ] ), 'score': 0.99902 } ]
Inverse affine transformation of the keypoints inferred and output by
output = self.HC(instances)
. The converted (preprocessed) keypoints are stored inkpts_2d_pred
is[256, 256]
.np.array(self.resolution).reshape(1, 1, 2)
is[[[256, 256]]]
is[N, 33, 2]
.def get_keypoints( self, instances, records, is_cuda=True ): """ Foward pass to obtain the screen coordinates. """ if is_cuda: instances = instances.cuda() output = self.HC(instances) # <--- Heat map generation (inference), egonet_heatmap_Nx3x256x256 # local part coordinates width, height = self.resolution local_coord = output[1].data.cpu().numpy() local_coord *= np.array(self.resolution).reshape(1, 1, 2) # transform local part coordinates to screen coordinates centers = [records[i]['center'] for i in range(len(records))] scales = [records[i]['scale'] for i in range(len(records))] rots = [records[i]['rotation'] for i in range(len(records))] for instance_idx in range(len(local_coord)): trans_inv = get_affine_transform( centers[instance_idx], scales[instance_idx], rots[instance_idx], (height, width), inv=1 ) screen_coord = affine_transform_modified( local_coord[instance_idx], trans_inv ) records[instance_idx]['kpts'] = screen_coord # assemble a dictionary where each key corresponds to one image ret = {} for record in records: path = record['path'] if path not in ret: ret[path] = self.new_img_dict() ret[path]['kpts_2d_pred'].append(record['kpts'].reshape(1, -1)) ret[path]['center'].append(record['center']) ret[path]['scale'].append(record['scale']) ret[path]['bbox_resize'].append(record['bbox_resize']) ret[path]['label'].append(record['label']) ret[path]['score'].append(record['score']) ret[path]['rotation'].append(record['rotation']) return ret
def affine_transform_modified(pts, t): """ Apply affine transformation with homogeneous coordinates. """ # pts of shape [n, 2] new_pts = np.hstack([pts, np.ones((len(pts), 1))]).T new_pts = t @ new_pts return new_pts[:2, :].T
- crop_instances - Loop processing for all detected car bounding boxes
The normalization process is added at the beginning of the
model and the inverse normalization process is added at the end ofegonet_fc_Nx66
model. Therefore, the tensor[N, 66]
used as input for theegonet_fc_Nx66
model should not be programmatically normalized.len(prediction)
is the batch sizeN
. It is synonymous with the number of cars detected for one image. https://github.com/Nicholasli1995/EgoNet/blob/a3ea8285d0497723dc2a3a60009b2da95937f542/libs/model/egonet.py#L469-L486def lift_2d_to_3d(self, records, cuda=True): """ Foward-pass of the lifter sub-model. """ for path in records.keys(): data = np.concatenate(records[path]['kpts_2d_pred'], axis=0) data = nop.normalize_1d(data, self.LS['mean_in'], self.LS['std_in']) data = data.astype(np.float32) data = torch.from_numpy(data) if cuda: data = data.cuda() prediction = self.L(data) # <--- egonet_fc_Nx66 prediction = nop.unnormalize_1d( prediction.data.cpu().numpy(), self.LS['mean_out'], self.LS['std_out'] ) records[path]['kpts_3d_pred'] = prediction.reshape(len(prediction), -1, 3) return records
To be more specific, the following two lines should not be executed programmatically. The quoted process reads the
self.LS (.npy)
and uses it for normalization, but the model committed to my repository has already fused the .npy values into the model. https://github.com/Nicholasli1995/EgoNet/blob/a3ea8285d0497723dc2a3a60009b2da95937f542/libs/model/egonet.py#L475data = nop.normalize_1d(data, self.LS['mean_in'], self.LS['std_in'])
prediction = nop.unnormalize_1d( prediction.data.cpu().numpy(), self.LS['mean_out'], self.LS['std_out'] ) records[path]['kpts_3d_pred'] = prediction.reshape(len(prediction), 32, 3)