Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Demo on a single image #1

Closed
kevinhartyanyi opened this issue Jan 25, 2022 · 5 comments
Closed

Demo on a single image #1

kevinhartyanyi opened this issue Jan 25, 2022 · 5 comments

Comments

@kevinhartyanyi
Copy link

Hi,

I read the paper and it looks interesting.
I was wondering how you would run this model on a single image. The provided demo code seems to only run on the validation set.

So if I have an image for which I have the bounding box/segmentation data of the hands, then how should I preprocess this for the model?

Given that this code is based on InterHand2.6M I tried to do something similar to what they have in their demo code:

def demo_model(args):
    pl.seed_everything(args.seed)
    model = fetch_pl_model(args, args.experiment)

    model.cuda()
    model.freeze()
    model.eval()

    # prepare input image
    transform = transforms.ToTensor()
    img_path = '1.jpg'
    original_img = cv2.imread(img_path)
    original_img_height, original_img_width = original_img.shape[:2]

    # prepare bbox
    bbox = [723, 354, 127, 74] # xmin, ymin, width, height  1
    bbox = process_bbox(bbox, (original_img_height, original_img_width, original_img_height))
    img, trans, inv_trans = generate_patch_image(original_img, bbox, False, 1.0, 0.0, cfg.input_img_shape)
    img = transform(img.astype(np.float32))/255
    img = img.cuda()[None,:,:,:]

    # forward
    inputs = {'img': img}
    targets = {}
    meta_info = {}

    with torch.no_grad():
        vis_dict = model(inputs, targets, meta_info, 'vis')
        print("vis_dict", vis_dict)`

Where process_bbox and generate_patch_image are from InterHand2.6M however, it didn't work out for me. The model always returned None. My guess is that it requires the bbox to be processed differently, looking at the code I think it has to do something with the "segm_256" key in the "targets" dict, but not sure what would be the easiest way.

I would appreciate some help or directions on what would be the simplest way to write a script that runs the model on a single image.

Best regards,
Kevin

@zc-alexfan
Copy link
Owner

zc-alexfan commented Jan 26, 2022

Hi Kevin, thanks for your interest in our work. I think there is an easier workaround if you just want to evaluate on unseen input image. You can simply take the model definition code and remove everything of the target dict and then manually load the weights in the modified class.

DIGIT inference code only relies on the input image if I recalled correctly. You just need to prepare the input image for the network.

@kevinhartyanyi
Copy link
Author

Thank you for your response. I did some experimenting and managed to run it.

What I did was:

I removed the unnecessary parts from the linked model. So the forward looks like this:

def forward(self, inputs, targets, meta_info, mode):
    input_img = inputs["img"]

    pose_dict = self.pose_reg(input_img, None, None)

    hm_2d = pose_dict["hm_2d"]
    zval = pose_dict["zval"]
    rel_root_depth_out = pose_dict["rel_root_depth_out"]
    hand_type_sigmoid = pose_dict["hand_type_sigmoid"]

    joint_xy = hm_utils.hm2xy(hm_2d, self.args.output_hm_shape, self.args.beta)

    # 2p5 to 3d
    joint_xyz = torch.cat((joint_xy, zval[:, :, None]), dim=2)

    out_dict = {
        "joint_coord": joint_xyz,
        "rel_root_depth": rel_root_depth_out,
        "hand_type": hand_type_sigmoid,
    }
    return out_dict

The self.pose_reg calls the digit model here, which other than the image also needs segm_target_256 and segm_valid.

To be honest, I'm not sure what the purpose of providing segm_target_256 and segm_valid are as the forward function of digit only uses them here to call the forward of SegmNet, but that function doesn't use any of them. Therefore, I use None in the place of these two, however one could also change the forward of SegmNet to not require the unused variables.

For the preprocessing and calling the model, I do the following:

def demo_model(args):
    model = Model(args)
    model.load_pretrained("saved_models/db7cba8c1.pt")
    model.cuda()
    model.eval()

    transform = transforms.ToTensor()
    img_path = 'input.jpg'
    original_img = cv2.imread(img_path)
    img = original_img.copy()

    bbox = [69, 137, 165, 153] # xmin, ymin, width, height

    trans, scale, rot, do_flip, color_scale = [0,0], 1.0, 0.0, False, np.array([1,1,1])
    bbox[0] = bbox[0] + bbox[2] * trans[0]
    bbox[1] = bbox[1] + bbox[3] * trans[1]
    img, trans, inv_trans = generate_patch_image(img, bbox, do_flip, scale, rot, cfg.input_img_shape, cv2.INTER_LINEAR)
    img = np.clip(img * color_scale[None,None,:], 0, 255)

    img = transform(img.astype(np.float32))/255.0
    img = img[None,:,:,:]

    inputs = {"img": img.to('cuda:0')}
    targets = {}
    meta_info = {}

    with torch.no_grad():
        out = model(inputs, targets, meta_info, None)

The preprocessing steps are based on the augmentation function.

Finally, I visualize the results using the code from InterHand2.6M

    # joint set information is in annotations/skeleton.txt
    joint_num = 21 # single hand
    joint_type = {'right': np.arange(0,joint_num), 'left': np.arange(joint_num,joint_num*2)}
    skeleton = load_skeleton(osp.join('data/InterHand/annotations/skeleton.txt'), joint_num*2)

    img = img[0].cpu().numpy().transpose(1,2,0) # cfg.input_img_shape[1], cfg.input_img_shape[0], 3
    joint_coord = out['joint_coord'][0].cpu().numpy() # x,y pixel, z root-relative discretized depth
    rel_root_depth = out['rel_root_depth'][0].cpu().numpy() # discretized depth
    hand_type = out['hand_type'][0].cpu().numpy() # handedness probability

    # restore joint coord to original image space and continuous depth space
    joint_coord[:,0] = joint_coord[:,0] / cfg.output_hm_shape[2] * cfg.input_img_shape[1]
    joint_coord[:,1] = joint_coord[:,1] / cfg.output_hm_shape[1] * cfg.input_img_shape[0]
    joint_coord[:,:2] = np.dot(inv_trans, np.concatenate((joint_coord[:,:2], np.ones_like(joint_coord[:,:1])),1).transpose(1,0)).transpose(1,0)
    joint_coord[:,2] = (joint_coord[:,2]/cfg.output_hm_shape[0] * 2 - 1) * (cfg.bbox_3d_size/2)

    # restore right hand-relative left hand depth to continuous depth space
    rel_root_depth = (rel_root_depth/cfg.output_root_hm_shape * 2 - 1) * (cfg.bbox_3d_size_root/2)

    # right hand root depth == 0, left hand root depth == rel_root_depth
    joint_coord[joint_type['left'],2] += rel_root_depth

    # handedness
    joint_valid = np.zeros((joint_num*2), dtype=np.float32)
    right_exist = False
    if hand_type[0] > 0.5: 
        right_exist = True
        joint_valid[joint_type['right']] = 1
    left_exist = False
    if hand_type[1] > 0.5:
        left_exist = True
        joint_valid[joint_type['left']] = 1

    print('Right hand exist: ' + str(right_exist) + ' Left hand exist: ' + str(left_exist))

    # visualize joint coord in 2D space
    filename = f"result_{img_path.split('.')[0]}.jpg"
    vis_img = original_img.copy()[:,:,::-1].transpose(2,0,1)
    vis_img = vis_keypoints(vis_img, joint_coord, joint_valid, skeleton, filename, vis_dir='.')

Running on the test image from Interhand2.6M I get the following result:
image

I'm closing this issue now

@anilesec
Copy link

@kevinhartyanyi
THanks for the information. I am also trying to do this. would you be able to share the modified demo script to run?
It would be helpful and save my time. Thanks in advance!

@kevinhartyanyi
Copy link
Author

Hi @anilesec
I forked the repo and uploaded the changes. You can find it here.

@anilesec
Copy link

Thank you @kevinhartyanyi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants