Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About the training loop #28

Closed
murdockhou opened this issue Oct 22, 2019 · 8 comments
Closed

About the training loop #28

murdockhou opened this issue Oct 22, 2019 · 8 comments
Labels
question Further information is requested

Comments

@murdockhou
Copy link

Hi, bother again!

I'm a little confused about training data code here. I find that you return the ori_img with corresponding keypoints annotations together (maybe there are multi persons' annotation). So, as hrnet is run as a single pose network, how can you training network using the ori image instead of the crop image based the human box annotation? Is there somewhere you have done this before?

For more, clearly, i think we can train hrnet ( a single pose network) using MSCOCO dataset is that we need to crop out multi/single input image from original image because maybe one image could have multi human annotations, just like in live-demo.py you have done.

So, could you tell me what mine thought is right or not, and to be honest, i'm also confused about how to train a single person pose network with using MSCOCO dataset?

Thanks a lot.

@stefanopini
Copy link
Owner

Hi!

Maybe I do not get the point of your question.
HRNet is a single-person HPE method, therefore input images should contain only one person (which is what happens both during training and testing).
Ms COCO provides both the keypoint annotations and the person bounding boxes, so it is possible to create a different example (with a different image crop) for each person using keypoint and bounding box annotations.
Each annotation is added to a list inthese lines:

self.data = []
# load annotations for each image of COCO
for imgId in tqdm(self.imgIds):
ann_ids = self.coco.getAnnIds(imgIds=imgId, iscrowd=False)
img = self.coco.loadImgs(imgId)[0]
if self.use_gt_bboxes:
objs = self.coco.loadAnns(ann_ids)
# sanitize bboxes
valid_objs = []
for obj in objs:
# Skip non-person objects (it should never happen)
if obj['category_id'] != 1:
continue
# ignore objs without keypoints annotation
if max(obj['keypoints']) == 0:
continue
x, y, w, h = obj['bbox']
x1 = np.max((0, x))
y1 = np.max((0, y))
x2 = np.min((img['width'] - 1, x1 + np.max((0, w - 1))))
y2 = np.min((img['height'] - 1, y1 + np.max((0, h - 1))))
# Use only valid bounding boxes
if obj['area'] > 0 and x2 >= x1 and y2 >= y1:
obj['clean_bbox'] = [x1, y1, x2 - x1, y2 - y1]
valid_objs.append(obj)
objs = valid_objs
else:
objs = bboxes[imgId]
# for each annotation of this image, add the formatted annotation to self.data
for obj in objs:
joints = np.zeros((self.nof_joints, 2), dtype=np.float)
joints_visibility = np.ones((self.nof_joints, 2), dtype=np.float)
if self.use_gt_bboxes:
# COCO pre-processing
# # Moved above
# # Skip non-person objects (it should never happen)
# if obj['category_id'] != 1:
# continue
#
# # ignore objs without keypoints annotation
# if max(obj['keypoints']) == 0:
# continue
for pt in range(self.nof_joints):
joints[pt, 0] = obj['keypoints'][pt * 3 + 0]
joints[pt, 1] = obj['keypoints'][pt * 3 + 1]
t_vis = int(np.clip(obj['keypoints'][pt * 3 + 2], 0, 1)) # ToDo check correctness
# COCO:
# if visibility == 0 -> keypoint is not in the image.
# if visibility == 1 -> keypoint is in the image BUT not visible (e.g. behind an object).
# if visibility == 2 -> keypoint looks clearly (i.e. it is not hidden).
joints_visibility[pt, 0] = t_vis
joints_visibility[pt, 1] = t_vis
center, scale = self._box2cs(obj['clean_bbox'][:4])
self.data.append({
'imgId': imgId,
'annId': obj['id'],
'imgPath': os.path.join(self.root_path, self.data_version, '%012d.jpg' % imgId),
'center': center,
'scale': scale,
'joints': joints,
'joints_visibility': joints_visibility,
})

Then each image is cropped (to extract a specific person) and rescaled with an affine warping in:
trans = get_affine_transform(c, s, self.pixel_std, r, self.image_size)
image = cv2.warpAffine(
image,
trans,
(int(self.image_size[0]), int(self.image_size[1])),
flags=cv2.INTER_LINEAR
)

Does this answer to your question?

@stefanopini stefanopini added the question Further information is requested label Oct 23, 2019
@murdockhou
Copy link
Author

murdockhou commented Oct 24, 2019 via email

@stefanopini
Copy link
Owner

Yes, but images are too much to be stored in RAM (in ordinary machines) so you have to load them from disk and, since the samples are shuffled during training, you have to re-load the same image at different steps of each epoch.

@murdockhou
Copy link
Author

@stefanopini Bother again. I find when creating dataset, you used get_affine_transform and cv2.warpAffine func to get correct single person area in dataset/COCO.py/line 293. I'm a little confused why don't you use crop function directly croped person area on ori_img? Is that has much difference between these two ways?

@stefanopini
Copy link
Owner

Hi @murdockhou !

The difference is that using warpAffine you can apply affine transformations instead of just cropping the person area.
This is not useful during evaluation/testing, but it is used during training for data augmentation.
If you look at the previous lines of the file (L258-L296), you can see that the parameters passed to the function get_affine_transform simply crop the image if self.is_train is False while their values are modified to change the scale and to rotate and flip the person area for data augmentation if self.is_train is True.
I hope it is clearer now.

Btw, I've adapted this code from the original implementation and some details are still unclear to me.
In particular, I don't know the meaning of the parameter pixel_std (see line 109).

@murdockhou
Copy link
Author

murdockhou commented Apr 13, 2020 via email

@valentin-fngr
Copy link

Thank you both of you for clarifying my understanding.
One question : @stefanopini has mentioned that we can only detect a single person per image.
When you say 'image' do you mean the cropped and rescaled bounding box area on the image ?

@stefanopini
Copy link
Owner

Hi @valentin-fngr , that's correct.
The HRNet model is designed as a top-down approach: person detection first (with almost any detector), then human pose estimation on the single person bounding box area with HRNet.
On the opposite, HigherHRNet is a bottom-up multiperson approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants