Question about Training #11

lutianhao · 2022-05-10T07:46:02Z

Hi, I try to training my coco style dataset by your scripts, I dont know which bash script should be used to train.(Could you please briefly explain the function of each script?) Then I use "scripts/train/lambda/coco/train.sh" this one for training. but one error happened.

cd /data_2/lutianhao/code/MIPNet/
CUDA_VISIBLE_DEVICES=4,5,6,7, python tools/lambda/train_lambda_real.py
--cfg experiments/coco/hrnet/w48_384x288_adam_lr1e-3.yaml
GPUS '(0,1,2,3,)'
OUTPUT_DIR 'Outputs/outputs/lambda/lambda_coco_real_waffle'
LOG_DIR 'Outputs/logs/lambda/lambda_coco_real_waffle'
TEST.MODEL_FILE 'models/pytorch/pose_coco/pose_hrnet_w48_384x288.pth'
DATASET.TRAIN_DATASET 'coco_lambda'
DATASET.TRAIN_SET 'train2017'
DATASET.TRAIN_IMAGE_DIR '/data_2/lutianhao/datasets/pose/coco2017/train2017'
DATASET.TRAIN_ANNOTATION_FILE '/data_2/lutianhao/datasets/pose/coco2017/annotations/person_keypoints_train2017.json'
DATASET.TRAIN_DATASET_TYPE 'coco_lambda'
DATASET.TEST_DATASET 'coco'
DATASET.TEST_SET 'val2017'
DATASET.TEST_IMAGE_DIR '/data_2/lutianhao/datasets/pose/coco2017/val2017'
DATASET.TEST_ANNOTATION_FILE '/data_2/lutianhao/datasets/pose/coco2017/annotations/person_keypoints_val2017.json'
DATASET.TEST_DATASET_TYPE 'coco'
TRAIN.LR 0.001
TRAIN.BEGIN_EPOCH 0
TRAIN.END_EPOCH 110
TRAIN.LR_STEP '(70, 100)'
TRAIN.BATCH_SIZE_PER_GPU 2
TEST.BATCH_SIZE_PER_GPU 1
TEST.USE_GT_BBOX True
EPOCH_EVAL_FREQ 1
PRINT_FREQ 100
MODEL.NAME 'pose_hrnet_se_lambda'
MODEL.SE_MODULES '[False, False, True, True]'

And the error is :

GAMMA1: 0.99 [0/927]
GAMMA2: 0.0
LR: 0.001
LR_FACTOR: 0.1
LR_STEP: [70, 100]
MOMENTUM: 0.9
NESTEROV: False
OPTIMIZER: adam
RESUME: False
SHUFFLE: True
WD: 0.0001
WORKERS: 24
=> init weights from normal distribution
=> loading pretrained model models/pytorch/imagenet/hrnet_w48-8ef0771d.pth

Total Parameters: 63,746,081

Total Multiply Adds (For Convolution and Linear Layers only): 46.562052726745605 GFLOPs

Number of Layers
Conv2d : 293 layers BatchNorm2d : 292 layers ReLU : 271 layers Bottleneck : 4 layers BasicBlock : 104 layers Upsample : 28 layers HighResolutionModule : 8 layers AdaptiveAvgPool2d : 5 l
ayers Linear : 20 layers Sigmoid : 10 layers BatchNorm1d : 5 layers SELambdaLayer : 5 layers SELambdaModule : 2 layers
=> loading model from models/pytorch/pose_coco/pose_hrnet_w48_384x288.pth
=> loading from latest_state_dict at models/pytorch/pose_coco/pose_hrnet_w48_384x288.pth
loading annotations into memory...
Done (t=31.87s)
creating index...
index created!
=> classes: ['background', 'person']
=> num_images: 118287
loading from cache from cache/coco_lambda/train2017/gt_db.pkl
done!
=> load 149813 samples
loading annotations into memory...
Done (t=4.04s)
creating index...
index created!
=> classes: ['background', 'person']
=> num_images: 5000
=> load 6352 samples
=> resuming optimizer from models/pytorch/pose_coco/pose_hrnet_w48_384x288.pth
=> updated lr schedule is [70, 100]

training on lambda
Epoch: [0][0/18727] Time 64.338s (64.338s) Speed 0.2 samples/s Data 10.114s (10.114s) Loss 0.00020 (0.00020) Accuracy 0.513 (0.513) model_grad 0.000568 (0.000568) DivLoss -0.00074 (-0.00074) PoseLoss 0.00020 (0.00020)
Traceback (most recent call last):
File "tools/lambda/train_lambda_real.py", line 280, in
main()
File "tools/lambda/train_lambda_real.py", line 242, in main
final_output_dir, tb_log_dir, writer_dict, print_prefix='lambda')
File "/data_2/lutianhao/code/MIPNet/tools/lambda/../../lib/core/train.py", line 464, in train_lambda
suffix += '_[{}:{}]'.format(count, round(lambda_a[count + B].item(), 2))
IndexError: index 16 is out of bounds for dimension 0 with size 16

lutianhao · 2022-05-10T08:54:02Z

I've solved this error, but I'd like to know the mean of "size=16", would you please explain a few? Thanks!

rawalkhirodkar · 2022-05-10T16:18:20Z

Glad the error is resolved.
The size of 16 is to visualize the samples during training, https://github.com/rawalkhirodkar/MIPNet/blob/505c92ec59ac79686a217dac45eb188fc38b8499/lib/core/train.py#L464

It looks like the error was due to having a batch size that was less than 16, in that case, you can update this constant to something smaller.

rawalkhirodkar closed this as completed May 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about Training #11

Question about Training #11

lutianhao commented May 10, 2022

lutianhao commented May 10, 2022

rawalkhirodkar commented May 10, 2022

Question about Training #11

Question about Training #11

Comments

lutianhao commented May 10, 2022

Total Parameters: 63,746,081

Total Multiply Adds (For Convolution and Linear Layers only): 46.562052726745605 GFLOPs

lutianhao commented May 10, 2022

rawalkhirodkar commented May 10, 2022