Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about Training #11

Closed
lutianhao opened this issue May 10, 2022 · 2 comments
Closed

Question about Training #11

lutianhao opened this issue May 10, 2022 · 2 comments

Comments

@lutianhao
Copy link

Hi, I try to training my coco style dataset by your scripts, I dont know which bash script should be used to train.(Could you please briefly explain the function of each script?) Then I use "scripts/train/lambda/coco/train.sh" this one for training. but one error happened.

cd /data_2/lutianhao/code/MIPNet/
CUDA_VISIBLE_DEVICES=4,5,6,7, python tools/lambda/train_lambda_real.py
--cfg experiments/coco/hrnet/w48_384x288_adam_lr1e-3.yaml
GPUS '(0,1,2,3,)'
OUTPUT_DIR 'Outputs/outputs/lambda/lambda_coco_real_waffle'
LOG_DIR 'Outputs/logs/lambda/lambda_coco_real_waffle'
TEST.MODEL_FILE 'models/pytorch/pose_coco/pose_hrnet_w48_384x288.pth'
DATASET.TRAIN_DATASET 'coco_lambda'
DATASET.TRAIN_SET 'train2017'
DATASET.TRAIN_IMAGE_DIR '/data_2/lutianhao/datasets/pose/coco2017/train2017'
DATASET.TRAIN_ANNOTATION_FILE '/data_2/lutianhao/datasets/pose/coco2017/annotations/person_keypoints_train2017.json'
DATASET.TRAIN_DATASET_TYPE 'coco_lambda'
DATASET.TEST_DATASET 'coco'
DATASET.TEST_SET 'val2017'
DATASET.TEST_IMAGE_DIR '/data_2/lutianhao/datasets/pose/coco2017/val2017'
DATASET.TEST_ANNOTATION_FILE '/data_2/lutianhao/datasets/pose/coco2017/annotations/person_keypoints_val2017.json'
DATASET.TEST_DATASET_TYPE 'coco'
TRAIN.LR 0.001
TRAIN.BEGIN_EPOCH 0
TRAIN.END_EPOCH 110
TRAIN.LR_STEP '(70, 100)'
TRAIN.BATCH_SIZE_PER_GPU 2
TEST.BATCH_SIZE_PER_GPU 1
TEST.USE_GT_BBOX True
EPOCH_EVAL_FREQ 1
PRINT_FREQ 100
MODEL.NAME 'pose_hrnet_se_lambda'
MODEL.SE_MODULES '[False, False, True, True]'

And the error is :

GAMMA1: 0.99 [0/927]
GAMMA2: 0.0
LR: 0.001
LR_FACTOR: 0.1
LR_STEP: [70, 100]
MOMENTUM: 0.9
NESTEROV: False
OPTIMIZER: adam
RESUME: False
SHUFFLE: True
WD: 0.0001
WORKERS: 24
=> init weights from normal distribution
=> loading pretrained model models/pytorch/imagenet/hrnet_w48-8ef0771d.pth

Total Parameters: 63,746,081

Total Multiply Adds (For Convolution and Linear Layers only): 46.562052726745605 GFLOPs

Number of Layers
Conv2d : 293 layers BatchNorm2d : 292 layers ReLU : 271 layers Bottleneck : 4 layers BasicBlock : 104 layers Upsample : 28 layers HighResolutionModule : 8 layers AdaptiveAvgPool2d : 5 l
ayers Linear : 20 layers Sigmoid : 10 layers BatchNorm1d : 5 layers SELambdaLayer : 5 layers SELambdaModule : 2 layers
=> loading model from models/pytorch/pose_coco/pose_hrnet_w48_384x288.pth
=> loading from latest_state_dict at models/pytorch/pose_coco/pose_hrnet_w48_384x288.pth
loading annotations into memory...
Done (t=31.87s)
creating index...
index created!
=> classes: ['background', 'person']
=> num_images: 118287
loading from cache from cache/coco_lambda/train2017/gt_db.pkl
done!
=> load 149813 samples
loading annotations into memory...
Done (t=4.04s)
creating index...
index created!
=> classes: ['background', 'person']
=> num_images: 5000
=> load 6352 samples
=> resuming optimizer from models/pytorch/pose_coco/pose_hrnet_w48_384x288.pth
=> updated lr schedule is [70, 100]

training on lambda
Epoch: [0][0/18727] Time 64.338s (64.338s) Speed 0.2 samples/s Data 10.114s (10.114s) Loss 0.00020 (0.00020) Accuracy 0.513 (0.513) model_grad 0.000568 (0.000568) DivLoss -0.00074 (-0.00074) PoseLoss 0.00020 (0.00020)
Traceback (most recent call last):
File "tools/lambda/train_lambda_real.py", line 280, in
main()
File "tools/lambda/train_lambda_real.py", line 242, in main
final_output_dir, tb_log_dir, writer_dict, print_prefix='lambda')
File "/data_2/lutianhao/code/MIPNet/tools/lambda/../../lib/core/train.py", line 464, in train_lambda
suffix += '_[{}:{}]'.format(count, round(lambda_a[count + B].item(), 2))
IndexError: index 16 is out of bounds for dimension 0 with size 16

@lutianhao
Copy link
Author

I've solved this error, but I'd like to know the mean of "size=16", would you please explain a few? Thanks!

@rawalkhirodkar
Copy link
Owner

Glad the error is resolved.
The size of 16 is to visualize the samples during training, https://github.com/rawalkhirodkar/MIPNet/blob/505c92ec59ac79686a217dac45eb188fc38b8499/lib/core/train.py#L464

It looks like the error was due to having a batch size that was less than 16, in that case, you can update this constant to something smaller.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants