Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loss become nan after 20 or 40 iters. #322

Open
hheavenknowss opened this issue Dec 29, 2020 · 4 comments
Open

loss become nan after 20 or 40 iters. #322

hheavenknowss opened this issue Dec 29, 2020 · 4 comments

Comments

@hheavenknowss
Copy link

Hi there, thanks for your work! I have some issues about training, here is my log.

nohup: ignoring input
2020-12-28 12:35:18,951 fcos_core INFO: Using 2 GPUs
2020-12-28 12:35:18,951 fcos_core INFO: Namespace(config_file='configs/fcos/fcos_imprv_dcnv2_X_101_64x4d_FPN_2x.yaml', distributed=True, local_rank=0, opts=['DATALOADER.NUM_WORKERS', '2', 'OUTPUT_DIR', 'training_dir/fcos_Decathlon'], skip_test=False)
2020-12-28 12:35:18,951 fcos_core INFO: Collecting env info (might take some time)
2020-12-28 12:35:21,205 fcos_core INFO:
PyTorch version: 1.1.0
Is debug build: No
CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 16.04.3 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
CMake version: version 3.5.1

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 9.0.176
GPU models and configuration:
GPU 0: GeForce GTX 1080 Ti
GPU 1: GeForce GTX 1080 Ti

Nvidia driver version: 384.111
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.0.5

Versions of relevant libraries:
[pip3] msgpack-numpy==0.4.3.2
[pip3] numpy==1.15.0
[pip3] numpydoc==0.7.0
[pip3] pytorch-pretrained-bert==0.6.2
[pip3] torch==1.1.0
[pip3] torchfile==0.1.0
[pip3] torchtext==0.3.1
[pip3] torchvision==0.3.0
[conda] Could not collect
Pillow (5.0.0)
2020-12-28 12:35:21,206 fcos_core INFO: Loaded configuration file configs/fcos/fcos_imprv_dcnv2_X_101_64x4d_FPN_2x.yaml
2020-12-28 12:35:21,206 fcos_core INFO:
MODEL:
META_ARCHITECTURE: "GeneralizedRCNN"
WEIGHT: "https://cloudstor.aarnet.edu.au/plus/s/k3ys35075jmU1RP/download#X-101-64x4d.pkl"
RPN_ONLY: True
FCOS_ON: True
BACKBONE:
CONV_BODY: "R-101-FPN-RETINANET"
RESNETS:
STRIDE_IN_1X1: False
BACKBONE_OUT_CHANNELS: 256
NUM_GROUPS: 64
WIDTH_PER_GROUP: 4
STAGE_WITH_DCN: (False, False, True, True)
WITH_MODULATED_DCN: True
DEFORMABLE_GROUPS: 1
RETINANET:
USE_C5: False # FCOS uses P5 instead of C5
FCOS:
# normalizing the regression targets with FPN strides
NORM_REG_TARGETS: True
# positioning centerness on the regress branch.
# Please refer to #89 (comment)
CENTERNESS_ON_REG: True
# using center sampling and GIoU.
# Please refer to https://github.com/yqyao/FCOS_PLUS
CENTER_SAMPLING_RADIUS: 1.5
IOU_LOSS_TYPE: "giou"
# we only use dcn in the last layer of towers
USE_DCN_IN_TOWER: True
DATASETS:
TRAIN: ("coco_Decathlon_train",)
TEST: ("coco_Decathlon_val",)
INPUT:
MIN_SIZE_RANGE_TRAIN: (640, 800)
MAX_SIZE_TRAIN: 1333
MIN_SIZE_TEST: 800
MAX_SIZE_TEST: 1333
DATALOADER:
SIZE_DIVISIBILITY: 32
SOLVER:
BASE_LR: 0.01
WEIGHT_DECAY: 0.0001
STEPS: (120000, 160000)
MAX_ITER: 180000
IMS_PER_BATCH: 10
WARMUP_METHOD: "constant"
TEST:
BBOX_AUG:
ENABLED: False
H_FLIP: True
SCALES: (400, 500, 600, 700, 900, 1000, 1100, 1200)
MAX_SIZE: 2000
SCALE_H_FLIP: True

2020-12-28 12:35:21,207 fcos_core INFO: Running with config:
DATALOADER:
ASPECT_RATIO_GROUPING: True
NUM_WORKERS: 2
SIZE_DIVISIBILITY: 32
DATASETS:
TEST: ('coco_Decathlon_val',)
TRAIN: ('coco_Decathlon_train',)
INPUT:
MAX_SIZE_TEST: 1333
MAX_SIZE_TRAIN: 1333
MIN_SIZE_RANGE_TRAIN: (640, 800)
MIN_SIZE_TEST: 800
MIN_SIZE_TRAIN: (800,)
PIXEL_MEAN: [102.9801, 115.9465, 122.7717]
PIXEL_STD: [1.0, 1.0, 1.0]
TO_BGR255: True
MODEL:
BACKBONE:
CONV_BODY: R-101-FPN-RETINANET
FREEZE_CONV_BODY_AT: 2
USE_GN: False
CLS_AGNOSTIC_BBOX_REG: False
DEVICE: cuda
FBNET:
ARCH: default
ARCH_DEF:
BN_TYPE: bn
DET_HEAD_BLOCKS: []
DET_HEAD_LAST_SCALE: 1.0
DET_HEAD_STRIDE: 0
DW_CONV_SKIP_BN: True
DW_CONV_SKIP_RELU: True
KPTS_HEAD_BLOCKS: []
KPTS_HEAD_LAST_SCALE: 0.0
KPTS_HEAD_STRIDE: 0
MASK_HEAD_BLOCKS: []
MASK_HEAD_LAST_SCALE: 0.0
MASK_HEAD_STRIDE: 0
RPN_BN_TYPE:
RPN_HEAD_BLOCKS: 0
SCALE_FACTOR: 1.0
WIDTH_DIVISOR: 1
FCOS:
CENTERNESS_ON_REG: True
CENTER_SAMPLING_RADIUS: 1.5
FPN_STRIDES: [8, 16, 32, 64, 128]
INFERENCE_TH: 0.05
IOU_LOSS_TYPE: giou
LOSS_ALPHA: 0.25
LOSS_GAMMA: 2.0
NMS_TH: 0.6
NORM_REG_TARGETS: True
NUM_CLASSES: 2
NUM_CONVS: 4
PRE_NMS_TOP_N: 1000
PRIOR_PROB: 0.01
USE_DCN_IN_TOWER: True
FCOS_ON: True
FPN:
USE_GN: False
USE_RELU: False
GROUP_NORM:
DIM_PER_GP: -1
EPSILON: 1e-05
NUM_GROUPS: 32
KEYPOINT_ON: False
MASK_ON: False
META_ARCHITECTURE: GeneralizedRCNN
RESNETS:
BACKBONE_OUT_CHANNELS: 256
DEFORMABLE_GROUPS: 1
NUM_GROUPS: 64
RES2_OUT_CHANNELS: 256
RES5_DILATION: 1
STAGE_WITH_DCN: (False, False, True, True)
STEM_FUNC: StemWithFixedBatchNorm
STEM_OUT_CHANNELS: 64
STRIDE_IN_1X1: False
TRANS_FUNC: BottleneckWithFixedBatchNorm
WIDTH_PER_GROUP: 4
WITH_MODULATED_DCN: True
RETINANET:
ANCHOR_SIZES: (32, 64, 128, 256, 512)
ANCHOR_STRIDES: (8, 16, 32, 64, 128)
ASPECT_RATIOS: (0.5, 1.0, 2.0)
BBOX_REG_BETA: 0.11
BBOX_REG_WEIGHT: 4.0
BG_IOU_THRESHOLD: 0.4
FG_IOU_THRESHOLD: 0.5
INFERENCE_TH: 0.05
LOSS_ALPHA: 0.25
LOSS_GAMMA: 2.0
NMS_TH: 0.4
NUM_CLASSES: 81
NUM_CONVS: 4
OCTAVE: 2.0
PRE_NMS_TOP_N: 1000
PRIOR_PROB: 0.01
SCALES_PER_OCTAVE: 3
STRADDLE_THRESH: 0
USE_C5: False
RETINANET_ON: False
ROI_BOX_HEAD:
CONV_HEAD_DIM: 256
DILATION: 1
FEATURE_EXTRACTOR: ResNet50Conv5ROIFeatureExtractor
MLP_HEAD_DIM: 1024
NUM_CLASSES: 81
NUM_STACKED_CONVS: 4
POOLER_RESOLUTION: 14
POOLER_SAMPLING_RATIO: 0
POOLER_SCALES: (0.0625,)
PREDICTOR: FastRCNNPredictor
USE_GN: False
ROI_HEADS:
BATCH_SIZE_PER_IMAGE: 512
BBOX_REG_WEIGHTS: (10.0, 10.0, 5.0, 5.0)
BG_IOU_THRESHOLD: 0.5
DETECTIONS_PER_IMG: 100
FG_IOU_THRESHOLD: 0.5
NMS: 0.5
POSITIVE_FRACTION: 0.25
SCORE_THRESH: 0.05
USE_FPN: False
ROI_KEYPOINT_HEAD:
CONV_LAYERS: (512, 512, 512, 512, 512, 512, 512, 512)
FEATURE_EXTRACTOR: KeypointRCNNFeatureExtractor
MLP_HEAD_DIM: 1024
NUM_CLASSES: 17
POOLER_RESOLUTION: 14
POOLER_SAMPLING_RATIO: 0
POOLER_SCALES: (0.0625,)
PREDICTOR: KeypointRCNNPredictor
RESOLUTION: 14
SHARE_BOX_FEATURE_EXTRACTOR: True
ROI_MASK_HEAD:
CONV_LAYERS: (256, 256, 256, 256)
DILATION: 1
FEATURE_EXTRACTOR: ResNet50Conv5ROIFeatureExtractor
MLP_HEAD_DIM: 1024
POOLER_RESOLUTION: 14
POOLER_SAMPLING_RATIO: 0
POOLER_SCALES: (0.0625,)
POSTPROCESS_MASKS: False
POSTPROCESS_MASKS_THRESHOLD: 0.5
PREDICTOR: MaskRCNNC4Predictor
RESOLUTION: 14
SHARE_BOX_FEATURE_EXTRACTOR: True
USE_GN: False
RPN:
ANCHOR_SIZES: (32, 64, 128, 256, 512)
ANCHOR_STRIDE: (16,)
ASPECT_RATIOS: (0.5, 1.0, 2.0)
BATCH_SIZE_PER_IMAGE: 256
BG_IOU_THRESHOLD: 0.3
FG_IOU_THRESHOLD: 0.7
FPN_POST_NMS_TOP_N_TEST: 2000
FPN_POST_NMS_TOP_N_TRAIN: 2000
MIN_SIZE: 0
NMS_THRESH: 0.7
POSITIVE_FRACTION: 0.5
POST_NMS_TOP_N_TEST: 1000
POST_NMS_TOP_N_TRAIN: 2000
PRE_NMS_TOP_N_TEST: 6000
PRE_NMS_TOP_N_TRAIN: 12000
RPN_HEAD: SingleConvRPNHead
STRADDLE_THRESH: 0
USE_FPN: False
RPN_ONLY: True
USE_SYNCBN: False
WEIGHT: https://cloudstor.aarnet.edu.au/plus/s/k3ys35075jmU1RP/download#X-101-64x4d.pkl
OUTPUT_DIR: training_dir/fcos_Decathlon
PATHS_CATALOG: /data/pxf/FCOS-master/fcos_core/config/paths_catalog.py
SOLVER:
BASE_LR: 0.01
BIAS_LR_FACTOR: 2
CHECKPOINT_PERIOD: 2500
DCONV_OFFSETS_LR_FACTOR: 1.0
GAMMA: 0.1
IMS_PER_BATCH: 10
MAX_ITER: 180000
MOMENTUM: 0.9
STEPS: (120000, 160000)
WARMUP_FACTOR: 0.3333333333333333
WARMUP_ITERS: 500
WARMUP_METHOD: constant
WEIGHT_DECAY: 0.0001
WEIGHT_DECAY_BIAS: 0
TEST:
BBOX_AUG:
ENABLED: False
H_FLIP: True
MAX_SIZE: 2000
SCALES: (400, 500, 600, 700, 900, 1000, 1100, 1200)
SCALE_H_FLIP: True
DETECTIONS_PER_IMG: 100
EXPECTED_RESULTS: []
EXPECTED_RESULTS_SIGMA_TOL: 4
IMS_PER_BATCH: 8


loading annotations into memory...
loading annotations into memory...
Done (t=0.06s)
creating index...
index created!
Done (t=0.06s)
creating index...
index created!
2020-12-28 12:35:23,601 fcos_core.trainer INFO: Start training
/root/anaconda3/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py:100: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
warnings.warn("torch.distributed.reduce_op is deprecated, please use "
/root/anaconda3/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py:100: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
warnings.warn("torch.distributed.reduce_op is deprecated, please use "
2020-12-28 12:36:34,561 fcos_core.trainer INFO: eta: 7 days, 9:22:30 iter: 20 loss: 13.8627 (14.0002) loss_centerness: 0.6938 (0.7076) loss_cls: 12.2059 (12.4410) loss_reg: 0.8402 (0.8516) time: 3.4431 (3.5479) data: 0.0109 (0.0280) lr: 0.003333 max mem: 9314
2020-12-28 12:37:42,998 fcos_core.trainer INFO: eta: 7 days, 6:12:21 iter: 40 loss: 23.3687 (nan) loss_centerness: 0.7030 (0.7552) loss_cls: 21.8341 (nan) loss_reg: 0.8110 (0.8365) time: 3.3760 (3.4849) data: 0.0098 (0.0190) lr: 0.003333 max mem: 9314
2020-12-28 12:38:54,639 fcos_core.trainer INFO: eta: 7 days, 7:48:16 iter: 60 loss: nan (nan) loss_centerness: nan (nan) loss_cls: nan (nan) loss_reg: nan (nan) time: 3.5713 (3.5173) data: 0.0095 (0.0161) lr: 0.003333 max mem: 9314
2020-12-28 12:40:04,952 fcos_core.trainer INFO: eta: 7 days, 7:45:54 iter: 80 loss: nan (nan) loss_centerness: nan (nan) loss_cls: nan (nan) loss_reg: nan (nan) time: 3.4935 (3.5169) data: 0.0100 (0.0147) lr: 0.003333 max mem: 9314
2020-12-28 12:41:15,379 fcos_core.trainer INFO: eta: 7 days, 7:47:24 iter: 100 loss: nan (nan) loss_centerness: nan (nan) loss_cls: nan (nan) loss_reg: nan (nan) time: 3.5659 (3.5178) data: 0.0098 (0.0137) lr: 0.003333 max mem: 9314
2020-12-28 12:42:26,265 fcos_core.trainer INFO: eta: 7 days, 7:59:31 iter: 120 loss: nan (nan) loss_centerness: nan (nan) loss_cls: nan (nan) loss_reg: nan (nan) time: 3.5816 (3.5222) data: 0.0093 (0.0131) lr: 0.003333 max mem: 9314
2020-12-28 12:43:37,949 fcos_core.trainer INFO: eta: 7 days, 8:24:53 iter: 140 loss: nan (nan) loss_centerness: nan (nan) loss_cls: nan (nan) loss_reg: nan (nan) time: 3.5643 (3.5310) data: 0.0094 (0.0127) lr: 0.003333 max mem: 9314
2020-12-28 12:44:49,125 fcos_core.trainer INFO: eta: 7 days, 8:34:07 iter: 160 loss: nan (nan) loss_centerness: nan (nan) loss_cls: nan (nan) loss_reg: nan (nan) time: 3.5420 (3.5345) data: 0.0104 (0.0125) lr: 0.003333 max mem: 9314
2020-12-28 12:45:59,373 fcos_core.trainer INFO: eta: 7 days, 8:25:33 iter: 180 loss: nan (nan) loss_centerness: nan (nan) loss_cls: nan (nan) loss_reg: nan (nan) time: 3.5570 (3.5321) data: 0.0100 (0.0123) lr: 0.003333 max mem: 9314

I've tried many times, and sometimes loss is normal. but most times loss is like this.
I'm wondering why loss always become nan after few iters. Is there something wrong about that?

@tianzhi0549
Copy link
Owner

@hheavenknowss Please try to clip the gradients.

@hheavenknowss
Copy link
Author

@hheavenknowss Please try to clip the gradients.

I have reduce the learning rate and it works, thanks reply

@maojiaoli
Copy link

@hheavenknowss Can you tell me your revised learning rate? I have the same problem. I'm looking forward to your reply.

@hheavenknowss
Copy link
Author

@hheavenknowss Can you tell me your revised learning rate? I have the same problem. I'm looking forward to your reply.

Sorry about that, it has been a while and I forgot that. But I remember I revised my learning rate to a lower learning rate. Hope that can help you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants