Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transfer Learning with differing number of classes #152

Closed
hxy1051653358 opened this issue Mar 23, 2019 · 31 comments
Closed

Transfer Learning with differing number of classes #152

hxy1051653358 opened this issue Mar 23, 2019 · 31 comments

Comments

@hxy1051653358
Copy link

hxy1051653358 commented Mar 23, 2019

I trained the voc dataset by myself and wanted to train new dataset with my own weight. The categories are different and the following errors occur during recovery training:

Traceback (most recent call last):
  File "/home/hxy/yolov3-pytorch-annotation/train1.py", line 204, in <module>
    var=opt.var,
  File "/home/hxy/yolov3-pytorch-annotation/train1.py", line 49, in train
    model.load_state_dict(checkpoint['model'])
  File "/home/hxy/anaconda3/envs/py1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 769, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for Darknet:
	size mismatch for module_list.104.conv_104.weight: copying a param with shape torch.Size([75, 1024, 1, 1]) from checkpoint, the shape in current model is torch.Size([30, 1024, 1, 1]).
	size mismatch for module_list.104.conv_104.bias: copying a param with shape torch.Size([75]) from checkpoint, the shape in current model is torch.Size([30]).
	size mismatch for module_list.116.conv_116.weight: copying a param with shape torch.Size([75, 512, 1, 1]) from checkpoint, the shape in current model is torch.Size([10, 512, 1, 1]).
	size mismatch for module_list.116.conv_116.bias: copying a param with shape torch.Size([75]) from checkpoint, the shape in current model is torch.Size([10]).
	size mismatch for module_list.128.conv_128.weight: copying a param with shape torch.Size([75, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([30, 256, 1, 1]).
	size mismatch for module_list.128.conv_128.bias: copying a param with shape torch.Size([75]) from checkpoint, the shape in current model is torch.Size([30]).

How can I solve it?

@gabrielloye
Copy link

Hi @hxy1051653358 ,
Not sure if this is the best practice but what I did was to remove the weights for the yolo layers from your own weights. This is because the sizes of the yolo layers in your new model and old weights do not match (as described in the error message).
Adding this line after loading your own weights in the training script will remove the mismatched weights:

 if resume:
        checkpoint = torch.load(latest, map_location='cpu')
        mod_weights = removekey(checkpoint['model'],['module_list.104.conv_104.weight', 'module_list.104.conv_104.bias', 'module_list.116.conv_116.weight', 'module_list.116.conv_116.bias', 'module_list.128.conv_128.weight', 'module_list.128.conv_128.bias'])
        model.load_state_dict(mod_weights, strict=False)

@hxy1051653358
Copy link
Author

@gabrielloye Thanks for your guidance, I will try your approach

@hxy1051653358
Copy link
Author

@gabrielloye How can I define removekey?

@gabrielloye
Copy link

@hxy1051653358 Oh I forgot to include that part as well, my bad. Here:

def removekey(d, listofkeys):
    r = dict(d)
    for key in listofkeys:
        print('key: {} is removed'.format(key))
        r.pop(key)
    return r

@perry0418
Copy link
Contributor

make sure your cfg file end with 'yolov3.cfg', and it will load the darknet53 weight for transfer learning.
see the train.py line 58 if cfg.endswith('yolov3.cfg'): cutoff = load_darknet_weights(model, weights + 'darknet53.conv.74')

@100330706
Copy link

@gabrielloye So with your method you use the old weights (randomly initialized or whatever) for layers 104, 116 and 128 whereas the rest of the network uses the new transferred weights, isn't it? Or does this remove any layer?

@gabrielloye
Copy link

gabrielloye commented Mar 28, 2019

@100330706 Nope, it doesn't remove any layer, what I did was to load the transferred weights and remove the yolo layers (104, 116, 128 in this case) first. I then load the old weights that fit the model (depends on the cfg you're using) and call the .update() method on it with the transferred weights. This will transfer all the layers of the transferred weights to the old weights except the ones we removed earlier. Finally, you can load the model with this "new" set of weights.
The code snippet below should be able to work when you add it in train.py

checkpoint = torch.load(latest, map_location=device)  # load checkpoint
mod_weights = removekey(checkpoint['model'],[--list of layers to remove ( i.e. 104, 116, 128)--])
load_darknet_weights(model, weights + 'darknet53.conv.74')
model_dict = model.state_dict()
model_dict.update(mod_weights)
model.load_state_dict(model_dict)

Remember to change the number in the below line according to your config as well since you're doing transfer learning:

    #Transfer learning (train only YOLO layers)
    for i, (name, p) in enumerate(model.named_parameters()):
        p.requires_grad = True if (p.shape[0] == 30) else False

Note: I'm using the latest version, and I had to comment out the scheduler to make this work.

@Ownmarc
Copy link
Contributor

Ownmarc commented Apr 1, 2019

@gabrielloye getting this error following the changes you suggested :

File "C:\Users\marcp\Anaconda3\envs\yolov3\lib\site-packages\torch\optim\sgd.py", line 101, in step
    buf.mul_(momentum).add_(1 - dampening, d_p)
RuntimeError: The size of tensor a (255) must match the size of tensor b (78) at non-singleton dimension 0

Did this happen to you ?

Edit, I commented out this and now its working :

if checkpoint['optimizer'] is not None: # optimizer.load_state_dict(checkpoint['optimizer']) # best_loss = checkpoint['best_loss']

@glenn-jocher glenn-jocher changed the title How to use pre-training weight transfer learning for different categories? Transfer Learning with differing number of classes Apr 2, 2019
@glenn-jocher
Copy link
Member

glenn-jocher commented Apr 2, 2019

@hxy1051653358 @gabrielloye @perry0418 @100330706 @Ownmarc the latest commit should handle transfer learning for various class sizes automatically using a new --transfer flag in train.py.

Transfer learning is performed only on YOLO layers of yolov3.pt, and these YOLO layers may now be any size specified in your *.cfg file. Note that you need to download yolov3.pt first from our Google Drive folder (https://github.com/ultralytics/yolov3#pretrained-weights) to your yolov3/weights/ directory.

Here is a transfer learning example with a single class (18-length YOLO layers) using yolov3-1cls.cfg and coco_1cls.data with are now added to the repo. coco_1cls.data points to coco_1cls.txt for training and testing, which is available in the Google Drive folder, and can be placed in your coco folder to follow our 1-class tutorial: https://github.com/ultralytics/yolov3/wiki/Example:-Train-Single-Class.

Namespace(accumulate=1, backend='nccl', batch_size=16, cfg='cfg/yolov3-1cls.cfg', data_cfg='data/coco_1cls.data', dist_url='tcp://127.0.0.1:9999', epochs=270, img_size=416, multi_scale=False, num_workers=4, rank=0, resume=False, transfer=True, world_size=1)

Using cpu 

layer                                     name  gradient   parameters                shape         mu      sigma
   0                          0.conv_0.weight     False          864        [32, 3, 3, 3]  -8.67e-05      0.112
   1                    0.batch_norm_0.weight     False           32                 [32]      0.538        0.3
   2                      0.batch_norm_0.bias     False           32                 [32]          0          0
   3                          1.conv_1.weight     False        18432       [64, 32, 3, 3]   0.000231      0.034
...
 218                104.batch_norm_104.weight     False          256                [256]      0.519      0.286
 219                  104.batch_norm_104.bias     False          256                [256]          0          0
 220                      105.conv_105.weight      True         4608      [18, 256, 1, 1]   0.000118     0.0359
 221                        105.conv_105.bias      True           18                 [18]    0.00203     0.0387
Model Summary: 222 layers, 6.15237e+07 parameters, 32310 gradients

  Epoch       Batch        xy        wh      conf       cls     total  nTargets      time
  0/269         0/0      2.11      6.65       140         0       148        17      6.06
     Image      Total          P          R        mAP
Calculating mAP: 100%|██████████| 1/1 [00:05<00:00,  5.50s/it]
         5          5          0          0          0

@vivian-wong
Copy link

When I tried transfer learning with yolov3-1cls.cfg and coco_1cls.data, I get the following error.

python train.py --cfg cfg/yolov3-1cls.cfg --data-cfg data/coco_1cls.data --transfer
Namespace(accumulate=1, backend='nccl', batch_size=16, cfg='cfg/yolov3-1cls.cfg', data_cfg='data/coco_1cls.data', dist_url='tcp://127.0.0.1:9999', epochs=273, img_size=416, multi_scale=False, nosave=False, num_workers=4, rank=0, resume=False, transfer=True, world_size=1)

Using CUDA device0 _CudaDeviceProperties(name='GeForce GTX 1080 Ti', total_memory=11171MB)
Traceback (most recent call last):
  File "train.py", line 250, in <module>
    num_workers=opt.num_workers
  File "train.py", line 58, in train
    strict=False)
  File "/home/vivian/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 769, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for Darknet:
	size mismatch for module_list.84.conv_84.weight: copying a param with shape torch.Size([512, 2048, 1, 1]) from checkpoint, the shape in current model is torch.Size([256, 512, 1, 1]).
	size mismatch for module_list.84.batch_norm_84.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for module_list.84.batch_norm_84.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for module_list.84.batch_norm_84.running_mean: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for module_list.84.batch_norm_84.running_var: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for module_list.87.conv_87.weight: copying a param with shape torch.Size([1024, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([256, 768, 1, 1]).
	size mismatch for module_list.87.batch_norm_87.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for module_list.87.batch_norm_87.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for module_list.87.batch_norm_87.running_mean: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for module_list.87.batch_norm_87.running_var: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for module_list.96.conv_96.weight: copying a param with shape torch.Size([256, 512, 1, 1]) from checkpoint, the shape in current model is torch.Size([128, 256, 1, 1]).
	size mismatch for module_list.96.batch_norm_96.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]).
	size mismatch for module_list.96.batch_norm_96.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]).
	size mismatch for module_list.96.batch_norm_96.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]).
	size mismatch for module_list.96.batch_norm_96.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]).
	size mismatch for module_list.99.conv_99.weight: copying a param with shape torch.Size([512, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 384, 1, 1]).
	size mismatch for module_list.99.batch_norm_99.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]).
	size mismatch for module_list.99.batch_norm_99.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]).
	size mismatch for module_list.99.batch_norm_99.running_mean: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]).
	size mismatch for module_list.99.batch_norm_99.running_var: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]).

How should I go about fixing this? Thank you!

@glenn-jocher
Copy link
Member

glenn-jocher commented Apr 8, 2019

@vivian-wong yolov3-1cls.cfg is a 1cls derivative of yolov3.cfg, but yolov3-spp.pt was loaded as the default, which is a better performing, newer variant of yolov3. I've changed the default back to yolov3.pt, so if you git pull and retry it will work.

Personally, I would not use transfer learning though, it doesn't save you much time, and you will get better results training normally from darknet53.

@vivian-wong
Copy link

Thank you for modifying the default. I am using transfer learning because I would like to train on a smaller dataset which has just one class. I have configured my *.data file as indicated in the tutorial. Now I get:

python train.py --cfg cfg/yolov3-1cls.cfg --data-cfg data/mydata.data --transfer
Namespace(accumulate=1, backend='nccl', batch_size=16, cfg='cfg/yolov3-1cls.cfg', data_cfg='data/mydata.data', dist_url='tcp://127.0.0.1:9999', epochs=273, img_size=416, multi_scale=False, nosave=False, num_workers=4, rank=0, resume=False, transfer=True, world_size=1)

Using CUDA device0 _CudaDeviceProperties(name='GeForce GTX 1080 Ti', total_memory=11171MB)
Traceback (most recent call last):
 File "train.py", line 250, in <module>
   num_workers=opt.num_workers
 File "train.py", line 56, in train
   chkpt = torch.load(weights + 'yolov3.pt', map_location=device)
 File "/home/vivian/anaconda3/lib/python3.7/site-packages/torch/serialization.py", line 368, in load
   return _load(f, map_location, pickle_module)
 File "/home/vivian/anaconda3/lib/python3.7/site-packages/torch/serialization.py", line 549, in _load
   deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly)
RuntimeError: unexpected EOF, expected 4417085 more bytes. The file might be corrupted.
*** Error in `python': corrupted double-linked list: 0x000055bd3f4cc7e0 ***

I also read your comment that

@hac135 most people don't realize this, and it's not the recommended method to go about things, but you can technically use the existing YOLOv3 architecture (and hence the pretrained yolov3.pt) to train any model with n<=80 classes with no changes. The unused conf outputs will learn to simply default to zero, and the rest of the unused outputs (the box and class conf associated with those unused classes) will no longer matter.

For example, our single class tutorial operates just as well with no modifications to the cfg file:
https://github.com/ultralytics/yolov3/wiki/Example:-Train-Single-Class

It's not clean and its not optimal, but it works.

So I tried doing
python train.py --data-cfg data/mydata.data --transfer which uses the default yolov3-spp.cfg. It worked (though with pretty bad results...). But this should do the job right?

@MuhammadAsadJaved
Copy link

Hi,
I have the same error while using MobileNet-YOLO-V3 with caffe, I am using this repo.
https://github.com/eric612/MobileNet-YOLO

Here are my error details.
Model : yolov3
darknet_yolov3

Questions : I have a pre-trained model on 80 classes , now I am using this model to retrain on 2 classes. I have made the necessary changes (classes and output) in the yolov3_train.prototxt , yolov3_test.prototxt , solver.prototxt.
But when I am running the train_yolov3.sh file it throw following error.
Maybe the error is because previous weights are about 80 classes, can I use these weights to retrain model on 2 classes?
Here is the error output.
F0916 19:53:43.311493 4202 net.cpp:760] Cannot copy param 0 weights from layer 'layer82-conv'; shape mismatch. Source param shape is 255 1024 1 1 (261120); target param shape is 21 1024 1 1 (21504). To learn this layer's parameters from scratch rather than copying from a saved net, rename the layer.
*** Check failure stack trace: ***
@ 0x7fdf3929b0cd google::LogMessage::Fail()
@ 0x7fdf3929cf33 google::LogMessage::SendToLog()
@ 0x7fdf3929ac28 google::LogMessage::Flush()
@ 0x7fdf3929d999 google::LogMessageFatal::~LogMessageFatal()
@ 0x7fdf39a64594 caffe::Net<>::CopyTrainedLayersFrom()
@ 0x7fdf39a67645 caffe::Net<>::CopyTrainedLayersFromBinaryProto()
@ 0x7fdf39a772be caffe::LoadNetWeights<>()
@ 0x7fdf39a798b0 caffe::Solver<>::InitTrainNet()
@ 0x7fdf39a79e34 caffe::Solver<>::Init()
@ 0x7fdf39a7a11f caffe::Solver<>::Solver()
@ 0x7fdf39a9cd31 caffe::Creator_SGDSolver<>()
@ 0x564ec97ce4d2 train()
@ 0x564ec97cacc5 main
@ 0x7fdf37fdfb97 __libc_start_main
@ 0x564ec97cb63a _start
Aborted (core dumped)

any suggestions to resolve this issue?

@glenn-jocher
Copy link
Member

@MuhammadAsadJaved your issue should be posted on the relevant repo, not this one.

@MuhammadAsadJaved
Copy link

@glenn-jocher Thank you for your advice. I also posted there but there was no response. So I post here as well to find some help because the issue is similar.

@glenn-jocher
Copy link
Member

@MuhammadAsadJaved all transfer learning works correctly in this repo. See https://github.com/ultralytics/yolov3/wiki/Example:-Transfer-Learning

@jayant3297
Copy link

I want to do face detection using YOLOv3 as the model and the Darknet object detection pre-trained weights, taken from
(https://pjreddie.com/darknet/yolo/)but I am unable to train it for single class "face" Can anyone help me ?

@MuhammadAsadJaved
Copy link

MuhammadAsadJaved commented Oct 7, 2020 via email

@MuhammadAsadJaved
Copy link

MuhammadAsadJaved commented Oct 7, 2020 via email

@jayant3297
Copy link

The bounding boxes I am getting are around the whole image and not around the face

@pankaja0285
Copy link

pankaja0285 commented Jan 3, 2022

Is there any working example/code for adding additional classes (say in my case I want to add 4 classes) to a pretrained yolov4 model (which I had trained for 20 classes) with darknet framework and the weights are saved every 10000 steps, so in all I have 4 weights saved. I see bits and pieces of code, as to passing the number of layers to freeze (in my case 20). But after that what are the next steps - to add the new classes and train for may be 100 iterations, stop and save the weights. Then once that is done I guess, I have to unfreeze all the 20 layers and retrain on all the classes (24). If a working example is there or if someone can help me with mostly code and some pseudo code, that will be helpful.

TIA

@glenn-jocher
Copy link
Member

@pankaja0285 for darknet training you probably want to head over to https://github.com/AlexeyAB/darknet

@pankaja0285
Copy link

pankaja0285 commented Jan 3, 2022

I already checked there, not much help @glenn-jocher. Hence asking here if someone can shed some light.

It's not that I need help with darknet. I need help with the transfer learning for additional classes. I can probably convert darknet trained yolov4 weights to say py format (pytorch) and then proceed. It's the proceed after that point, that I need help with.

NOTE: FYI, I trained the 20 classes of a VOC dataset.

@glenn-jocher
Copy link
Member

@pankaja0285 VOC training is very simple with YOLOv5. All models and datasets download automatically:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

python train.py --data VOC.yaml --weights yolov5s.pt

@pankaja0285
Copy link

pankaja0285 commented Jan 3, 2022

@glenn-jocher I think there's a miscommunication. I already trained for 20 classes and have the yolov4 weights for it, now I how to add additional classes (in my 4 additional classes). From whatever I have read so far I need to provide the layers to freeze in --freeze parameter and then is what I am asking - is everything done behind the scenes?

Or for e.g. in your ultralytics repo - train.py how does it go about doing the transfer learning?
What do I need to do?

Again this is for my own additional dataset that contains the 4 classes.

@glenn-jocher
Copy link
Member

glenn-jocher commented Jan 3, 2022

@pankaja0285 YOLOv5 automatically handles class differences. Starting a training from any other pretrained weights is the default workflow, no action is required on your part. i.e. the command below trains a 20-class model starting from 80-class COCO weights:

python train.py --data VOC.yaml --weights yolov5s.pt

@pankaja0285
Copy link

pankaja0285 commented Jan 3, 2022

Ok,

  • Also do the weights of training - quoting from your response above
    python train.py --data VOC.yaml --weights yolov5s.pt
    -does it get saved in .pt format?
    -also how do I handle to run on GPU

  • then if I want to add 4 new classes to the above trained yolov5 weights, do I just give the new yaml file and the yolo weights
    Something like this
    python train.py --data VOC_addnl_4.yaml --weights new_weights

@glenn-jocher
Copy link
Member

glenn-jocher commented Jan 3, 2022

@pankaja0285
Copy link

Also, I have a NVIDIA GPU, Cuda and CUDNN setup done and all installed. How do I run the training on GPU I guess you have a specific flag setting for it that I have to pass in the
python train.py....

@pankaja0285
Copy link

pankaja0285 commented Jan 3, 2022

@pankaja0285 see https://docs.ultralytics.com/yolov5/tutorials/train_custom_data to get started - Agreed,

But you are still not answering my question about the additional classes - how to add and how to further train with the existing weights. Doing the first training and I do see your repo that the best model is getting saved in .pt format. But how do I enhance the model for additional classes is my question.

Also, an FYI even though CUDA is available on my laptop, while training the device is not getting recognized. Do I have to modify and configuration settings or where do I need to make any changes, please let me know. I just started to train and I am getting the message that says "... CUDA is not available".... I am running from Pycharm terminal.

@glenn-jocher
Copy link
Member

@pankaja0285 apologies for the confusion earlier. To clarify:

  1. Training on GPU: If you have CUDA installed, PyTorch should automatically use your GPU for training. Make sure your PyTorch installation is compatible with your CUDA version. You don't need to set any specific flags; the code will default to GPU if it's available and configured correctly.

  2. Adding Additional Classes: To add more classes to an existing model, you need to modify your dataset to include the new classes and update your .yaml file accordingly. Then you can start training with the new dataset and the pre-trained weights. The model will adjust its final layer to accommodate the new number of classes.

Here's a simplified example command for continuing training with additional classes:

python train.py --data VOC_addnl_4.yaml --weights path/to/your/previous/best_model.pt

The weights will be saved in .pt format by default.

  1. CUDA Not Available: If CUDA is not being recognized, it could be due to several reasons:

    • Your PyTorch installation might not be compatible with your CUDA version.
    • Your environment variables for CUDA might not be set correctly.
    • PyCharm's terminal might not be recognizing your system's environment variables.

    To troubleshoot, try running the training script from a regular terminal or command prompt outside of PyCharm. If it works there, the issue might be with PyCharm's configuration.

If you continue to have issues, please provide more details about your setup, including the versions of CUDA, cuDNN, and PyTorch you're using, and I'll do my best to assist you further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants