Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Target classes exceed model classes #678

Closed
daddydrac opened this issue Dec 3, 2019 · 65 comments
Closed

Target classes exceed model classes #678

daddydrac opened this issue Dec 3, 2019 · 65 comments
Labels
bug Something isn't working

Comments

@daddydrac
Copy link

Created custom files for training:
(4 + 1 + 13) * 3 = 54
13 classes, 54 filters

*.names has 13 names in it
*.cfg was converted properly w 13 for classes, 54 for filters in all 3 yolo blocks

yolov3/utils/utils.py", line 451, in build_targets
assert c.max() <= model.nc, 'Target classes exceed model classes'
AssertionError: Target classes exceed model classes

@daddydrac daddydrac added the bug Something isn't working label Dec 3, 2019
@glenn-jocher
Copy link
Member

@joehoeller hey bud. This means that you've supplied class numbers in your labels that exceed the class count you specified in your *.data and *.cfg files. Classes are zero indexed, so for example if you specify classes=3 in *.data and *.cfg, your labels may only have classes 0, 1 and 2 present.

@daddydrac
Copy link
Author

daddydrac commented Dec 3, 2019 via email

@glenn-jocher
Copy link
Member

glenn-jocher commented Dec 3, 2019

@joehoeller your *.cfg and *.data simply need to match up in classes. It's ok to use the default yolov3-spp.cfg for example with 80 classes and supply labels that only go up to 15 classes (the remaining 65 spots on the output vectors will go unused, but it won't really hurt your training).

If you modify your classes to 3, then your labels must either be 0, 1, or 2 (zero indexed).

@FranciscoReveriano
Copy link
Contributor

Were you able to solve this? I am starting to think that you might have reference it wrongly in the .data part. Assuming your .cfg file is properly formatted and being called properly.

@daddydrac
Copy link
Author

daddydrac commented Dec 3, 2019

I updated Dark Chocolate, the COCOJSON -> Darknet converter, so now classes are 0 indexed: https://github.com/joehoeller/Dark-Chocolate, and outputs:
0 0.2125 0.369140625 0.0578125 0.232421875

My *.names file has 13 classes (the numbers are there as example, not actually in the file):

0. person
1. bicycle
2. car
3. motorcycle
4. airplane
5. bus
6. train
7. truck
8. boat
9. traffic light
10. fire hydrant
11. stop sign
12. parking meter

The *.data files has this:

classes=13
train=training_img_paths.txt
valid=training_img_paths.txt
names=data/training.names
backup=backup/
eval=coco

I tried the yolov3-spp.cfg as is and got same error as the copy I modified with all 3 [yolo] layers and the conv above it:
(4 + 1 + 13) * 3 = 54

[convolutional]
size=1
stride=1
pad=1
filters=54
activation=linear

[yolo]
mask = 0,1,2
anchors = 10,13, 16,30, 33,23, 30,61, 62,45, 59,119, 116,90, 156,198, 373,326
classes=13
num=9
jitter=.3
ignore_thresh = .7
truth_thresh = 1
random=1

The error still persists:

Traceback (most recent call last):
  File "train.py", line 450, in <module>
    train()  # train normally
  File "train.py", line 276, in train
    loss, loss_items = compute_loss(pred, targets, model)
  File /yolov3/utils/utils.py", line 333, in compute_loss
    tcls, tbox, indices, anchor_vec = build_targets(model, targets)
  File "/yolov3/utils/utils.py", line 451, in build_targets
    assert c.max() <= model.nc, 'Target classes exceed model classes'
AssertionError: Target classes exceed model classes

@glenn-jocher
Copy link
Member

@joehoeller that all looks right. Maybe I should update the error message to output more detail, hold on. Ok I've updated the assert statement in 0dd0fa7

Now it should give you more specific information. Can you git pull and try again?

@daddydrac
Copy link
Author

daddydrac commented Dec 3, 2019

@joehoeller that all looks right. Maybe I should update the error message to output more detail, hold on. Ok I've updated the assert statement in 0dd0fa7

Now it should give you more specific information. Can you git pull and try again?

Ok, nice!!! Now I know what the problem is:

AssertionError: Model accepts 13 classes labeled from 0-12, however you labelled a class 17. See https://docs.ultralytics.com/yolov5/tutorials/train_custom_data

Let me go fix and I'll report back ;)

@daddydrac
Copy link
Author

I guess it is training, finally:

eureka

How long do you think it'll take for 17 classes on a RTX 2080 Ti, using Yolov3-SPP?

@glenn-jocher
Copy link
Member

@joehoeller if your terminal window is too short output from tqdm progress bar will wrap.

The images look like your labels need to offset the box centers by box width / 2 and box height / 2

@daddydrac
Copy link
Author

daddydrac commented Dec 3, 2019

No prob, I am on it -> "...box width / 2 and box height / 2", thnx for the tip!
Lastly, the img size says 416 as it's training, but in my *.cfg my img sizes are:

[net]
# Testing
# batch=1
# subdivisions=1
# Training
batch=16
subdivisions=16

width=640
height=512

channels=3
momentum=0.9
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1

@daddydrac
Copy link
Author

daddydrac commented Dec 4, 2019

I adjusted the math, does this look right to you? Darknet/Yolo is all 0.0 to 1.0 values.
Updated vals:

2 0.6578125 0.685546875 0.20625 0.23828125
2 0.81640625 0.6884765625 0.1375 0.150390625
2 0.5625 0.71484375 0.075 0.0546875
2 0.60078125 0.7138671875 0.04375 0.06640625

Previous vals:

2 0.315625 0.37109375 0.20625 0.23828125
2 0.6328125 0.376953125 0.1375 0.150390625
2 0.125 0.4296875 0.075 0.0546875
2 0.2015625 0.427734375 0.04375 0.06640625

@glenn-jocher
Copy link
Member

glenn-jocher commented Dec 4, 2019

@joehoeller the image size information in the cfg is not used. All of the files (train.py, detect.py, test.py) use the --img-size argument, which is 416 default, and must be a 32-multiple.

Its hard to tell by eye if the new vals are correct, you just need to look at your test_batch0.jpg etc. to see that they overlay your objects correctly.

@daddydrac
Copy link
Author

daddydrac commented Dec 4, 2019 via email

@glenn-jocher
Copy link
Member

glenn-jocher commented Dec 4, 2019

@joehoeller yeah that looks a lot better. This happens because COCO box origins are in a corner (top left I think), while darknet format has origins in the center.

Note that the overlays you see are the labels, not the predicted boxes. Now you want to let it train for a few hours or a day or so and check your results.png.

So the best way to do it is to note the epoch your validation losses (on the bottom row) start increasing, and then restart your training again from zero to that --epochs number, to lock in the LR drops that are programmed at 80% and 90% of --epochs.

@daddydrac
Copy link
Author

daddydrac commented Dec 4, 2019 via email

@daddydrac
Copy link
Author

daddydrac commented Dec 4, 2019

@glenn-jocher 4 Ques:

Q1. Is there a way to stop and then resume training?

Q2. You said, "note the epoch your validation losses (on the bottom row) start increasing, ", but this is all I get back:
zxc

Q3. When it is done, how to I access metrics like False Positive, False Negative, mAP etc? (Run the cmd for the utils or...?)

Q4. In the image above, is it really producing a mAP score of 0.841 by epoch 109, or is it overfitting?

@daddydrac
Copy link
Author

UPDATE: I fixed the PR and updated the math, the COCO JSON -> Darknet conversion tool (Dark Chocolate) works now: daddydrac/Dark-Chocolate#2

@daddydrac daddydrac reopened this Dec 4, 2019
@daddydrac
Copy link
Author

@glenn-jocher plz see 4 ques above

@FranciscoReveriano
Copy link
Contributor

UPDATE: I fixed the PR and updated the math, the COCO JSON -> Darknet conversion tool (Dark Chocolate) works now: joehoeller/Dark-Chocolate#2

I been super busy with finals and final projects at Duke. So haven't been able to work on this much. But let me know later if you need a code review on Dark Chocolate.

@daddydrac
Copy link
Author

daddydrac commented Dec 5, 2019 via email

@daddydrac
Copy link
Author

daddydrac commented Dec 5, 2019

@glenn-jocher Please provide feedback - >

Done training and got great results, mAP 0.96, which (I think) is competition grade.
My test results after running python3 detect.py --weights weights/last.pt --data data/custom.data --cfg cfg/yolov3-spp-r.cfg --img-size 640 (notice -r is my version of spp.cfg) produced very accurate detection in image set.

So now I am wondering about the following:

  1. Where is the final training loss and a graph showing training and validation loss over the epochs?

  2. How can I get qualitative detection output images on validation set showing true positive
    detections, false positives and false negatives (how can i do a validation set)?

  3. How do I get a class-wise analysis of precision and recall of detections?

  4. What would be the column headers for the results.txt file, I am unsure what numbers represent what(?)
    For example, it just outputs this for last set of values:
    272/272 8.13G 1.25 0.349 0.105 1.7 138 416 0.283 0.961 0.922 0.435 1.28 0.31 0.0511

@daddydrac daddydrac reopened this Dec 5, 2019
@FranciscoReveriano
Copy link
Contributor

Yeah. if you can request me to do a code review on github. I will be happy to do that.

@daddydrac
Copy link
Author

daddydrac commented Dec 5, 2019

Yeah. if you can request me to do a code review on github. I will be happy to do that.

I tried, it wont let me. I am not sure why, I'll keep trying.

@glenn-jocher
Copy link
Member

@joehoeller @FranciscoReveriano sorry I've been pretty busy lately. To answer your questions:

  1. The final results should be in results.txt and results.png.
  2. For the class by class results run python3 test.py --cfg ... --weights ... --data ... etc.
  3. See 2
  4. The column headers are in utils.utils.plot_results()

@daddydrac
Copy link
Author

daddydrac commented Dec 6, 2019 via email

@joel5638
Copy link

okay sure

@joel5638
Copy link

joel5638 commented Feb 25, 2020

@joehoeller Still the same error.

Is it because im training this only on the 8,800 16 bit thermal .tiff images?
Im not using any coco dataset.

I'm only training on 16bit thermal .tiff images

Do I also have to add coco to the training set with the thermal images?

If possible can you send me a path of folders you have placed those coco images to?
Screenshot from 2020-02-25 15-18-21

In the train folder. I have .tiff images.

@daddydrac
Copy link
Author

daddydrac commented Feb 25, 2020 via email

@joel5638
Copy link

@joehoeller yes ive updated the three blocks of the filters, classes.

Do I have to include coco images in the training set?
Or just the ‘.tiff’ will do?

@daddydrac
Copy link
Author

daddydrac commented Feb 25, 2020 via email

@joel5638
Copy link

@joehoeller ive added .tiff extension in the utils/dataset.py

Im just thinking if i should try training with .jpeg first tomorrow and see if I see the error again.

So one question to you, did u train both the 8bit jpeg and coco data for the Thermal object detection?

@daddydrac
Copy link
Author

daddydrac commented Feb 25, 2020 via email

@daddydrac
Copy link
Author

daddydrac commented Feb 25, 2020 via email

@joel5638
Copy link

@joehoeller okay sure. Will try to fix it tomorrow.

@glenn-jocher
Copy link
Member

@joel5638 18 classes means n=18

@joel5638
Copy link

@glenn-jocher yeah. I changed the filters and updated n=18.
Which means my filters would be (4+1+18)3 = 69

Ive updated the same in the cfg and tried. Doesnt work.

Some minor change is missing out.

So when you asked me to run ‘python3 train.py —img640.

I got an error stating ‘cannot load ../coco/train2014/...

But im not using any coco dataset. Im only training the model on 8000 thermal images but no RGB.

@glenn-jocher
Copy link
Member

@joel5638 the example command I gave you is an example of how to use the --img-size argument. Obviously you apply the argument to your own command.

The image format is irrelevant as long as opencv can open them.

@daddydrac
Copy link
Author

daddydrac commented Feb 25, 2020 via email

@glenn-jocher
Copy link
Member

glenn-jocher commented Feb 25, 2020

@joehoeller cv2 loads images and videos:
https://docs.opencv.org/4.2.0/d4/da8/group__imgcodecs.html#ga288b8b3da0892bd651fce07b3bbd3a56

Currently, the following file formats are supported:

Windows bitmaps - *.bmp, *.dib (always supported)
JPEG files - *.jpeg, *.jpg, *.jpe (see the Note section)
JPEG 2000 files - *.jp2 (see the Note section)
Portable Network Graphics - *.png (see the Note section)
WebP - *.webp (see the Note section)
Portable image format - *.pbm, *.pgm, *.ppm *.pxm, *.pnm (always supported)
PFM files - *.pfm (see the Note section)
Sun rasters - *.sr, *.ras (always supported)
TIFF files - *.tiff, *.tif (see the Note section)
OpenEXR Image files - *.exr (see the Note section)
Radiance HDR - *.hdr, *.pic (always supported)
Raster and Vector geospatial data supported by GDAL (see the Note section)

@daddydrac
Copy link
Author

daddydrac commented Feb 25, 2020 via email

@joel5638
Copy link

@joehoeller @glenn-jocher Im using @joehoeller 's thermal object detection repository. He mentioned 18 classes and ive added the n=18. so filters would be (18+1+4)x3 = 69 which is constant. but it doesnt seem to work.

Ive also tried 66 as filters. I'm lost and confused what is going wrong :(

@daddydrac
Copy link
Author

daddydrac commented Feb 26, 2020 via email

@joel5638
Copy link

It is the same error.

RuntimeError: shape '[16, 3, 23, 13, 13]' is invalid for input of size 178464

  1. I have resized the .tiff image size to 160x120 and using them.
  2. Ive updated the classes as 18 and filters as 69 in yolov3-spp-r.cfg
  3. Ive also tried classes as 17 and filters as 66 in yolov3-spp-r.cfg

Nothing seems to work.

@daddydrac
Copy link
Author

daddydrac commented Feb 26, 2020 via email

@joel5638
Copy link

joel5638 commented Feb 26, 2020

my custom.data looks like this :
classes=18
train=./data/training_img_paths.txt
valid=./data/training_img_paths.txt
names=data/custom.names
backup=backup/
eval=coco

my custom.names looks like this:

person
bicycle
car
motorcycle
airplane
bus
train
truck
boat
traffic light
fire hydrant
stop sign
parking meter
bench
bird
cat
dog
horse

my training_img_paths.txt looks like this:

./coco/images/train/FLIR_00211.tiff
./coco/images/train/FLIR_06774.tiff
./coco/images/train/FLIR_00075.tiff
./coco/images/train/FLIR_00070.tiff
./coco/images/train/FLIR_03503.tiff
......
......
......

@joel5638
Copy link

joel5638 commented Feb 26, 2020

@joehoeller I just cloned your repository again and just added paths to the FLIR dataset with jpeg thermal images and the cfg you used for training.

I ran this command : python3 train.py --data data/custom.data --cfg cfg/yolov3-spp-r.cfg --weights weights/yolov3-spp.weights

when I Change the filters to 69 and classes to 18 in the cfg file. This is the error

Error : AssertionError: Model accepts 18 classes labeled from 0-17, however you labelled a class 18

But when I Change the filters to 66 and classes to 18 in the cfg file. This is the error.

Error : RuntimeError: shape '[16, 3, 23, 13, 13]' is invalid for input of size 178464

Looks like nothing seems to work out with the cfg file

@daddydrac
Copy link
Author

Just for fun, try the stock spp cfg file, & see if it works:
https://github.com/joehoeller/Object-Detection-on-Thermal-Images/blob/master/cfg/yolov3-spp.cfg

@glenn-jocher
Copy link
Member

@joehoeller @joel5638 yes you can always use the default yolov3-spp.cfg (with no changes) to train custom datasets with up to 80 classes, like an 18 class dataset. It's not an optimal solution, but it works.

@joel5638
Copy link

joel5638 commented Feb 27, 2020

@joehoeller @glenn-jocher I just tried with stock cfgand i see the same result.

AssertionError: Model accepts 18 classes labeled from 0-17, however you labelled a class 18. See https://docs.ultralytics.com/yolov5/tutorials/train_custom_data

@glenn-jocher
Copy link
Member

@joel5638 if your data is only labeled for classes between 0 and 17 it’s not possible to see that message. I thought you’d already fixed your labels?

@joel5638
Copy link

@joehoeller @glenn-jocher Finally the training has started. I deleted all the 8000 images and their labels and only took 10 images with their labels and tried and it worked out. So, I'm assuming that the images in the train folder need their respective labels to train.
Screenshot from 2020-02-27 10-59-44

@glenn-jocher
Copy link
Member

glenn-jocher commented Feb 27, 2020

Oh good!

The images used for training should not all need label files. We routinely take custom datasets and add COCO images sans labels for backgrounds. All that’s required is that the background images are listed in the *.txt file along with the rest of the (labeled) images.

@joel5638
Copy link

Oh wow. Now I get the picture.
Thank you so much @glenn-jocher and @joehoeller
You’ve been a great and great help in educating me. Love your passion. I’m grateful very much. Thank you both.

@glenn-jocher
Copy link
Member

@joel5638 you're very welcome! It's a pleasure to assist. Our community is very helpful and supportive, and I'm sure they'll also be pleased to help you out. Keep up the good work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants