Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guideline to train against other datasets with different classes #31

Closed
y22ma opened this issue Jan 13, 2017 · 46 comments
Closed

Guideline to train against other datasets with different classes #31

y22ma opened this issue Jan 13, 2017 · 46 comments

Comments

@y22ma
Copy link

y22ma commented Jan 13, 2017

A guideline to train against other datasets such as the udacity self driving dataset would be much appreciated.

Do I create a labels.txt in the root folder, and specify a model name outside of coco_models and voc_models listed in darkflow/misc.py?

@thtrieu
Copy link
Owner

thtrieu commented Jan 13, 2017

Good suggestion on self-driving dataset.

Indeed labels.txt is in root (./labels.txt), and the model's name should be different from the default ones. This seems like a bad design, so I am open to your suggestions.

@y22ma
Copy link
Author

y22ma commented Jan 13, 2017

Well I wonder if it's possible to dynamically construct a FC layer according to the number of classes you have in a labels.txt.

For example, if I want to use yolo_tiny, but for a 5 class dataset rather than a 20class dataset, we could reformat the FC layers to generate appropriate numbers of output.

In darkflow's current form, I would have to modify the yolo_tiny.cfg file, and tell the training script to ignore the FC weights and reinitialize new ones?

@thtrieu
Copy link
Owner

thtrieu commented Jan 13, 2017

That's a good suggestion too. The current design of darkflow does not allow doing so, one can modify the source code at ./cfg/process.py so that while parsing for number of outputs in a .cfg, it counts the number of line in ./labels.txt instead. But another number that also affects the last FC's output size is number of boxes, for this you have to look further in the .cfg file, at [detection] layer. I personally don't think it is necessary to build this complicated behavior, but you can always customize the source as you like (just that process.py is a bit messy).

To completely initialize the new net, just leave the --load option, to load the first identical layers of your new net from, say, yolo-tiny.weights, point --load to this file. There will be a table printed out indicating which layer is loaded, which is initialized.

@y22ma
Copy link
Author

y22ma commented Jan 13, 2017

Just so I understand you fully, when you say "identical layer", I just need to not modify the layers I don't want to change, and darkflow would detect changes in a new cfg file and initialize those variables properly?

@thtrieu
Copy link
Owner

thtrieu commented Jan 13, 2017

Correct, notice the first word is also there: The first matching layers are reused. The first mismatch will cause the left of the net to be initialized

@y22ma
Copy link
Author

y22ma commented Jan 14, 2017

@thtrieu just realized layers don't have IDs, and to introduce change the cfg file, you actually have the change the layer structure. Just wondering if I could avoid that?

The reasons being, I want to swap out FC layers, train them and finetune the entire network with lower learning rate. I want to load the pretrain weights, still train it.

Is it possible to add an extra parameter that specify train=true and reinitilize=random or something of that sorts for each layer?

@thtrieu
Copy link
Owner

thtrieu commented Jan 14, 2017

Surely you can do that, but it will require source code modification though.

@y22ma
Copy link
Author

y22ma commented Jan 14, 2017

Ok, working on that right now. Could you please point me to where the process script decide to reinitialise a layer when changes are detected?

@thtrieu
Copy link
Owner

thtrieu commented Jan 14, 2017

A bit complicated:

  1. A "weight walker" in ./utils/loader.py is used to load the source weight file.
  2. Then "weight loader" in ./utils/loader.py cycles through each pair of layer between the source config and the destination config (can be the same config) and yield the weights as long as the pair is identical (by comparing layer.signature). If the pair is not identical, None is yielded.
  3. Then comes thee part tensorflow in charge ./net/ops/baseop.py, the layer is wrapped into tensorflow's variables and placholders. If the value of that layer collected from previous step is None, it'll be initialized, if not, it'll be used as initial value.

Hope this helps.

@y22ma
Copy link
Author

y22ma commented Jan 14, 2017

Reading through weight_loader in loader.py, I'm having a hard time locating the exact line where the signature is compared and rejected. Could you kindly clarify?

In the mean time, I'm planning to not touch the convolution layers at all and swap out the FC and detection layer with the following.

[connected]
output= 735
activation=linear

[detection]
classes=5
coords=4
rescore=1
side=7
num=2
softmax=0
sqrt=1
jitter=.2

object_scale=1
noobject_scale=.5
class_scale=1
coord_scale=5

Will keep you updated on how it works

@thtrieu
Copy link
Owner

thtrieu commented Jan 15, 2017

  1. The comparison is done at line 30 by == operator being overloaded in the definition of class Layer

  2. If I understand you correctly, all you have to do is changing the definition of the last FC layer in .cfg like above, and then call for a partial load, no source code need to be modify.

@y22ma
Copy link
Author

y22ma commented Jan 15, 2017

and the number of classes in the detection layer as you have mentioned before right?

@thtrieu
Copy link
Owner

thtrieu commented Jan 15, 2017

Yes.
Just to clarify:
Doing the following:

  1. copy the config of X.cfg to Y.cfg
  2. Change the output of last FC
  3. Change the number of class in detection layer
  4. call ./flow --model Y.cfg --load X.weights

Will result in: (N-1) first layer of Y is loaded from X.weights, last layer of Y will be initialized. You can check the status if each layer is loaded or initialized when the program prints a table of layers.

@y22ma
Copy link
Author

y22ma commented Jan 15, 2017

Training now! To detail what i'm doing, I loaded the CSV annotation file from udacity dataset to produce dumps in the same format you expect in data.py, and in the udacity dataset there are 5 difference classes.

Will keep you posted, and any tips would be much appreciated!

@thtrieu
Copy link
Owner

thtrieu commented Jan 15, 2017

The bullet points at the end of this post might be helpful https://thtrieu.github.io/notes/Fine-tuning-YOLO-4-classes#hand-picking-good-feature

Besides, I would love to reference your training results/demo on this repo's README. If that's okay, do notify me when you're ready.

@y22ma
Copy link
Author

y22ma commented Jan 15, 2017

Really good tips. I have the following for sample size at the moment

car: 60788
biker: 1676
truck: 3503
trafficLight: 17253
pedestrian: 9866

Loss converged to 3.0 now. Will run the regular to see if it's reasonable

@y22ma
Copy link
Author

y22ma commented Jan 16, 2017

@thtrieu the loss shows up as 2.4, but when I perform testing using my test point, the probability produces nan. Just wondering if you have any clue how that could be possible? I'm guessing that nan would've been produced during training as well?

@thtrieu
Copy link
Owner

thtrieu commented Jan 16, 2017

Can you describe in detail what commands you did to obtain these results. They all seem new to me.

@y22ma
Copy link
Author

y22ma commented Jan 16, 2017

To train I did:
./flow --train --model cfg/v1.1/tiny-yolov1-5c.cfg --load bin/tiny_yolo.weights --annotation <path to my annotations> --dataset <path to my images>

To run I did:
./flow --test <path to my test images> --model cfg/v1.1/tiny-yolov1-5c.cfg --load -1

Interestingly, when I pass in -1 to --load to load the latest check point to both --train and --test option, I got the following output

Source | Train? | Layer description                | Output size
-------+--------+----------------------------------+---------------
       |        | input                            | (?, 448, 448, 3)
 Init  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 448, 448, 16)
 Load  |  Yep!  | maxp 2x2p0_2                     | (?, 224, 224, 16)
 Init  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 224, 224, 32)
 Load  |  Yep!  | maxp 2x2p0_2                     | (?, 112, 112, 32)
 Init  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 112, 112, 64)
 Load  |  Yep!  | maxp 2x2p0_2                     | (?, 56, 56, 64)
 Init  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 56, 56, 128)
 Load  |  Yep!  | maxp 2x2p0_2                     | (?, 28, 28, 128)
 Init  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 28, 28, 256)
 Load  |  Yep!  | maxp 2x2p0_2                     | (?, 14, 14, 256)
 Init  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 14, 14, 512)
 Load  |  Yep!  | maxp 2x2p0_2                     | (?, 7, 7, 512)
 Init  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 7, 7, 1024)
 Init  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 7, 7, 256)
 Load  |  Yep!  | flat                             | (?, 12544)
 Init  |  Yep!  | full 12544 x 735  linear         | (?, 735)
-------+--------+----------------------------------+---------------

Failing to load any convolution layers it seems, no wonder it spits out NaN :( It does not do this when I pass in weights file as my --load argument. Seem to tell me that there might be versioning issues with ckpt format. I'm currently using TensorFlow 12.1, if it helps.

tiny-yolov1-5c.cfg was modified from tiny-yolov1.cfg, with changes to [connected] and [detection] posted above.

@thtrieu
Copy link
Owner

thtrieu commented Jan 16, 2017

It is totally okay with the Inits. The table tells you which layer are loaded from .weights file, not ckpt. As long as following this table is the message Loading from ./ckpt/tiny-yolov1-5c-<number> and Finished in <>second then you're doing fine.

The strange thing to me is, how can you get any loss value when running a --test command? Normally a --test command simply prints that it is forwarding some input images and preprocessing them before termination.

@y22ma
Copy link
Author

y22ma commented Jan 16, 2017

Ah good to know that it's loading the weights. I don't actually have a NaN loss value, what I'm referring to is NaN matrix it produces when I run a forward pass during the test procedure.

I printed out the result of line 94 in net/flow.py:
out = self.sess.run(self.out, feed_dict)

and out showed up as a NaN matrix, which makes it hard to believe that it would've produced a valid loss during training?

@thtrieu
Copy link
Owner

thtrieu commented Jan 16, 2017

NaN is not necessarily the probabilities in YOLO's formulation. It can be the coordinate offset, confidence, class, etc. You can always check to see what is the output matrix during training by putting self.out into fetches at line 49 of the same file. I suspect these are also NaN matrices and the loss value of 2.4 or 3.0 is result of overflow/underflow.

If the matrices is indeed NaN during training, then there is a scaling problem due to overusing the old weights (N-1 layers are reused with totally different classes of object, and v1.1 is using Batch-Norm with arbitrary large scaling/offset parameters). To check this, try running the model without loading from any .weights file (full initialization) and see if the NaN problem persists.

@y22ma
Copy link
Author

y22ma commented Jan 17, 2017

Thanks for the tips.

I'm not sure what you mean by "putting self.out into fetches", but I did try running the model without loading from any weights via:
./flow --test <path to my test images> --model cfg/v1.1/tiny-yolov1-5c.cfg

And I'm seeing the same NaN matrix coming out of out = self.sess.run(self.out, feed_dict)

@thtrieu
Copy link
Owner

thtrieu commented Jan 17, 2017

By fetches, I mean the fetches in this python code fetched = self.sess.run(fetches, feed_dict) at line 50 of ./net/flow.py. You can use fetches to look at intermediate layers' value.
For e.g.

fetches = [self.train_op, loss_op, self.top.out, self.top.inp.out, self.top.inp.inp.out, self.top.inp.inp.inp.out]

will allow you to fetch the train op (meaning to train the net), loss op (too see the loss), and the last four layers' output matrix. You can certainly use a loop to create this list, the way I did above is just illustrative.

If you were able to print the output of all intermediate layers, then it will be easier to debug your program (to see the NaN problem starts to happen at which layer). I believe this is a problem-specific issue because YOLO models on PASCAL VOC dataset all running fine.

@y22ma
Copy link
Author

y22ma commented Jan 17, 2017

Used your command to fetch the intermediate layer outputs, and I actually don't see nan output at the last few layers during training, but I do see nan output during testing which starts at self.top.inp.inp.inp.out (Tensor("BiasAdd_7:0", shape=(?, 7, 7, 256), dtype=float32))

I would expect that if the network is producing NaN results, it would've done so during training as well?

@y22ma
Copy link
Author

y22ma commented Jan 18, 2017

Found out something really peculiar. I downloaded the tiny-yolo.weights from link referred to by the yolov1 site, and found out the link actually points to tiny-yolov2 weights. This is proven by successful load of the final convolution layer when I use the v2 tiny-yolo.cfg. The NaN starts right at that layer as well, so I'm going to try tracking down the correct tiny-yolov1 weights, and train against it.

@thtrieu
Copy link
Owner

thtrieu commented Jan 18, 2017

yes, the official site of YOLO is now providing YOLO9000 only. If you want older versions, tell me and I'll upload them.

@y22ma
Copy link
Author

y22ma commented Jan 18, 2017

If you could upload tiny-yolo-v1, that would be much appreciated.

Just so you know when I try to load yolov1.weights, the walker asserts "Over-read". Not sure if you wish to maintain yolov1 loading anymore, but i thought I would bring that to your attention.

@thtrieu
Copy link
Owner

thtrieu commented Jan 18, 2017

to be clear, There is v1.0 (without batch-norm), v1.1 (with batch-norm) and v2 (yolo9000). Which one are you referring to?

It might be this

@y22ma
Copy link
Author

y22ma commented Jan 20, 2017

Just to update you on this, I'm training the weights you provided using v1.1/tiny-yolov1.cfg, with 5 classes modification I made above. The loss is around 2.2, and the output are not really valid. Will try to keep it going for one more day before I give up :)

I had to disable the following assert in line 74 of loader.py to load the tiny-yolov1.weights at all.

        if walker.path is not None:
            #assert walker.offset == walker.size, \
            #'expect {} bytes, found {}'.format(
            #    walker.offset, walker.size)
            print('Successfully identified {} bytes'.format(
                walker.offset))

@thtrieu
Copy link
Owner

thtrieu commented Jan 20, 2017

Training YOLO can be a daunting task, especially for those with limited computational resources. I encourage you to go a little further.

2.2 is a very familiar loss to me, it can tell underfitting or too large learning rate. I suggest going for smaller learning rate to see if there is any progress. If not, then go for a deeper, but much thinner net, see this post if you have not.

@y22ma
Copy link
Author

y22ma commented Jan 20, 2017

It's odd the training loss for tiny-yolov1.weights is around the same 1.8-2.0 region, yet it actually makes sensible detections.

I do have a GTX 1070, so I'm doing a bit better than running purely on CPU. Will keep you posted tmr.

@y22ma
Copy link
Author

y22ma commented Jan 23, 2017

Getting some results that makes sense now! Yolo is picking up cars in the dataset, although the bounding box is often drawn with an offset and with the wrong width/height.

@y22ma y22ma closed this as completed Jan 23, 2017
@y22ma y22ma reopened this Jan 23, 2017
@thtrieu
Copy link
Owner

thtrieu commented Jan 23, 2017

make sure you are using Python3, or convert your code to appropriate one because there is a difference between integer/float division between python2 and python3 that can make a consistent mislocation of bounding boxes.

@y22ma
Copy link
Author

y22ma commented Jan 26, 2017

Yeesh, I'm fairly certain that I'm not using Python 3 at the moment. Will try that. In general, the bounding box seems to be very small, which can be caused by the small bounding box annotations in the Udacity datasets (some times it gets below 5 pixels in width or height).

If that doesn't improve things, I'll move to Python 3.

@y22ma
Copy link
Author

y22ma commented Jan 31, 2017

It's not converging the right solutions :( the boxes show up at roughly the right place but the sizes are wrong.

I'll put the code up on my fork for anyone to investigate!

@thtrieu
Copy link
Owner

thtrieu commented Jan 31, 2017

  1. Please update new commit
  2. Make sure you are using Python3 and Tensorflow 0.12
  3. Please make sure you overfit successfully a small dataset (3 ~ 5) images successfully before going any further (for configs with batch-norm, use larger epoch number so that the moving averages are converged)

That will single out many possibilities. Debugging Deep Learning application is not simple.

@y22ma
Copy link
Author

y22ma commented Feb 5, 2017

Overfitting did the trick!! Will post my results shortly. Thanks alot for your help.

@y22ma
Copy link
Author

y22ma commented Feb 7, 2017

@thtrieu, here's my fork for training against the Udacity SDC dataset: https://github.com/y22ma/darkflow/tree/udacity

Udacity employs a different annotation format than PASCAL VOC, and I hacked the dataset.py script to load the udacity annotation using my function. How would you like this to be handled?

@humbledprogrammer
Copy link

humbledprogrammer commented Feb 27, 2017

Could you please say more about the theory behind the step 3?
What does the overfitting on a small (3~5) images improve? Shall that small training be started with same parameters as the targeted training over the entire training set?

@eugtanchik
Copy link

Hello there, I am really interested in using this library for training on my own datasets. I have some problems when trying to test few images after training. Could you help me to understand better how it works?

@eugtanchik
Copy link

eugtanchik commented Mar 6, 2017

While testing I have the following output:

Parsing cfg/yolo-voc-1c.cfg
Loading None ...
Finished in 0.00013875961303710938s

Building net ...
Source | Train? | Layer description | Output size
-------+--------+----------------------------------+---------------
| | input | (?, 416, 416, 3)
Init | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 416, 416, 32)
Load | Yep! | maxp 2x2p0_2 | (?, 208, 208, 32)
Init | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 208, 208, 64)
Load | Yep! | maxp 2x2p0_2 | (?, 104, 104, 64)
Init | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 104, 104, 128)
Init | Yep! | conv 1x1p0_1 +bnorm leaky | (?, 104, 104, 64)
Init | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 104, 104, 128)
Load | Yep! | maxp 2x2p0_2 | (?, 52, 52, 128)
Init | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 52, 52, 256)
Init | Yep! | conv 1x1p0_1 +bnorm leaky | (?, 52, 52, 128)
Init | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 52, 52, 256)
Load | Yep! | maxp 2x2p0_2 | (?, 26, 26, 256)
Init | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 26, 26, 512)
Init | Yep! | conv 1x1p0_1 +bnorm leaky | (?, 26, 26, 256)
Init | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 26, 26, 512)
Init | Yep! | conv 1x1p0_1 +bnorm leaky | (?, 26, 26, 256)
Init | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 26, 26, 512)
Load | Yep! | maxp 2x2p0_2 | (?, 13, 13, 512)
Init | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 13, 13, 1024)
Init | Yep! | conv 1x1p0_1 +bnorm leaky | (?, 13, 13, 512)
Init | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 13, 13, 1024)
Init | Yep! | conv 1x1p0_1 +bnorm leaky | (?, 13, 13, 512)
Init | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 13, 13, 1024)
Init | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 13, 13, 1024)
Init | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 13, 13, 1024)
Load | Yep! | concat [16] | (?, 26, 26, 512)
Load | Yep! | local flatten 2x2 | (?, 13, 13, 2048)
Load | Yep! | concat [26, 24] | (?, 13, 13, 3072)
Init | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 13, 13, 1024)
Init | Yep! | conv 1x1p0_1 linear | (?, 13, 13, 30)
-------+--------+----------------------------------+---------------
Running entirely on CPU
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
Loading from ./ckpt/yolo-voc-1c-4000
Finished in 6.8827056884765625s

Forwarding 3 inputs ...
Total time = 3.51149582862854s / 3 inps = 0.8543367688326967 ips
Post processing 3 inputs ...
Total time = 0.17760968208312988s / 3 inps = 16.890971059763825 ips

but on testing images it detects nothing. Do you have any ideas what's wrong?

@eugtanchik
Copy link

In what format should be annotations - xml or some other formats are acceptable?

@longchuanshu
Copy link

@eugtanchik into $DARKFLOW_ROOT/net/yolov2/test.py to print boxes.probs, make sure your confidence beyond the threshold

@thtrieu thtrieu closed this as completed Mar 13, 2017
@maryam1369
Copy link

Hi,
I have a csv annotation file and I am using "https://github.com/y22ma/darkflow/tree/udacity ", but I get error: Annotation directory not found ...
please help me.

@JoffreyN
Copy link

E:\Users\ZP\Desktop\Getdata>flow.py --model cfg/yolov2-tiny-voc.cfg --load bin/yolov2-tiny-voc.weights --savepb

Parsing ./cfg/yolov2-tiny-voc.cfg
Parsing cfg/yolov2-tiny-voc.cfg
Loading bin/yolov2-tiny-voc.weights ...
Successfully identified 63102560 bytes
Finished in 0.04497408866882324s
Traceback (most recent call last):
File "E:\Users\ZP\Desktop\Getdata\flow.py", line 6, in
cliHandler(sys.argv)
File "D:\Program Files\Python36\lib\site-packages\darkflow\cli.py", line 26, in cliHandler
tfnet = TFNet(FLAGS)
File "D:\Program Files\Python36\lib\site-packages\darkflow\net\build.py", line 64, in init
self.framework = create_framework(*args)
File "D:\Program Files\Python36\lib\site-packages\darkflow\net\framework.py", line 59, in create_framework
return this(meta, FLAGS)
File "D:\Program Files\Python36\lib\site-packages\darkflow\net\framework.py", line 15, in init
self.constructor(meta, FLAGS)
File "D:\Program Files\Python36\lib\site-packages\darkflow\net\yolo_init_.py", line 20, in constructor
misc.labels(meta, FLAGS) #We're not loading from a .pb so we do need to load the labels
File "D:\Program Files\Python36\lib\site-packages\darkflow\net\yolo\misc.py", line 36, in labels
with open(file, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'labels.txt'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants