Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue running GenderTraining program on Ubuntu #205

Closed
MisterMcDuck opened this issue Jun 25, 2022 · 6 comments
Closed

Issue running GenderTraining program on Ubuntu #205

MisterMcDuck opened this issue Jun 25, 2022 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@MisterMcDuck
Copy link

Hello,

I attempted to follow the instructions provided at https://github.com/takuya-takeuchi/DlibDotNet/wiki/Tutorial-for-Linux
and https://github.com/takuya-takeuchi/FaceRecognitionDotNet/tree/master/tools/GenderTraining

to train a gender model as specified. I compiled everything with CUDA support, and can confirm that works as I've previously trained dlib networks on this machine.

I always specified 64/desktop cuda 112 when building the libraries. However, when I try to run the training program, I receive this error:

ubuntu@ip-172-30-0-90:~/DNN/DotNet/FaceRecognitionDotNet/tools/GenderTraining/bin/x64/Release/netcoreapp2.0$ ls
DlibDotNet.dll  DlibDotNet.xml            GenderTraining.dll  GenderTraining.runtimeconfig.dev.json  libDlibDotNetNativeDnn.so                   libDlibDotNetNativeDnnGenderClassification.so
DlibDotNet.pdb  GenderTraining.deps.json  GenderTraining.pdb  GenderTraining.runtimeconfig.json      libDlibDotNetNativeDnnAgeClassification.so

ubuntu@ip-172-30-0-90:~/DNN/DotNet/FaceRecognitionDotNet/tools/GenderTraining/bin/x64/Release/netcoreapp2.0$ dotnet GenderTraining.dll train -d=/home/ubuntu/DNN/DotNet/FaceRecognitionDotNet/tools/GenderTraining/UTKDataset -b=400 -e=600 -v=20
           Dataset: /home/ubuntu/DNN/DotNet/FaceRecognitionDotNet/tools/GenderTraining/UTKDataset
             Epoch: 600
     Learning Rate: 0.001
 Min Learning Rate: 1E-05
    Min Batch Size: 400
Validation Interval: 20

Start load train images
Load train images: 7824
Start load test images
Load test images: 1954

**************************** FATAL ERROR DETECTED ****************************

Error detected at line 202.
Error detected in file /opt/data/FaceRecognitionDotNet/src/DlibDotNet/src/dlib/dlib/../dlib/dnn/trainer.h.
Error detected in function void dlib::dnn_trainer<net_type, solver_type>::train_one_step(const std::vector<typename net_type::input_type>&, const std::vector<typename net_type::training_label_type>&) [with net_type = dlib::add_loss_layer<dlib::loss_multiclass_log_, dlib::add_layer<dlib::fc_<2ul, (dlib::fc_bias_mode)0u>, dlib::add_layer<dlib::dropout_, dlib::add_layer<dlib::relu_, dlib::add_layer<dlib::fc_<512ul, (dlib::fc_bias_mode)0u>, dlib::add_layer<dlib::dropout_, dlib::add_layer<dlib::relu_, dlib::add_layer<dlib::fc_<512ul, (dlib::fc_bias_mode)0u>, dlib::add_layer<dlib::max_pool_<3l, 3l, 2, 2>, dlib::add_layer<dlib::relu_, dlib::add_layer<dlib::con_<384l, 3l, 3l, 1, 1, 1, 1>, dlib::add_layer<dlib::bn_<(dlib::layer_mode)0u>, dlib::add_layer<dlib::max_pool_<3l, 3l, 2, 2>, dlib::add_layer<dlib::relu_, dlib::add_layer<dlib::con_<256l, 5l, 5l, 1, 1, 2, 2>, dlib::add_layer<dlib::bn_<(dlib::layer_mode)0u>, dlib::add_layer<dlib::max_pool_<3l, 3l, 2, 2>, dlib::add_layer<dlib::relu_, dlib::add_layer<dlib::con_<96l, 7l, 7l, 4, 4>, dlib::input_rgb_image_sized<227ul>, void>, void>, void>, void>, void>, void>, void>, void>, void>, void>, void>, void>, void>, void>, void>, void>, void>, void> >; solver_type = dlib::sgd; typename net_type::input_type = dlib::matrix<dlib::rgb_pixel>; typename net_type::training_label_type = long unsigned int].

Failing expression was data.size() == labels.size().


******************************************************************************

Aborted (core dumped)

I'm not sure how the two std:vectors could have a differing size. If you think it would help I could try this on a windows OS as this is just an Amazon EC2 instance.

Thanks for any advice you can give!

@takuya-takeuchi
Copy link
Owner

@MisterMcDuck
This issue may be ralated to takuya-takeuchi/DlibDotNet#272
So I think we have to modify DlibDotNet code.

@takuya-takeuchi takuya-takeuchi self-assigned this Jun 27, 2022
@takuya-takeuchi takuya-takeuchi added the bug Something isn't working label Jun 27, 2022
@MisterMcDuck
Copy link
Author

@takuya-takeuchi

Thanks for the link. I can confirm that with a simple modification as done in the linked PR, it's now training.

The modification I made, just for testing:

---- src/GenderClassification/dlib/dnn/loss/multiclass_log/gender/Gender.h ----
index a9daf2d..e779c16 100644
@@ -9,8 +9,8 @@
 #include "defines.h"
 #include "DlibDotNet.Native.Dnn/dlib/dnn/loss/multiclass_log/template.h"
 
-typedef unsigned long gender_out_type;
-typedef unsigned long gender_train_label_type;
+typedef uint32_t gender_out_type;
+typedef uint32_t gender_train_label_type;
 
 MAKE_LOSSMULTICLASSLOG_FUNC(gender_train_type,  matrix_element_type::RgbPixel, dlib::rgb_pixel, matrix_element_type::UInt32, gender_train_label_type, 100)

and

 src/DlibDotNet.Native.Dnn/dlib/dnn/loss/multiclass_log/LossMulticlassLogBase.h 
index 38d30f3..6f73d7d 100644
@@ -8,8 +8,8 @@
 
 #include "../LossBase.h"
 
-typedef unsigned long loss_multiclass_log_out_type;
-typedef unsigned long loss_multiclass_log_train_label_type;
+typedef uint32_t loss_multiclass_log_out_type;
+typedef uint32_t loss_multiclass_log_train_label_type;
 
 using namespace dlib;
 using namespace std;

and the result:

dotnet GenderTraining.dll train -d /media/chris/DATA/Datasets/UTKDataset/output
            Dataset: /media/chris/DATA/Datasets/UTKDataset/output
              Epoch: 300
      Learning Rate: 0.001
  Min Learning Rate: 1E-05
     Min Batch Size: 256
Validation Interval: 30

Start load train images
Load train images: 7824
Start load test images
Load test images: 1954
step#: 0     learning rate: 0.001  average loss: 0            steps without apparent progress: 0
step#: 5     learning rate: 0.001  average loss: 0.769476     steps without apparent progress: 0
step#: 9     learning rate: 0.001  average loss: 0.769381     steps without apparent progress: 0
step#: 14    learning rate: 0.001  average loss: 0.725918     steps without apparent progress: 7

If I get some time I'll try to bring together a PR, but it'd need to cover all the cases rather than just this one.

@takuya-takeuchi
Copy link
Owner

I think we should use uint64_t.
Because dlib uses uint64_t when it is build in linux.
Otherwise, using uint32_t occurs 'explicit type conversion'.

But you can continue to train by your code.
This issue is not matter but it could occur only compile warning.
Thanks :)

@MisterMcDuck
Copy link
Author

MisterMcDuck commented Jun 29, 2022

I did see issues keeping UInt32, e.g. 1/2 sized arrays during the Validation phase, but worked around them. Out of curiosity I implemented UInt64 support for loss multiloss log, but I think they'd be breaking changes for the library which I think you would want to avoid. I separated the commits into basic support in std vector and the breaking changes in loss multiloss log if you're interested:

takuya-takeuchi/DlibDotNet@master...MisterMcDuck:DlibDotNet:feature/UInt64

To see them. I guess overrides could be used, but I don't know how much interest there are in these changes.

@takuya-takeuchi
Copy link
Owner

@MisterMcDuck
Thanks for your contribution and sorry for the late contact.

I created new PR from your branch.
takuya-takeuchi/DlibDotNet#281

Your change looks good to me :)
TBH, I do not take care of breaking changes.

I try to build and test it on windows, linux and osx.

Thanks a lot.

@takuya-takeuchi
Copy link
Owner

It should be resolved by 1.3.0.7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants