-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model not Learning #3
Comments
Hi @bieltura!
|
Hi @roman-vygon, Thank you alot for the drive link, I missed it in the README. I completly understand the fact about hard/semihard examples, which makes also sense, but thanks for the clarification :) The choice of per_class and per_batch is due to memory concerns. I have several GPUs available, but does the NeMo framework support multiple GPUs on your code? There's a gpu argument which is expecting a int value then used as a str:
I guess it defines the number of GPUs to use in training, but increasing from 0 still is giving out of memory in my case. I have trained with new dists.npy and class_probs35.npy but still not learning... (tried batch_classes of 35 with per_class 2 as well). In case it helps, my current env is: Pytorch 1.7.0 (bc of Cuda driver 11.0), nemo-toolkit 0.10.1 and nemo-asr 0.9.1. Thanks, |
Currently working on porting this code to a newer NeMo version, that would support multi-gpu balanced batches, but so far my only advice would be to decrease batch_classes while increasing the per_class argument. |
Hi, |
Hi, Best. |
Hi!
First of all, thank you for open souring the code. I have tried to replicate the results and I have found a few issues during the training process.
After all of that, I can load your pretrained model Res15_35 (as there are no manifests files for 12 yet provided) and I can achieve the accuracy on Triplet evaluation. On the other hand though, there's no learning when training my model from scratch. The command used follows:
python TripletEncoder.py --name=test_encoder --manifest=35 --mode=Res15 --per_class=5 --per_batch=10 --hidden_size=45
Several per_batch and per_class parameters have been tested and same behaviour: The Triplet loss is always oscillating around 1.1 and 0.7 but there's not an evident decrease while training.
Then running the infer train script through:
python infer_train.py --name=res15_encoder --manifest=35 --model=Res15 --enc_step=25440 --hidden_size=45
The resulting Avg Accuracy is arround 20-35. This is not happening when loading the pretrained model, do you know what could be happening?
Thanks in advanced,
Biel.
The text was updated successfully, but these errors were encountered: