Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model not Learning #3

Open
bieltura opened this issue Mar 22, 2021 · 5 comments
Open

Model not Learning #3

bieltura opened this issue Mar 22, 2021 · 5 comments

Comments

@bieltura
Copy link

Hi!

First of all, thank you for open souring the code. I have tried to replicate the results and I have found a few issues during the training process.

  1. I have generated a script following the presented notebook to generate dists.npy not present in the source code. The file is 799.9mb long that saves an array of shape (9998, 9998).
  2. The classes probabilites files are missing, I am assigning them to None.
  3. I had to comment the line in l2.py to avoid getting the grad_fn error while training.

After all of that, I can load your pretrained model Res15_35 (as there are no manifests files for 12 yet provided) and I can achieve the accuracy on Triplet evaluation. On the other hand though, there's no learning when training my model from scratch. The command used follows:

python TripletEncoder.py --name=test_encoder --manifest=35 --mode=Res15 --per_class=5 --per_batch=10 --hidden_size=45

Several per_batch and per_class parameters have been tested and same behaviour: The Triplet loss is always oscillating around 1.1 and 0.7 but there's not an evident decrease while training.

image

Then running the infer train script through:
python infer_train.py --name=res15_encoder --manifest=35 --model=Res15 --enc_step=25440 --hidden_size=45

The resulting Avg Accuracy is arround 20-35. This is not happening when loading the pretrained model, do you know what could be happening?

image

Thanks in advanced,

Biel.

@roman-vygon
Copy link
Owner

Hi @bieltura!
I am grateful for your response and sorry that the code didn't work as intended.
I've added dists and probs files corresponding to the 12 and 35 classes task, manifests for training 12 classes and the "silence" class folder to the project's google drive.
As to the trouble with learning the model from scratch:

  1. Triplet loss showed during training isn't a good way to see if the model is training, because it's calculated from the hard\semihard triplets, but still it should decrease to about a level of margin, which is 0.5 by default.
  2. At first glance, the problem with the model not training could be because of the small number of classes per batch, Res15_35 model was trained with all 35 classes in batch with 4 examples per class. If the memory allows, increase the classes per batch parameter.
    I'll try to reproduce your error and come back to you with results.

@bieltura
Copy link
Author

bieltura commented Mar 23, 2021

Hi @roman-vygon,

Thank you alot for the drive link, I missed it in the README.

I completly understand the fact about hard/semihard examples, which makes also sense, but thanks for the clarification :)

The choice of per_class and per_batch is due to memory concerns. I have several GPUs available, but does the NeMo framework support multiple GPUs on your code? There's a gpu argument which is expecting a int value then used as a str:

os.environ['CUDA_VISIBLE_DEVICES']=str(args.gpu)

I guess it defines the number of GPUs to use in training, but increasing from 0 still is giving out of memory in my case.

I have trained with new dists.npy and class_probs35.npy but still not learning... (tried batch_classes of 35 with per_class 2 as well).

In case it helps, my current env is: Pytorch 1.7.0 (bc of Cuda driver 11.0), nemo-toolkit 0.10.1 and nemo-asr 0.9.1.

Thanks,
Biel

@roman-vygon
Copy link
Owner

Currently working on porting this code to a newer NeMo version, that would support multi-gpu balanced batches, but so far my only advice would be to decrease batch_classes while increasing the per_class argument.

@elenazy
Copy link

elenazy commented Jul 26, 2021

Hi!

First of all, thank you for open souring the code. I have tried to replicate the results and I have found a few issues during the training process.

1. I have generated a script following the presented notebook to generate dists.npy not present in the source code. The file is 799.9mb long that saves an array of shape (9998, 9998).

2. The classes probabilites files are missing, I am assigning them to None.

3. I had to comment the line in l2.py to avoid getting the grad_fn error while training.

After all of that, I can load your pretrained model Res15_35 (as there are no manifests files for 12 yet provided) and I can achieve the accuracy on Triplet evaluation. On the other hand though, there's no learning when training my model from scratch. The command used follows:

python TripletEncoder.py --name=test_encoder --manifest=35 --mode=Res15 --per_class=5 --per_batch=10 --hidden_size=45

Several per_batch and per_class parameters have been tested and same behaviour: The Triplet loss is always oscillating around 1.1 and 0.7 but there's not an evident decrease while training.

image

Then running the infer train script through:
python infer_train.py --name=res15_encoder --manifest=35 --model=Res15 --enc_step=25440 --hidden_size=45

The resulting Avg Accuracy is arround 20-35. This is not happening when loading the pretrained model, do you know what could be happening?

image

Thanks in advanced,

Biel.

Hi,
I have met the same issue with you. Follow the author's recommendation,I decreased batch_classes while increasing the per_class argument,but it didn't work. Are you making new progress now?Please help me.
Thanks!

@bieltura
Copy link
Author

bieltura commented Oct 2, 2021

Hi,
Sorry for the late reply. We did not implement this part finally. I hope the author could upload a new version as soon as possible.

Best.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants