Model not Learning #3

bieltura · 2021-03-22T15:40:36Z

Hi!

First of all, thank you for open souring the code. I have tried to replicate the results and I have found a few issues during the training process.

I have generated a script following the presented notebook to generate dists.npy not present in the source code. The file is 799.9mb long that saves an array of shape (9998, 9998).
The classes probabilites files are missing, I am assigning them to None.
I had to comment the line in l2.py to avoid getting the grad_fn error while training.

After all of that, I can load your pretrained model Res15_35 (as there are no manifests files for 12 yet provided) and I can achieve the accuracy on Triplet evaluation. On the other hand though, there's no learning when training my model from scratch. The command used follows:

python TripletEncoder.py --name=test_encoder --manifest=35 --mode=Res15 --per_class=5 --per_batch=10 --hidden_size=45

Several per_batch and per_class parameters have been tested and same behaviour: The Triplet loss is always oscillating around 1.1 and 0.7 but there's not an evident decrease while training.

Then running the infer train script through:
python infer_train.py --name=res15_encoder --manifest=35 --model=Res15 --enc_step=25440 --hidden_size=45

The resulting Avg Accuracy is arround 20-35. This is not happening when loading the pretrained model, do you know what could be happening?

Thanks in advanced,

Biel.

The text was updated successfully, but these errors were encountered:

roman-vygon · 2021-03-23T04:12:54Z

Hi @bieltura!
I am grateful for your response and sorry that the code didn't work as intended.
I've added dists and probs files corresponding to the 12 and 35 classes task, manifests for training 12 classes and the "silence" class folder to the project's google drive.
As to the trouble with learning the model from scratch:

Triplet loss showed during training isn't a good way to see if the model is training, because it's calculated from the hard\semihard triplets, but still it should decrease to about a level of margin, which is 0.5 by default.
At first glance, the problem with the model not training could be because of the small number of classes per batch, Res15_35 model was trained with all 35 classes in batch with 4 examples per class. If the memory allows, increase the classes per batch parameter.
I'll try to reproduce your error and come back to you with results.

bieltura · 2021-03-23T12:19:47Z

Hi @roman-vygon,

Thank you alot for the drive link, I missed it in the README.

I completly understand the fact about hard/semihard examples, which makes also sense, but thanks for the clarification :)

The choice of per_class and per_batch is due to memory concerns. I have several GPUs available, but does the NeMo framework support multiple GPUs on your code? There's a gpu argument which is expecting a int value then used as a str:

os.environ['CUDA_VISIBLE_DEVICES']=str(args.gpu)

I guess it defines the number of GPUs to use in training, but increasing from 0 still is giving out of memory in my case.

I have trained with new dists.npy and class_probs35.npy but still not learning... (tried batch_classes of 35 with per_class 2 as well).

In case it helps, my current env is: Pytorch 1.7.0 (bc of Cuda driver 11.0), nemo-toolkit 0.10.1 and nemo-asr 0.9.1.

Thanks,
Biel

roman-vygon · 2021-05-17T11:16:26Z

Currently working on porting this code to a newer NeMo version, that would support multi-gpu balanced batches, but so far my only advice would be to decrease batch_classes while increasing the per_class argument.

elenazy · 2021-07-26T03:39:08Z

Hi!

First of all, thank you for open souring the code. I have tried to replicate the results and I have found a few issues during the training process.
1. I have generated a script following the presented notebook to generate dists.npy not present in the source code. The file is 799.9mb long that saves an array of shape (9998, 9998).

2. The classes probabilites files are missing, I am assigning them to None.

3. I had to comment the line in l2.py to avoid getting the grad_fn error while training.
After all of that, I can load your pretrained model Res15_35 (as there are no manifests files for 12 yet provided) and I can achieve the accuracy on Triplet evaluation. On the other hand though, there's no learning when training my model from scratch. The command used follows:

python TripletEncoder.py --name=test_encoder --manifest=35 --mode=Res15 --per_class=5 --per_batch=10 --hidden_size=45

Several per_batch and per_class parameters have been tested and same behaviour: The Triplet loss is always oscillating around 1.1 and 0.7 but there's not an evident decrease while training.

Then running the infer train script through:
python infer_train.py --name=res15_encoder --manifest=35 --model=Res15 --enc_step=25440 --hidden_size=45

The resulting Avg Accuracy is arround 20-35. This is not happening when loading the pretrained model, do you know what could be happening?

Thanks in advanced,

Biel.

Hi,
I have met the same issue with you. Follow the author's recommendation，I decreased batch_classes while increasing the per_class argument,but it didn't work. Are you making new progress now？Please help me.
Thanks!

bieltura · 2021-10-02T15:30:34Z

Hi,
Sorry for the late reply. We did not implement this part finally. I hope the author could upload a new version as soon as possible.

Best.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model not Learning #3

Model not Learning #3

bieltura commented Mar 22, 2021

roman-vygon commented Mar 23, 2021

bieltura commented Mar 23, 2021 •

edited

Loading

roman-vygon commented May 17, 2021

elenazy commented Jul 26, 2021

bieltura commented Oct 2, 2021

Model not Learning #3

Model not Learning #3

Comments

bieltura commented Mar 22, 2021

roman-vygon commented Mar 23, 2021

bieltura commented Mar 23, 2021 • edited Loading

roman-vygon commented May 17, 2021

elenazy commented Jul 26, 2021

bieltura commented Oct 2, 2021

bieltura commented Mar 23, 2021 •

edited

Loading