lazyness for a small optimization #1117

pommedeterresautee · 2019-09-17T09:38:05Z

This optimization is based on non blocking memory transfer on flair LM embeddings plus some small change in operation order so memory transfer goes during the time GPU is computing stuff.
It really works only when there are several chunks.

I can't measure any change on CONNL03 times (there are max 2 chunks on first batches with chunk size setup to 512) but on my French dataset with long sentences I have up to 8 chunks (plus many 3 chunks batches) and I can measure a constant 1 second decrease, which is significant as the full process takes 13 seconds (14 before), so a 7% decrease.

Of course impact of this optimization would increase with chunk size decreasing (because it would create more chunks) but it would impact model accuracy, so... I didn't measured it.

More importantly, the main optimization opportunity is now on memory transfer, on CONNL 2003 the transfer of letter IDs from CPU to GPU (before applying the LM itself) and the time to transfer word embedings to GPU represent together 1/3 of the time spent in predict on my 2080TI.

So clearly this is where to work on on speed part.
This is kind of good news that it s concentrate on a single bottle neck, but it s a hard to beat one.
I have tried pin memory without any success, may be not significant enough here.

I have spot this lib https://github.com/Santosh-Gupta/SpeedTorch but it s not obvious if it can help.

If you have some ideas, don't hesitate to share them here :-)

Nb: interesting detail, as written in a previous PR, now large batch of Sentence makes the algo significantly faster, I don't remember it was the case with the released version.

Santosh-Gupta · 2019-09-17T15:13:56Z

@pommedeterresautee , it looks like the embeddings are in self.encoder = nn.Embedding(len(dictionary), embedding_size) ?

For this, you can use the library hold these embeddings on the CPU, and pass them to the GPU. You can check the word2vec example from the library to see how to do this, and then post a question on the gitter as soon as you have a question.

I'm not sure if I completely understand the process though. It seems that you have a letter Id? does this correspond to a letter embedding? Are they different from the word embeddings?

alanakbik · 2019-09-17T16:06:30Z

Wow SpeedTorch looks interesting! Thanks @pommedeterresautee for the PR!

alanakbik · 2019-09-17T16:06:36Z

👍

yosipk · 2019-09-18T08:25:33Z

👍

pommedeterresautee · 2019-09-18T16:14:11Z

@Santosh-Gupta Tks a lot for your pointers.
I have not yet had the opportunity to make real tests.
Can you please confirm if my understanding is right.

nn.Embedding(len(dictionary), embedding_size) contains the embeddings I am interested into.
I want to extract from it series of embeddings.
For that, I need to convert nn.Embedding to numpy array, serialize it, load it with DataGadget and then call getData?

Then I create an empty Pytorch tensor and copy in weight.data the weights from cumpy.

Am I right?
Would it be possible to have a loading directly from nn.Embeddings? or from numpy arrays without having a serialization-deserialization step?

Santosh-Gupta · 2019-09-18T20:50:21Z

For that, I need to convert nn.Embedding to numpy array, serialize it, load it with DataGadget and then call getData?

Yes, the data will be in a form that will be accepted by Pytorch variables.

Then I create an empty Pytorch tensor and copy in weight.data the weights from cumpy.

You would want to create the Pytorch variable before so that getData has a place to put its data.

What is your purpose for this? If you just want to store the data on CPU, the data gadget can do that, you have to set CPUPinn=True

For embeddings, the ModelFactory might be more appropriate. What are you trying to do overall?

Would it be possible to have a loading directly from nn.Embeddings? or from numpy arrays without having a serialization-deserialization step?

Yes, you can use a ModelFactory instance and the afterOptimizationStep method with an index slice covering the entire nn.Embeddings . First initialize the instance with zerosInit()

Lets continue the discussion on the project gitter so that others could benefit from the discussion

https://gitter.im/SpeedTorch/community

pommedeterresautee added 2 commits September 17, 2019 08:55

non blocking memory

bf08187

improve

a6b4e11

yosipk merged commit 6a83b5c into flairNLP:master Sep 18, 2019

pommedeterresautee deleted the non_blocking branch September 19, 2019 14:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lazyness for a small optimization #1117

lazyness for a small optimization #1117

pommedeterresautee commented Sep 17, 2019 •

edited

Santosh-Gupta commented Sep 17, 2019

alanakbik commented Sep 17, 2019

alanakbik commented Sep 17, 2019

yosipk commented Sep 18, 2019

pommedeterresautee commented Sep 18, 2019

Santosh-Gupta commented Sep 18, 2019

lazyness for a small optimization #1117

lazyness for a small optimization #1117

Conversation

pommedeterresautee commented Sep 17, 2019 • edited

Santosh-Gupta commented Sep 17, 2019

alanakbik commented Sep 17, 2019

alanakbik commented Sep 17, 2019

yosipk commented Sep 18, 2019

pommedeterresautee commented Sep 18, 2019

Santosh-Gupta commented Sep 18, 2019

pommedeterresautee commented Sep 17, 2019 •

edited