Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lazyness for a small optimization #1117

Merged
merged 2 commits into from Sep 18, 2019

Conversation

pommedeterresautee
Copy link
Contributor

@pommedeterresautee pommedeterresautee commented Sep 17, 2019

This optimization is based on non blocking memory transfer on flair LM embeddings plus some small change in operation order so memory transfer goes during the time GPU is computing stuff.
It really works only when there are several chunks.

I can't measure any change on CONNL03 times (there are max 2 chunks on first batches with chunk size setup to 512) but on my French dataset with long sentences I have up to 8 chunks (plus many 3 chunks batches) and I can measure a constant 1 second decrease, which is significant as the full process takes 13 seconds (14 before), so a 7% decrease.

Of course impact of this optimization would increase with chunk size decreasing (because it would create more chunks) but it would impact model accuracy, so... I didn't measured it.

More importantly, the main optimization opportunity is now on memory transfer, on CONNL 2003 the transfer of letter IDs from CPU to GPU (before applying the LM itself) and the time to transfer word embedings to GPU represent together 1/3 of the time spent in predict on my 2080TI.

So clearly this is where to work on on speed part.
This is kind of good news that it s concentrate on a single bottle neck, but it s a hard to beat one.
I have tried pin memory without any success, may be not significant enough here.

I have spot this lib https://github.com/Santosh-Gupta/SpeedTorch but it s not obvious if it can help.

If you have some ideas, don't hesitate to share them here :-)

Nb: interesting detail, as written in a previous PR, now large batch of Sentence makes the algo significantly faster, I don't remember it was the case with the released version.

@Santosh-Gupta
Copy link

@pommedeterresautee , it looks like the embeddings are in self.encoder = nn.Embedding(len(dictionary), embedding_size) ?

For this, you can use the library hold these embeddings on the CPU, and pass them to the GPU. You can check the word2vec example from the library to see how to do this, and then post a question on the gitter as soon as you have a question.

I'm not sure if I completely understand the process though. It seems that you have a letter Id? does this correspond to a letter embedding? Are they different from the word embeddings?

@alanakbik
Copy link
Collaborator

Wow SpeedTorch looks interesting! Thanks @pommedeterresautee for the PR!

@alanakbik
Copy link
Collaborator

👍

1 similar comment
@yosipk
Copy link
Collaborator

yosipk commented Sep 18, 2019

👍

@yosipk yosipk merged commit 6a83b5c into flairNLP:master Sep 18, 2019
@pommedeterresautee
Copy link
Contributor Author

@Santosh-Gupta Tks a lot for your pointers.
I have not yet had the opportunity to make real tests.
Can you please confirm if my understanding is right.

nn.Embedding(len(dictionary), embedding_size) contains the embeddings I am interested into.
I want to extract from it series of embeddings.
For that, I need to convert nn.Embedding to numpy array, serialize it, load it with DataGadget and then call getData?

Then I create an empty Pytorch tensor and copy in weight.data the weights from cumpy.

Am I right?
Would it be possible to have a loading directly from nn.Embeddings? or from numpy arrays without having a serialization-deserialization step?

@Santosh-Gupta
Copy link

For that, I need to convert nn.Embedding to numpy array, serialize it, load it with DataGadget and then call getData?

Yes, the data will be in a form that will be accepted by Pytorch variables.

Then I create an empty Pytorch tensor and copy in weight.data the weights from cumpy.

You would want to create the Pytorch variable before so that getData has a place to put its data.

What is your purpose for this? If you just want to store the data on CPU, the data gadget can do that, you have to set CPUPinn=True

For embeddings, the ModelFactory might be more appropriate. What are you trying to do overall?

Would it be possible to have a loading directly from nn.Embeddings? or from numpy arrays without having a serialization-deserialization step?

Yes, you can use a ModelFactory instance and the afterOptimizationStep method with an index slice covering the entire nn.Embeddings . First initialize the instance with zerosInit()

Lets continue the discussion on the project gitter so that others could benefit from the discussion

https://gitter.im/SpeedTorch/community

@pommedeterresautee pommedeterresautee deleted the non_blocking branch September 19, 2019 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants