lazyness for a small optimization #1117
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This optimization is based on non blocking memory transfer on flair LM embeddings plus some small change in operation order so memory transfer goes during the time GPU is computing stuff.
It really works only when there are several chunks.
I can't measure any change on CONNL03 times (there are max 2 chunks on first batches with chunk size setup to 512) but on my French dataset with long sentences I have up to 8 chunks (plus many 3 chunks batches) and I can measure a constant 1 second decrease, which is significant as the full process takes 13 seconds (14 before), so a 7% decrease.
Of course impact of this optimization would increase with chunk size decreasing (because it would create more chunks) but it would impact model accuracy, so... I didn't measured it.
More importantly, the main optimization opportunity is now on memory transfer, on CONNL 2003 the transfer of letter IDs from CPU to GPU (before applying the LM itself) and the time to transfer word embedings to GPU represent together 1/3 of the time spent in predict on my 2080TI.
So clearly this is where to work on on speed part.
This is kind of good news that it s concentrate on a single bottle neck, but it s a hard to beat one.
I have tried pin memory without any success, may be not significant enough here.
I have spot this lib https://github.com/Santosh-Gupta/SpeedTorch but it s not obvious if it can help.
If you have some ideas, don't hesitate to share them here :-)
Nb: interesting detail, as written in a previous PR, now large batch of Sentence makes the algo significantly faster, I don't remember it was the case with the released version.