Simplification of code and a bit faster #12

shabie · 2021-11-06T02:37:50Z

The bottleneck remains the tesseract but still I tried to make it faster. If you don't find any bugs, please merge.

Key changes:

Use of lru_cache
Use of word_ids for token-level bounding box duplication. This is a feature of BertTokenizerFast.

shabie · 2021-11-06T22:26:52Z

Sorry merging this myself since I wanna move forward with it.

uakarsh · 2021-11-07T05:37:16Z

Not a problem, I think that it is fine

shabie added 2 commits November 6, 2021 03:25

tweaked for perform. bottleneck still tesseract

7e563eb

fix error of double cls token

0d56356

shabie requested a review from uakarsh November 6, 2021 02:37

add var for CLS token box

b60d018

shabie merged commit 6378b14 into master Nov 6, 2021

shabie deleted the optim-dataset branch November 6, 2021 22:28

Provide feedback