Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM Issue #109

Open
gbadiali opened this issue Jan 12, 2017 · 1 comment
Open

OOM Issue #109

gbadiali opened this issue Jan 12, 2017 · 1 comment

Comments

@gbadiali
Copy link

The following strings take more than 500ms to be tokenized:

  1. "@Bam_cos0118 실물을 봤으닊 이러죠!!!!!!!!!꺄윽ㅇㅇ썈꺅!!!!!!!!!!!!!밤님 우주최강지구최강중국최강한국최강그리스최강호주최강미국최강북한최강ㅇ일본최강홍콩최강대만최강마카오최강아프리카최강우즈베키스탄최강!!!!존예!!!존귀!!!시라구!!!"
  2. "한국일보 6월3일자 만평 https://t.co/nnZCJovw0w"

We also run into OOM errors when tokenizing many of these in a row.
java.lang.OutOfMemoryError: GC overhead limit exceeded
VM error: GC overhead limit exceeded

@hohyon-ryu
Copy link
Contributor

Hi @gbadiali, thanks for reporting the issue. I currently lost access to this repo so I cannot merge PRs or publish new changes. And it is actually quite complicated to apply updates in this repo to Penguin as Penguin needs to support 2 separate versions. I can give you some help if you want to fix on your side or anyone at Twitter wants to own fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants