-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow text-dict style vocab on LM Dataset #1235
Conversation
We should never check just using such generic exception logic. You should instead use some heuristic, eg check the first few non-space characters, that it starts with {"...": ..., ...}. |
Ok I used a regex now. |
The regex should check the non whitespace characters, from the beginning on (^), and then check that the first non-white space char is {, and then the next is ", and then it matches a pattern like you did. You should not read the whole file at that point but just the first 1024 bytes or so. I think it's then easier and clearer when you filter out the whitespace chars in that header, and then do the regex on the filtered header. You also should not use re.compile as this is not needed here but just re.match. You also should use our efficient literal_eval (I think in utils, see other related code) instead of eval. |
Ok! makes sense. |
.+ and [0-9]+ instead of the *. |
Otherwise ok. |
The default output vocabulary of the https://github.com/rwth-i6/subword-nmt repo is a python formatted dictionary in text form. The
OggZipDataset
can use it, but theLMDataset
did not yet, so I added that capability.