Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow text-dict style vocab on LM Dataset #1235

Merged
merged 5 commits into from
Nov 30, 2022

Conversation

JackTemaki
Copy link
Collaborator

@JackTemaki JackTemaki commented Nov 29, 2022

The default output vocabulary of the https://github.com/rwth-i6/subword-nmt repo is a python formatted dictionary in text form. The OggZipDataset can use it, but the LMDataset did not yet, so I added that capability.

@JackTemaki JackTemaki requested review from a team and albertz as code owners November 29, 2022 14:21
@albertz
Copy link
Member

albertz commented Nov 29, 2022

We should never check just using such generic exception logic. You should instead use some heuristic, eg check the first few non-space characters, that it starts with {"...": ..., ...}.

@JackTemaki
Copy link
Collaborator Author

Ok I used a regex now.

@albertz
Copy link
Member

albertz commented Nov 30, 2022

The regex should check the non whitespace characters, from the beginning on (^), and then check that the first non-white space char is {, and then the next is ", and then it matches a pattern like you did.

You should not read the whole file at that point but just the first 1024 bytes or so.

I think it's then easier and clearer when you filter out the whitespace chars in that header, and then do the regex on the filtered header.

You also should not use re.compile as this is not needed here but just re.match.

You also should use our efficient literal_eval (I think in utils, see other related code) instead of eval.

@JackTemaki
Copy link
Collaborator Author

Ok! makes sense.

@albertz
Copy link
Member

albertz commented Nov 30, 2022

.+ and [0-9]+ instead of the *.

@albertz
Copy link
Member

albertz commented Nov 30, 2022

Otherwise ok.

@albertz albertz merged commit 2c0bf36 into master Nov 30, 2022
@albertz albertz deleted the nick_lm_allow_text_dict_vocab branch November 30, 2022 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants