Allow text-dict style vocab on LM Dataset #1235

JackTemaki · 2022-11-29T14:21:22Z

The default output vocabulary of the https://github.com/rwth-i6/subword-nmt repo is a python formatted dictionary in text form. The OggZipDataset can use it, but the LMDataset did not yet, so I added that capability.

albertz · 2022-11-29T14:43:48Z

We should never check just using such generic exception logic. You should instead use some heuristic, eg check the first few non-space characters, that it starts with {"...": ..., ...}.

JackTemaki · 2022-11-29T17:40:13Z

Ok I used a regex now.

albertz · 2022-11-30T10:05:26Z

The regex should check the non whitespace characters, from the beginning on (^), and then check that the first non-white space char is {, and then the next is ", and then it matches a pattern like you did.

You should not read the whole file at that point but just the first 1024 bytes or so.

I think it's then easier and clearer when you filter out the whitespace chars in that header, and then do the regex on the filtered header.

You also should not use re.compile as this is not needed here but just re.match.

You also should use our efficient literal_eval (I think in utils, see other related code) instead of eval.

JackTemaki · 2022-11-30T14:33:52Z

Ok! makes sense.

albertz · 2022-11-30T14:44:01Z

.+ and [0-9]+ instead of the *.

albertz · 2022-11-30T14:44:50Z

Otherwise ok.

JackTemaki added 2 commits November 18, 2022 16:28

allow text-dict in LMDataset

9810d62

fix lm vocab loading, allow reuse of index

186775e

JackTemaki requested review from a team and albertz as code owners November 29, 2022 14:21

replace SyntaxError by regex-test

7b45db9

updated regex, use literal_eval

60e5ea3

replace * by + in regex

a43e608

albertz approved these changes Nov 30, 2022

View reviewed changes

albertz merged commit 2c0bf36 into master Nov 30, 2022

albertz deleted the nick_lm_allow_text_dict_vocab branch November 30, 2022 15:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow text-dict style vocab on LM Dataset #1235

Allow text-dict style vocab on LM Dataset #1235

JackTemaki commented Nov 29, 2022 •

edited

Loading

albertz commented Nov 29, 2022

JackTemaki commented Nov 29, 2022

albertz commented Nov 30, 2022

JackTemaki commented Nov 30, 2022

albertz commented Nov 30, 2022

albertz commented Nov 30, 2022

Allow text-dict style vocab on LM Dataset #1235

Allow text-dict style vocab on LM Dataset #1235

Conversation

JackTemaki commented Nov 29, 2022 • edited Loading

albertz commented Nov 29, 2022

JackTemaki commented Nov 29, 2022

albertz commented Nov 30, 2022

JackTemaki commented Nov 30, 2022

albertz commented Nov 30, 2022

albertz commented Nov 30, 2022

JackTemaki commented Nov 29, 2022 •

edited

Loading