Fix MST-parser tokenization #188

akolonin · 2019-03-20T08:13:15Z

Use standard affix file for LG-ANY mode in MI-Observer and MST-parser (@alexei-gl to provide @glicerico )
Use the same tokenization for MI-Observer and MST-parser, maybe have it configurable for backward compatibility with legacy code (@glicerico )
Have MST-Parser returning tokenized sentence in export (@glicerico )
Re-generate MST-parses with new corpus http://langlearn.singularitynet.io/data/cleaned/English/Gutenberg-Children-Books/lower_LGEng_token/
for the following parsing settings, rows 54-59:
https://docs.google.com/spreadsheets/d/1TPbtGrqZ7saUHhOIi5yYmQ9c-cvVlAGqY14ATMPVCq4/edit#gid=963717716
LG "English"
Baseline "random":
Baseline "sequential":
R=6, Weight = 1, mst-weight - none
R=6, Weight = 6/r, mst-weight = +1/r
LG "ANY", all parses, no mst-weight
Make sure GL and GT work with new parses, perform grammar learning/testing with algorithm settings to be defined (@alexei-gl )

This is temporary solution for #93

glicerico · 2019-03-20T09:44:20Z

@alexei-gl @akolonin I was thinking the following. Based on the following:

The tokenization problem is an independent and possibly-complex problem
We want to introduce as less "supervision" as possible.
We want to make it easy to change tokenization, also in a future in which we have tokenization done unsupervisedly in some way
The file-based parser (for recent NN experiments) is avoiding tokenization completely by taking as input an already-tokenized file.

Why don't we dissociate tokenization from the current parsing problem completely and remove tokenization from pair-counting and mst-parsing? What I mean is that I can remove all tokenization from this processes and just use the tokenization in whatever file we input (so it would do a sentence split based on spaces only, just like the file-based parser is doing now).
Tokenization would then become a pre-cleaner problem, we would assure that the same tokenization is used in both parts of the process, and we could at some poing tackle unsupervised tokenization separately and just use that output as input to the parsing pipeline.
How does this sound to you?

alexei-gl · 2019-03-20T14:43:56Z

@glicerico, @akolonin That sounds reasonable. In that case we still have to agree on 4.0.affix and probably 4.0.regex contents for MI-counting/MST-parsing (LG any mode), induced grammar dictionaries, used by grammar tester, in order for pre/post processing to be disabled in link-parser.

glicerico · 2019-03-20T14:56:38Z

@alexei-gl I'm proposing to use an empty affix file. I understand that's the only one used for tokenization in Link Grammar, but I may be missing something. However, I'm not aware how you may use that file for your post-parsing processes, so I'm not sure if it makes sense to use an empty affix file.

akolonin · 2019-03-20T16:50:33Z

@glicerico I would love if both MI-Obsever and MST-Parser would have "spaces-only" tokenization option as you have already suggested to Sergey Shalyapin. We would use this option to avoid all these confusions and have things under full control of Pre-Cleaner (improving pre-cleaner for directed speech would be separate task for mid-term).

glicerico · 2019-03-27T15:45:24Z

@akolonin @alexei-gl
About the list above:

Let's use an empty /any/4.0.affix file
It's done in https://github.com/glicerico/learn/tree/same_tokenizers_ULL, I'll merge after testing that everything works fine
Same as 2)
Planning to submit these runs soon, I have been fixing the pipeline in singnet (broken after splitting "learn" repo from "opencog")
No comment

akolonin · 2019-03-29T07:03:18Z

@glicerico - please make sure the new MST-Parses are in new format so we have MI-values attached to the links.

glicerico · 2019-04-02T16:21:48Z

I merged the branch in PR singnet/learn#5

akolonin · 2019-04-12T11:38:32Z

@glicerico
For item 4, make sure that:
All MST-parses are generated from
http://langlearn.singularitynet.io/data/cleaned/English/Gutenberg-Children-Books/lower_LGEng_token/
ad evaluated against "gold standard" http://langlearn.singularitynet.io/data/parses/English/Gutenberg-Children-Books/test/GC_LGEnglish_noQuotes_manual.ull and "silver-no-direct-speech" http://langlearn.singularitynet.io/data/parses/English/Gutenberg-Children-Books/test/GC_LGEnglish_noQuotes_fullyParsed.ull, rows 56-60, cells D-H: https://docs.google.com/spreadsheets/d/1TPbtGrqZ7saUHhOIi5yYmQ9c-cvVlAGqY14ATMPVCq4/edit?ts=5cb0064f#gid=963717716

glicerico · 2019-05-30T13:55:09Z

I believe this issue has been handled... @akolonin should we close this issue?

akolonin · 2019-05-30T14:43:26Z

Yes, @alexei-gl just have completed verifying MST-parses

akolonin added the bug Something isn't working label Mar 20, 2019

akolonin assigned glicerico and alexei-gl Mar 20, 2019

akolonin added the enhancement New feature or request label Mar 26, 2019

akolonin mentioned this issue Mar 28, 2019

Alternative MST-Parsing #106

Closed

akolonin unassigned glicerico Apr 1, 2019

akolonin added the doing In progress label Apr 2, 2019

akolonin assigned glicerico Apr 12, 2019

akolonin mentioned this issue Apr 12, 2019

Tokenization is different for LG English and LG ANY - which problems may be raised by this #93

Open

akolonin unassigned glicerico Apr 18, 2019

akolonin closed this as completed May 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix MST-parser tokenization #188

Fix MST-parser tokenization #188

akolonin commented Mar 20, 2019 •

edited

Loading

glicerico commented Mar 20, 2019

alexei-gl commented Mar 20, 2019

glicerico commented Mar 20, 2019

akolonin commented Mar 20, 2019 •

edited

Loading

glicerico commented Mar 27, 2019 •

edited

Loading

akolonin commented Mar 29, 2019

glicerico commented Apr 2, 2019

akolonin commented Apr 12, 2019

glicerico commented May 30, 2019

akolonin commented May 30, 2019

Fix MST-parser tokenization #188

Fix MST-parser tokenization #188

Comments

akolonin commented Mar 20, 2019 • edited Loading

glicerico commented Mar 20, 2019

alexei-gl commented Mar 20, 2019

glicerico commented Mar 20, 2019

akolonin commented Mar 20, 2019 • edited Loading

glicerico commented Mar 27, 2019 • edited Loading

akolonin commented Mar 29, 2019

glicerico commented Apr 2, 2019

akolonin commented Apr 12, 2019

glicerico commented May 30, 2019

akolonin commented May 30, 2019

akolonin commented Mar 20, 2019 •

edited

Loading

akolonin commented Mar 20, 2019 •

edited

Loading

glicerico commented Mar 27, 2019 •

edited

Loading