Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix MST-parser tokenization #188

Closed
akolonin opened this issue Mar 20, 2019 · 10 comments
Closed

Fix MST-parser tokenization #188

akolonin opened this issue Mar 20, 2019 · 10 comments
Assignees
Labels
bug Something isn't working doing In progress enhancement New feature or request

Comments

@akolonin
Copy link
Collaborator

akolonin commented Mar 20, 2019

  1. Use standard affix file for LG-ANY mode in MI-Observer and MST-parser (@alexei-gl to provide @glicerico )
  2. Use the same tokenization for MI-Observer and MST-parser, maybe have it configurable for backward compatibility with legacy code (@glicerico )
  3. Have MST-Parser returning tokenized sentence in export (@glicerico )
  4. Re-generate MST-parses with new corpus http://langlearn.singularitynet.io/data/cleaned/English/Gutenberg-Children-Books/lower_LGEng_token/
    for the following parsing settings, rows 54-59:
    https://docs.google.com/spreadsheets/d/1TPbtGrqZ7saUHhOIi5yYmQ9c-cvVlAGqY14ATMPVCq4/edit#gid=963717716
    LG "English"
    Baseline "random":
    Baseline "sequential":
    R=6, Weight = 1, mst-weight - none
    R=6, Weight = 6/r, mst-weight = +1/r
    LG "ANY", all parses, no mst-weight
  5. Make sure GL and GT work with new parses, perform grammar learning/testing with algorithm settings to be defined (@alexei-gl )

This is temporary solution for #93

@akolonin akolonin added the bug Something isn't working label Mar 20, 2019
@glicerico
Copy link
Member

@alexei-gl @akolonin I was thinking the following. Based on the following:

  • The tokenization problem is an independent and possibly-complex problem
  • We want to introduce as less "supervision" as possible.
  • We want to make it easy to change tokenization, also in a future in which we have tokenization done unsupervisedly in some way
  • The file-based parser (for recent NN experiments) is avoiding tokenization completely by taking as input an already-tokenized file.

Why don't we dissociate tokenization from the current parsing problem completely and remove tokenization from pair-counting and mst-parsing? What I mean is that I can remove all tokenization from this processes and just use the tokenization in whatever file we input (so it would do a sentence split based on spaces only, just like the file-based parser is doing now).
Tokenization would then become a pre-cleaner problem, we would assure that the same tokenization is used in both parts of the process, and we could at some poing tackle unsupervised tokenization separately and just use that output as input to the parsing pipeline.
How does this sound to you?

@alexei-gl
Copy link
Collaborator

@glicerico, @akolonin That sounds reasonable. In that case we still have to agree on 4.0.affix and probably 4.0.regex contents for MI-counting/MST-parsing (LG any mode), induced grammar dictionaries, used by grammar tester, in order for pre/post processing to be disabled in link-parser.

@glicerico
Copy link
Member

@alexei-gl I'm proposing to use an empty affix file. I understand that's the only one used for tokenization in Link Grammar, but I may be missing something. However, I'm not aware how you may use that file for your post-parsing processes, so I'm not sure if it makes sense to use an empty affix file.

@akolonin
Copy link
Collaborator Author

akolonin commented Mar 20, 2019

@glicerico I would love if both MI-Obsever and MST-Parser would have "spaces-only" tokenization option as you have already suggested to Sergey Shalyapin. We would use this option to avoid all these confusions and have things under full control of Pre-Cleaner (improving pre-cleaner for directed speech would be separate task for mid-term).

@akolonin akolonin added the enhancement New feature or request label Mar 26, 2019
@glicerico
Copy link
Member

glicerico commented Mar 27, 2019

@akolonin @alexei-gl
About the list above:

  1. Let's use an empty /any/4.0.affix file
  2. It's done in https://github.com/glicerico/learn/tree/same_tokenizers_ULL, I'll merge after testing that everything works fine
  3. Same as 2)
  4. Planning to submit these runs soon, I have been fixing the pipeline in singnet (broken after splitting "learn" repo from "opencog")
  5. No comment

@akolonin
Copy link
Collaborator Author

@glicerico - please make sure the new MST-Parses are in new format so we have MI-values attached to the links.

@akolonin akolonin added the doing In progress label Apr 2, 2019
@glicerico
Copy link
Member

I merged the branch in PR singnet/learn#5

@glicerico
Copy link
Member

I believe this issue has been handled... @akolonin should we close this issue?

@akolonin
Copy link
Collaborator Author

Yes, @alexei-gl just have completed verifying MST-parses

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working doing In progress enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants