Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clustering/categorising links/connectors in GL #249

Open
akolonin opened this issue Aug 2, 2019 · 3 comments
Open

Clustering/categorising links/connectors in GL #249

akolonin opened this issue Aug 2, 2019 · 3 comments
Assignees

Comments

@akolonin
Copy link
Collaborator

akolonin commented Aug 2, 2019

As discussed with @linas in email thread and @OlegBaskov and @glicerico in Slack:

Currently, Grammar Learner (GL) clusters words into word categories or Link Grammar (LG) rules but it does not cluster links, so all links are just individual reciprocal pairs of connectors between individual pairs of clusters. But we may need to cluster these links into categories. This may be useful for analysis of the grammar and may (or may not) be useful for better quality parsing.

That is, in current design of GL grammar induction and export to LG dictionary:
Assuming we have categories/rules A,B,C,D.
AB- and AB+ are connectors from A to B and from B to A respectively.
CD- and CD+ are connectors between C and D.
I guess (@linas ?) there is a possibility that A and C are both nouns while B and D are both verbs.
In this sense, there should be just one label for both AB and CD.
I think that is what @linas mean by "clustering links".

For example, have A being one subcategory of nouns, C another subcategory of nouns, B and D subcategories of verbs, and link/connector labels AB and CD learned from the parses. Then we can generalise that A and C are child categories for X, B and D are child categories of Z and AB and CD are child categories of XZ.

Another example from real LG dictionary created by GL:
http://langlearn.singularitynet.io/data/clustering_2018/POC-English-2018-12-31/POC-English-Amb_LG-English_dILEd_no-gen/dict_37C_2018-12-31_0006.4.0.dict
...
% AB
"a":
(ABAK+) or (ABAL+) or (ABAM+) or (ABAO+) or (ABAP+) or (ABBB+) or (ABBD+) or (ABBE+) or (ABBH+);
% AC
"are":
(AFAC- & ACBH+);
% AD
"be":
(BGAD- & BJAD- & ADBA+);
...
AB, AC and AD are categories/rules
ABAK+ is connector between AB and AK
ABAK- is connector between AK and AB

I guess, in the email thread I guess @linas suggests to cluster these connector labels in space of categories being connected by the links corresponding to the respective connectors and then replace the members of clustered groups with labels of the clusters when the grammar is exported.

I can also imagine the processes of clustering (generalising) rules and clustering (generalisation) links may go iteratively in the loop one after another till some stable equilibrium is reached.

This improvement may take place in existing "Rule Generalisation" component of GL or there could be "Grammar Generalisation" component with two sub-components called "Rule Generalisation" and "Link Generalisation" running in finite loop based on some threshold.

Also, in re-factored GL, the "Grammar Generalisation" can run without prior clustering in sparse vector space being seed with "intitial categories/rules" where each rule corresponds to single word taken from the input parses.

@linas
Copy link

linas commented Aug 3, 2019

Consider the following synonymous phrases: "X is next to Y", "X abuts Y", "X is a neighbor of Y", "X and Y share a common boundary". "X and Y are close to each other".

In this example, X and Y are slots, into which you can plug in any physically compatible objects. (nouns naming physical things of similar size). You can turn these slots into connectors in some "obvious" way. Each of these "graphs" have two connectors, one each for the X and Y slots. These graphs are to be clustered together into a collection of synonymous phrases. One is not clustering words, one is not clustering links, one is clustering complex blobs with two attachment points.

So imagine some complex graphs with two attachment points. Now imagine simpler graphs. What's simpler? Well, two vertices with an edge connecting them. Simpler than that? A single dot. A single dot is the simplest possible graph. When you are clustering words, you are clustering single dots. The point of this example is that you can also cluster more complicated things. The next step up in complexity after single dots is single edges, aka links. A link is just an edge, with a word (a dot) at either end of it. To cluster edges, you have to decide "when are two edges similar, or not?" This is similar to, but not the same as building an edge from two word-clusters.

OK, now for some subtle points. First, how might one write the above synonymous-phrase cluster? Using the ascii-file notation, it would have to be this:

house car tree dog: PA+ or PB-;
is_next_to is_a_neighbor_of adjoins abuts meets touches: PA- & PB+;

Notice that house car tree dog are four single words that can have TWO different connectors. This is because the the left-right constraints must be followed. You cannot get this constraint with single word-clusters. You can't just say "oh look car tree dog house is cluster A and 'adjoins abuts touches` is cluster B and the link between them is AB" that doesn't work. Just take a look:

tree dog house car: AB+ or AB-;
adjoins abuts touches:  AB+ & AB-;

The above is no good, because it allows "house car" and "car dog" as valid sentences. Word clusters are not enough to get correct link types.

One more example, next post.

@linas
Copy link

linas commented Aug 3, 2019

Next example: You cannot say "the dog is next to the mountain" because dogs and mountains are incomparable in size. So the way the links work would have to be:

tree house car dog: PA+ or PB-;
mountain valley lake river:  QA+ or QB-;
protein atom amino ribose: RA+ or RB-;
abuts adjoins touches is_next_to:  (PA- & PB+) or (QA- & QB+) or (RA- & RB+);

The point of the structure above is that it prevents nonsense sentences like "the atom is next to the river".

The next problem is that word-clusters alone are not "orthogonal". So, for example, you can say, "the dog bit the man", but not "the house bit the man". As a result:

bit saw jumped:  S- & O+;
man dog: S+ or O-;
man dog house tree: PA+ or PB-;
adjoins abuts touches: PA- & PB+;

The point here is that "dog" can belong to several different clusters at the same time....

All these considerations suggest that word-clusters are not enough, that ideas like link-clustering are needed.

@linas
Copy link

linas commented Aug 3, 2019

For the record, a cut-n-paste from the email. I'm exploring the approach below. The problem with it is that the dimension of the space is huge, and I cannot yet propose a good algo that is fast, efficient for finding suitable clusters.

Let wa, wb be two words. Let {da} and {db} be the sets of all disjuncts/n-grams (or whatever) attached to these two words. Definition: a plink ("pair-link") is the tuple (wa, wb, {da}, {db}) and grammatical extraction is the task of deciding when (wa, wb, {da}, {db}) is similar to (wc, we, {dc}, {de}). If one has a collection of similar plinks, a "plink-class", then that plink-class defines a "link-type" in the sense of an LG link type. (To be clear: wa and wb are words and not word-classes. The description here works with words, not word-classes, because clustering into word-classes first risks erasing/blurring distinctions.)

If you have a plink-class, then, yes, you can say that wa and wc are in "the same grammatical class". But this is for this plink-class only. There may be some other plink-class that contains wa but not wc, in which case, wa and wc are NOT "in the same word-category" (for that plink-class, only).

So, for example: let S be the plink-class connecting subjects to verbs. Then wa==lamp and wc==lamps both belong to the same word-class for the S-plink, because both "lamp" and "lamps" can be subjects of a sentence.

Let D be the plink-class connecting nouns to the determiners "these, those, some, many". In this case, "lamp" cannot be in the same word-class as "lamps" because you cannot say "those lamp".

So membership in a grammatical word-class depends on the linkage. Sometimes, "lamp" and "lamps" are in the same word-class, and sometimes not. This is what I mean by "erasure": if one is constructing word-classes strictly from word-vectors, and one put "lamp" and "lamps" into the same word-category, then later on, it becomes impossible to get determiner usage correct.

If you look at how LG is actually organized, then both "lamp" and "lamps" have the S connector, but only one has the D connector. This is what I mean when I say "word classes are not enough to determine grammar".

There are multiple algorithmic challenges here. One is a combinatorial explosion: the space of (wa, wb, {da}, {db}) is absolutely immense, and extremely sparse. Finding a fast, accurate technique to discover plink-classes is hard. Noise, accuracy, getting enough samples is hard. If {da} and {db} are coming from MST, and there is not enough data to overcome the inaccuracy of MST, then those errors are propagated upstream. I do not have any magic answers for these issues, other than saying "more hard work".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants