Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

meet operator behaving strangely #164

Closed
Ansa211 opened this issue Jan 12, 2018 · 5 comments
Closed

meet operator behaving strangely #164

Ansa211 opened this issue Jan 12, 2018 · 5 comments
Labels

Comments

@Ansa211
Copy link

Ansa211 commented Jan 12, 2018

This issue is very confusing for me, any explanation would be welcome.

(meet 1:[mwe_id="(.*;.*)"] 2:[] -5 -1) & 1.mwe_id=2.mwe_id within <s/> has 3 results, and as expected, all three are to the left of the main word because meet has parameters -5 -1

(meet 1:[mwe_id="(.*;.*)"] 2:[] 1 5) & 1.mwe_id=2.mwe_id within <s/> has 4 results and as expected, all 4 are to the right of the KWIC word because meet has parameters 1 5

Question 1:
the only condition on nodes 1 and 2 is one of equality; in other words, the second query should match the same sentences as the first, but the two words should swap roles (in the second query, the left one of them should be KWIC and the right one should be in context). Why is it not so?

(meet 1:[mwe_id="(.*;.*)"] 2:[] -5 5) & 1.mwe_id=2.mwe_id within <s/>
should match both to the left and to the right of the KWIC word because of the parameters -5 5; however, it gives the same result as setting the parameters to -5 -1, why?

@Ansa211
Copy link
Author

Ansa211 commented Jan 12, 2018

Same issue, simplified queries, not so easy to overview output:

(meet 1:[mwe_id!="_"] 2:[mwe_id!="_"] 1 5) & 1.mwe_id=2.mwe_id & 1.word!=2.word within <s/>
---> apply negative filter (meet 1:[mwe_id!="_"] 2:[mwe_id!="_"] -5 -1) & 1.mwe_id=2.mwe_id & 1.word!=2.word within <s/>

Again, I expected that the second query matches exactly the same nodes as the first one but with roles swapped, so after the application of the negative filter, there should be nothing left; so why are there 22 lines in the output?

@Ansa211
Copy link
Author

Ansa211 commented Apr 3, 2018

I tested this on an instance of NoSketchEngine; the results are the same, so the problem must be in Manatee. I will report it to SketchEngine people.

@Ansa211 Ansa211 closed this as completed Apr 3, 2018
@Ansa211
Copy link
Author

Ansa211 commented Apr 11, 2018

Answer from SketchEngine:

Dear Anna,

unfortunately I don't have any good news for you -- after much deliberation, the gurus told me that when label positions are ambiguous, the result is unspecified. Currently, only one of the possibilities is propagated through the evaluation tree. Only the position of the KWIC is what differentiates between different result rows.

Therefore, queries like this are not well-formed and should be avoided. The query can possibly be formulated in a different way or perhaps emulated using the filtering functionality on concordances.

Best Regards,
Ondrej Herman

Sketch Engine Team


Previous communication

URL: https://the.sketchengine.co.uk/corpus/first?corpname=preloaded%2Fsusanne&reload=&iquery=&queryselector=cqlrow&lemma=&lpos=&phrase=&word=&wpos=&char=&cql=%28meet+1%3A%22his%22+2%3A%5B%5D+1+5%29+%261.word%3D2.word&default_attr=word&fc_lemword_window_type=both&fc_lemword_wsize=5&fc_lemword=&fc_lemword_type=all&fc_pos_window_type=both&fc_pos_wsize=5&fc_pos_type=all

I do not understand why this query has empty output, while http://ske.li/e6x has 18 results. My expectation was that this query matches exactly the same sentences, but with the first of the two words being the KWIC (instead of the second which is the KWIC in http://ske.li/e6).

I have described another example of a similar problem with the meet operator and global conditions at #164 . The same queries as mentioned there were tested in a NoSke instance on http://corpora.phil.hhu.de/bonito/parseme.cgi/first?corpname=parseme_de_a&reload=1&iquery=&queryselector=cqlrow&lemma=&phrase=&word=&char=&cql=&default_attr=word&fc_lemword_window_type=both&fc_lemword_wsize=5&fc_lemword=&fc_lemword_type=all, so I believe the unexpected behaviour is due to Manatee and not due to the front-end.

@Ansa211
Copy link
Author

Ansa211 commented Apr 11, 2018

At least the following two queries, in which the conditions on node 2 have been more fully specified, have the same number of results (30):
(meet 1:[mwe_id="(.*;.*)"] 2:[mwe_id="(.*;.*)"] -5 -1) & 1.mwe_id=2.mwe_id within <s/>
(meet 1:[mwe_id="(.*;.*)"] 2:[mwe_id="(.*;.*)"] 1 5) & 1.mwe_id=2.mwe_id within <s/>

But this version has 122 - and in some of them, only one node is highlighted:
(meet 1:[mwe_id="(.*;.*)"] 2:[mwe_id="(.*;.*)"] -5 5) & 1.mwe_id=2.mwe_id within <s/>

@Ansa211
Copy link
Author

Ansa211 commented Apr 23, 2018

Further correspondence with Ondrej Herman has clarified the issue even further.

From my message:

Could you please be more specific about what you mean by ambiguous label positions? Is this the case that any query of the form
(meet 1:[conditions1] 2:[conditions2] -num1 num2) & 1.attribute1 = 2.attribute2
is malformed? Or even any
(meet 1:[conditions1] 2:[conditions2] -num1 num2)
? (From a tiny bit of experimentation, I suspect the latter.)
Also, does the same issue concern any other query types that you can think of?

I tried to emulate such queries (with the condition on some parameters being equal between the two words, so that I really need the labels) through the use of filters, but I found no way how to do it - the labels of the positions (such as 1 and 2) are not remembered from the original query to the application of the filter.

Of course, one could go back to
(1:[conditions1] []{0,num2} 2:[conditions2] | 2:[conditions2] []{0,num1} 1:[conditions1]) & 1.attribute1 = 2.attribute2
which should work (is that correct?), but that means loosing the functionality of meet (the fact that only the two relevant words are highlighted and only one of them is the KWIC).

I would be grateful if you have any further ideas for reformulating/emulating this type of query.
But more importantly, I would like to understand better which queries I should avoid.


From Ondrej Herman's reply:

Operace (meet A B x y) se snaží vyhledat všechna A, která mají v okně daném parametry x a y nějaký výskyt B. Globální podmínka pak filtruje řádky tohoto výsledku, které neodpovídají žádané podmínce. To znamená, že ve výsledku nikdy nebude víc výskytů A na stejné pozici. Meet obecně ani není komutativní.

Váš první příklad může dávat platné výsledky, ale pouze pokud v korpusu ke každému A existuje právě jedno B. Druhý dotaz je v pořádku, ale výsledek jsou opět všechna A a label pro B je pro jednotlivé výskyty spíše informativního charakteru.

Dotaz
(1:[conditions1] []{0,num2} 2:[conditions2] | 2:[conditions2] []{0,num1} 1:[conditions1]) & 1.attribute1 = 2.attribute2
má obdobný problém. Částečné výsledky levé a pravé části kolem svislítka mohou být identické a lišit se jen v labelech. K vyhodnocení globalní podmínky se pak přes operátor svislítka dostane jen jeden z nich.

Dotaz s "meet" má ještě jeden rozdíl oproti tomuto dotazu -- A a B s meet se mohou nacházet na stejné pozici.

Obávám se, že tato omezení v CQL nedokážeme moc dobře, blížíte se k limitům jazyka. Ani jiné řešení, které by šlo naklikat, mě nenapadá, ale zkusím se ještě poptat.

Osobně bych postupoval tak, že bych upravil skript corpquery distribuovaný s Manatee -- krmil bych jej Vaším dotazem bez globálních podmínek, které bych vyhodnocoval mimo CQL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant