Trend diagram example KWIC does not work correctly with search expressions containing repetition #288

janiemi · 2022-12-21T05:32:00Z

If an extended or advanced search expression contains repetition for a token expression, the example KWIC from the trend diagram does not always work correctly:

If the expression contains repeating tokens (e.g. […]{2,3}) and matches more tokens than it has token expressions, the trend diagram examples shows hits multiple times. For example, for a query with three token expressions matching the five tokens a b c d e, the example KWIC shows the three-token matches a b c, b c d and c d e.
If the expression contains optional tokens (e.g. […]{0,1}) and matches fewer tokens than it has token expressions, trend diagram examples do not show hits without the optional tokens (hits with fewer tokens than token expressions).

(This may be a known problem but at least I didn’t find any GitHub issue for it. The problem was noticed by @jpiitula, and I investigated it further.)

Examples of the two cases:

Search “säker” followed by any two words followed by “att”, choose the Statistics tab, then “Show Trend Diagram” and from the trend diagram, the leftmost occurrence (1997) with 1 absolute hit. The opened example KWIC shows two lines: the same sentence twice, the first with “säkert bidrog till” as the hit and the second with “bidrog till att”. Apparently, the original hit was “säkert bidrog till att”.
Search “säker” followed by an optional any word followed by “att”, choose the Statistics tab, select a row with two tokens (such as “säkert att”), then “Show Trend Diagram” and from there, try to choose any point. No example KWIC appears, and the JavaScript console shows the error Uncaught Object { message: "Expected \"<\" or \"[\" but \"(\" found.", expected: (3) […], found: "(", location: {…}, name: "SyntaxError", stack: "" } bundle.js:280:18186. On a local test instance running Språkbanken’s code in the dev branch, I got a different error: Uncaught TypeError: this.s is undefined trend_results.js:49:24. (In fact, this also seemed to happen for the first search in the dev version and in the current Korplabb, so maybe something has been changed that affects the trend diagram examples.) Before this, another error is shown on the console, but I don’t know if it’s related: ReferenceError: token is not defined repeatChange token.js:100

This appears to be due to the backend query having a primary cqp for the base search expression and a secondary cqp2 for the time expression corresponding to the selected point on the trend diagram, padded with []’s to have the same number of token expressions as the primary CQP expression. For example, search 1 above results in this backend query, where I think the relevant parameters for this issue are the following:

cqp: [lemma contains "säker"] []{2,2} [word = "att"]
cqp2: [(int(_.text_datefrom) >= 19970501 & int(_.text_dateto) <= 19970531) | (int(_.text_datefrom) <= 19970501 & int(_.text_dateto) >= 19970531)] [] []
expand_prequeries: false

In this case, the secondary CQP matches all three-token sequences in the matches of the first CQP. If a token expression is optional, the secondary CQP is longer than matches without the optional token, so it doesn’t match at all.

(By the way, is it intentional that the value of cqp2 contains newlines and a large number extra spaces, or do they just result from a multi-line string in the JavaScript code? Its JSON representation is "[(int(_.text_datefrom) >= 19970501 & int(_.text_dateto) <= 19970531) |\n (int(_.text_datefrom) <= 19970501 & int(_.text_dateto) >= 19970531)\n ] [] []".)

I can think of at least the following solutions:

Instead of using a secondary CQP expression, add the time expression with & to the last token expression of the primary CQP (or to the first one, if that is faster): [lemma contains "säker"] []{2,2} [word = "att" & ((int(_.text_datefrom) >= 19970501 & int(_.text_dateto) <= 19970531) | (int(_.text_datefrom) <= 19970501 & int(_.text_dateto) >= 19970531))]
Instead of using a secondary CQP expression, add the time expression to the primary CQP as a global constraint referring to the match label: [lemma contains "säker"] []{2,2} [word = "att"] :: (int(match.text_datefrom) >= 19970501 & int(match.text_dateto) <= 19970531) | (int(match.text_datefrom) <= 19970501 & int(match.text_dateto) >= 19970531)
Use the subset operation in the secondary CQP expression: subset Last where match: [(int(_.text_datefrom) >= 19970501 & int(_.text_dateto) <= 19970531) | (int(_.text_datefrom) <= 19970501 & int(_.text_dateto) >= 19970531)]

I don’t know which of the queries would be the fastest. I think option 3 might be the easiest from the point of view of the frontend, as it wouldn’t need to modify existing CQP expressions. However, the backend would need to be modified not to add a within clause to such CQP expressions. And I don’t know if supporting CQP queries of this kind would have some security implications. In options 1 and 2, I think it might be possible to modify the CQP expression with a regular expression replacement operation, without having to parse the CQP, but you’d need to take into account the possible existing global constraint (in the advanced search).

The text was updated successfully, but these errors were encountered:

majsan · 2023-01-09T09:40:26Z

It would be best if Martin also looked at this issue before we fix it in the frontend.

I think that both 1 and 2 are easier edits to the CQP-query than we already do, but as Jyrki says, 3 is the easiest one. As for performance, I don't know.

@MartinHammarstedt which of the alternatives do you think is the best?

And @janiemi about "the value of cqp2 contains newlines and a large number extra spaces", you're probably right. Should be fixed before closing this (but maybe the code will be replaced anyway).

MartinHammarstedt · 2023-02-14T14:12:06Z

Regarding which is the fastest one, I don't know either. Not using a secondary query is maybe faster.

Like Jyrki said, number 3 would require changes in the backend. Number 2 is probably the cleanest one but wouldn't work if someone is already using global constraints (unless you detect that and add the new constraints to the existing ones). Number 1 feels a little dirty, and if people are using the advanced search it might be hard to figure out exactly which token to add the constraint to (for example in the query [word="hund"] | [word="katt"]).

janiemi · 2023-02-15T09:05:47Z

@MartinHammarstedt, you’re right. I didn’t take token-level regular expressions into account, so option 1 is probably ruled out. And yes, in option 2, a possible existing global constraint should be detected and the CQP expression should be modified differently based on whether it already contains a global constraint or not. I suppose it wouldn’t be too difficult, but of course it would be an additional complication.

Even if option 2 were implemented in the frontend, I think it might sometimes be useful in its own right to have the backend support (secondary) CQP queries that cannot take a within clause, such as the subset operation. My concern with that is if it has any security implications, e.g. by making some operations work that shouldn’t be allowed. I don’t expect that but I think we should try to be fairly certain of it.

janiemi mentioned this issue Dec 21, 2022

Statistics example KWIC: Multi-token match shown as multiple single-token matches if statistics compiled on text attributes only #289

Closed

majsan added the bug label Jan 9, 2023

arildm added a commit that referenced this issue Feb 20, 2024

Constrain trend subquery to match, fix #288

5373cdc

arildm mentioned this issue Feb 20, 2024

Fix trend diagram subquery with repetition #342

Merged

arildm closed this as completed in #342 Feb 20, 2024

Phaqui pushed a commit to giellatekno/korp-frontend that referenced this issue Aug 15, 2024

Constrain trend subquery to match, fix spraakbanken#288

d72e562

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trend diagram example KWIC does not work correctly with search expressions containing repetition #288

Trend diagram example KWIC does not work correctly with search expressions containing repetition #288

janiemi commented Dec 21, 2022 •

edited

Loading

majsan commented Jan 9, 2023

MartinHammarstedt commented Feb 14, 2023

janiemi commented Feb 15, 2023

Trend diagram example KWIC does not work correctly with search expressions containing repetition #288

Trend diagram example KWIC does not work correctly with search expressions containing repetition #288

Comments

janiemi commented Dec 21, 2022 • edited Loading

majsan commented Jan 9, 2023

MartinHammarstedt commented Feb 14, 2023

janiemi commented Feb 15, 2023

janiemi commented Dec 21, 2022 •

edited

Loading