Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trend diagram example KWIC does not work correctly with search expressions containing repetition #288

Closed
janiemi opened this issue Dec 21, 2022 · 3 comments · Fixed by #342
Labels

Comments

@janiemi
Copy link
Contributor

janiemi commented Dec 21, 2022

If an extended or advanced search expression contains repetition for a token expression, the example KWIC from the trend diagram does not always work correctly:

  1. If the expression contains repeating tokens (e.g. […]{2,3}) and matches more tokens than it has token expressions, the trend diagram examples shows hits multiple times. For example, for a query with three token expressions matching the five tokens a b c d e, the example KWIC shows the three-token matches a b c, b c d and c d e.
  2. If the expression contains optional tokens (e.g. […]{0,1}) and matches fewer tokens than it has token expressions, trend diagram examples do not show hits without the optional tokens (hits with fewer tokens than token expressions).

(This may be a known problem but at least I didn’t find any GitHub issue for it. The problem was noticed by @jpiitula, and I investigated it further.)

Examples of the two cases:

  1. Search “säker” followed by any two words followed by “att”, choose the Statistics tab, then “Show Trend Diagram” and from the trend diagram, the leftmost occurrence (1997) with 1 absolute hit. The opened example KWIC shows two lines: the same sentence twice, the first with “säkert bidrog till” as the hit and the second with “bidrog till att”. Apparently, the original hit was “säkert bidrog till att”.
  2. Search “säker” followed by an optional any word followed by “att”, choose the Statistics tab, select a row with two tokens (such as “säkert att”), then “Show Trend Diagram” and from there, try to choose any point. No example KWIC appears, and the JavaScript console shows the error Uncaught Object { message: "Expected \"<\" or \"[\" but \"(\" found.", expected: (3) […], found: "(", location: {…}, name: "SyntaxError", stack: "" } bundle.js:280:18186. On a local test instance running Språkbanken’s code in the dev branch, I got a different error: Uncaught TypeError: this.s is undefined trend_results.js:49:24. (In fact, this also seemed to happen for the first search in the dev version and in the current Korplabb, so maybe something has been changed that affects the trend diagram examples.) Before this, another error is shown on the console, but I don’t know if it’s related: ReferenceError: token is not defined repeatChange token.js:100

This appears to be due to the backend query having a primary cqp for the base search expression and a secondary cqp2 for the time expression corresponding to the selected point on the trend diagram, padded with []’s to have the same number of token expressions as the primary CQP expression. For example, search 1 above results in this backend query, where I think the relevant parameters for this issue are the following:

  • cqp: [lemma contains "säker"] []{2,2} [word = "att"]
  • cqp2: [(int(_.text_datefrom) >= 19970501 & int(_.text_dateto) <= 19970531) | (int(_.text_datefrom) <= 19970501 & int(_.text_dateto) >= 19970531)] [] []
  • expand_prequeries: false

In this case, the secondary CQP matches all three-token sequences in the matches of the first CQP. If a token expression is optional, the secondary CQP is longer than matches without the optional token, so it doesn’t match at all.

(By the way, is it intentional that the value of cqp2 contains newlines and a large number extra spaces, or do they just result from a multi-line string in the JavaScript code? Its JSON representation is "[(int(_.text_datefrom) >= 19970501 & int(_.text_dateto) <= 19970531) |\n (int(_.text_datefrom) <= 19970501 & int(_.text_dateto) >= 19970531)\n ] [] []".)

I can think of at least the following solutions:

  1. Instead of using a secondary CQP expression, add the time expression with & to the last token expression of the primary CQP (or to the first one, if that is faster): [lemma contains "säker"] []{2,2} [word = "att" & ((int(_.text_datefrom) >= 19970501 & int(_.text_dateto) <= 19970531) | (int(_.text_datefrom) <= 19970501 & int(_.text_dateto) >= 19970531))]
  2. Instead of using a secondary CQP expression, add the time expression to the primary CQP as a global constraint referring to the match label: [lemma contains "säker"] []{2,2} [word = "att"] :: (int(match.text_datefrom) >= 19970501 & int(match.text_dateto) <= 19970531) | (int(match.text_datefrom) <= 19970501 & int(match.text_dateto) >= 19970531)
  3. Use the subset operation in the secondary CQP expression: subset Last where match: [(int(_.text_datefrom) >= 19970501 & int(_.text_dateto) <= 19970531) | (int(_.text_datefrom) <= 19970501 & int(_.text_dateto) >= 19970531)]

I don’t know which of the queries would be the fastest. I think option 3 might be the easiest from the point of view of the frontend, as it wouldn’t need to modify existing CQP expressions. However, the backend would need to be modified not to add a within clause to such CQP expressions. And I don’t know if supporting CQP queries of this kind would have some security implications. In options 1 and 2, I think it might be possible to modify the CQP expression with a regular expression replacement operation, without having to parse the CQP, but you’d need to take into account the possible existing global constraint (in the advanced search).

@majsan
Copy link
Member

majsan commented Jan 9, 2023

It would be best if Martin also looked at this issue before we fix it in the frontend.

I think that both 1 and 2 are easier edits to the CQP-query than we already do, but as Jyrki says, 3 is the easiest one. As for performance, I don't know.

@MartinHammarstedt which of the alternatives do you think is the best?

And @janiemi about "the value of cqp2 contains newlines and a large number extra spaces", you're probably right. Should be fixed before closing this (but maybe the code will be replaced anyway).

@majsan majsan added the bug label Jan 9, 2023
@MartinHammarstedt
Copy link
Member

Regarding which is the fastest one, I don't know either. Not using a secondary query is maybe faster.

Like Jyrki said, number 3 would require changes in the backend. Number 2 is probably the cleanest one but wouldn't work if someone is already using global constraints (unless you detect that and add the new constraints to the existing ones). Number 1 feels a little dirty, and if people are using the advanced search it might be hard to figure out exactly which token to add the constraint to (for example in the query [word="hund"] | [word="katt"]).

@janiemi
Copy link
Contributor Author

janiemi commented Feb 15, 2023

@MartinHammarstedt, you’re right. I didn’t take token-level regular expressions into account, so option 1 is probably ruled out. And yes, in option 2, a possible existing global constraint should be detected and the CQP expression should be modified differently based on whether it already contains a global constraint or not. I suppose it wouldn’t be too difficult, but of course it would be an additional complication.

Even if option 2 were implemented in the frontend, I think it might sometimes be useful in its own right to have the backend support (secondary) CQP queries that cannot take a within clause, such as the subset operation. My concern with that is if it has any security implications, e.g. by making some operations work that shouldn’t be allowed. I don’t expect that but I think we should try to be fairly certain of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants