Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Statistics example KWIC: Multi-token match shown as multiple single-token matches if statistics compiled on text attributes only #289

Closed
janiemi opened this issue Dec 21, 2022 · 2 comments
Labels

Comments

@janiemi
Copy link
Contributor

janiemi commented Dec 21, 2022

When a query matches multiple tokens and statistics are compiled on one or more text (structural) attributes only (not word (positional) ones), a statistics example KWIC shows a separate match for each token of the original query.

For example: Search “vara säker att” and compile statistics on subject, choose the Statistics tab and click on the value of the subject, for example “Sociologi” with 4 matches, and the example KWIC shows 12 matches, one for each token in the original search result. The example KWIC is returned by this backend query where I think the relevant parameters are the following:

  • cqp: [lemma contains "vara"] [lemma contains "säker"] [word = "att"]
  • cqp2: ([_.text_subject="Sociologi"])
  • expand_prequeries: false

The secondary CQP expression cqp2 matches separately each token matched by the primary CQP expression.

I think this issue is somewhat similar to #288, even though the secondary CQP expression is not padded with []’s. I’d thus think that similar solutions would work:

  1. Instead of using a secondary CQP expression, add the expression selecting the statistics value(s) with & to the last token of the primary CQP (or to the first token if that is faster): [lemma contains "vara"] [lemma contains "säker"] [word = "att" & _.text_subject="Sociologi"].
    In fact, a similar approach seems to be already used if statistics are compiled by both a word and a text attribute, when the secondary CQP can be for example ([word="är" & _.text_subject="Sociologi"] [word="säkert"] [word="att"]). (A difference is probably that this secondary CQP is not a modification of the primary one.)
  2. Instead of using a secondary CQP expression, add the time expression to the primary CQP as a global constraint referring to the match label: [lemma contains "vara"] [lemma contains "säker"] [word = "att"] :: match.text_subject="Sociologi"
  3. Use the subset operation in the secondary CQP expression: subset Last where match: [_.text_subject="Sociologi"]

My disclaimers and notes in #288 also apply here: I don’t know which of the queries would be the fastest. I think option 3 might be the easiest from the point of view of the frontend, as it wouldn’t need to modify existing CQP expressions. However, the backend would need to be modified not to add a within clause to such CQP expressions. And I don’t know if supporting CQP queries of this kind would have some security implications. In options 1 and 2, I think it might be possible to modify the CQP expression with a regular expression replacement operation, without having to parse the CQP, but you’d need to take into account the possible existing global constraint (in the advanced search).

@majsan majsan added the bug label Jan 9, 2023
@majsan
Copy link
Member

majsan commented Jan 9, 2023

Thanks for the bug report. Again, I'll let Martin have a say before trying to fix it.

@MartinHammarstedt Please look at this one, similar to #288 .

@arildm
Copy link
Member

arildm commented Jan 31, 2024

Fixed in c531656

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants