Statistics example KWIC: Multi-token match shown as multiple single-token matches if statistics compiled on text attributes only #289

janiemi · 2022-12-21T05:43:08Z

When a query matches multiple tokens and statistics are compiled on one or more text (structural) attributes only (not word (positional) ones), a statistics example KWIC shows a separate match for each token of the original query.

For example: Search “vara säker att” and compile statistics on subject, choose the Statistics tab and click on the value of the subject, for example “Sociologi” with 4 matches, and the example KWIC shows 12 matches, one for each token in the original search result. The example KWIC is returned by this backend query where I think the relevant parameters are the following:

cqp: [lemma contains "vara"] [lemma contains "säker"] [word = "att"]
cqp2: ([_.text_subject="Sociologi"])
expand_prequeries: false

The secondary CQP expression cqp2 matches separately each token matched by the primary CQP expression.

I think this issue is somewhat similar to #288, even though the secondary CQP expression is not padded with []’s. I’d thus think that similar solutions would work:

Instead of using a secondary CQP expression, add the expression selecting the statistics value(s) with & to the last token of the primary CQP (or to the first token if that is faster): [lemma contains "vara"] [lemma contains "säker"] [word = "att" & _.text_subject="Sociologi"].
In fact, a similar approach seems to be already used if statistics are compiled by both a word and a text attribute, when the secondary CQP can be for example ([word="är" & _.text_subject="Sociologi"] [word="säkert"] [word="att"]). (A difference is probably that this secondary CQP is not a modification of the primary one.)
Instead of using a secondary CQP expression, add the time expression to the primary CQP as a global constraint referring to the match label: [lemma contains "vara"] [lemma contains "säker"] [word = "att"] :: match.text_subject="Sociologi"
Use the subset operation in the secondary CQP expression: subset Last where match: [_.text_subject="Sociologi"]

My disclaimers and notes in #288 also apply here: I don’t know which of the queries would be the fastest. I think option 3 might be the easiest from the point of view of the frontend, as it wouldn’t need to modify existing CQP expressions. However, the backend would need to be modified not to add a within clause to such CQP expressions. And I don’t know if supporting CQP queries of this kind would have some security implications. In options 1 and 2, I think it might be possible to modify the CQP expression with a regular expression replacement operation, without having to parse the CQP, but you’d need to take into account the possible existing global constraint (in the advanced search).

The text was updated successfully, but these errors were encountered:

majsan · 2023-01-09T09:53:29Z

Thanks for the bug report. Again, I'll let Martin have a say before trying to fix it.

@MartinHammarstedt Please look at this one, similar to #288 .

arildm · 2024-01-31T13:43:06Z

Fixed in c531656

majsan added the bug label Jan 9, 2023

arildm mentioned this issue Jan 25, 2024

Constrain CQP subqueries to span match #334

Merged

arildm closed this as completed Jan 31, 2024

arildm mentioned this issue Mar 28, 2024

Statistics subquery incorrect when using repetition and boundaries #354

Closed

arildm mentioned this issue Apr 15, 2024

Error when clicking trend diagram with multiple series #358

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Statistics example KWIC: Multi-token match shown as multiple single-token matches if statistics compiled on text attributes only #289

Statistics example KWIC: Multi-token match shown as multiple single-token matches if statistics compiled on text attributes only #289

janiemi commented Dec 21, 2022 •

edited

majsan commented Jan 9, 2023

arildm commented Jan 31, 2024

Statistics example KWIC: Multi-token match shown as multiple single-token matches if statistics compiled on text attributes only #289

Statistics example KWIC: Multi-token match shown as multiple single-token matches if statistics compiled on text attributes only #289

Comments

janiemi commented Dec 21, 2022 • edited

majsan commented Jan 9, 2023

arildm commented Jan 31, 2024

janiemi commented Dec 21, 2022 •

edited