Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/count: List index out of range with at least certain Flashback and Familjeliv corpora #6

Open
janiemi opened this issue Aug 20, 2021 · 2 comments

Comments

@janiemi
Copy link

janiemi commented Aug 20, 2021

The /count endpoint returns an IndexError: list index out of range when trying to search certain Flashback or Familjeliv subcorpora with (certain) group_by and group_by_struct parameters. For example:
https://ws.spraakbanken.gu.se/ws/korp/v8/count?group_by=deprel&group_by_struct=thread_title&cqp=%3Cthread%3E+%5Bpos%20%3D%20%22DT%22%5D&corpus=FLASHBACK-DATOR&default_within=sentence&debug=true
results in the following:

{
"ERROR": {
"type": "IndexError",
"value": "list index out of range",
"traceback": [
"Traceback (most recent call last):",
"  File \"/home/fkkorp/korp-backend/v8/korp.py\", line 223, in error_catcher",
"    g(*pargs, **kwargs)",
"  File \"/home/fkkorp/korp-backend/v8/korp.py\", line 213, in f",
"    for response in generator(args, *pargs, **kwargs):",
"  File \"/home/fkkorp/korp-backend/v8/korp.py\", line 1569, in count",
"    if group_by[i][0] in split:",
"IndexError: list index out of range"
]
},
"time": 26.713754177093506
}

Does the corpus data perhaps contain something unexpected by /count? Anyway, I think it would be better if the code were able to handle that without such an internal-looking error.

I got the error with a number of different parameters, though I haven’t tried all combinations:

  • group_by: pos, deprel, msd, word
  • group_by_struct: thread_title, text_username; but not forum_title
  • cqp: [], [pos="VB"], [pos="DT"], [msd=".*+.*"], but not [pos="RO"]; with or without anchoring to <text> or <thread>, but not when anchoring to <forum>
  • corpus: FLASHBACK-DATOR, FLASHBACK-HEM, FLASHBACK-POLITIK, FLASHBACK-SAMHALLE, FAMILJELIV-FORALDER, FAMILJELIV-KANSLIGA; but not FLASHBACK-LIVSSTIL, FLASHBACK-EKONOMI, FLASHBACK-FORDON, FLASHBACK-DROGER, FLASHBACK-KULTUR, FAMILJELIV-ALLMANNA-KROPP, FAMILJELIV-GRAVID, TWITTER, TWITTER-2015 (with group_by_struct=user_username), WIKIPEDIA-SV (with group_by_struct=text_title)

It would seem that larger corpora are more likely to cause the error, but that’s not completely consistent, at least if you only take token count into account. And I couldn’t get the error from other than Flashback and Familjeliv subcorpora.

(I came across this issue by accident when testing different combinations of statistics attributes in the frontend.)

@MartinHammarstedt
Copy link
Member

The issue seems to be structural attribute values containing tabs. The statistics query is using CWB's tabulate command, and when grouping by more than one attribute the values are separated by tabs. If the values also contain tabs, the result can't be parsed. I'm not sure if this can be solved while still using tabulate, so maybe a note about tabs in the readme will have to do for now, and some better error handling in the code of course.

@janiemi
Copy link
Author

janiemi commented Jan 10, 2023

Ok, thanks for the explanation. Apparently, we have avoided the issue by disallowing tabs in the values of structural attributes as well as positional ones. I didn’t notice any option in cqp to change the value separator of tabulate, so I suppose you can’t do more than what you suggested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants