Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto generate excessive long parametrized ids #7254

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

symonk
Copy link
Member

@symonk symonk commented May 24, 2020

Fixes #6881

Evening maintainers, I've opened this PR to get some feedback on this approach. I'm not entirely sure on the best area to solve this problem, using parameterised with very large datasets gets automatically appended onto the generated/resolved ids. This poses a problem later when trying to update the PYTEST_CURRENT_TEST environment variable on a windows based system, doing so will raise a ValueError from setitem. This attempt at patching it checks for a windows based operating system and enforces a 100 char limit, exceeding the limit for an individual id will rewrite using uuid4 based auto-generated template name.

Looking some maintainer feedback on the following:

  • Is this the right area to solve the problem?
  • I wanted to make an opt for overriding the limit here but struggled with guaranteeing MetaFunc has access to the pytest config consistently
  • is this even an appropriate kind of solution / is there a preferred way of auto generating instead of the one outlined here
  • This area of the system seems massively used, I'm sure there are tons of edge cases I haven't considered here

The code base is very big, so I'm relying solely on local test failure(s) to detect anything chaotic, but I would appreciate any tips / feedback for a new comer here.

@@ -1230,6 +1231,14 @@ def idmaker(
_idvalset(valindex, parameterset, argnames, idfn, ids, config=config, item=item)
for valindex, parameterset in enumerate(parametersets)
]
# rewrite parametrized ids greater than a 100 limit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have to do this for all ids and we should use the auto-id mechanism for that + warn

it has to happen on all platforms, else users get different test ids on different platforms thus destroying reasonably simple aggregation

Copy link
Member Author

@symonk symonk May 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @RonnyPfannschmidt for your time, i'm struggling to warn here without causing an error and failing to collect tests

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at first glance i would propose to use the node.warn apis if applicable, and to move the limit into the _idval helper function in order to get a normal numbered id like other tests with non-repressentable ids

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RonnyPfannschmidt any advice here? i am struggling to warn here with warnings.warn() / item/warn() ? node.warn I am not too familiar with. Attempting to warning is just erroring, I assume due to collection/interpretation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ill at best be able to look into this mid week as im currently preoccupied

@symonk symonk changed the title Auto generating parametrized ids for excessive sizes on windows Auto generate excessive long parametrized ids May 24, 2020
@symonk
Copy link
Member Author

symonk commented May 25, 2020

struggling to warn here using warnings.warn / item.warn inside _idva, think its because this is happening during collection. Also unsure what to do with ids=[] value(s) regarding this

"δοκ.ιμή@παράδειγμα.δοκιμή",
"\\u03b4\\u03bf\\u03ba.\\u03b9\\u03bc\\u03ae@\\u03c0\\u03b1\\u03c1\\u03ac\\u03b4\\u03b5\\u03b9\\u03b3"
"\\u03bc\\u03b1.\\u03b4\\u03bf\\u03ba\\u03b9\\u03bc\\u03ae",
"δοκ.ιμή@δοκιμή",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

converted bytes now exceed the 100 limit, unsure of the data set so shortened it, not sure if acceptable

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one makes me wonder if a "character" limit of 100 might be better than a bytes limit of 100

as far as i can guess the source data is a most interesting mail address and the shortened one isnt

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why we even escape ID names to ASCII? Is there a reason not to allow Unicode IDs?

OK found the answer: disable_test_id_escaping_and_forfeit_all_rights_to_community_support hehe.

Copy link
Member

@bluetech bluetech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a good idea to avoid excessively long IDs.

But, I think that if the user explicitly provides a long ID, we should not forfeit it, but the code as written does.

@@ -1181,19 +1181,25 @@ def _idval(
if hook_id:
return hook_id

modified_val = None # empty strings are valid as val
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a better name for this variable would be id. Because I see val as the input value, and the modified_val here is actually the intended ID string for that value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @bluetech will revisit this during this week. - Any other thoughts on a more appropriate name? I wouldnt want to shadow the id() builtin here

Comment on lines +1197 to +1199
# val does not always enter the isinstance checks due to its type
# when it does we will check the string length returned and use auto generation of ids when over 100 chars
# on types which do not match isinstance checks pytest uses auto generate of ids automatically anyway
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to reword this a bit more succinctly:

Suggested change
# val does not always enter the isinstance checks due to its type
# when it does we will check the string length returned and use auto generation of ids when over 100 chars
# on types which do not match isinstance checks pytest uses auto generate of ids automatically anyway
# If val is not of one of the types we support above, or if it ends up too long to
# be useful, use a generated name based on the argument name and index instead.

"δοκ.ιμή@παράδειγμα.δοκιμή",
"\\u03b4\\u03bf\\u03ba.\\u03b9\\u03bc\\u03ae@\\u03c0\\u03b1\\u03c1\\u03ac\\u03b4\\u03b5\\u03b9\\u03b3"
"\\u03bc\\u03b1.\\u03b4\\u03bf\\u03ba\\u03b9\\u03bc\\u03ae",
"δοκ.ιμή@δοκιμή",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why we even escape ID names to ASCII? Is there a reason not to allow Unicode IDs?

OK found the answer: disable_test_id_escaping_and_forfeit_all_rights_to_community_support hehe.

@bluetech
Copy link
Member

@symonk I updated the PR description to add Fixes #6881.

@symonk
Copy link
Member Author

symonk commented Jun 1, 2020

I think it's a good idea to avoid excessively long IDs.

But, I think that if the user explicitly provides a long ID, we should not forfeit it, but the code as written does.

we are already doing it with certain types of data. I am not sure how to overcome this - should it perhaps be a configurable option that users can choose to enable? Given that PYTEST_CURRENT_TEST is rewritten, maybe even a larger default before rewriting would be sensible, to perhaps rewrite when the data is getting extremely big? The user can hack around the windows limits (manually), but should pytest expect them to do so?

Base automatically changed from master to main March 9, 2021 20:40
@nicoddemus
Copy link
Member

Gentle ping. 👍

@kurtmckee
Copy link

Another gentle ping.

I'm encountering this while migrating the feedparser test suite to pytest. It's possible to overcome this by using an explicit ids= parameter but it is unexpected to encounter a ValueError from pytest.

I'm very willing to help drive this to closure. If help or updates are needed, please reply to this PR or to #6881. I'm subscribing to both.

@RonnyPfannschmidt
Copy link
Member

I'm wondering if we should just fall back to autogenerated ids or for long values +warn

@kurtmckee
Copy link

@RonnyPfannschmidt, that would solve my use case. Do you want a new PR that accomplishes that?

@RonnyPfannschmidt
Copy link
Member

That could be a starting point, i wonder if it would make sense to use content hashes for long content

@kurtmckee
Copy link

i wonder if it would make sense to use content hashes

That may be helpful, if uniqueness is a requirement (or even just desirable). My use case involves multiplying a string by a large number to test a performance optimization that only tries decoding n bytes of input to determine the character set. I imagine that others may encounter similar situations.

I won't be able to pick up this work immediately due to work and life, but I'll take a stab at it and post a PR as I'm able.

@RonnyPfannschmidt
Copy link
Member

@kurtmckee note that for specific tests that just multiply a string to go for long, choosing a actual id is the best way to go

@obestwalter
Copy link
Member

It looks like a different approach is preferred, so we should close this one in favour of an upcoming PR by @kurtmckee (if you are up for that :)).

@obestwalter obestwalter added the status: needs information reporter needs to provide more information; can be closed after 2 or more weeks of inactivity label Jun 20, 2024
@kurtmckee
Copy link

@obestwalter I can give it a shot! To confirm, the strategy here is:

  1. Detect long IDs (cross-platform for consistency)
  2. Hash the IDs to avoid platform-specific length restrictions
  3. Issue a warning suggesting ways to choose IDs independently

Does that summary of my understanding align with what's been discussed here?

@obestwalter
Copy link
Member

Proposing to close this and move the discussion back to #6681 - thanks @symonk to get the ball rolling on this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: needs information reporter needs to provide more information; can be closed after 2 or more weeks of inactivity
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Long parametrized test_input on Windows: ValueError: the environment variable is longer than 32767 characters
6 participants