Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perform a more thorough check of package_data structure #1769

Merged
merged 2 commits into from Jul 23, 2019

Conversation

@dhimmel
Copy link
Contributor

commented May 22, 2019

Summary of changes

package_data and exclude_package_data expect to receive a dictionary whose values are lists rather than strings. Previously, passing a str value did not cause any warning or error, which is dangerous because users accidentally input a glob/path that then gets iterated into individual characters.

Refs #1459

Pull Request Checklist

  • Changes have tests
  • News fragment added in changelog.d. See documentation for details
@dhimmel

This comment has been minimized.

Copy link
Contributor Author

commented May 22, 2019

Any advice on what test I should write and where I should put it would be appreciated.

@dhimmel

This comment has been minimized.

Copy link
Contributor Author

commented May 22, 2019

As of a86588f, the two messages are:

DistutilsSetupError: 'package_data' must be a dictionary mapping package names to lists of wildcard patterns
WARNING: 'package_data' must be a dictionary mapping package names to lists of wildcard patterns, but the value for 'manubot' is a str not list
@@ -314,12 +318,14 @@ def check_package_data(dist, attr, value):
iter(v)
except TypeError:
break
if isinstance(v, str):
distutils.log.warn(

This comment has been minimized.

Copy link
@pganssle

pganssle May 22, 2019

Member

I think we've generally been doing warnings.warn instead of distutils.log.warn these days.

@pganssle

This comment has been minimized.

Copy link
Member

commented May 22, 2019

@dhimmel Thank you for for looking into this. I think you'll want to add tests into test_dist. If you switch over to use a warning instead of distutils.log.warn, you can use the pytest.raises and pytest.warns context managers to test both of these scenarios.

@dhimmel

This comment has been minimized.

Copy link
Contributor Author

commented May 22, 2019

Thanks @pganssle for the suggestions.

As far as the check_package_data implementation goes, I considered a few options:

  1. warn if dictionary value is string
  2. warn if dictionary value is not list
  3. error if dictionary value is string
  4. error if dictionary value is not list

I've implemented 1 because 3 & 4 could cause setup.py files that currently work to begin failing. Does anyone have a preference between 1 or 2. Note that 1 and 2 is also an option, where the warning message would note what type of value was received and that it wasn't a list.

@dhimmel dhimmel force-pushed the dhimmel:issue-1459 branch 2 times, most recently from f7d9412 to b8904e2 May 23, 2019

@dhimmel

This comment has been minimized.

Copy link
Contributor Author

commented May 23, 2019

As of b8904e2, Python 2 tests fail when a unicode str is passed as a dictionary key. This became apparent because test_dist.py invokes from __future__ import unicode_literals. I am guessing we do not want check_package_data data to fail if a dictionary like the following is provided:

    package_data = {
        u'hello': ['*.msg'],
    }

Therefore, I will update check_package_data to be more lenient. But please let me know if Python 2 should only accept str and not unicode keys?

@pganssle

This comment has been minimized.

Copy link
Member

commented May 23, 2019

@dhimmel Python 2 should definitely accept unicode strings for keys.

@benoit-pierre

This comment has been minimized.

Copy link
Member

commented May 23, 2019

Would re-using assert_string_list make more sense? Have you checked the implementation for check_extras?

@dhimmel

This comment has been minimized.

Copy link
Contributor Author

commented May 23, 2019

Would re-using assert_string_list make more sense?

Thanks for the the suggestion. In 93af6a5, I update the docstring for assert_string_list because it seems to raise an error when value=None. One issue I have with assert_string_list is that the implementation doesn't actually test for a list of strings. For example, it would pass with a tuple or generator of strings. Is that intended or not... IDK?

The code for check_package_data could be simplified by outsourcing to a function like assert_string_list, However my worry is about backwards compatibility. Currently, setup.py files that use str (rather than list of str) values for package_data will not fail installation and the package may function properly, despite the unintended interpretation of the the package_data configuration. Because of this, I assumed we'd want to take the following approach:

  • keep all existing checks that raise errors, i.e. don't increase the ability to accept bad input
  • issue warnings for any input that doesn't conform to the documentation, but that was previously accepted

If we are to stick by those guidelines, we can add assert_string_list to check that the dict values are lists of strings, but we'd have to except the DistutilsSetupError from assert_string_list and convert it to a warning. So yes, we could add an extra check, but this wouldn't shorten/simplify check_package_data. Or @benoit-pierre do you think backwards incompatibility is okay, where specs that used to pass now error?

@dhimmel dhimmel force-pushed the dhimmel:issue-1459 branch from 482f6c0 to c065031 May 23, 2019

@benoit-pierre

This comment has been minimized.

Copy link
Member

commented May 23, 2019

Would re-using assert_string_list make more sense?

Thanks for the the suggestion. In 93af6a5, I update the docstring for assert_string_list because it seems to raise an error when value=None. One issue I have with assert_string_list is that the implementation doesn't actually test for a list of strings. For example, it would pass with a tuple or generator of strings. Is that intended or not... IDK?

That's OK, yes.

The code for check_package_data could be simplified by outsourcing to a function like assert_string_list, However my worry is about backwards compatibility. Currently, setup.py files that use str (rather than list of str) values for package_data will not fail installation and the package may function properly, despite the unintended interpretation of the the package_data configuration. Because of this, I assumed we'd want to take the following approach:

* keep all existing checks that raise errors, i.e. don't increase the ability to accept bad input

* issue warnings for any input that doesn't conform to the documentation, but that was previously accepted

If we are to stick by those guidelines, we can add assert_string_list to check that the dict values are lists of strings, but we'd have to except the DistutilsSetupError from assert_string_list and convert it to a warning. So yes, we could add an extra check, but this wouldn't shorten/simplify check_package_data. Or @benoit-pierre do you think backwards incompatibility is okay, where specs that used to pass now error?

IMHO, printing a warning instead of erroring out does not make sense if a keyword value is not valid. Passing a string instead of a list of string for package_data as always been invalid, resulting in missing package data, so I don't see it as a backward incompatible change to enforce it.

@dhimmel

This comment has been minimized.

Copy link
Contributor Author

commented May 24, 2019

IMHO, printing a warning instead of erroring out does not make sense if a keyword value is not valid

Okay in 0435d7a I switch to raising an error for any invalid values.

Passing a string instead of a list of string for package_data as always been invalid, resulting in missing package data, so I don't see it as a backward incompatible change to enforce it.

Note that the old behavior may not always result in missing package data, because on Linux the path / could include everything? Furthermore, missing package data usually makes it so some specific functionality breaks. However, this PR will now make it so the entire installation breaks. Not saying I'm against raising an error, just that I want the risks to be clear.

That's OK, yes.

I'm not sure that all iterables should be considered lists for the purposes of assert_string_list. For example, if the value is a generator that gets consumed during these initial checks, then that is a problem?

@dhimmel dhimmel changed the title check_package_data: warn if value is str Perform a more thorough check of package_data structure May 29, 2019

@dhimmel

This comment has been minimized.

Copy link
Contributor Author

commented May 29, 2019

@pganssle @benoit-pierre ready for re-review. Also advice appreciated on whether this PR should address the potential issue with consumable iterables described above.

@dhimmel

This comment has been minimized.

Copy link
Contributor Author

commented Jun 24, 2019

@pganssle @benoit-pierre ready for re-review.

@benoit-pierre

This comment has been minimized.

Copy link
Member

commented Jun 25, 2019

Sorry for the delay.

I think it would be more useful if the exception message in case of invalid key showed the key value.

Was it you intent to check the exception message in the tests, or to provide a custom message when the check fails (see https://docs.pytest.org/en/latest/deprecations.html#raises-message-deprecated)? IMHO the former is better, and the tests can be parameterized (with added check for the "not a dictionary" case):

 setuptools/dist.py            | 12 ++++-----
 setuptools/tests/test_dist.py | 58 ++++++++++++++++++++++++-------------------
 2 files changed, 38 insertions(+), 32 deletions(-)

diff --git i/setuptools/dist.py w/setuptools/dist.py
index 4e9190b3..8e3e1f92 100644
--- i/setuptools/dist.py
+++ w/setuptools/dist.py
@@ -306,15 +306,15 @@ def check_test_suite(dist, attr, value):
 
 def check_package_data(dist, attr, value):
     """Verify that value is a dictionary of package names to glob lists"""
-    msg = (
-        "{!r} must be a dictionary mapping package names to lists of "
-        "string wildcard patterns".format(attr)
-    )
     if not isinstance(value, dict):
-        raise DistutilsSetupError(msg)
+        raise DistutilsSetupError(
+            "{!r} must be a dictionary mapping package names to lists of "
+            "string wildcard patterns".format(attr))
     for k, v in value.items():
         if not isinstance(k, six.string_types):
-            raise DistutilsSetupError(msg)
+            raise DistutilsSetupError(
+                "{!r} must be strings (got {!r})".format(
+                    'keys of {!r} dict'.format(attr), k))
         assert_string_list(dist, 'values of {!r} dict'.format(attr), v)
 
 
diff --git i/setuptools/tests/test_dist.py w/setuptools/tests/test_dist.py
index 091b3687..38cc1039 100644
--- i/setuptools/tests/test_dist.py
+++ w/setuptools/tests/test_dist.py
@@ -3,6 +3,7 @@
 from __future__ import unicode_literals
 
 import io
+import re
 from distutils.errors import DistutilsSetupError
 from setuptools.dist import (
     _get_unpatched,
@@ -270,35 +271,40 @@ def test_maintainer_author(name, attrs, tmpdir):
             assert line in pkg_lines_set
 
 
-@pytest.mark.filterwarnings('error')
-def test_check_package_data_okay():
-    package_data = {
+CHECK_PACKAGE_DATA_TESTS = (
+    # Valid.
+    ({
         '': ['*.txt', '*.rst'],
         'hello': ['*.msg'],
-    }
-    assert check_package_data(None, 'package_data', package_data) is None
-
-
-def test_check_package_data_invalid_key():
-    package_data = {
+    }, None),
+    # Not a dictionary.
+    ((
+        ('', ['*.txt', '*.rst']),
+        ('hello', ['*.msg']),
+    ), (
+        "'package_data' must be a dictionary mapping package"
+        " names to lists of string wildcard patterns"
+    )),
+    # Invalid key type.
+    ({
         400: ['*.txt', '*.rst'],
-    }
-    expected_message = (
-        "'package_data' must be a dictionary mapping package names to lists "
-        "of string wildcard patterns"
-    )
-    with pytest.raises(DistutilsSetupError, message=expected_message):
-        check_package_data(None, 'package_data', package_data)
-
-
-def test_check_package_data_invalid_value():
-    """error if str values: https://github.com/pypa/setuptools/issues/1459"""
-    package_data = {
+    }, (
+        "\"keys of 'package_data' dict\" "
+        "must be strings (got 400)"
+    )),
+    # Invalid value type.
+    ({
         'hello': '*.msg',
-    }
-    expected_message = (
+    }, (
         "\"values of 'package_data' dict\" "
         "must be a list of strings (got '*.msg')"
-    )
-    with pytest.raises(DistutilsSetupError, message=expected_message):
-        check_package_data(None, 'package_data', package_data)
+    )),
+)
+
+@pytest.mark.parametrize('package_data, expected_message', CHECK_PACKAGE_DATA_TESTS)
+def test_check_package_data(package_data, expected_message):
+    if expected_message is None:
+        assert check_package_data(None, 'package_data', package_data) is None
+    else:
+        with pytest.raises(DistutilsSetupError, match=re.escape(expected_message)):
+            check_package_data(None, 'package_data', package_data)
@benoit-pierre

This comment has been minimized.

Copy link
Member

commented Jun 25, 2019

As for the problem with generators, there's only one option: raise an error if a generator is passed, as we can't change the API of the check_xxx functions (distutils territory) to validate and return the validated value (returning the result of consuming it if a generator is passed).

dhimmel added a commit to dhimmel/setuptools that referenced this pull request Jun 26, 2019

@dhimmel

This comment has been minimized.

Copy link
Contributor Author

commented Jun 26, 2019

Thanks @benoit-pierre for the suggested patch. I applied it in d6fb012.

You were correct that I misunderstood the "message" parameter of pytest.raises as described in the deprecation notice:

It is a common mistake to think this parameter will match the exception message, while in fact it only serves to provide a custom message in case the pytest.raises check fails.

I will now go through everything again and think about whether it makes sense to address the generator issue in this PR.

"{!r} must be strings (got {!r})".format(
'keys of {!r} dict'.format(attr), k))
"keys of the {!r} dict must be strings (got {!r})"
.format(attr, k)

This comment has been minimized.

Copy link
@dhimmel

dhimmel Jun 26, 2019

Author Contributor

I'm thinking we might as well avoid the awkward nesting quoting when possible, such that this returns keys of 'package_data' dict must be strings (got 400) rather than "keys of 'package_data' dict" must be strings (got 400).

@dhimmel dhimmel force-pushed the dhimmel:issue-1459 branch 3 times, most recently from 903cfd2 to 993bbd6 Jun 26, 2019

try:
# verify that value is not a single-use iterable
assert list(value) == list(value)

This comment has been minimized.

Copy link
@dhimmel

dhimmel Jun 26, 2019

Author Contributor

@benoit-pierre does this seem like the right implementation to check whether value is consumable?

For my records, this was a helpful reference here.

This comment has been minimized.

Copy link
@benoit-pierre

benoit-pierre Jun 27, 2019

Member

I wonder if we should just simplify the whole thing and just explicitly check for a list/tuple with isinstance(value, (tuple, list)). @jaraco, @pganssle?

This comment has been minimized.

Copy link
@pganssle

pganssle Jul 2, 2019

Member

I think using isinstance in this case is better than consuming the list and checking if it's consumed for sure.

I haven't had time to look at the rest of this PR and see where this assert_string_list is actually used - is there any chance that people would have legitimate reasons for passing a set to something that calls this?

This comment has been minimized.

Copy link
@dhimmel

dhimmel Jul 3, 2019

Author Contributor

Currently, assert_string_list is called by dist.check_nsp, dist.check_package_data (added by this PR), and directly for some of the following:

setuptools/setup.py

Lines 92 to 114 in 3adda96

"distutils.setup_keywords": [
"eager_resources = setuptools.dist:assert_string_list",
"namespace_packages = setuptools.dist:check_nsp",
"extras_require = setuptools.dist:check_extras",
"install_requires = setuptools.dist:check_requirements",
"tests_require = setuptools.dist:check_requirements",
"setup_requires = setuptools.dist:check_requirements",
"python_requires = setuptools.dist:check_specifier",
"entry_points = setuptools.dist:check_entry_points",
"test_suite = setuptools.dist:check_test_suite",
"zip_safe = setuptools.dist:assert_bool",
"package_data = setuptools.dist:check_package_data",
"exclude_package_data = setuptools.dist:check_package_data",
"include_package_data = setuptools.dist:assert_bool",
"packages = setuptools.dist:check_packages",
"dependency_links = setuptools.dist:assert_string_list",
"test_loader = setuptools.dist:check_importable",
"test_runner = setuptools.dist:check_importable",
"use_2to3 = setuptools.dist:assert_bool",
"convert_2to3_doctests = setuptools.dist:assert_string_list",
"use_2to3_fixers = setuptools.dist:assert_string_list",
"use_2to3_exclude_fixers = setuptools.dist:assert_string_list",
],

def check_nsp(dist, attr, value):
"""Verify that namespace packages are valid"""
ns_packages = value
assert_string_list(dist, attr, ns_packages)

So I think the total list of setup keywords that end up hitting assert_string_list is:

  • eager_resources
  • namespace_packages
  • package_data
  • exclude_package_data
  • dependency_links
  • convert_2to3_doctests
  • use_2to3_fixers
  • use_2to3_exclude_fixers

Is the behavior of any of these setup keywords order dependent? If so, we probably want to not allow sets of strings?

This comment has been minimized.

Copy link
@dhimmel

dhimmel Jul 3, 2019

Author Contributor

I wonder if we should just simplify the whole thing and just explicitly check for a list/tuple with isinstance(value, (tuple, list))

Implemented in d9035ef

This comment has been minimized.

Copy link
@dhimmel

dhimmel Jul 10, 2019

Author Contributor

@dhimmel dhimmel force-pushed the dhimmel:issue-1459 branch from 993bbd6 to ef002e7 Jun 26, 2019

dhimmel added a commit to dhimmel/setuptools that referenced this pull request Jul 3, 2019

fix assert_string_list docstring
value=None raises TypeError

DistutilsSetupError: 2 must be a list of strings (got None)

@benoit-pierre benoit-pierre force-pushed the dhimmel:issue-1459 branch from d9035ef to 8f848bd Jul 16, 2019

@benoit-pierre

This comment has been minimized.

Copy link
Member

commented Jul 16, 2019

I have squashed the history and simplified the news fragment. I think this is good to go, but I'll wait a few days before merging in case @pganssle / @jaraco disagree.

improve `package_data` check
Ensure the dictionary values are lists/tuples of strings.

Fix #1459.

@benoit-pierre benoit-pierre merged commit 38b9010 into pypa:master Jul 23, 2019

6 checks passed

Summary 1 potential rule
Details
codecov/patch 100% of diff hit (target 84.66%)
Details
codecov/project 84.71% (+0.05%) compared to 67344c9
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
deploy/netlify Deploy preview ready!
Details
@dquist

This comment has been minimized.

Copy link

commented Aug 14, 2019

While I welcome better checks for package_data, this was a breaking change because packages that were relying on the previous behavior and were installing fine with 3.7.3 are now unable to be installed with python 3.7.4 and I had to run around trying to figure out why all of our CI pipelines were suddenly failing. Fortunately I was able to quickly patch the offending upstream project, but others may not be so fortunate.

This change should not have been introduced into a patch release.

@pganssle

This comment has been minimized.

Copy link
Member

commented Aug 17, 2019

While I welcome better checks for package_data, this was a breaking change because packages that were relying on the previous behavior and were installing fine with 3.7.3 are now unable to be installed with python 3.7.4 and I had to run around trying to figure out why all of our CI pipelines were suddenly failing.

I don't think this is anything to do with 3.7.3 vs 3.7.4, setuptools is an independent project from CPython. This was released as part of setuptools version 41.1.0, which was a bugfix release, not a patch release.

My understanding is that this is backwards compatible because this only raises an error on already broken workflows - i.e. it may have been silently failing and now it's loudly failing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.