Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make os.walk and os.fwalk yield namedtuple instead of tuple #71047

Closed
palaviv mannequin opened this issue Apr 26, 2016 · 13 comments
Closed

Make os.walk and os.fwalk yield namedtuple instead of tuple #71047

palaviv mannequin opened this issue Apr 26, 2016 · 13 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@palaviv
Copy link
Mannequin

palaviv mannequin commented Apr 26, 2016

BPO 26860
Nosy @loewis, @rhettinger, @giampaolo, @ethanfurman, @serhiy-storchaka, @palaviv
Files
  • os-walk-result-namedtuple.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/rhettinger'
    closed_at = <Date 2017-04-11.02:02:11.234>
    created_at = <Date 2016-04-26.13:45:06.781>
    labels = ['type-feature', 'library']
    title = 'Make os.walk and os.fwalk yield namedtuple instead of tuple'
    updated_at = <Date 2017-04-11.02:02:11.204>
    user = 'https://github.com/palaviv'

    bugs.python.org fields:

    activity = <Date 2017-04-11.02:02:11.204>
    actor = 'rhettinger'
    assignee = 'rhettinger'
    closed = True
    closed_date = <Date 2017-04-11.02:02:11.234>
    closer = 'rhettinger'
    components = ['Library (Lib)']
    creation = <Date 2016-04-26.13:45:06.781>
    creator = 'palaviv'
    dependencies = []
    files = ['42612']
    hgrepos = []
    issue_num = 26860
    keywords = ['patch']
    message_count = 13.0
    messages = ['264285', '264288', '264418', '264421', '264478', '264503', '264506', '264507', '291346', '291352', '291354', '291355', '291402']
    nosy_count = 6.0
    nosy_names = ['loewis', 'rhettinger', 'giampaolo.rodola', 'ethan.furman', 'serhiy.storchaka', 'palaviv']
    pr_nums = []
    priority = 'normal'
    resolution = 'rejected'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue26860'
    versions = ['Python 3.6']

    @palaviv
    Copy link
    Mannequin Author

    palaviv mannequin commented Apr 26, 2016

    I am suggesting that os.walk and os.fwalk will yield a namedtuple instead of the regular tuple they currently yield.
    The use case for this change can be seen in the next example:

    def walk_wrapper(walk_it):
        for dir_entry in walk_it:
            if dir_entry[0] == "aaa":
               yield dir_entry

    Because walk_it can be either os.walk or os.fwalk I need to access dir_entry via index.

    My change will allow me to change this function to:

    def walk_wrapper(walk_it):
        for dir_entry in walk_it:
            if dir_entry.dirpath == "aaa":
               yield dir_entry

    Witch is more clear and readable.

    @palaviv palaviv mannequin added stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Apr 26, 2016
    @ethanfurman
    Copy link
    Member

    Quick review of patch looks good. I'll try to look it over more closely later.

    @rhettinger
    Copy link
    Contributor

    Classes are normally named with CamelCase. Also, "walk_result" or "WalkResult" seems like an odd name that doesn't really fit. DirEntry or DirInfo is a better match (see the OP's example, "for dir_entry in walk_it: ...")

    The "versionchanged" should be a "versionadded".

    The docs should use "named tuple" instead of "namedtuple". The former is the generic term used in the glossary to describe the instances. The latter is the factory function that creates a new tuple subclass.

    The attribute descriptions for the docs are pretty good. They should also be applied as actual docstrings in the code as well.

    The docs and code for fwalk() needs to be harmonized with walk() so the the tuple fields use the same names: change (root, dirs, files) to (dirpath, dirnames, filenames).

    @serhiy-storchaka
    Copy link
    Member

    Sorry, but I disagree with Raymond in many points.

    Classes are normally named with CamelCase. Also, "walk_result" or "WalkResult" seems like an odd name that doesn't really fit. DirEntry or DirInfo is a better match (see the OP's example, "for dir_entry in walk_it: ...")

    See "stat_result", "statvfs_result", "waitid_result", "uname_result", and "times_result". DirEntry is already used in the os module. And if accept this feature, needed separate types for walk() and fwalk() results.

    The "versionchanged" should be a "versionadded".

    os.walk() is not new. Just it's result is changed. Class "walk_result" can be tagged with "versionadded", but I'm not sure there is a need to document it separately. The documentation of the os module already too large. "uname_result" and "times_result" are not documented.

    The docs and code for fwalk() needs to be harmonized with walk() so the the tuple fields use the same names: change (root, dirs, files) to (dirpath, dirnames, filenames).

    (root, dirs, files) is shorter than (dirpath, dirnames, filenames) and these names were used with os.walk() and os.fwalk() for years.

    I general, I have doubts about this feature.

    1. There is little backward incompatibility. At least pickle is not backward compatible, and I guess other serialization methods.

    2. os.walk() and os.fwalk() are purposed to be used in for loop with immediate unpacking result tuple:

        for root, dirs, files in os.walk(...):
            ...

    Adding named tuple doesn't add any benefit for common case.

    In OP case, you can either use fwalk-based implementation of walk (bpo-15200):

        def fwalk_as_walk(*args, **kwargs):
            for x in os.fwalk(*args, **kwargs):
                yield x[:-1]

    or just ignore the rest of tuple items:

        for root, *_ in walk_it:
            ...
    1. Using namedtuple is slower and consumes more memory than using tuple. Even for FS-related operation like os.walk() this can matter. A lot of code is optimized for exact tuples, with namedtuple this optimization is lost.

    2. New names (dirpath, dirnames, filenames) are questionable. Why not use undersores (dir_names)? "dir" in dirpath refers to the current proceeded directory, but "dir" in dirnames refers to it's subdirectories. Currently you are free to use short names (root, dirs, files) from examples or what you prefer, but with namedtuple you are sticked with standard names forever. There are no names that satisfy everybody.

    3. Third-party walk-like iterators generate tuples, so you can't use attribute access in too general code.

    @palaviv
    Copy link
    Mannequin Author

    palaviv mannequin commented Apr 29, 2016

    In regard to Raymonds points I agree with Serhiys comments.

    As for Serhiy`s doubts:

    1. Using namedtuple is slower and consumes more memory than using tuple. Even for FS-related operation like os.walk() this can matter. A lot of code is optimized for exact tuples, with namedtuple this optimization is lost.

    I did some testing on my own PC:
    ./python -m timeit -s "from os import walk" "for x in walk('Lib'): pass"

    Regular tuple: 7.53 msec
    Named tuple: 7.66 msec

    1. New names (dirpath, dirnames, filenames) are questionable. Why not use undersores (dir_names)? "dir" in dirpath refers to the current proceeded directory, but "dir" in dirnames refers to it's subdirectories. Currently you are free to use short names (root, dirs, files) from examples or what you prefer, but with namedtuple you are sticked with standard names forever. There are no names that satisfy everybody.

    I agree that there will be no names that will satisfy everybody but I think the names that are currently in the documentation are the most trivial choice.

    As for points 1,2,5 this feature doesn`t break any of the old walk API.

    One more point I would like input on is the testing. I can remove the walk method from the WalkTests, FwalkTests classes and use the new named tuple attributes in the tests. Do you think its better or should we keep the tests with the old API (access using indexes)?

    @ethanfurman
    Copy link
    Member

    I'm not clear on what you asking, but regardless we should have both the old (by-index) tests and new by-attribute tests.

    @rhettinger
    Copy link
    Contributor

    https://www.python.org/dev/peps/pep-0008/#class-names -- "Class names should normally use the CapWords convention."

    Examples:
    ---------
    crypt.py
    6:from collections import namedtuple as _namedtuple
    13:class _Method(_namedtuple('_Method', 'name ident salt_chars total_size')):

    difflib.py
    34:from collections import namedtuple as _namedtuple
    36:Match = _namedtuple('Match', 'a b size')

    dis.py
    163:_Instruction = collections.namedtuple("_Instruction",
    280: Generates a sequence of Instruction namedtuples giving the details of each

    doctest.py
    107:from collections import namedtuple
    109:TestResults = namedtuple('TestResults', 'failed attempted')

    functools.py
    21:from collections import namedtuple
    345:_CacheInfo = namedtuple("CacheInfo", ["hits", "misses", "maxsize", "currsize"])

    inspect.py
    51:from collections import namedtuple, OrderedDict
    323:Attribute = namedtuple('Attribute', 'name kind defining_class object')
    968:Arguments = namedtuple('Arguments', 'args, varargs, varkw')
    1008:ArgSpec = namedtuple('ArgSpec', 'args varargs keywords defaults')
    1032:FullArgSpec = namedtuple('FullArgSpec',
    1124:ArgInfo = namedtuple('ArgInfo', 'args varargs keywords locals')
    1317:ClosureVars = namedtuple('ClosureVars', 'nonlocals globals builtins unbound')
    1372:Traceback = namedtuple('Traceback', 'filename lineno function code_context index')
    1412:FrameInfo = namedtuple('FrameInfo', ('frame',) + Traceback._fields)

    nntplib.py
    159:GroupInfo = collections.namedtuple('GroupInfo',
    162:ArticleInfo = collections.namedtuple('ArticleInfo',

    No doubt, there are exceptions to the rule in the standard library which is less consistent than we might like: "stat_result". That said, stat_result is a structseq and many C type names are old or violate the rules (list vs List, etc). New named tuples should follow PEP-8 can use CapWords convention unless there is a strong reason not to in a particular case.

    @palaviv
    Copy link
    Mannequin Author

    palaviv mannequin commented Apr 29, 2016

    Thanks for the response Ethan I think that I will leave the tests as they are in the current patch.

    No doubt, there are exceptions to the rule in the standard library which is less consistent than we might like: "stat_result". That said, stat_result is a structseq and many C type names are old or violate the rules (list vs List, etc). New named tuples should follow PEP-8 can use CapWords convention unless there is a strong reason not to in a particular case.

    I actually thought we should keep on consistency with other "result" like objects. I can see your point about new named tuples that should follow PEP-8 and DirEntry is an example of new "result" class that follow PEP-8.
    What names do you suggest? Maybe DirInfo and FDirInfo?

    @pppery pppery mannequin changed the title os.walk and os.fwalk yield namedtuple instead of tuple Make os.walk and os.fwalk yield namedtuple instead of tuple May 4, 2016
    @giampaolo
    Copy link
    Contributor

    Should we have concerns about performances? Accessing a namedtuple value is almost 4x times slower compared to a plain tuple [1] and os.walk() may iterate hundreds of times.

    http://stackoverflow.com/questions/2646157/what-is-the-fastest-to-access-struct-like-object-in-python

    @rhettinger
    Copy link
    Contributor

    I would expect that the field access time is inconsequential compared to just about every other aspect of os.walk().

    @serhiy-storchaka
    Copy link
    Member

    namedtuple's attribute access was optimized in recent years. In 3.7 it is 30% faster than in 3.4. So now it is only 3x times slower compared to a plain tuple. On other hand, os.walk() and os.fwalk() was optimized too. In 3.7 they are up to 3.5x times faster than in 3.4 (with hot caches). I didn't make measurements, but I expect that using namedtuples with os.walk() can decrease its performance at least by few percents.

    My main concern is that this feature will increase the complexity of the documentation of the os module (very little) and may encourage writing less clear code (but this is just my own preference, others can found new style more clear).

    @serhiy-storchaka
    Copy link
    Member

    s/at least/at most/

    @rhettinger rhettinger self-assigned this Apr 9, 2017
    @rhettinger
    Copy link
    Contributor

    There doesn't seem to be a consensus that the proposal is a net win. Serhiy made a persuasive argument that the added complexity isn't worth it.

    I'll leave this open for a day or two so that anyone else can make their case. Otherwise, I'll mark this as closed/rejected.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants