Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

glob.glob() can return bizarre ordering #26

Closed
erykoff opened this issue Jun 22, 2016 · 8 comments
Closed

glob.glob() can return bizarre ordering #26

erykoff opened this issue Jun 22, 2016 · 8 comments
Milestone

Comments

@erykoff
Copy link

erykoff commented Jun 22, 2016

Hey, stackexchange has the (unsatisfying) answer:
http://stackoverflow.com/questions/6773584/how-is-pythons-glob-glob-ordered

Apparently, it's order in the filesystem, whatever that may be; the order they were written to disk? But for two globs of files, they are by no means guaranteed to have the same matched ordering.

(I think sorting might fix this but would require some testing)

@esheldon
Copy link
Collaborator

esheldon commented Jun 22, 2016

I always do the following

flist=glob(pattern)
flist.sort()

note sort is in place, so this will not work, since sort() returns None

flist=glob(pattern).sort()

@rmjarvis
Copy link
Owner

Or

flist = sorted(glob(pattern))

But the point remains that it would be helpful to have other ways to specify the list of image and catalog files to make sure they match up 1-1.

@esheldon
Copy link
Collaborator

FYI, sorted() returns an iterator, not a list.

@rmjarvis
Copy link
Owner

rmjarvis commented Jun 22, 2016

I don't think so...

>>> type(sorted(glob.glob('*.fits')))
<type 'list'>

or in Python 3.4 (where I thought maybe they changed the nature of this function)

>>> type(sorted(glob.glob('*.fits')))
<class 'list'>

@esheldon
Copy link
Collaborator

I was clearly confused

@erykoff
Copy link
Author

erykoff commented Jun 22, 2016

I did not know about the sorted() thingy. I always did it the way Erin did.

In any event, as Mike has said, the original point still stands that we
need to guarantee 1-1 matching. It may be that simply using sorted() does
the trick if you have filenames that are all the same except for some
obviously sortable index, which is the "expected" behavior.

On Wed, Jun 22, 2016 at 10:25 AM, Erin Sheldon notifications@github.com
wrote:

I was clearly confused


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#26 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AE7bEx0CKEX-E3kLVWiCbgzI_QV6j8SZks5qOW_0gaJpZM4I78t1
.

@esheldon
Copy link
Collaborator

In my experience it is best to determine such things algorithmically. For example, if I'm processing an exposure I can generate the file names for the ccds in the exposure using a simple algorithm based on the DESDM run identifier. The base name is known and the suffix is based on ccd number.

I recommend doing that by default. e.g. specify an exposure you want to process by name or id, and the code looks for it in the usual place ($DESDATA/OPS/red/...) and does its thing.

This has the advantage that if a file is missing you will notice it; you generated the file list and you know what should be there, if the file is missing the code will crash. Also there are usually auxilliary files you will need to use that will be generated using the same algorithms, and don't need to be determined from the input file list.

Then if you want to be able to do something different, perhaps for testing, that can be an alternative method, e.g. specifying an exact list of files, or a pattern as in the above config.

@rmjarvis
Copy link
Owner

This is now fixed on branch #20. There are a few ways allowed to specify the file lists. From the doc string:

        There are a number of ways to specify the input files (parameters `images` and `cats`):

        1. If you only have a single image/catalog, you may just give the file name directly
           as a single string.
        2. For multiple images, you may specify a list of strings listing all the file names.
        3. You may specify a string with ``{chipnum}`` which will be filled in by the chipnum
           values given in the `chipnums` parameter using ``s.format(chipnum=chipnum)``.
        4. You may specify a string with ``%s`` (or perhaps ``%02d``, etc.) which will be filled
           in by the chipnum values given in the `chipnums` parameter using ``s % chipnum``.
        5. You may specify a string that ``glob.glob(s)`` will understand and convert into a
           list of file names.  Caveat: ``glob`` returns the files in native directory order
           (cf. ``ls -f``).  This can thus be different for the images and catalogs if they
           were written to disk out of order.  Therefore, we sort the list returned by
           ``glob.glob(s)``.  Typically, this will result in the image file names and catalog
           file names matching up correctly, but it is the users responsibility to ensure
           that this is the case.

        The `chipnums` parameter specifies chip "numbers" which are really just any identifying
        number or string that is different for each chip in the exposure.  Typically, these are
        numbers, but they don't have to be if you have some other way of identifying the chips.

        There are a number of ways that the chipnums may be specified:

        1. A single number or string.
        2. A list of numbers or strings.
        3. A string that can be ``eval``ed to yield the appropriate list.  e.g.
           `[ c for c in range(1,63) if c is not 61 ]`
        4. None, in which case range(len(images)) will be used.  In this case options 3,4 above
           for the images and cats parameters are not allowed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants