Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build.yml globbing #287

Merged
merged 22 commits into from Feb 10, 2018
Merged

build.yml globbing #287

merged 22 commits into from Feb 10, 2018

Conversation

eode
Copy link
Contributor

@eode eode commented Jan 14, 2018

Uses syntax:

contents:
  foo:
    "*.baz"   # Case insensitive.  All files are sub-nodes of 'foo'

A few notes:

  • case insensitivity was actually kindof a pain -- but it works now
  • there are a couple tools in tensorflow that I didn't bring over -- like the backported pathlib in tools.compat, and file->node duplicate naming conflict resolver. These have been brought over. There may be conflicts for the TensorFlow branch once this is merged into master..
  • subdirs are made into nodenames for now instead of making subnodes -- so subdir_foo_csv
    • This now uses only the filename of the found file, and appends a number if there's a conflict.
  • still rough

@asah
Copy link
Contributor

asah commented Jan 14, 2018

Broken on Python 2.7 see Travis :-(

* Provides Python2.7 compatibility
* Provides helper functions from tools/util.py
@eode
Copy link
Contributor Author

eode commented Jan 16, 2018

Between this and the TensorFlow branch, whichever is merged first will probably have some conflicts with the other, as I've pulled in some files into this branch from TF.

This should be ready for a review, now.

@eode
Copy link
Contributor Author

eode commented Jan 16, 2018

My original method of globbing just walked the whole package dir, and filtered out paths that didn't match. Simpler, but also slower. This version is likely a little slower for small build dirs, and faster for larger ones (though still a little slower than the stdlib, platform-dependent one). Glob strings work as you'd expect, with full multilevel specification, like: foo/*/datastore/**/babynames_??.csv catching foo/bar/datastore/anything/goes/here/babynames_01.csv

Some example syntax:

# glob strings must be quoted.
contents:
  csv:
    "[!bt]*.csv":      # a glob term without content is accepted, but needs a colon.
    "csv.*":           # transforms work
      transform: csv
    subnode:
      "subdir/**":     # multilevel recursion
        transform: csv
  excel:
    "*.xlsx":
      kwargs:          # kwargs verified to work
        skiprows: [0,10,100, 300, 600]
  collision:           # naming collisions from subdirs are renamed with a number
    "**/csv.txt":      # example: csv_txt, csv_txt_2 ...
      transform: csv

@dimaryaz
Copy link
Contributor

Two questions:

  • Why are you reimplementing globbing yourself, instead of using the built-in glob?
  • Do we actually want case-insensitive globbing?

@dimaryaz
Copy link
Contributor

pathlib seems to support both case-sensitive and case-insensitive globbing:

pathlib.PureWindowsPath('foo.py').match('*.PY')
pathlib.PurePosixPath('foo.py').match('*.PY')

@akarve
Copy link
Member

akarve commented Jan 17, 2018

Strongly suggest we use glob.glob and willing to even sacrifice some functionality for it (easier to maintain)

@eode
Copy link
Contributor Author

eode commented Jan 17, 2018

@dimaryaz, @akarve
Not attached to using the glob I made.. ..initial reasoning was via a conversation with Aneesh, where we came to the conclusion that case insensitivity was the route (roughly) to go. However, I think he figured we'd be able to use glob.glob, which I did, initially, as well.

So, this is what I ran across:

  • Pathlib.PureWindowsPath.match() doesn't work the same as glob.glob
    • it's individual, per-path basis
    • it doesn't match the same as glob.glob on **/*
    • it requires walking all files of the subtree, regardless of whether they are matched or not
      • this might not be a problem, depending on our package sizes
  • glob.glob doesn't support case insensitivity -- or more to the point, case consistency. It matches the convention of the current OS

..but, it came with some experimentation that I found all that out. :-/ At that point, I was sucked in, and made what I suppose was a mistake -- writing my own.

The end result:
Mine matches bash's behavior, (not python's glob.glob() if it were case-insensitive).

..in any case, it still might be preferable to just use an existing lib -- either pathlib.PureWindowsPath.match() and forego ** matching, or glob.glob() and forego identical builds for identical build dirs on different OSes.

@quiltdata
Copy link
Collaborator

quiltdata commented Jan 17, 2018 via email

@eode
Copy link
Contributor Author

eode commented Jan 18, 2018

Any particular place I should put tests for testing package['foo/bar'] and 'foo/bar' in package (Package.__contains__ and Package.__getitem__)?

I did the typical Python thing of implementing __getitem__ for the __contains__ check. Used when preventing name conflicts when adding nodes in build._build.

@eode
Copy link
Contributor Author

eode commented Jan 18, 2018

dangit. glob.glob() doesn't handle recursion on 2.7. I've got to switch to pathlib.Path.glob(). I.e., ** isn't supported.

@quiltdata
Copy link
Collaborator

quiltdata commented Jan 18, 2018 via email

@eode
Copy link
Contributor Author

eode commented Jan 18, 2018

OK, this should be fixed now -- end result:
Using pathlib, which follows bash's description of ** matching, but not bash's actual behavior.

Bash:
If set, the pattern ** used in a pathname expansion context will
    match all files and zero or more directories and subdirectories.
    If the pattern is followed by a /, only directories and
    subdirectories match.

But bash's actual behavior is that ** matches all files and subdirectories, and their files, recursively. Pathlib actually matches all files and zero or more directories and subdirectories. All this means is that to get everything, a user needs to specify **/*, not just **.

@eode
Copy link
Contributor Author

eode commented Jan 18, 2018

@akarve @dimaryaz This should be ready for review again.

By the way, changes to setup.py, utils, and compat.py were mostly pulled in from the Tensorflow branch. Of the items in setup.py, only pathlib is used.

@eode
Copy link
Contributor Author

eode commented Jan 18, 2018

FYI, glob.glob recursivity in python was added at 3.5, same as the pathlib backport point.

@eode eode requested review from akarve, dimaryaz and kevinemoore and removed request for akarve January 18, 2018 15:58
@akarve
Copy link
Member

akarve commented Jan 24, 2018

OK I've got this in my build.yml:

  4   babynames:
  5     NationalReadMe:
  6       file: babynames/NationalReadMe.pdf
  7     # apply transform: csv to all succeeding nodes in this group
  8     '*.txt':
  9       transform: csv
 10       kwargs:
 11         header: #there's no header row in these files
 12     yob1880:
 13       file: babynames/yob1880.txt
 14     yob1881:
 15       file: babynames/yob1881.txt
 16     yob1882:
 17       file: babynames/yob1882.txt
# ... lots of .txt files

But it is not working. All of the .txt use transform: id and the CLI tells me as much during build. Am I doing something wrong?

@quiltdata
Copy link
Collaborator

quiltdata commented Jan 24, 2018 via email

@akarve
Copy link
Member

akarve commented Jan 24, 2018

Yes I'm on the right branch. Build succeeds but doesn't treat the txts properly.

@eode
Copy link
Contributor Author

eode commented Jan 25, 2018

@akarve -- can you sign off on this for your purposes?
@dimaryaz -- anything else re: code review?

@eode
Copy link
Contributor Author

eode commented Jan 26, 2018

This has a soft dependency on #312, in that there are newer versions of some files there, and the diff size will be reduced once that's in.

@asah asah requested review from akarve and removed request for dimaryaz February 1, 2018 18:24
@eode
Copy link
Contributor Author

eode commented Feb 2, 2018

@akarve Docs updated.

@akarve
Copy link
Member

akarve commented Feb 6, 2018

I got it to fail twice with Failed to build the package: Naming conflict: 'babynames/yob1880' has been added to the package more than once (full output message below). The second time (because of build cache?) it produced output that wasn't verbose enough.

  • What is the cause of this error? I have verified only one yob1880 node in build.yml and only one babynames/yob1880.txt
  • For the second error case (due to build cache?) can you make the output more verbose so that console output looks more like an actual build?
(dev) apk-mbp-3:datasets karve$ quilt build akarve/pdb build-good.yml 
Inferring 'transform: id' for README.md
Copying README.md...
Inferring 'transform: id' for babynames/NationalReadMe.pdf
...
Serializing babynames/yob2009.txt...
100%|████████████████████████████████████████| 447k/447k [00:00<00:00, 17.1MB/s]
Saving as binary dataframe...
Matched with 'babynames/*.txt': 'babynames/yob2010.txt'
Serializing babynames/yob2010.txt...
100%|████████████████████████████████████████| 437k/437k [00:00<00:00, 20.8MB/s]
Saving as binary dataframe...
Failed to build the package: Naming conflict: 'babynames/yob1880' has been added to the package more than once

Second run:

(dev) apk-mbp-3:datasets karve$ quilt build akarve/pdb build.yml 
Inferring 'transform: id' for README.md
Copying README.md...
Inferring 'transform: id' for Untitled.ipynb
Copying Untitled.ipynb...
...
Matched with 'babynames/*.txt': 'babynames/yob1998.txt'
Matched with 'babynames/*.txt': 'babynames/yob1999.txt'
Matched with 'babynames/*.txt': 'babynames/yob2000.txt'
Matched with 'babynames/*.txt': 'babynames/yob2001.txt'
Matched with 'babynames/*.txt': 'babynames/yob2002.txt'
Matched with 'babynames/*.txt': 'babynames/yob2003.txt'
Matched with 'babynames/*.txt': 'babynames/yob2004.txt'
Matched with 'babynames/*.txt': 'babynames/yob2005.txt'
Matched with 'babynames/*.txt': 'babynames/yob2006.txt'
Matched with 'babynames/*.txt': 'babynames/yob2007.txt'
Matched with 'babynames/*.txt': 'babynames/yob2008.txt'
Matched with 'babynames/*.txt': 'babynames/yob2009.txt'
Matched with 'babynames/*.txt': 'babynames/yob2010.txt'
Inferring 'transform: id' for babynames/NationalReadMe.pdf
Copying babynames/NationalReadMe.pdf...
Failed to build the package: Naming conflict: 'babynames/yob1880' has been added to the package more than once

@eode
Copy link
Contributor Author

eode commented Feb 6, 2018

Discussed on slack: When working with a generated build.yml, replace nodes with glob rather than adding glob in addition.

Cached build reporting not discussed yet.

Copy link
Contributor

@kevinemoore kevinemoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good aside from conflicts and minor changes suggested inline.


# patch = mock.patch
# Path = pathlib.Path
# TemporaryDirectory = tempfile.TemporaryDirectory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to be used anywhere. Is it part of TensorFlow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was, but has been dropped. Now that the TF Prep PR has been merged, I can merge master into this and clean up, then squash and merge into master.

def __getitem__(self, item):
node = self.get_contents()

if '/' in item and '.' in item:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's pick just one syntax for accessing package components by path. My vote is to use '.' as the separator.

Copy link
Contributor Author

@eode eode Feb 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm rather for '/', myself. Originally, I was for the dot notation, but these considerations changed my thoughts on it:

  • Our user/pkg/sub/node/notation uses slashes, and this is consistent with that.
  • Using slashes leaves open the option of relaxing node names to be more like filenames in the future
  • It would allow us to use a pathlib.PurePosixPath object when doing common pathlike operations.
  • as we move toward using item['notation'] for nodes, looking like a Python identifier becomes less important.

Although less of a general consideration, it's worth noting that '/' is used in build._build_node, which this was made for.

parts = [item]

try:
for part in parts[:-1]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems overly complicated. Why not: for part in parts...return node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right.

@akarve
Copy link
Member

akarve commented Feb 7, 2018

Works. I'll file a separate issue for the build output when build cache is hit.

@eode
Copy link
Contributor Author

eode commented Feb 8, 2018

@kevinemoore Ready for re-review.

@eode eode merged commit 3550404 into master Feb 10, 2018
nl0 added a commit to nl0/quilt that referenced this pull request Feb 13, 2018
* master: (38 commits)
  Implement the "always requires auth" catalog config (quiltdata#365)
  Replace a confusing SQL query with a slightly less confusing one (quiltdata#363)
  Eliminate stray back-tick [ci skip]
  Eliminate redundant `.team` [ci skip]
  Whack backticks [ci skip]
  fix syntax example; refactor headings [ci skip]
  Add alternative terms [ci skip]
  Link to object store docs [ci skip]
  Consolidate and move sections [ci skip]
  Rename section [ci skip]
  Simplify and update links [ci skip]
  Rename section [ci skip]
  Add notes on immutability, tophash, etc. [ci skip]
  Use a small dataset in most of the tests (quiltdata#360)
  Add (No results) to empty search results (quiltdata#359)
  Admin UI endpoint to list users and associated data (quiltdata#354)
  Upgrade the stack (quiltdata#326)
  build.yml globbing (quiltdata#287)
  Pass the DISABLE_SIGNUP env variable to django (quiltdata#356)
  use npm lock files, delete yarn (quiltdata#355)
  ...
@akarve akarve deleted the build-yml-globbing branch May 14, 2018 20:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants