build.yml globbing #287

eode · 2018-01-14T12:54:48Z

Uses syntax:

contents:
  foo:
    "*.baz"   # Case insensitive.  All files are sub-nodes of 'foo'

A few notes:

case insensitivity was actually kindof a pain -- but it works now
~~there are a couple tools in tensorflow that I didn't bring over -- like the backported pathlib in tools.compat, and file->node duplicate naming conflict resolver.~~ These have been brought over. There may be conflicts for the TensorFlow branch once this is merged into master..
subdirs are made into nodenames for now instead of making subnodes -- so subdir_foo_csv
- This now uses only the filename of the found file, and appends a number if there's a conflict.
~~still rough~~

Also, remove unused import

…e-build-yml-globbing

asah · 2018-01-14T17:47:07Z

Broken on Python 2.7 see Travis :-(

* Provides Python2.7 compatibility * Provides helper functions from tools/util.py

eode · 2018-01-16T08:52:38Z

Between this and the TensorFlow branch, whichever is merged first will probably have some conflicts with the other, as I've pulled in some files into this branch from TF.

This should be ready for a review, now.

eode · 2018-01-16T09:17:23Z

My original method of globbing just walked the whole package dir, and filtered out paths that didn't match. Simpler, but also slower. This version is likely a little slower for small build dirs, and faster for larger ones (though still a little slower than the stdlib, platform-dependent one). Glob strings work as you'd expect, with full multilevel specification, like: foo/*/datastore/**/babynames_??.csv catching foo/bar/datastore/anything/goes/here/babynames_01.csv

Some example syntax:

# glob strings must be quoted.
contents:
  csv:
    "[!bt]*.csv":      # a glob term without content is accepted, but needs a colon.
    "csv.*":           # transforms work
      transform: csv
    subnode:
      "subdir/**":     # multilevel recursion
        transform: csv
  excel:
    "*.xlsx":
      kwargs:          # kwargs verified to work
        skiprows: [0,10,100, 300, 600]
  collision:           # naming collisions from subdirs are renamed with a number
    "**/csv.txt":      # example: csv_txt, csv_txt_2 ...
      transform: csv

dimaryaz · 2018-01-16T23:27:34Z

Two questions:

Why are you reimplementing globbing yourself, instead of using the built-in glob?
Do we actually want case-insensitive globbing?

dimaryaz · 2018-01-17T00:05:44Z

pathlib seems to support both case-sensitive and case-insensitive globbing:

pathlib.PureWindowsPath('foo.py').match('*.PY')
pathlib.PurePosixPath('foo.py').match('*.PY')

akarve · 2018-01-17T01:13:09Z

Strongly suggest we use glob.glob and willing to even sacrifice some functionality for it (easier to maintain)

eode · 2018-01-17T18:33:14Z

@dimaryaz, @akarve
Not attached to using the glob I made.. ..initial reasoning was via a conversation with Aneesh, where we came to the conclusion that case insensitivity was the route (roughly) to go. However, I think he figured we'd be able to use glob.glob, which I did, initially, as well.

So, this is what I ran across:

Pathlib.PureWindowsPath.match() doesn't work the same as glob.glob
- it's individual, per-path basis
- it doesn't match the same as glob.glob on **/*
- it requires walking all files of the subtree, regardless of whether they are matched or not
  - this might not be a problem, depending on our package sizes
glob.glob doesn't support case insensitivity -- or more to the point, case consistency. It matches the convention of the current OS

..but, it came with some experimentation that I found all that out. :-/ At that point, I was sucked in, and made what I suppose was a mistake -- writing my own.

The end result:
Mine matches bash's behavior, (not python's glob.glob() if it were case-insensitive).

..in any case, it still might be preferable to just use an existing lib -- either pathlib.PureWindowsPath.match() and forego ** matching, or glob.glob() and forego identical builds for identical build dirs on different OSes.

quiltdata · 2018-01-17T19:23:22Z

this is a first version - start with something simple (glob.glob) and document the behavior in the code & docs, then we'll "fix" it later if it becomes a problem. aka "let it become a problem" Brian - that said, I love love love that you went ahead and wrote your own. Obviously, not good if it had been week(s) but for a day or two, it's an excellent way to "get close" to the problem. ᐧ

…

On Wed, Jan 17, 2018 at 10:33 AM, eode ***@***.***> wrote: @dimaryaz <https://github.com/dimaryaz>, @akarve <https://github.com/akarve> Not attached to using the glob I made.. ..initial reasoning was via a conversation with Aneesh, where we came to the conclusion that case insensitivity was the route (roughly) to go. However, I think he figured we'd be able to use glob.glob, which I did, initially, as well. So, this is what I ran across: - Pathlib.PureWindowsPath.match() doesn't work the same as glob.glob - it's individual, per-path basis - it doesn't match the same as glob.glob on **/* - it requires walking all files of the subtree, regardless of whether they are matched or not - this might not be a problem, depending on our package sizes - glob.glob doesn't support case insensitivity -- or more to the point, case consistency. It matches the convention of the current OS ..but, it came with some experimentation that I found all that out. :-/ At that point, I was sucked in, and made what I suppose was a mistake -- writing my own. The end result: Mine matches bash's behavior, (not python's glob.glob() if it were case-insensitive). ..in any case, it still might be preferable to just use an existing lib -- either pathlib.PureWindowsPath.match() and forego ** matching, or glob.glob() and forego identical builds for identical build dirs on different OSes. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#287 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AQUcPbnUmPpqSjDgIQt60ARekY1nmw8vks5tLjzrgaJpZM4RdlXA> . -- You received this message because you are subscribed to the Google Groups "Developers" group. To unsubscribe from this group and stop receiving emails from it, send an email to ***@***.*** To post to this group, send email to ***@***.*** To view this discussion on the web visit https://groups.google.com/a/ quiltdata.io/d/msgid/dev/quiltdata/quilt/pull/287/c358399058%40github.com <https://groups.google.com/a/quiltdata.io/d/msgid/dev/quiltdata/quilt/pull/287/c358399058%40github.com?utm_medium=email&utm_source=footer> .

eode · 2018-01-18T03:33:25Z

Any particular place I should put tests for testing package['foo/bar'] and 'foo/bar' in package (Package.__contains__ and Package.__getitem__)?

I did the typical Python thing of implementing __getitem__ for the __contains__ check. Used when preventing name conflicts when adding nodes in build._build.

eode · 2018-01-18T03:36:32Z

dangit. glob.glob() doesn't handle recursion on 2.7. I've got to switch to pathlib.Path.glob(). I.e., ** isn't supported.

quiltdata · 2018-01-18T03:37:38Z

Ugh!! Pls slack dima if you don't get quick answers to questions

…

On Jan 17, 2018 7:36 PM, "eode" ***@***.***> wrote: dangit. glob.glob() doesn't handle recursion on 2.7. I've got to switch to pathlib.Path.glob(). — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#287 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AQUcPWCwborqvUOERNUKdG544d_TBEerks5tLrxAgaJpZM4RdlXA> . -- You received this message because you are subscribed to the Google Groups "Developers" group. To unsubscribe from this group and stop receiving emails from it, send an email to ***@***.*** To post to this group, send email to ***@***.*** To view this discussion on the web visit https://groups.google.com/a/ quiltdata.io/d/msgid/dev/quiltdata/quilt/pull/287/c358527830%40github.com <https://groups.google.com/a/quiltdata.io/d/msgid/dev/quiltdata/quilt/pull/287/c358527830%40github.com?utm_medium=email&utm_source=footer> .

eode · 2018-01-18T04:20:40Z

OK, this should be fixed now -- end result:
Using pathlib, which follows bash's description of ** matching, but not bash's actual behavior.

Bash:
If set, the pattern ** used in a pathname expansion context will
    match all files and zero or more directories and subdirectories.
    If the pattern is followed by a /, only directories and
    subdirectories match.

But bash's actual behavior is that ** matches all files and subdirectories, and their files, recursively. Pathlib actually matches all files and zero or more directories and subdirectories. All this means is that to get everything, a user needs to specify **/*, not just **.

eode · 2018-01-18T04:52:03Z

@akarve @dimaryaz This should be ready for review again.

By the way, changes to setup.py, utils, and compat.py were mostly pulled in from the Tensorflow branch. Of the items in setup.py, only pathlib is used.

eode · 2018-01-18T04:58:34Z

FYI, glob.glob recursivity in python was added at 3.5, same as the pathlib backport point.

akarve · 2018-01-24T03:22:14Z

OK I've got this in my build.yml:

  4   babynames:
  5     NationalReadMe:
  6       file: babynames/NationalReadMe.pdf
  7     # apply transform: csv to all succeeding nodes in this group
  8     '*.txt':
  9       transform: csv
 10       kwargs:
 11         header: #there's no header row in these files
 12     yob1880:
 13       file: babynames/yob1880.txt
 14     yob1881:
 15       file: babynames/yob1881.txt
 16     yob1882:
 17       file: babynames/yob1882.txt
# ... lots of .txt files

But it is not working. All of the .txt use transform: id and the CLI tells me as much during build. Am I doing something wrong?

quiltdata · 2018-01-24T12:00:51Z

Are you working from the build-yml-globbing branch? master of course fails. $ quilt build asah/babynames ./build.yml Inferring 'transform: id' for babynames/NationalReadMe.pdf Copying ./babynames/NationalReadMe.pdf... Traceback (most recent call last): ... File "/Users/asah/quilt/compiler/quilt/tools/build.py", line 110, in _build_node checks_contents=checks_contents, dry_run=dry_run, env=env, ancestor_args=group_args) File "/Users/asah/quilt/compiler/quilt/tools/build.py", line 108, in _build_node raise StoreException("Invalid node name: %r" % child_name) quilt.tools.store.StoreException: Invalid node name: '*.txt' vs. $ cd quilt/compiler $ git checkout build-yml-globbing Switched to branch 'build-yml-globbing' Your branch is up-to-date with 'origin/build-yml-globbing'. $ pip install -e . Obtaining file:///Users/asah/quilt/compiler Requirement already satisfied: appdirs>=1.4.0 in /Users/asah/miniconda3/lib/python3.6/site-packages (from quilt==2.8.2.dev0) ... $ cd ~/babynames $ quilt build asah/babynames ./build.yml Inferring 'transform: id' for babynames/NationalReadMe.pdf Copying ./babynames/NationalReadMe.pdf... {'pattern': '*.txt', 'dir': PosixPath('.')} baby1.txt Serializing ./baby1.txt... 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.00/4.00 [00:00<00:00, 559B/s] Saving as binary dataframe... baby2.txt Serializing ./baby2.txt... 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.00/4.00 [00:00<00:00, 2.71kB/s] Saving as binary dataframe... Inferring 'transform: id' for babynames/yob1880.txt Copying ./babynames/yob1880.txt... Inferring 'transform: id' for babynames/yob1881.txt Copying ./babynames/yob1881.txt... Inferring 'transform: id' for babynames/yob1882.txt Copying ./babynames/yob1882.txt... Built asah/babynames successfully. ᐧ

…

On Tue, Jan 23, 2018 at 7:22 PM, Aneesh Karve ***@***.***> wrote: OK I've got this in my build.yml: 4 babynames: 5 NationalReadMe: 6 file: babynames/NationalReadMe.pdf 7 # apply transform: csv to all succeeding nodes in this group 8 '*.txt': 9 transform: csv 10 kwargs: 11 header: #there's no header row in these files 12 yob1880: 13 file: babynames/yob1880.txt 14 yob1881: 15 file: babynames/yob1881.txt 16 yob1882: 17 file: babynames/yob1882.txt # ... lots of .txt files But it is not working. All of the .txt use transform: id and the CLI tells me as much during build. Am I doing something wrong? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#287 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AQUcPbRVUe360TysYg37cAOmZeF5Iwevks5tNqHngaJpZM4RdlXA> . -- You received this message because you are subscribed to the Google Groups "Developers" group. To unsubscribe from this group and stop receiving emails from it, send an email to ***@***.*** To post to this group, send email to ***@***.*** To view this discussion on the web visit https://groups.google.com/a/ quiltdata.io/d/msgid/dev/quiltdata/quilt/pull/287/c360008835%40github.com <https://groups.google.com/a/quiltdata.io/d/msgid/dev/quiltdata/quilt/pull/287/c360008835%40github.com?utm_medium=email&utm_source=footer> .

akarve · 2018-01-24T15:36:32Z

Yes I'm on the right branch. Build succeeds but doesn't treat the txts properly.

eode · 2018-01-25T02:11:45Z

@akarve -- can you sign off on this for your purposes?
@dimaryaz -- anything else re: code review?

eode · 2018-01-26T18:52:37Z

This has a soft dependency on #312, in that there are newer versions of some files there, and the diff size will be reduced once that's in.

…build-yml-globbing

eode · 2018-02-02T21:40:43Z

@akarve Docs updated.

akarve · 2018-02-06T02:32:35Z

I got it to fail twice with Failed to build the package: Naming conflict: 'babynames/yob1880' has been added to the package more than once (full output message below). The second time (because of build cache?) it produced output that wasn't verbose enough.

What is the cause of this error? I have verified only one yob1880 node in build.yml and only one babynames/yob1880.txt
For the second error case (due to build cache?) can you make the output more verbose so that console output looks more like an actual build?

(dev) apk-mbp-3:datasets karve$ quilt build akarve/pdb build-good.yml 
Inferring 'transform: id' for README.md
Copying README.md...
Inferring 'transform: id' for babynames/NationalReadMe.pdf
...
Serializing babynames/yob2009.txt...
100%|████████████████████████████████████████| 447k/447k [00:00<00:00, 17.1MB/s]
Saving as binary dataframe...
Matched with 'babynames/*.txt': 'babynames/yob2010.txt'
Serializing babynames/yob2010.txt...
100%|████████████████████████████████████████| 437k/437k [00:00<00:00, 20.8MB/s]
Saving as binary dataframe...
Failed to build the package: Naming conflict: 'babynames/yob1880' has been added to the package more than once

Second run:

(dev) apk-mbp-3:datasets karve$ quilt build akarve/pdb build.yml 
Inferring 'transform: id' for README.md
Copying README.md...
Inferring 'transform: id' for Untitled.ipynb
Copying Untitled.ipynb...
...
Matched with 'babynames/*.txt': 'babynames/yob1998.txt'
Matched with 'babynames/*.txt': 'babynames/yob1999.txt'
Matched with 'babynames/*.txt': 'babynames/yob2000.txt'
Matched with 'babynames/*.txt': 'babynames/yob2001.txt'
Matched with 'babynames/*.txt': 'babynames/yob2002.txt'
Matched with 'babynames/*.txt': 'babynames/yob2003.txt'
Matched with 'babynames/*.txt': 'babynames/yob2004.txt'
Matched with 'babynames/*.txt': 'babynames/yob2005.txt'
Matched with 'babynames/*.txt': 'babynames/yob2006.txt'
Matched with 'babynames/*.txt': 'babynames/yob2007.txt'
Matched with 'babynames/*.txt': 'babynames/yob2008.txt'
Matched with 'babynames/*.txt': 'babynames/yob2009.txt'
Matched with 'babynames/*.txt': 'babynames/yob2010.txt'
Inferring 'transform: id' for babynames/NationalReadMe.pdf
Copying babynames/NationalReadMe.pdf...
Failed to build the package: Naming conflict: 'babynames/yob1880' has been added to the package more than once

eode · 2018-02-06T04:25:21Z

Discussed on slack: When working with a generated build.yml, replace nodes with glob rather than adding glob in addition.

Cached build reporting not discussed yet.

kevinemoore

Looks good aside from conflicts and minor changes suggested inline.

kevinemoore · 2018-02-06T22:27:23Z

compiler/quilt/tools/compat.py

+
+# patch = mock.patch
+# Path = pathlib.Path
+# TemporaryDirectory = tempfile.TemporaryDirectory


This doesn't seem to be used anywhere. Is it part of TensorFlow?

It was, but has been dropped. Now that the TF Prep PR has been merged, I can merge master into this and clean up, then squash and merge into master.

kevinemoore · 2018-02-06T22:37:07Z

compiler/quilt/tools/package.py

+    def __getitem__(self, item):
+        node = self.get_contents()
+
+        if '/' in item and '.' in item:


Let's pick just one syntax for accessing package components by path. My vote is to use '.' as the separator.

I'm rather for '/', myself. Originally, I was for the dot notation, but these considerations changed my thoughts on it:

Our user/pkg/sub/node/notation uses slashes, and this is consistent with that.

Using slashes leaves open the option of relaxing node names to be more like filenames in the future

It would allow us to use a pathlib.PurePosixPath object when doing common pathlike operations.

as we move toward using item['notation'] for nodes, looking like a Python identifier becomes less important.

Although less of a general consideration, it's worth noting that '/' is used in build._build_node, which this was made for.

kevinemoore · 2018-02-06T22:38:11Z

compiler/quilt/tools/package.py

+            parts = [item]
+
+        try:
+            for part in parts[:-1]:


This seems overly complicated. Why not: for part in parts...return node?

You're right.

akarve · 2018-02-07T04:07:54Z

Works. I'll file a separate issue for the build output when build cache is hit.

…obbing

eode · 2018-02-08T00:34:39Z

@kevinemoore Ready for re-review.

* master: (38 commits) Implement the "always requires auth" catalog config (quiltdata#365) Replace a confusing SQL query with a slightly less confusing one (quiltdata#363) Eliminate stray back-tick [ci skip] Eliminate redundant `.team` [ci skip] Whack backticks [ci skip] fix syntax example; refactor headings [ci skip] Add alternative terms [ci skip] Link to object store docs [ci skip] Consolidate and move sections [ci skip] Rename section [ci skip] Simplify and update links [ci skip] Rename section [ci skip] Add notes on immutability, tophash, etc. [ci skip] Use a small dataset in most of the tests (quiltdata#360) Add (No results) to empty search results (quiltdata#359) Admin UI endpoint to list users and associated data (quiltdata#354) Upgrade the stack (quiltdata#326) build.yml globbing (quiltdata#287) Pass the DISABLE_SIGNUP env variable to django (quiltdata#356) use npm lock files, delete yarn (quiltdata#355) ...

eode added 3 commits January 2, 2018 15:46

NotImplementedError for find_node_by_name(), diff_node_dataframe()

423209b

Also, remove unused import

Merge branch 'master' of github.com:quiltdata/quilt-compiler into eod…

39a7030

…e-build-yml-globbing

Add basic globbing for build.yml files

d620ac7

eode added 2 commits January 15, 2018 12:33

Copied files from tensorflow_virtual_files branch

f1de29e

* Provides Python2.7 compatibility * Provides helper functions from tools/util.py

improve glob speed, test glob, test building w/glob, fixes

3602ed1

eode added 2 commits January 17, 2018 20:47

Remove homegrown glob. Add conflict checking.

141fc24

Cleanup, peel off most noise to a separate function

b18c0ec

Use pathlib rather than glob.glob. Adapt build.py.

a7138b3

minor cleanup

b739bd2

eode requested review from akarve, dimaryaz and kevinemoore and removed request for akarve January 18, 2018 15:58

eode and others added 4 commits January 24, 2018 16:17

Remove debug print calls, print match message

39c3c60

Merge branch 'master' into build-yml-globbing

3bac736

revert changes to test_util.py (unrelated)

19dc273

Revert rename of find_node_by_name param

3530f6b

Merge branch 'master' into build-yml-globbing

33b3e99

asah requested review from akarve and removed request for dimaryaz February 1, 2018 18:24

eode added 2 commits February 2, 2018 16:35

Update docs for globbing

4193965

Merge branch 'build-yml-globbing' of github.com:quiltdata/quilt into …

1edb892

…build-yml-globbing

kevinemoore requested changes Feb 6, 2018

View reviewed changes

akarve approved these changes Feb 7, 2018

View reviewed changes

eode added 5 commits February 7, 2018 12:06

Merge branch 'master' of github.com:quiltdata/quilt into build-yml-gl…

de06931

…obbing

Docstrings, PR items, tests, clearer errors

36411ff

_build: Don't accept or submit '/' in individual leaf node names

b9c4564

include missed build file for tests

16e9f73

Remove older pr-related comments

17f4e79

eode added the waiting on review label Feb 8, 2018

kevinemoore approved these changes Feb 9, 2018

View reviewed changes

kevinemoore removed the waiting on review label Feb 9, 2018

Fix docs typo, file->files

4a2e25d

eode merged commit 3550404 into master Feb 10, 2018

akarve deleted the build-yml-globbing branch May 14, 2018 20:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

build.yml globbing #287

build.yml globbing #287

eode commented Jan 14, 2018 •

edited

asah commented Jan 14, 2018

eode commented Jan 16, 2018

eode commented Jan 16, 2018 •

edited

dimaryaz commented Jan 16, 2018

dimaryaz commented Jan 17, 2018

akarve commented Jan 17, 2018

eode commented Jan 17, 2018 •

edited

quiltdata commented Jan 17, 2018 via email

eode commented Jan 18, 2018

eode commented Jan 18, 2018 •

edited

quiltdata commented Jan 18, 2018 via email

eode commented Jan 18, 2018

eode commented Jan 18, 2018

eode commented Jan 18, 2018

akarve commented Jan 24, 2018

quiltdata commented Jan 24, 2018 via email

akarve commented Jan 24, 2018

eode commented Jan 25, 2018

eode commented Jan 26, 2018

eode commented Feb 2, 2018

akarve commented Feb 6, 2018 •

edited

eode commented Feb 6, 2018 •

edited

kevinemoore left a comment

kevinemoore Feb 6, 2018

eode Feb 6, 2018

kevinemoore Feb 6, 2018

eode Feb 7, 2018 •

edited

kevinemoore Feb 6, 2018

eode Feb 7, 2018

akarve commented Feb 7, 2018

eode commented Feb 8, 2018

build.yml globbing #287

build.yml globbing #287

Conversation

eode commented Jan 14, 2018 • edited

asah commented Jan 14, 2018

eode commented Jan 16, 2018

eode commented Jan 16, 2018 • edited

dimaryaz commented Jan 16, 2018

dimaryaz commented Jan 17, 2018

akarve commented Jan 17, 2018

eode commented Jan 17, 2018 • edited

quiltdata commented Jan 17, 2018 via email

eode commented Jan 18, 2018

eode commented Jan 18, 2018 • edited

quiltdata commented Jan 18, 2018 via email

eode commented Jan 18, 2018

eode commented Jan 18, 2018

eode commented Jan 18, 2018

akarve commented Jan 24, 2018

quiltdata commented Jan 24, 2018 via email

akarve commented Jan 24, 2018

eode commented Jan 25, 2018

eode commented Jan 26, 2018

eode commented Feb 2, 2018

akarve commented Feb 6, 2018 • edited

eode commented Feb 6, 2018 • edited

kevinemoore left a comment

Choose a reason for hiding this comment

kevinemoore Feb 6, 2018

Choose a reason for hiding this comment

eode Feb 6, 2018

Choose a reason for hiding this comment

kevinemoore Feb 6, 2018

Choose a reason for hiding this comment

eode Feb 7, 2018 • edited

Choose a reason for hiding this comment

kevinemoore Feb 6, 2018

Choose a reason for hiding this comment

eode Feb 7, 2018

Choose a reason for hiding this comment

akarve commented Feb 7, 2018

eode commented Feb 8, 2018

eode commented Jan 14, 2018 •

edited

eode commented Jan 16, 2018 •

edited

eode commented Jan 17, 2018 •

edited

eode commented Jan 18, 2018 •

edited

akarve commented Feb 6, 2018 •

edited

eode commented Feb 6, 2018 •

edited

eode Feb 7, 2018 •

edited