New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
build.yml globbing #287
build.yml globbing #287
Conversation
Also, remove unused import
…e-build-yml-globbing
Broken on Python 2.7 see Travis :-( |
* Provides Python2.7 compatibility * Provides helper functions from tools/util.py
Between this and the TensorFlow branch, whichever is merged first will probably have some conflicts with the other, as I've pulled in some files into this branch from TF. This should be ready for a review, now. |
My original method of globbing just walked the whole package dir, and filtered out paths that didn't match. Simpler, but also slower. This version is likely a little slower for small build dirs, and faster for larger ones (though still a little slower than the stdlib, platform-dependent one). Glob strings work as you'd expect, with full multilevel specification, like: Some example syntax: # glob strings must be quoted.
contents:
csv:
"[!bt]*.csv": # a glob term without content is accepted, but needs a colon.
"csv.*": # transforms work
transform: csv
subnode:
"subdir/**": # multilevel recursion
transform: csv
excel:
"*.xlsx":
kwargs: # kwargs verified to work
skiprows: [0,10,100, 300, 600]
collision: # naming collisions from subdirs are renamed with a number
"**/csv.txt": # example: csv_txt, csv_txt_2 ...
transform: csv |
Two questions:
|
|
Strongly suggest we use |
@dimaryaz, @akarve So, this is what I ran across:
..but, it came with some experimentation that I found all that out. :-/ At that point, I was sucked in, and made what I suppose was a mistake -- writing my own. The end result: ..in any case, it still might be preferable to just use an existing lib -- either pathlib.PureWindowsPath.match() and forego ** matching, or glob.glob() and forego identical builds for identical build dirs on different OSes. |
this is a first version - start with something simple (glob.glob) and
document the behavior in the code & docs, then we'll "fix" it later if it
becomes a problem.
aka "let it become a problem"
Brian - that said, I love love love that you went ahead and wrote your
own. Obviously, not good if it had been week(s) but for a day or two, it's
an excellent way to "get close" to the problem.
ᐧ
…On Wed, Jan 17, 2018 at 10:33 AM, eode ***@***.***> wrote:
@dimaryaz <https://github.com/dimaryaz>, @akarve
<https://github.com/akarve>
Not attached to using the glob I made.. ..initial reasoning was via a
conversation with Aneesh, where we came to the conclusion that case
insensitivity was the route (roughly) to go. However, I think he figured
we'd be able to use glob.glob, which I did, initially, as well.
So, this is what I ran across:
- Pathlib.PureWindowsPath.match() doesn't work the same as glob.glob
- it's individual, per-path basis
- it doesn't match the same as glob.glob on **/*
- it requires walking all files of the subtree, regardless of
whether they are matched or not
- this might not be a problem, depending on our package sizes
- glob.glob doesn't support case insensitivity -- or more to the
point, case consistency. It matches the convention of the current OS
..but, it came with some experimentation that I found all that out. :-/ At
that point, I was sucked in, and made what I suppose was a mistake --
writing my own.
The end result:
Mine matches bash's behavior, (not python's glob.glob() if it were
case-insensitive).
..in any case, it still might be preferable to just use an existing lib --
either pathlib.PureWindowsPath.match() and forego ** matching, or
glob.glob() and forego identical builds for identical build dirs on
different OSes.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#287 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AQUcPbnUmPpqSjDgIQt60ARekY1nmw8vks5tLjzrgaJpZM4RdlXA>
.
--
You received this message because you are subscribed to the Google Groups
"Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to ***@***.***
To post to this group, send email to ***@***.***
To view this discussion on the web visit https://groups.google.com/a/
quiltdata.io/d/msgid/dev/quiltdata/quilt/pull/287/c358399058%40github.com
<https://groups.google.com/a/quiltdata.io/d/msgid/dev/quiltdata/quilt/pull/287/c358399058%40github.com?utm_medium=email&utm_source=footer>
.
|
Any particular place I should put tests for testing I did the typical Python thing of implementing |
dangit. glob.glob() doesn't handle recursion on 2.7. I've got to switch to pathlib.Path.glob(). I.e., |
Ugh!! Pls slack dima if you don't get quick answers to questions
…On Jan 17, 2018 7:36 PM, "eode" ***@***.***> wrote:
dangit. glob.glob() doesn't handle recursion on 2.7. I've got to switch to
pathlib.Path.glob().
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#287 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AQUcPWCwborqvUOERNUKdG544d_TBEerks5tLrxAgaJpZM4RdlXA>
.
--
You received this message because you are subscribed to the Google Groups
"Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to ***@***.***
To post to this group, send email to ***@***.***
To view this discussion on the web visit https://groups.google.com/a/
quiltdata.io/d/msgid/dev/quiltdata/quilt/pull/287/c358527830%40github.com
<https://groups.google.com/a/quiltdata.io/d/msgid/dev/quiltdata/quilt/pull/287/c358527830%40github.com?utm_medium=email&utm_source=footer>
.
|
OK, this should be fixed now -- end result:
But bash's actual behavior is that |
FYI, |
OK I've got this in my build.yml:
But it is not working. All of the .txt use |
Are you working from the build-yml-globbing branch? master of course fails.
$ quilt build asah/babynames ./build.yml
Inferring 'transform: id' for babynames/NationalReadMe.pdf
Copying ./babynames/NationalReadMe.pdf...
Traceback (most recent call last):
...
File "/Users/asah/quilt/compiler/quilt/tools/build.py", line 110, in
_build_node
checks_contents=checks_contents, dry_run=dry_run, env=env,
ancestor_args=group_args)
File "/Users/asah/quilt/compiler/quilt/tools/build.py", line 108, in
_build_node
raise StoreException("Invalid node name: %r" % child_name)
quilt.tools.store.StoreException: Invalid node name: '*.txt'
vs.
$ cd quilt/compiler
$ git checkout build-yml-globbing
Switched to branch 'build-yml-globbing'
Your branch is up-to-date with 'origin/build-yml-globbing'.
$ pip install -e .
Obtaining file:///Users/asah/quilt/compiler
Requirement already satisfied: appdirs>=1.4.0 in
/Users/asah/miniconda3/lib/python3.6/site-packages (from quilt==2.8.2.dev0)
...
$ cd ~/babynames
$ quilt build asah/babynames ./build.yml
Inferring 'transform: id' for babynames/NationalReadMe.pdf
Copying ./babynames/NationalReadMe.pdf...
{'pattern': '*.txt', 'dir': PosixPath('.')}
baby1.txt
Serializing ./baby1.txt...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|
4.00/4.00 [00:00<00:00, 559B/s]
Saving as binary dataframe...
baby2.txt
Serializing ./baby2.txt...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|
4.00/4.00 [00:00<00:00, 2.71kB/s]
Saving as binary dataframe...
Inferring 'transform: id' for babynames/yob1880.txt
Copying ./babynames/yob1880.txt...
Inferring 'transform: id' for babynames/yob1881.txt
Copying ./babynames/yob1881.txt...
Inferring 'transform: id' for babynames/yob1882.txt
Copying ./babynames/yob1882.txt...
Built asah/babynames successfully.
ᐧ
…On Tue, Jan 23, 2018 at 7:22 PM, Aneesh Karve ***@***.***> wrote:
OK I've got this in my build.yml:
4 babynames:
5 NationalReadMe:
6 file: babynames/NationalReadMe.pdf
7 # apply transform: csv to all succeeding nodes in this group
8 '*.txt':
9 transform: csv
10 kwargs:
11 header: #there's no header row in these files
12 yob1880:
13 file: babynames/yob1880.txt
14 yob1881:
15 file: babynames/yob1881.txt
16 yob1882:
17 file: babynames/yob1882.txt
# ... lots of .txt files
But it is not working. All of the .txt use transform: id and the CLI
tells me as much during build. Am I doing something wrong?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#287 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AQUcPbRVUe360TysYg37cAOmZeF5Iwevks5tNqHngaJpZM4RdlXA>
.
--
You received this message because you are subscribed to the Google Groups
"Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to ***@***.***
To post to this group, send email to ***@***.***
To view this discussion on the web visit https://groups.google.com/a/
quiltdata.io/d/msgid/dev/quiltdata/quilt/pull/287/c360008835%40github.com
<https://groups.google.com/a/quiltdata.io/d/msgid/dev/quiltdata/quilt/pull/287/c360008835%40github.com?utm_medium=email&utm_source=footer>
.
|
Yes I'm on the right branch. Build succeeds but doesn't treat the txts properly. |
This has a soft dependency on #312, in that there are newer versions of some files there, and the diff size will be reduced once that's in. |
@akarve Docs updated. |
I got it to fail twice with
Second run:
|
Discussed on slack: When working with a generated Cached build reporting not discussed yet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good aside from conflicts and minor changes suggested inline.
compiler/quilt/tools/compat.py
Outdated
|
||
# patch = mock.patch | ||
# Path = pathlib.Path | ||
# TemporaryDirectory = tempfile.TemporaryDirectory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem to be used anywhere. Is it part of TensorFlow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was, but has been dropped. Now that the TF Prep PR has been merged, I can merge master into this and clean up, then squash and merge into master.
compiler/quilt/tools/package.py
Outdated
def __getitem__(self, item): | ||
node = self.get_contents() | ||
|
||
if '/' in item and '.' in item: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's pick just one syntax for accessing package components by path. My vote is to use '.' as the separator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm rather for '/', myself. Originally, I was for the dot notation, but these considerations changed my thoughts on it:
- Our
user/pkg/sub/node/notation
uses slashes, and this is consistent with that. - Using slashes leaves open the option of relaxing node names to be more like filenames in the future
- It would allow us to use a pathlib.PurePosixPath object when doing common pathlike operations.
- as we move toward using item['notation'] for nodes, looking like a Python identifier becomes less important.
Although less of a general consideration, it's worth noting that '/'
is used in build._build_node
, which this was made for.
compiler/quilt/tools/package.py
Outdated
parts = [item] | ||
|
||
try: | ||
for part in parts[:-1]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems overly complicated. Why not: for part in parts...return node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right.
Works. I'll file a separate issue for the build output when build cache is hit. |
@kevinemoore Ready for re-review. |
* master: (38 commits) Implement the "always requires auth" catalog config (quiltdata#365) Replace a confusing SQL query with a slightly less confusing one (quiltdata#363) Eliminate stray back-tick [ci skip] Eliminate redundant `.team` [ci skip] Whack backticks [ci skip] fix syntax example; refactor headings [ci skip] Add alternative terms [ci skip] Link to object store docs [ci skip] Consolidate and move sections [ci skip] Rename section [ci skip] Simplify and update links [ci skip] Rename section [ci skip] Add notes on immutability, tophash, etc. [ci skip] Use a small dataset in most of the tests (quiltdata#360) Add (No results) to empty search results (quiltdata#359) Admin UI endpoint to list users and associated data (quiltdata#354) Upgrade the stack (quiltdata#326) build.yml globbing (quiltdata#287) Pass the DISABLE_SIGNUP env variable to django (quiltdata#356) use npm lock files, delete yarn (quiltdata#355) ...
Uses syntax:
A few notes:
there are a couple tools in tensorflow that I didn't bring over -- like the backported pathlib in tools.compat, and file->node duplicate naming conflict resolver.These have been brought over. There may be conflicts for the TensorFlow branch once this is merged into master..subdir_foo_csv
still rough