Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Loading of Structured Packages to Py-Config #519

Closed
3 tasks done
JeffersGlass opened this issue Jun 15, 2022 · 30 comments · Fixed by #914
Closed
3 tasks done

Add Loading of Structured Packages to Py-Config #519

JeffersGlass opened this issue Jun 15, 2022 · 30 comments · Fixed by #914
Labels
needs-triage Issue needs triage type: feature New feature or request

Comments

@JeffersGlass
Copy link
Member

JeffersGlass commented Jun 15, 2022

Checklist

  • I added a descriptive title
  • I searched for other feature requests and couldn't find a duplicate (including also the type-feature tag)
  • I confirmed that it's not related to another project are area (see the above section)

What is the idea?

Currently, <py-env> makes use of loadFromFile() to copy a single file at each path in Paths to the same directory the Pyscript code runs in, allowing for simple imports of individual .py files.

However, by breaking up the file structure of the included files, this methods breaks recognition of local packages. It also does not allow for folder of content to be included en masse, which would be useful for processing data stored across multiple files.

Consider a very silly utility file:

### emoji.py ###
def print_saxophone():
  print(🎷)

This works:

### index.html ###
<py-env>
  - paths:
    - ./emoji.py
</py-env>
<py-script>
  from emoji import print_saxophone
  print_saxophone()
</py-script>

But this breaks:

utils/
├─ __init__.py
├─ emoji.py
index.html
### index.html ###
<py-env>
  - paths:
    - ./utils/__init__.py
    - ./utils/emoji.py
    <!-- Or any other combination of paths -->
</py-env>
<py-script>
  from utils.emoji import print_saxophone
  print_saxophone()
</py-script>

Why is this needed

Having the ability to structure local Python code as packages is useful for organizing larger codebases into relevant chunks, and allows for better code organization.

What should happen?

I'm not sure what the syntax for usage would be. It could extending the syntax to include a local-packages label or similar. (There's probably a better name for the tag.)

<py-env>
  - matplotlib
  - paths:
    - ./file1.py
  - local-packages
    - ./utils

Or, perhaps there's a way to use paths itself more cleverly, where Pyscript can identify at load time which paths represent files and which represent packages/folders.

Additional Context

From an implementation perspective, I recognize this may be difficult to handle cleanly, since we need to be able to individually fetch all the files that make up a given package from wherever they are stored. But even if the user had to enumerate all the files in a given package in their , it would still be useful to allow them to be referenced as a package.

@JeffersGlass JeffersGlass added needs-triage Issue needs triage type: feature New feature or request labels Jun 15, 2022
@JeffersGlass JeffersGlass changed the title Add Local Package Loading to Py-Env Add Local Package/Folder Loading to Py-Env Jun 15, 2022
@JeffersGlass JeffersGlass changed the title Add Local Package/Folder Loading to Py-Env Add Local Package Loading to Py-Env Jun 15, 2022
@JeffersGlass
Copy link
Member Author

For those encountering this challenge, there's a topic on the Pyscript forum for a workaround, as well as an example repo from mudream4869 showing how to compile a local folder to a wheel and reference that in <py-env>.

@karolyjozsa
Copy link

karolyjozsa commented Aug 9, 2022

I needed the same for demoing my package on the web without forcing users to install my package. Well, I also need it for development, i.e. need the changing Python modules in the local package on-the-fly (compiling wheel every time is not a solution for me). So I created a patched version of PyScript, see attached, by modifying 2 of the .ts files (and the compiled .js). I also created a README for explaining what it does, why and how.
You may consider adding this new feature to the next version of PyScript. Only the 2x modified .ts files need to be pulled.

EDIT: Corrected, so that the missing packagebase is not an empty string (which causes backward compatibility issues), but undefined rather.
pyscript_src_patched.zip

@pauleveritt
Copy link

This is an opportunity: But even if the user had to enumerate all the files in a given package

__all__ is a way to publicly exports symbols from a package. It's laborious and possibly error-prone, but you could fetch the package's __init__.py, get the __all__, find the paths that import those symbols, and fetch them.

Of course, this would be so much easier in Python instead of TS. 😀 Wish you could have a disposable interpreter for bootstrapping the actual one.

@JeffersGlass
Copy link
Member Author

find the paths that import those symbols

Would you do this by looking at the files they were imported from in the __init__.py? It's an interesting thought. Or is there another way of discovering which file a symbol was defined it?

@pauleveritt
Copy link

@JeffersGlass The pithy answer: Python and inspect. But you're right, if it was TS, you'd like back up at the from foo import bar lines. Tricky, but likely not impossible.

@marimeireles
Copy link
Member

@madhur-tandon I think this is not relevant anymore due to your pr?

@JeffersGlass
Copy link
Member Author

JeffersGlass commented Oct 4, 2022

I think it would be the same thing now, but for the paths specification in <py-config>? I can change the issue name.

The gist is - right now anything specified in paths is copied directly to the same location in the Emscripten file system as the running Python code, which flattens the directory structure of Python Packages and makes them... well, not packages anymore.

It'd be a nice-to-have feature to be able to somehow load a complete package from an arbitrary URL in a way that preserves directory structure, without having to first compile it to a wheel or zip it up, but implementation-wise it might be unreasonable.

@JeffersGlass JeffersGlass changed the title Add Local Package Loading to Py-Env Add Loading of Structured Packages to Py-Config Oct 4, 2022
@madhur-tandon
Copy link
Member

madhur-tandon commented Oct 11, 2022

@JeffersGlass To begin with, I started with the example you showed i.e.

### emoji.py ###
def print_saxophone():
  print(🎷)

with the index.html having the following:

<py-config>
paths = ["./emoji.py"]
</py-config>

<py-script>
    from emoji import print_saxophone
    print_saxophone()
</py-script>

However, the above doesn't work, so I tried the other way instead i.e.

<py-config>
paths = ["./emoji.py"]
</py-config>

<py-script>
    import emoji
    emoji.print_saxophone()
</py-script>

and it works. Could you confirm this if it occurs for you as well? i.e. if one way of importing works and the other doesn't?

We can then look at the package and retaining the directory structure next.

Update: This seems to be non-deterministic i.e. sometimes it works, while sometimes it fails with
ModuleNotFoundError: No module named 'emoji' and it seems like there is no difference in how you choose to use it aka both ways work, but non-deterministic failure is still there.

@JeffersGlass
Copy link
Member Author

Hi @madhur-tandon - both examples above work for me, with the minor correction in my original code that print_saxophone requires quotes around the emoji:

### emoji.py ###
def print_saxophone():
  print("🎷") #These quotes are critical! Oops!

Would you confirm whether either method works for you with that adjustment?

@madhur-tandon
Copy link
Member

Would you confirm whether either method works for you with that adjustment?

Yes, I already added the quotes and both work for me too. But sometimes, the issue can occur in both of them so it is non-deterministic really and is independent of the way of importing. You can check the Update I wrote at the bottom in the original comment.

Also, you can check PR 845 which partially solves the package problem -- by retaining the directory structure. I requested a review from you too.

@pauleveritt
Copy link

FWIW, regarding importing packages, @ryanking13 had a Pyodide package he worked on: https://github.com/ryanking13/pyodide-importer

Here was the discussion: pyodide/pyodide#1917

He's no longer updating it, but he discovered lots of edge cases. Useful info.

@JeffersGlass
Copy link
Member Author

But sometimes, the issue can occur in both of them so it is non-deterministic really and is independent of the way of importing.

I have yet to see this fail for me with either version... interesting. It's essentially the same process that's been in the todo.html example for ages though, yeah? Nothing fancy going in here that would specifically break it that I can see.

@ryanking13
Copy link

FWIW, regarding importing packages, @ryanking13 had a Pyodide package he worked on: https://github.com/ryanking13/pyodide-importer

Here was the discussion: pyodide/pyodide#1917

He's no longer updating it, but he discovered lots of edge cases. Useful info.

Yes, Pyodide core team has decided not to support that feature internally and our current recommendation is to make a wheel or a zip file instead of downloading each Python file in packages.

FWIW, Pyodide has recently added a native file system mount support (pyodide/pyodide#2987). (For now it is supported by chromium based browsers only)

@antocuni
Copy link
Contributor

Sorry for chiming in so late, I didn't see this discussion before.
So, my two cents on it.

  1. Yes, we should provide a way to download files while preserving directory structure
  2. I think that for the time being the easiest solution is to enumerate all files explicitly, but we should have something which is easily extensible to other smarter ways
  3. whatever solution we come up with, it should support generic files as well. .py files are just a special case of them
  4. there are lot of tricky and corner cases to be aware of.

Speaking of tricky cases, re @pauleveritt

__all__ is a way to publicly exports symbols from a package.

__all__ is supposed to list publicly exported symbols but you cannot reliably rely on that (for example, I never use it for my own packages: it is useful only if you do from ... import * and you should never do that anyway -- but that's a different discussion :)).
Moreover it lists individual names and not files, so it's not very useful in this respect. And there might also be other files which are imported indirectly. And there might be packages which dynamically determine which files to import, etc. etc.
In general, it is not possible to statically analyze a python app to determine the list of .py files which are needed at runtime: AFAIK it's a known problem/limitation of all tools which produces a single executable out of a python app.
I'd prefer to avoid going down that rabbit hole :).


Relative vs absolute URLs

This is something which was not obvious to me at first so I think it's better to say it explicitly: currently, paths=[...] is not a list of paths, but a list of URLs. Because of relative URLs work, if your page is served at http://example.com/foo.html and you use paths=['./a.py'], it downloads http://example.com/a.py.
So they look like "paths" in your local filesystem but in reality they are just URLs, which can be relative or absolute.

If we want to preserve the directory structure, we need to know what is the "base URL". For example, imagine that we want to download these two URLs:

https://github.com/pyscript/pyscript/blob/main/pyscriptjs/tests/integration/__init__.py
https://github.com/pyscript/pyscript/blob/main/pyscriptjs/tests/integration/test_00_support.py

If the "base URL" is https://github.com/pyscript/pyscript/blob/main/, then I expect them to be saved in the /pyscriptjs/tests/integration directory of the virtual FS. But if the "base URL" is e.g. https://github.com/pyscript/pyscript/blob/main/pyscripyjs/tests, then I expect them to be saved in the /integration directory.

I don't think there is any reasonable way to automatically determine the "base URL", it must be specified by the user. So we need a nice way so that they can specify:

  1. the base URL
  2. the local folder where they want to save the files
  3. the list of files to download

A possible proposal

This is just a first proposal, but I am thinking of something like this:

[[fetch]]
url = "https://github.com/pyscript/pyscript/blob/main/"
folder = "pyscriptjs/tests/integration/"
files = ["__init__.py", "test_00_support.py"]

[[fetch]]
url = "https://example.com"
folder = "foo"
files = ["a.txt", "b.txt"]

This would result in the following virtual FS structure:

pyscriptjs/tests/integration/__init__.py
pyscriptjs/tests/integration/test_00_support.py
foo/a.txt
foo/b.txt

Moreover, we can still support the common case of ".py files which are siblings of the main .html file". The "normal" version would be this:

[[fetch]]
url = "" 
folder = ""
files = ["a.txt", "b.txt"]

But I think that it's very reasonable to say that url="" and "folder="" are the default, so you don't need to specify them in the common case. So, a full example could be this:

<py-config>
[[fetch]]
files = ["utils/__init__.py", "utils/foo.py"]

[[fetch]]
url = "https://github.com/pyscript/pyscript/blob/main/"
folder = "pyscriptjs/tests/integration/"
files = ["__init__.py", "test_00_support.py"]

[[fetch]]
url = "https://example.com"
folder = "foo"
files = ["a.txt", "b.txt"]
</py-config>

Naming is hard

As usual, finding good names is hard. The ones which I used above are just the first which came to my mind, but here are some alternatives:

  • [[download]] instead of [[fetch]]
  • base_url, baseURL, URL instead of url
  • target_folder, dir, directory, path instead of folder

Extensibility

The proposal above is easily extensible. For example, I can imagine having a github plugin which allows to write the following:

<py-config>
[[fetch]]
provider = "github"
repo = "pyscript/pyscriptjs`
version = "2022.09.1"  # or a branch name, a commit ID, etc.
</py-config>

Other possible "fetch provider" could be (just thinking alound):

  • unzip which automatically unzip/untar an archive
  • filelist, which downloads filelist.txt which contains the full list of files to download (so that you don't have to list all of them explicitly)
  • ftp
  • endless possibilities :)

@madhur-tandon
Copy link
Member

@antocuni I think it is a very good proposal and will proceed to implement it. But I want to ask for opinions from @JeffersGlass, @fpliger and @tedpatrick -- if they have any objections.

For those who need quick context: I recently merged #845 since we changed to using Emscripten's FS APIs
and a side effect of it is weird directory structure when saving files from URLs as illustrated by: #845 (comment) since we didn't handle them.

Antonio's proposal feels like a natural extension and we should get to action!

@madhur-tandon
Copy link
Member

@antocuni On a second thought and perhaps taking some ideas from what @ntoll suggests --

We could simply have a list of dictionary / objects with each dict having key has what the path looks like in the file system and the value as the URI.

So,

# for a local resource
[[fetch]]
path = "utils/random/generator.py"
uri = "./random_gen.py"

[[fetch]]
# for a resource on the internet
path = "/hi.txt"
uri = "https://example.com/foo/hello.txt"

and so on...

IMO, this is much simpler + the responsibility for calculating the URLs or perhaps calculating the path for the filesystem -- both are delegated to the user.

WDYT?

@JeffersGlass
Copy link
Member Author

We could simply have a list of dictionary / objects with each dict having key has what the path looks like in the file system and the value as the URI.

I quite like this! Definitely simpler, and probably less prone to user error. I wonder if the path then can be something that's even more explicitly a reference to the local file system? In contrast to uri it makes sense,

My immediate thought is, "How do we accommodate wanting to fetch many files?" In which case, I think the answer is "Here's a recipe in the docs for how you do that at the start of your script", maybe using pyfetch directly. Or maybe we wrap that in a fetchFile function at some point or something.

@antocuni
Copy link
Contributor

antocuni commented Nov 1, 2022

re @madhur-tandon

# for a local resource
[[fetch]]
path = "utils/random/generator.py"
uri = "./random_gen.py"

[[fetch]]
# for a resource on the internet
path = "/hi.txt"
uri = "https://example.com/foo/hello.txt"

I'm confused; consider my original examples:

[[fetch]]
files = ["utils/__init__.py", "utils/foo.py"]

If I understand correctly, it would become the following?

[[fetch]]
path = "utils/__init__.py"
uri = "utils/__init__.py"

[[fetch]]
path = "utils/foo.py"
uri = "utils/foo.py"

And my second example:

[[fetch]]
url = "https://github.com/pyscript/pyscript/blob/main/"
folder = "pyscriptjs/tests/integration/"
files = ["__init__.py", "test_00_support.py"]

would become:

[[fetch]]
path = "pyscriptjs/tests/integration/__init__.py"
uri = "https://github.com/pyscript/pyscript/blob/main/pyscriptjs/tests/integration/__init__.py"

[[fetch]]
path = "pyscriptjs/tests/integration/test_00_support.py"
uri = "https://github.com/pyscript/pyscript/blob/main/pyscriptjs/tests/integration/test_00_support.py"

Did I understand it correctly?
If so, in which way this is simpler than my original proposal? 🤔

Also, why uri instead of url?


re @JeffersGlass

I quite like this! Definitely simpler, and probably less prone to user error.

I think it's definitely complex and very likely to lead to much more user errors, because of the tons of copy&paste which is needed.

My immediate thought is, "How do we accommodate wanting to fetch many files?" In which case, I think the answer is "Here's a recipe in the docs for how you do that at the start of your script", maybe using pyfetch directly. Or maybe we wrap that in a fetchFile function at some point or something.

if we want to go in this direction, then you can just remove [[fetch]] altogether. If you have a programmatic API to download multiple files at once, why do you want also declarative API to download single files?

Personally, -1 on this. There is a lot of value in being able to just say [[fetch]] files = ['a', 'b', 'c'] and having them magically on your virtual FS.

Also, from the implementation point of view, having a declarative syntax allows us to fetch them while pyodide is loading, thus considerably cutting down startup time

@JeffersGlass
Copy link
Member Author

If so, in which way this is simpler than my original proposal?

Simpler in the very literal sense of: there are only two parameters to understand instead of three. Take the resource at uri and download it to the filesystem at filepath. Simpler. Not necessarily better.

Additionally, requiring each URL/URI consist of a "folder" and "file" is a restriction in the proposal - what if I want to load the file present at http://www.jeff.glass/my_api_endpoint?python_version=3.9 and use that as a data or source file? That's perfectly valid URL to serve a file from, but doesn't have a "file" part of the URL.

There may be other ways to solve the latter problem.

@antocuni
Copy link
Contributor

antocuni commented Nov 1, 2022

Simpler in the very literal sense of: there are only two parameters to understand instead of three. Take the resource at uri and download it to the filesystem at filepath. Simpler. Not necessarily better.

ok, in this sense I agree. It is surely simpler to implement, it is probably simpler to explain, I'm very doubtful that it's simpler to use.

And personally, I think that simplicity to use -- especially for common cases -- should be our first priority. (And a very common case is: I have 3 .py files and I want to make them available as modules).

Also, note that folder is basically just a way to save typing. We can easily remove it if you think it adds unnecessarily complexity.

Additionally, requiring each URL/URI consist of a "folder" and "file" is a restriction in the proposal - what if I want to load the file present at http://www.jeff.glass/my_api_endpoint?python_version=3.9 and use that as a data or source file? That's perfectly valid URL to serve a file from, but doesn't have a "file" part of the URL.

You have a point here. I start to wonder whether we should support both use cases, because they seem to serve very different purposes:

  1. I want to download a set of files from a base URL and preserve the directory structure
  2. I want to download a single file from an arbitrary URL and save it with a custom filename

I'm trying to solve point (1) (which incidentally it's what this issue is about ;)).
@JeffersGlass and @madhur-tandon are trying to solve point (2).

We can probably come up with a solution which does both. E.g.:

[[fetch]]
# point (1). If you specify "files", it downloads multiple files
url = "http://example.com/"
files = ["a.txt", "b.txt"]

[[fetch]]
# point (2). If you specify "save_as", it downloads a single file
url = "http://example.com/api_endpoint?v=12"
save_as = "c.txt"

[[fetch]]
# if you specify both, it's an error
url = "..."
files = [...]
save_as = ...

I'm not particularly convinced of the name save_as though. Misc alternatives:

  • local_name
  • target
  • target_filename
  • to
  • other suggestions?

Also, let me underline again a very strong point of my proposal which I think was overlooked: since url="" by default, we can specify just files=[...] and they will be downloaded from the same webserver as the HTML page. In other words:

# this is what we have now
paths = ['foo/__init__.py', 'foo/bar.py']

# this would be *completely equivalent* to the above
[[fetch]]
files = ['foo/__init__.py', 'foo/bar.py']

@madhur-tandon
Copy link
Member

@antocuni your understanding regarding both the cases below is correct:

[[fetch]]
path = "utils/init.py"
uri = "utils/init.py"

[[fetch]]
path = "utils/foo.py"
uri = "utils/foo.py"

[[fetch]]
path = "pyscriptjs/tests/integration/init.py"
uri = "https://github.com/pyscript/pyscript/blob/main/pyscriptjs/tests/integration/__init__.py"

[[fetch]]
path = "pyscriptjs/tests/integration/test_00_support.py"
uri = "https://github.com/pyscript/pyscript/blob/main/pyscriptjs/tests/integration/test_00_support.py"

This is simpler in the sense that right now -- in your original proposal -- the user has to split the complete URL into parts for url, folder and files -- which opens the possibility of mistakes on the user side. Then, on our side, we have to construct the complete URL using this -- which I already do in the PR currently since it is based on the original proposal right now -- see here:

https://github.com/pyscript/pyscript/pull/914/files#diff-a46bdf2eb41054c052aba50dd29116d21635c57e69de9b2596015f9670ab7a8dR174-R196

This also opens the area for mistakes on our side -- eg: handling trailing /.

Overall, splitting the URL by the User --> reconstructing it by us --> larger surface area for mistakes.

Further, what if I want the file in the FS to be saved in another custom path or perhaps with a different name? i.e. saving https://github.com/pyscript/pyscript/blob/main/pyscriptjs/tests/integration/test_00_support.py to utils/hi.py for example.
The original proposal doesn't accommodate this.

And regarding issue of copy paste for cases like:

[[fetch]]
path = "utils/__init__.py"
uri = "utils/__init__.py"

We can simply say that if you want the same dir structure and the same filename, perhaps only the following is sufficient

[[fetch]]
uri = "utils/__init__.py"

and internally, we just copy the value of uri into path.

Also, uri is just name, we can rename it to anything, maybe url is fine I guess.

@antocuni
Copy link
Contributor

antocuni commented Nov 2, 2022

This is simpler in the sense that right now -- in your original proposal -- the user has to split the complete URL into parts for url, folder and files

this is not completely true, since url and folder are optional.
But also, look at the title of the issue: we want to "preserve the structure of python packages", and for this you have to specify which part of the URL is the "base" and which part contains that "directory structure" that you want to preserver.
Consider this:

url = "https://github.com/pyscript/pyscript/blob/main/pyscriptjs/tests/integration/test_00_support.py"

In which path of the virtual FS you want to save the file?

  • pyscript/pyscript/blob/main/pyscriptjs/tests/integration/test_00_support.py ?
  • pyscript/blob/main/pyscriptjs/tests/integration/test_00_support.py ?
  • blob/main/pyscriptjs/tests/integration/test_00_support.py ?
  • main/pyscriptjs/tests/integration/test_00_support.py ?
  • pyscriptjs/tests/integration/test_00_support.py ?
  • etc.

The point of my original proposal is to solve the issue; url specifies the base, and files specify the folders as you want them in the virtual FS.

This assumes that the layout on the web server is the same as the one which you want on disk, of course. I claim that for the common case of "I want to download a python package and make it available on my FS", this is ~always true.

which opens the possibility of mistakes on the user side.

forcing the user to do an endless copy&paste also opens the possibility of a lot of mistakes on their side.

Then, on our side, we have to construct the complete URL using this -- which I already do in the PR currently since it is based on the original proposal right now -- see here:
[cut]
This also opens the area for mistakes on our side -- eg: handling trailing /.

Which means that our code needs to be robust, well tested, aware of corner cases and hopefully bug free. Yes, I agree ;).

Overall, splitting the URL by the User --> reconstructing it by us --> larger surface area for mistakes.

disagree.

Further, what if I want the file in the FS to be saved in another custom path or perhaps with a different name? i.e. saving

Look at my answer above, I think I already answered to this, didn't I?

And regarding issue of copy paste for cases like:

[[fetch]]
path = "utils/__init__.py"
uri = "utils/__init__.py"

We can simply say that if you want the same dir structure and the same filename, perhaps only the following is sufficient

[[fetch]]
uri = "utils/__init__.py"

and internally, we just copy the value of uri into path.

And what would this do?

[[fetch]]
url = "http://example.com/a/b/c/d/e.py"

Also, uri is just name, we can rename it to anything, maybe url is fine I guess.

well, we are talking about URLs, so I think we should use url, yes 😅.

@fpliger
Copy link
Contributor

fpliger commented Nov 2, 2022

As shocked as I am... I'll say I mostly agree with @antocuni here. :D

I don't love the verbosity of it but think [most of] it is necessary to avoid weird corner cases. My main desire with this though is:

  1. We [try to] keep it simple
  2. The default (and simplest option) works fine "out of the box" for most cases
  3. We provide options so users can customize it to match a different use case (even if we need to complicate it a bit more)
  4. We make sure we keep the door for plugins to add functionality here
  5. We make sure we are really accounting for the different use cases this feature can be used for and we don't try to force "one size fits all" when it doesn't make sense

Elaborating on 5 really quick, this feature is meant to copy/fetch [arbitrary] files to the browser FS, while path was originally meant to upload python files that can be imported and use by users application. These are quite different use cases as a whole even if they are similar in some aspects.

[[fetch]]
# point (1). If you specify "files", it downloads multiple files
url = "http://example.com/"
files = ["a.txt", "b.txt"]

[[fetch]]
# point (2). If you specify "save_as", it downloads a single file
url = "http://example.com/api_endpoint?v=12"
save_as = "c.txt"

[[fetch]]
# if you specify both, it's an error
url = "..."
files = [...]
save_as = ...

I like the direction of this, especially if we can use default values to not force users to type excessive characters "for nothing" while maintaining clarity. A good example of this is what @antocuni mentioned with

Also, let me underline again a very strong point of my proposal which I think was overlooked: since url="" by default, we can specify just files=[...] and they will be downloaded from the same webserver as the HTML page. In other words:

# this is what we have now
paths = ['foo/__init__.py', 'foo/bar.py']

# this would be *completely equivalent* to the above
[[fetch]]
files = ['foo/__init__.py', 'foo/bar.py']

There are a couple of cases/questions that I want to clarify though:

  1. Given the example above or the one where we provide an explicit base_url (yes, using it instead of uri because I think tis' more semantically helpful) and multiple files without specifying save_as will mean the saved files maintain the same file structure, right? So, in the example above, I'll end up with files inside a foo folder.

  2. If I, as a user, want to download multiple files to a new/specific folder in the file system that is not part of the original path, how would I do it? For instance, I'm grabbing multiple data files from different sources (so I have multiple fetch sections) and I want all of them to be written into a data root folder. How would this case be supported?

  3. Expanding on 2, there are some cases where users want to flatten downloaded files (like the one above) and other where they want to keep it or even others when they want a mix of both... (i.e., let's say said app in 2 downloads financial data as well as demographics and I want the resulting structure of the folders of my app to be /data/finance and /data/demographics, how do we support this?

Unfortunately, when talking about files and transformations, it's not trivial to think and provide a solution that will works for everyone BUT I think it's key for us to provide ways that we/users can extend how fetch works to support these scenarios and others that we may not have thought of...

@antocuni
Copy link
Contributor

antocuni commented Nov 2, 2022

I might have an idea which could solve in an easy and elegant way the majority of use cases described above (the last famous words...).

So, first of all, the naming: I started by using url, files and target, but then I realized that it might be nicer to just use from and to. I'm happy to discuss naming independently, but I kind of like those.

Some examples:

# fetch a single file
[[fetch]]
from = "http://a.com/data.csv"
# http://a.com/data.csv ==> ./data.csv

# fetch a single file, specify the target filename
[[fetch]]
from = "http://a.com/data.csv?version=1"
to = "/tmp/data.csv"
# http://a.com/data.csv?version=1 ==> /tmp/data.csv

# fetch multiple files from the "default" webserver, save to the default folder
[[fetch]]
files = ["foo/__init__.py", "foo/mod.py"]

# fetch multiple files and put them in a different folder
[[fetch]]
files = ["foo/__init__.py", "foo/mod.py"]
to = "/my/lib/"
# ==> /my/lib/foo/__init__.py
# ==> /my/lib/foo/mod.py

# fetch multiple files from a different base url
[[fetch]]
from = "http://a.com/download/"
files = ["foo/__init__.py", "foo/mod.py"]
to = "/my/lib/"
# http://a.com/download/foo/__init__.py ==> /my/lib/foo/__init__.py
# http://a.com/download/foo/mod.py ==> /my/lib/foo/mod.py

Basically, the semantics would be this, in python pseudocode:

def fetch_one(entry):
    base_url = entry.from or ""
    target = entry.to or "./"
    if entry.files:
        for f in entry.files:
            src = os.path.join(base_url, f)
            dst = os.path.join(target, f)
            wget(src, dst)
    else:
        src = base_url
        dst = target
        wget(src, dst)

I think that the scheme above plays very well with plugins. Imagine to have a plugin which registers a "github fetch provider":

[[fetch]]
from = "github:antocuni/dotfiles"
to = "/tmp/"

@fpliger
Copy link
Contributor

fpliger commented Nov 2, 2022

Really like that approach.

Wow @antocuni , I really think you are starting to see things as I do 🤣 🤣 🤣 🤣

@antocuni
Copy link
Contributor

antocuni commented Nov 3, 2022

I think there is one small tweak that we can discuss: according to what I proposed above, in the "single file" case, does to contain the name of the destination file, not the destination folder. E.g:

[[fetch]]
from = "http://example.com/data.csv"
to = "/tmp" 

This would save the file as /tmp, not /tmp/data.csv.
One thing that we can do is the following:

  • if to ends with a /, it is considered a directory name (and we derive the filename from the URL and/or from the Content-Disposition HTTP header
  • if to ends with something else, it's considered the full path of the file.

E.g.:

[[fetch]]
from = "http://example.com/data.csv"
to = "/tmp/"   # ==> /tmp/data.csv

[[fetch]]
from = "http://example.com/data.csv"
to = "/tmp/myfile"   # ==> /tmp/myfile

To be honest, I'm undecided whether this is an useful feature or just obscure and error prone

@JeffersGlass
Copy link
Member Author

Very nice @antocuni ! I think this is really excellent.

Per your last example in the single file case - my two cents is that it's maybe more trouble than it's worth. I think there's more value in saying "The to field is the path (folder) where the fetched files are placed" and not having to clarify "Unless you're only fetching a single file...".

And re: naming, if I understand right (thank you for the Pseudocode!): from is URL, to is a filesystem path (folder), and files is a list of suffixes appended to each as applicable. You know how I love when the key names describe the data type the represent - maybe from_url and to_path or to_folder or something? Just to continue hammering home the distinction between resources on the network and in the local filesystem.

@fpliger
Copy link
Contributor

fpliger commented Nov 3, 2022

I'm aligned with @JeffersGlass on consistency here...

And re: naming, if I understand right (thank you for the Pseudocode!): from is URL, to is a filesystem path (folder), and files is a list of suffixes appended to each as applicable. You know how I love when the key names describe the data type the represent - maybe from_url and to_path or to_folder or something? Just to continue hammering home the distinction between resources on the network and in the local filesystem.

+1 on explicit naming when it's helpful! I'm debating on the from because I think this can be generic "enough" (especially if we start supporting some weird things via plugins but on the to, I think there's a lot of value in using a more descriptive name.

@antocuni
Copy link
Contributor

antocuni commented Nov 3, 2022

Per your last example in the single file case - my two cents is that it's maybe more trouble than it's worth. I think there's more value in saying "The to field is the path (folder) where the fetched files are placed" and not having to clarify "Unless you're only fetching a single file...".

so you are proposing that to is always a folder? Works for me, but then how do you handle the case in which the URL doesn't have a "reasonable" filename? E.g., what would be the filename in this example?

[[fetch]]
from = "http://google.com/"
to = "/tmp"  # ===> /tmp/???

(I'm fine to say that this is not supported).

+1 on explicit naming when it's helpful! I'm debating on the from because I think this can be generic "enough" (especially if we start supporting some weird things via plugins but on the to, I think there's a lot of value in using a more descriptive name.

yes, my idea is precisely from is generic, so the concrete meaning depends on the fetch provider: e.g., github:antocuni/dotfiles is NOT an URL.

I'm fine to use a better name for to. Maybe we could use to_folder and to_file, which would solve also the problem above. E.g.:

[[fetch]]
from = "http://a.com/data.csv"
to_folder = "/tmp"  # ==> /tmp/data.csv

[[fetch]]
from = "http://a.come/data.csv"
to_file = "/tmp/foo.csv"  # ==> /tmp/foo.csv

@fpliger
Copy link
Contributor

fpliger commented Nov 3, 2022

so you are proposing that to is always a folder? Works for me, but then how do you handle the case in which the URL doesn't have a "reasonable" filename? E.g., what would be the filename in this example?

[[fetch]]
from = "http://google.com/"
to = "/tmp"  # ===> /tmp/???
(I'm fine to say that this is not supported).

In this case users should use to_file, files or something like that.

yes, my idea is precisely from is generic, so the concrete meaning depends on the fetch provider: e.g., github:antocuni/dotfiles is NOT an URL.

I'm fine to use a better name for to. Maybe we could use to_folder and to_file, which would solve also the problem above.

Yeah, very aligned here as well. What's happening?! :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-triage Issue needs triage type: feature New feature or request
Projects
None yet
8 participants