Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vendor standard libraries in a zip file #3166

Closed
wants to merge 25 commits into from

Conversation

ryanking13
Copy link
Member

@ryanking13 ryanking13 commented Oct 4, 2022

Description

Vendors standard libraries in a zip file (/lib/python310.zip) in a compiles(pyc) format. The wasm_assets.py code which creates a zip file is adapted from the CPython repository.

Zipping standard libraries reduces the size of pyodide.asm.data around ~2MB:

$ stat -c "%s %n" pyodide.asm.data
5014687 pyodide.asm.data  # (in main)
2878194 pyodide.asm.data  # (in this PR)

Note that I haven't checked how this PR affects the import time. I wasn't able to find a good benchmark or experiment of how using zipimport affects the import time.

cc: @tiran

Partially related: #3050

Checklists

@rth
Copy link
Member

rth commented Oct 4, 2022

Thanks for doing this!

Zipping standard libraries reduces the size of pyodide.asm.data around ~2MB:

I think we should compare after gzip compression (done by the CDN) in which case I suspect the difference will be less.

So from what I understand you are shipping this zip inside the pyodide.asm.data? Why not just provide already unziped files (since compression is already handled by CDN & browser)? On the other hand, if we are able to ship it separately from pyodide.asm.data I would certainly be interested in that pyodide/pyodide-pack#1

@ryanking13
Copy link
Member Author

ryanking13 commented Oct 4, 2022

I think we should compare after gzip compression (done by the CDN) in which case I suspect the difference will be less.

Right, that's a good point. I should check that.

On the other hand, if we are able to ship it separately from pyodide.asm.data I would certainly be interested in that pyodide/pyodide-pack#1

Yes, that sounds very interesting. If we ship a zip file instead of a tar file, we might don't need to extract it to the file system, as Python can import it directly. So maybe we won't need to care about implementing unpackers.

@ryanking13 ryanking13 marked this pull request as draft October 4, 2022 06:40
@rth
Copy link
Member

rth commented Nov 12, 2022

I did some benchmarks of this on my laptop with files served with gzip compression,

cd dist
gzip -k *
npx http-server -g

(we should add instructions to the docs https://www.npmjs.com/package/http-server is much better for benchmarking tasks than the built-in python http.server, though I haven't managed to make brotli compression work),
then using logging from #3247

main

The download size for loading console.html: 6.14 MB (gzip compressed), 14.43 MB (uncompressed)

Load time:
- Firefox 106.0.5 in Linux: 1335 ± 50 ms
- Chrome 106: 1427 ± 32 ms (console opened), 877 ± 70 (console closed, probably with more caching)

Measured with caching in the console disabled each time in a new tab, but I'm not sure to what extent that really disables all WASM related caching. Creating a whole new browser profile for each run is also not very practical.

An example of an exception in the stdlib,

>>> from pathlib import Path
>>> Path('/a/b').write_text('a')
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/lib/python3.10/pathlib.py", line 1152, in write_text
    with self.open(mode='w', encoding=encoding, errors=errors, newline=newline) as f:
  File "/lib/python3.10/pathlib.py", line 1117, in open
    return self._accessor.open(self, mode, buffering, encoding, errors,
FileNotFoundError: [Errno 44] No such file or directory: '/a/b'

very reasonable.

This PR

The download size for loading console.html: 5.71 MB (gzip compressed), 12.21 MB (uncompressed)

Load time:
- Firefox 106.0.5 in Linux: 852 ± 90 ms
- Chrome 106: 618 ± 10 ms (console opened), 461 ± 5 (console closed, probably with more caching)

An example of an exception in the stdlib:

>>> from pathlib import Path
>>> Path('/a/b').write_text('a')
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/src/cpython/installs/python-3.10.2/lib/python3.10/pathlib.py", line 1152, in write_text
  File "/src/cpython/installs/python-3.10.2/lib/python3.10/pathlib.py", line 1117, in open
FileNotFoundError: [Errno 44] No such file or directory: '/a/b'

still OK (I wasn't sure how it would display with the zip file). Maybe is there a way to make this path a bit shorter to avoid python-3.10 twice in the path?

So overall the size improvement after compression is 8%, so not that much, however, the load time improvement for loading just core Pyodide + stdlib is really large,

  • 1.5x on Firefox
  • up to 1.9x for Chrome, and more with the console open but that's probably not very relevant

So overall it looks great @ryanking13 , and it also addresses the issue of a stdlib in a separate zip file without dealing with custom decompressors.

Would you mind adding a changelog particularly mentioning the load time improvement and crediting python/cpython#29984 ?

I think we should include this in the next release.

@rth rth marked this pull request as ready for review November 12, 2022 17:03
Copy link
Member

@hoodmane hoodmane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the Python developers already solved the bootstrap problem for us by ensuring that they can unzip before they need to import any Python files. That makes sense. It's cool that you didn't need to add hardly any runtime code for this.

Comment on lines 47 to 48
# Note that all those tests are moved to the subdirectory of test in Python 3.11,
# so we don't need this after Python 3.11 update.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just leave it then? It won't be that long before we update to 3.11.

Copy link
Member Author

@ryanking13 ryanking13 Nov 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we zip these directories, loadPackage('test') fails because it will try to unzip files into ctypes/test, unittest/test.

Copy link
Member Author

@ryanking13 ryanking13 Nov 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't be that long before we update to 3.11.

BTW, are you working on Python 3.11 update recently? If not, I have some time to try.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I just noticed that the upstream commits that move test directories are not applied to Python 3.11, this part is going to be needed until we have Python 3.12...

@@ -346,6 +345,8 @@ If you updated the Pyodide version, make sure you also updated the 'indexURL' pa
throw new Error("Didn't expect to load any more file_packager files!");
};

setHomeDirectory(Module, config.homedir);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you move setHomeDirectory after initializing the module? It's better if possible to keep all of this stuff before initializing Python because there are code paths related to virtual environments that care about these paths on initialization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a bug in main branch: /home/pyodide/lib/python310.zip is in the sys.path not /lib/python310.zip.

>>> import sys
>>> sys.path
['/home/pyodide', '/home/pyodide/lib/python310.zip', '/lib/python3.10', '/lib/python3.10/lib-dynload', '/lib/python3.10/site-packages']

moving setHomeDirectory after the initialization fixed this, but maybe there can be a better fix.

@ryanking13
Copy link
Member Author

ryanking13 commented Nov 13, 2022

Wow, thanks for the benchmark and for reminding me of this @rth. I'll re-work a bit on this PR next week.

Some of the pain points in this PR were resolved in Python 3.11, so I think we can get more benefits from this PR when we update to Python 3.11.

File "/src/cpython/installs/python-3.10.2/lib/python3.10/pathlib.py", line 1152, in write_text

The traceback showing /src/cpython/installs is not very satisfactory.

@rth
Copy link
Member

rth commented Nov 13, 2022

Some of the pain points in this PR were resolved in Python 3.11, so I think we can get more benefits from this PR when we update to Python 3.11.

As you prefer, but I think it would be good to do a stable release soon. If we go for an Python 3.11 update, we would need to do another alpha release before that.

But certainly updating to 3.11 would be great also.

@hoodmane
Copy link
Member

Packages that currently aren't building in Python 3.11:

And also scipy. Many of these just need to be recythonized though.

@hoodmane
Copy link
Member

I guess I should open a python 3.11 PR soon so people know what the status is.

@ryanking13
Copy link
Member Author

As you prefer, but I think it would be good to do a stable release soon. If we go for an Python 3.11 update, we would need to do another alpha release before that.

Yeah, we can certainly merge this PR before updating to Python 3.11. I'll try to fix the traceback before merging though.

@ryanking13
Copy link
Member Author

ryanking13 commented Nov 14, 2022

Okay, I changed it to show only a relative path in traceback:

>>> import pathlib
>>> pathlib.Path("a/b/c").write_text("1")
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "pathlib.py", line 1152, in write_text
  File "pathlib.py", line 1117, in open
FileNotFoundError: [Errno 44] No such file or directory: 'a/b/c
>>> json.decoder.JSONDecoder().decode(1)
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "json/decoder.py", line 337, in decode
TypeError: expected string or bytes-like object

I found that Python embeddable package for Windows which also ships stdlibs in a zip file shows tracebacks like this. So I think it makes sense to follow it?

@rth
Copy link
Member

rth commented Nov 14, 2022

Maybe we should make an announcement (on the public mailing list & twitter) that we are considering shipping .pyc compiled sources in the next release (both for stdlib and packages in #3253) and see if there are any concerns, particularly for teaching. I can't think of anything right now, but it doesn't hurt asking since it's a pretty large change (but IMO worth it).

Because the other solution would be to deploy two versions one source compiled, the other not. It's significantly more work for us. Though maybe we should still do this at least for packages, as otherwise it would be pretty difficult to debug if anything goes wrong and people want to check the source.

@ryanking13
Copy link
Member Author

Maybe we should make an announcement (on the public mailing list & twitter) that we are considering shipping .pyc compiled sources in the next release (both for stdlib and packages in #3253) and see if there are any concerns, particularly for teaching. I can't think of anything right now, but it doesn't hurt asking since it's a pretty large change (but IMO worth it).

Yes, that sounds very nice to me!

Because the other solution would be to deploy two versions one source compiled, the other not. Though it's significantly more work for us.

Yeah, that would be a bit complicated.. especially since it will add extra maintenance burden.

@hoodmane
Copy link
Member

I find the traceback json/decoder.py a bit confusing: it doesn't immediately look like it's in the stdlib. It would be nicer if it would still start with /lib/python3.10/ but I guess if that won't work for something it's not the end of the world...

@ryanking13
Copy link
Member Author

It would be nicer if it would still start with /lib/python3.10/ but I guess if that won't work for something it's not the end of the world...

Yes, we can make the path start with /lib/python3.10 like /lib/python3.10/pathlib.py and /lib/python3.10/json/decoder.py. The problem is that the file is not actually there so it is also a bit confusing. I have no strong opinion for this, so if someone has a strong opinion, I'll follow.

@hoodmane
Copy link
Member

Is the file in json/decoder.py? If not, then it doesn't make much difference which wrong path you give, but one of them makes me feel better...

@ryanking13
Copy link
Member Author

ryanking13 commented Nov 15, 2022

The traceback normally shows an absolute path, so if it shows a relative path, it is distinguishable. Though I agree that it is only understandable for those who know what is happening behind...

Or maybe we can try something different which explicitly tells that the file is inside the zipfile, something like:

  • /lib/python310.zip/json/decoder.py
  • <A_MESSAGE_THAT_TELLS_IT_IS_IN_THE_ZIP>/json/decoder.py

It somewhat matches with the __file__ attribute of the module (though it is not available for packages (#3253).

>>> pathlib.__file__
/lib/python310.zip/pathlib.pyc

Not sure this is a good idea 🤔

@rth
Copy link
Member

rth commented Nov 15, 2022

/lib/python310.zip/json/decoder.py

+1 for this. I also agree that displaying an absolute path that doesn't exist can be very confusing. Particularly given that Pyodide is used a lot for teaching Python and so we should try to be as close to the normal experience as possible.
Actually, if we could extract the files from the zip file it would likely still be easier to understand, but possibly slower (and we have no decoder).

@ryanking13
Copy link
Member Author

/lib/python310.zip/json/decoder.py

+1 for this. I also agree that displaying an absolute path that doesn't exist can be very confusing. Particularly given that Pyodide is used a lot for teaching Python and so we should try to be as close to the normal experience as possible. Actually, if we could extract the files from the zip file it would likely still be easier to understand, but possibly slower (and we have no decoder).

Opened a discussion: https://discuss.python.org/t/proper-file-paths-to-show-in-exception-for-py-compiled-files/21108.

@rth
Copy link
Member

rth commented Dec 2, 2022

In view of #3269 @ryanking13 could we have a flag here to source compile or not? I think shipping a zipfile always is maybe less problematic.

I would say we add flag to enable/disable this at build time, merge it as disabled for now, and then see what we deploy and whether we could deploy both.

return (file_pyc, archivename)


def create_stdlib_zip(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could put most of these functions in pyodide_build.packer (or something similar) so we could unit test them better? E.g. running this on some .py files in a temp dir, and checking the output zip (also to check that it doesn't contain .py).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sound good to me. That would require some modification in our build system since pyodide-build is not used for CPython build steps now, but I think it wouldn't be that hard.

@ryanking13
Copy link
Member Author

In view of #3269 @ryanking13 could we have a flag here to source compile or not? I think shipping a zipfile always is maybe less problematic.

I would say we add flag to enable/disable this at build time, merge it as disabled for now, and then see what we deploy and whether we could deploy both.

Yes, that sounds good to me.

@ryanking13
Copy link
Member Author

I moved the work in this PR into a build flag PYODIDE_ZIP_STDLIB.

PYODIDE_ZIP_STDLIB=1 make

So we can merge this PR as is without affecting default behavior. At the same time, people in the Jupyter workshop can experiment with this. I will open a separate PR to disable .pyc compilation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants