Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openprinting-ppds: MemoryError #2

Closed
532910 opened this issue Feb 14, 2020 · 17 comments
Closed

openprinting-ppds: MemoryError #2

532910 opened this issue Feb 14, 2020 · 17 comments

Comments

@532910
Copy link

532910 commented Feb 14, 2020

openprinting-ppds requires too much memory, and doesn't handle this case
correctly:

% /usr/lib/cups/driver/openprinting-ppds cat openprinting-ppds:0/ppd/openprinting/Samsung/PXL/Samsung_M262x_282x_Series_PXL.ppd
Traceback (most recent call last):
  File "/usr/lib/cups/driver/openprinting-ppds", line 119, in <module>
    main()
  File "/usr/lib/cups/driver/openprinting-ppds", line 95, in main
    ppd = cat(args[1])
  File "/usr/lib/cups/driver/openprinting-ppds", line 67, in cat
    ppds['ARCHIVE'] = BytesIO(decompress(base64.b64decode(ppds['ARCHIVE'].encode('ASCII'))))
  File "/usr/lib/cups/driver/openprinting-ppds", line 17, in decompress
    return process.communicate(value)[0]
  File "/usr/lib/python3.7/subprocess.py", line 939, in communicate
    stdout, stderr = self._communicate(input, endtime, timeout)
  File "/usr/lib/python3.7/subprocess.py", line 1711, in _communicate
    stdout = b''.join(stdout)
MemoryError
zsh: exit 1     /usr/lib/cups/driver/openprinting-ppds cat
% free
              total        used        free      shared  buff/cache   available
Mem:          986Mi        55Mi       698Mi       0.0Ki       233Mi       762Mi
Swap:            0B          0B          0B
% apt policy openprinting-ppds 
openprinting-ppds:
  Installed: 20181217-2
  Candidate: 20181217-2
  Version table:
 *** 20181217-2 500
        500 https://deb.debian.org/debian buster/main amd64 Packages
        100 /var/lib/dpkg/status
@vitorbaptista
Copy link
Collaborator

Hi @532910, thanks for the report.

This problem is because pyppd saves the all PPDs into a single compressed string, which needs to be decompressed in its entirety to load a PPD. This requires enough RAM to hold all decompressed PPDs in memory. On my system (Ubuntu 16.04), it requires 700 MB to extract a single PPD, and ~30 MB to list the available PPDs. It's a lot.

To solve this, we'd need to change the compression method to somehow be able to extract just the required PPD. This will probably reduce the compression rate, as the PPDs are compressed individually (or in batches), in exchange of a reduced runtime memory usage.

I won't have time to work on this in the following weeks, but if you or anyone else reading this is able to work on a pull request, I can help pointing in the right direction.

@532910
Copy link
Author

532910 commented Feb 14, 2020

On my system 762Mi is not enough. It's virtual host so I've increased ram and it works now, but I don't know how to check how much it requires.

This issue is not only about required memory amount, but also about wrong handling this case. Cups just says: Unable to copy PPD file / empty PPD file

@vitorbaptista
Copy link
Collaborator

Adding to my comment above, it might be possible to use https://docs.python.org/3/library/lzma.html#lzma.LZMADecompressor to decompress the archive incrementally, reducing the memory usage.

@532910 You can check the memory usage by running:

$ /usr/bin/time --verbose /usr/lib/cups/driver/openprinting-ppds cat openprinting-ppds:0/ppd/openprinting/Samsung/PXL/Samsung_M262x_282x_Series_PXL.ppd
...
	Command being timed: "/usr/lib/cups/driver/openprinting-ppds cat openprinting-ppds:0/ppd/openprinting/Samsung/PXL/Samsung_M262x_282x_Series_PXL.ppd"
	User time (seconds): 1.11
	System time (seconds): 0.51
	Percent of CPU this job got: 142%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.13
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 700256
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 189094
	Voluntary context switches: 28437
	Involuntary context switches: 17
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

The required memory is the Maximum resident set size. In this example, 700256 kbytes, roughly ~700 MB. The more PPDs in your archive, the higher this memory requirement.

On the wrong error message, I agree it would be good to improve it and I'm happy to change it on PyPPD. However, I think this is a case of CUPS handling the exception, not PyPPD. I'm following the issue you created in their repo.

@532910
Copy link
Author

532910 commented Feb 14, 2020

Maximum resident set size (kbytes): 1159316: 1.1G

@tillkamppeter
Copy link
Member

About the "list" option to retreive the index list of the PPD package we do not need to worry. Each PPD is represented with a one line entry there. So it should be around 1% or less of the whole archive size.
With the "cat" option we request always one single PPD file and a single PPD is rather small compared to what the whole package can be. What we should do is to not decompress the whole package but do a streaming decompression until we arrive at the desired PPD file deleting the unneeded/skipped data immediately before loading new data, until we reach the desired PPD and after having the desired PPD loaded immediately stop the process. We could load 2 data blocks of ~ 10 kB and if they do not contain the beginning of the PPD discard the first and load the next and so on until we have the beginning of the PPD. Then we stop discarding and load more blocks until we find the end. After that we output the PPD and terminate the program. WDYT?

@vitorbaptista
Copy link
Collaborator

Hey @tillkamppeter, nice to see you around! Yes, that's what I was thinking. It seems Python's LZMA package is able to do this streaming decompression, but we're not using it because it's Python3 only. We're using the xz executable. We need to find a way to send the data to xz and read the result in batches, which I'm not sure is possible.

I'm unable to work on this for the next weeks, so if anyone reading this wants to give it a try, please go ahead. The changes will probably be limited to https://github.com/vitorbaptista/pyppd/blob/master/pyppd/pyppd-ppdfile.in#L42-L55 and https://github.com/vitorbaptista/pyppd/blob/master/pyppd/compressor.py.

@532910
Copy link
Author

532910 commented Feb 16, 2020

Really I see nothing wrong with Python3 dependency, Python2 is widely deprecated nowadays.

@vitorbaptista
Copy link
Collaborator

Really I see nothing wrong with Python3 dependency, Python2 is widely deprecated nowadays.

@tillkamppeter What do you think? We could use the native LZMA support on Python3 only, which will reduce the memory requirements, and fallback to the current usage of xz. However, if Python2 isn't important anymore, we could just stop supporting it and bump the version to 2.x. If anyone needs Python2 support, they can use the 1.x versions.

@tillkamppeter
Copy link
Member

I think so, too. Do a 2.x which is Python3-only and use the native LZMA support. The rare cases of Python2 need are covered by the old 1.x versions using xz.
By the way, in Ubuntu 20.04 Python2 will be completely removed from Main, and most common Python projects (HPLIP, system-config-printer, ...) had already switched to Python3 years ago.

@tillkamppeter
Copy link
Member

Could you finish this (including the 2.0 release) at least some days before Feb 27? This is the Feature Freeze for Ubuntu 20.04 LTS (Focal). Thanks.

@532910
Copy link
Author

532910 commented Feb 18, 2020

Isn't better to use Zstd?

% git clone https://github.com/OpenPrinting/foomatic-db
% cd foomatic-db/db/source/
% tar caf PPD.tar PPD
% =time --verbose zstd --ultra -22 PPD.tar -o PPD.tar.ultra.zst
	User time (seconds): 72.66
	Maximum resident set size (kbytes): 1261088
% =time --verbose zstd PPD.tar -o PPD.tar.default.zst        
	User time (seconds): 0.56
	Maximum resident set size (kbytes): 36536
% =time --verbose xz --keep PPD.tar --stdout > PPD.tar.default.xz
	User time (seconds): 93.65
	Maximum resident set size (kbytes): 98024
% =time --verbose xz --keep --best PPD.tar --stdout > PPD.tar.best.xz   
	User time (seconds): 84.93
	Maximum resident set size (kbytes): 690072

% ls
587M  PPD.tar
3.2M  PPD.tar.best.xz
4.7M  PPD.tar.default.xz
14M   PPD.tar.default.zst
1.9M  PPD.tar.ultra.zst

% =time --verbose xzcat PPD.tar.default.xz > /dev/null 
	User time (seconds): 0.96
	Maximum resident set size (kbytes): 10808
% =time --verbose xzcat PPD.tar.best.xz > /dev/null 
	User time (seconds): 0.85
	Maximum resident set size (kbytes): 68220
% =time --verbose zstdcat PPD.tar.default.zst > /dev/null
	User time (seconds): 0.22
	Maximum resident set size (kbytes): 4392
% =time --verbose zstdcat PPD.tar.ultra.zst  > /dev/null
	User time (seconds): 0.18
	Maximum resident set size (kbytes): 133560

@tillkamppeter
Copy link
Member

Note that the zstd Python bindings are not in Ubuntu, and the zstd command line utility is not in Ubuntu Main. So using zstd for pyppd now would make it impossible to get the fixed pyppd into Ubuntu 20.04 LTS. So we should use the Python3 onboard solution for now and can think about anything even better only after 20.04.

@532910
Copy link
Author

532910 commented Feb 19, 2020

agree

@tillkamppeter
Copy link
Member

I have merged Pull Request #3, cloned the repo with it included and tried it. If I select a PPD close to the start of the list it extracts the PPD quickly, and works correctly with some MB of memory consumption. If I take PPD near the end of the list it takes some seconds and then gets killed with signal 9, before reaching the desired PPD.

I have downloaded the source of foomatic-db and compressed the db/source/PPD directory. I am on Ubuntu Focal, this distro is Python3-only and there is no /usr/bin/python. I tried both linking /usr/bin/python3 to /usr/bin/python and editing the shebang of the self-extracting archive.

@tillkamppeter
Copy link
Member

I have looked into it again and with a PPD near the beginning of the archive we really save memory, but if we try to extract one near the end of the archive, the process seems to hog much more memory as before and so the kernel kills it because it is running out of memory. So there seems to be a severe memory leak which I was not able to deduct from reviewing the Python code.

@tillkamppeter
Copy link
Member

Can it be that we had double compression and decompression in the code? If you look at the load() function, it calls decompress() then the old code called decompress again in the cat() function, which is now replaced by streaming decompression.

@tillkamppeter
Copy link
Member

Last patch (Pull Request #4) makes it working great now. Thank you very much @dsam82.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants