Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"build zipmanifest" should be called only once for each ZipProvider #240

Closed
ghost opened this issue Aug 11, 2014 · 11 comments
Closed

"build zipmanifest" should be called only once for each ZipProvider #240

ghost opened this issue Aug 11, 2014 · 11 comments
Labels

Comments

@ghost
Copy link

@ghost ghost commented Aug 11, 2014

Originally reported by: jun66j5 (Bitbucket: jun66j5, GitHub: jun66j5)


When Trac 1.0-stable (r13065) with message catalogs on setuptools 5.4 is installed as a egg file, "build zipmanifest" is called more than 4 times on setuptools 5.3. Especally, that's too slow on Windows because zipfile.py on Windows is slow more than 10 times on Unix.

After 5.4, ZipProvider.zipinfo property reads egg file each time. Before 5.3, ZipProvider.__init__ method reads egg file and the result wil be stored in its instance variable.

I think the changes lead the issue. The following patch would be the same behavior of 5.3. Thoughts?

(Originally reported at https://groups.google.com/d/topic/trac-users/gX5kYTUFXM4)

#!diff
--- pkg_resources.py.orig     2014-08-09 22:06:34.877375000 +0900
+++ pkg_resources.py  2014-08-09 22:06:37.533625000 +0900
@@ -1636,7 +1636,11 @@

     @property
     def zipinfo(self):
-        return self._zip_manifests.load(self.loader.archive)
+        try:
+            return self._zipinfo
+        except AttributeError:
+            self._zipinfo = self._zip_manifests.load(self.loader.archive)
+            return self._zipinfo

     def get_resource_filename(self, manager, resource_name):
         if not self.egg_name:

How to reproduce:

#!shell
$ easy_install Babel==0.9.6
$ easy_install Genshi==0.6
$ easy_install http://svn.edgewall.org/repos/trac/branches/1.0-stable
$ python -m cProfile -s time $VENV/bin/trac-admin --version 2>/dev/null | head -15

setuptools 5.4 on Windows:

trac-admin-script.py 1.0.2dev-r13065
         2475815 function calls (2471154 primitive calls) in 13.135 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      187    3.756    0.020    7.826    0.042 zipfile.py:799(_RealGetContents)
   119597    1.124    0.000    1.640    0.000 zipfile.py:287(__init__)
   478400    1.027    0.000    1.027    0.000 {method 'read' of 'cStringIO.StringI' objects}
   119784    0.805    0.000    1.826    0.000 pkg_resources.py:1553(<genexpr>)
   266592    0.531    0.000    0.531    0.000 {method 'append' of 'list' objects}
   119597    0.511    0.000    0.749    0.000 zipfile.py:900(getinfo)
   122860    0.321    0.000    0.321    0.000 {_struct.unpack}
   119597    0.301    0.000    0.301    0.000 zipfile.py:383(_decodeExtra)
149422/148669    0.299    0.000    0.303    0.000 {len}

setuptools 5.3 on Windows:

trac-admin-script.py 1.0.2dev-r13065
         864149 function calls (859551 primitive calls) in 4.985 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       42    0.828    0.020    1.723    0.041 zipfile.py:799(_RealGetContents)
      944    0.275    0.000    0.536    0.001 api.py:316(<genexpr>)
     1218    0.273    0.000    0.534    0.000 api.py:317(<genexpr>)
   133269    0.270    0.000    0.270    0.000 {unichr}
    26466    0.247    0.000    0.361    0.000 zipfile.py:287(__init__)
   105875    0.226    0.000    0.226    0.000 {method 'read' of 'cStringIO.StringI' objects}
      132    0.171    0.001    0.171    0.001 {nt.rename}
    79708    0.157    0.000    0.157    0.000 {method 'append' of 'list' objects}
       42    0.138    0.003    2.206    0.053 pkg_resources.py:1533(build_zipmanifest)

setuptools 5.3 on Ubuntu 12.04:

trac-admin 1.0.2dev-r13065
         845470 function calls (840887 primitive calls) in 0.474 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       44    0.108    0.002    0.184    0.004 zipfile.py:755(_RealGetContents)
    27178    0.033    0.000    0.041    0.000 zipfile.py:277(__init__)
       69    0.028    0.000    0.028    0.000 {method 'get_data' of 'zipimport.zipimporter' objects}
       44    0.015    0.000    0.218    0.005 pkg_resources.py:1533(build_zipmanifest)
   108712    0.014    0.000    0.014    0.000 {method 'read' of 'cStringIO.StringI' objects}
       16    0.013    0.001    0.020    0.001 sre_compile.py:301(_optimize_unicode)
      944    0.012    0.000    0.020    0.000 api.py:316(<genexpr>)
     1218    0.012    0.000    0.019    0.000 api.py:317(<genexpr>)
  706/166    0.010    0.000    0.025    0.000 sre_parse.py:379(_parse)

setuptools 5.3 on Ubuntu 12.04:

trac-admin 1.0.2dev-r13065
         2367413 function calls (2362830 primitive calls) in 1.292 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      189    0.503    0.003    0.859    0.005 zipfile.py:755(_RealGetContents)
   120309    0.151    0.000    0.190    0.000 zipfile.py:277(__init__)
   481236    0.065    0.000    0.065    0.000 {method 'read' of 'cStringIO.StringI' objects}
   120498    0.058    0.000    0.117    0.000 pkg_resources.py:1550(<genexpr>)
   123576    0.035    0.000    0.035    0.000 {_struct.unpack}
   120309    0.031    0.000    0.031    0.000 zipfile.py:368(_decodeExtra)
   120309    0.028    0.000    0.037    0.000 zipfile.py:854(getinfo)
       69    0.028    0.000    0.028    0.000 {method 'get_data' of 'zipimport.zipimporter' objects}
   120381    0.025    0.000    0.025    0.000 {method 'find' of 'str' objects}

@ghost
Copy link
Author

@ghost ghost commented Aug 15, 2014

Original comment by philip_thiem (Bitbucket: philip_thiem, GitHub: Unknown):


Thanks! There was some work on implementing caching of the zipfile process wide, so the try/catch would not had been needed. This was for cases where several packages are in the same zip. However, It was decided to make that behavior non-default. My guess is that this area slipped by in that decision. @jaraco I can take a look to make sure we don't regress the previous memory issues this weekend and merge if that is ok.

@ghost
Copy link
Author

@ghost ghost commented Aug 15, 2014

Original comment by jaraco (Bitbucket: jaraco, GitHub: jaraco):


Duplicate of #154.

@ghost
Copy link
Author

@ghost ghost commented Aug 15, 2014

Original comment by jaraco (Bitbucket: jaraco, GitHub: jaraco):


This issue appears to me to be an exact duplicate of 154. If there is something that distinguishes it, I'm missing it.

Does enabling the environment variable to enable caching fix the issue?

I agree it would be nice to speed the loading of these structures. Perhaps we should revisit that issue to see if there is a better implementation that could cache the values during startup, but expire them over time to allow the memory to be reclaimed.

@ghost
Copy link
Author

@ghost ghost commented Aug 15, 2014

Original comment by jun66j5 (Bitbucket: jun66j5, GitHub: jun66j5):


Does enabling the environment variable to enable caching fix the issue?

If PKG_RESOURCES_CACHE_ZIP_MANIFESTS environment is enabled, the issue would be fixed and faster than 5.3.

However, I think setuptools 5.4 without the environment should have the same speed of 5.3.

@ghost
Copy link
Author

@ghost ghost commented Aug 15, 2014

Original comment by jaraco (Bitbucket: jaraco, GitHub: jaraco):


On further consideration, perhaps you're right. I believe my concerns in #154 about memory usage may have been overly conservative. Looking at the 5.3 code, it seems that the 'zipinfo' was always loaded once for each ZipProvider. It should be reasonable to retain that behavior.

@ghost
Copy link
Author

@ghost ghost commented Aug 15, 2014

Original comment by jaraco (Bitbucket: jaraco, GitHub: jaraco):


In my original consideration, I believed that the zipinfo was primarily only used on startup and had little performance impact on a running application. The issues reported in the referenced Trac report strongly suggests otherwise. Therefore, I believe the caching mechanism should be enabled by default.

@ghost
Copy link
Author

@ghost ghost commented Aug 15, 2014

Original comment by jaraco (Bitbucket: jaraco, GitHub: jaraco):


Use MemoizedZipManifests for all operations. Fixes #240.

@ghost
Copy link
Author

@ghost ghost commented Aug 15, 2014

Original comment by jaraco (Bitbucket: jaraco, GitHub: jaraco):


Released as 5.7. Please use the PKG_RESOURCES_CACHE_ZIP_MANIFESTS for Setuptools 5.4-5.6.

@ghost
Copy link
Author

@ghost ghost commented Aug 15, 2014

Original comment by philip_thiem (Bitbucket: philip_thiem, GitHub: Unknown):


Actually it is not the same. My fix to #154 was a global cache. This put an instance cache back into the ZipProvider. And indeed, prior to the #154 fix it was an instance cache built in the ZipProvider.init.

#!python


    def __init__(self, module):
        EggProvider.__init__(self, module)
        self.zipinfo = build_zipmanifest(self.loader.archive)
        self.zip_pre = self.loader.archive+os.sep

(See https://bitbucket.org/pypa/setuptools/src/bfbccab83c1d/pkg_resources.py - bfbccab83c1d - 2014-05-27)

In the case of #154, the reporter's deployment mechanism was accessing multiple packages from the same zip, but different ZipProviders (different packages). Thus, each Package's init function would had rebuilt the zip manifest from the same zip multiple times. In retrospect, #154 might had been fixable using weak references just so it wasn't guaranteed that the cached manifest would be around for the duration of the process, but I digress.

In any case with a global cache, an instance cache was no longer needed, so it was removed (10cc90d9b828) as part of the original fix. The zipinfo property then called the "build" mechanism directly each time. What is happening here, is that during the lifetime of the object it would seem that zipinfo gets called four or more times on a given instance. So without the global cache the zip manifest is being rebuilt, not just for the same zip but also the same package multiple times.

@ghost
Copy link
Author

@ghost ghost commented Aug 15, 2014

Original comment by philip_thiem (Bitbucket: philip_thiem, GitHub: Unknown):


So I guess my question would be are ZipProviders around for the direction of the process. If so, adding an instance cache wouldn't be any better than turning on the global cache and may use more memory. If not, then some memory may be saved by reimplementing an instance cache, and leaving the optional global in place. [Edits in italics]

@ghost
Copy link
Author

@ghost ghost commented Aug 15, 2014

Original comment by jaraco (Bitbucket: jaraco, GitHub: jaraco):


@philip_thiem Good analysis. My suspicion is that ZipProviders are around for the life of the process, but I haven't verified that fact. Do feel free to explore the issue further and provide additional analysis or changes if you believe the performance can be better optimized.

@ghost ghost added major bug 5.4 labels Mar 29, 2016
@ghost ghost closed this Mar 29, 2016
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
0 participants