Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange memory leak(?) consuming behaviour #65

Closed
dgtlmoon opened this issue Aug 29, 2022 · 5 comments
Closed

Strange memory leak(?) consuming behaviour #65

dgtlmoon opened this issue Aug 29, 2022 · 5 comments

Comments

@dgtlmoon
Copy link

dgtlmoon commented Aug 29, 2022

2.2.0

update looks like a Python LXML memory leak issue https://medium.com/devopss-hole/python-lxml-memory-leak-b8d0b1000dc7

For some background, I'm using your wonderful library in my flask application, so it means that the process does not get restarted, I've tried solving this by moving the inscriptis step to its own thread but it still seems to make the whole app bleed memory

#!/usr/bin/python3

import time

def leak_memory():
  from inscriptis import get_text
  with open('leaky.html', 'r') as f:
    s = f.read()
  text_content = get_text(s)

leak_memory()
leak_memory()
leak_memory()
leak_memory()

print ("Done, now look at memory usage")
time.sleep(20)

See the script and the test HTML here

leaky.html.zip

What I'm seeing is that that on some more complex HTML, it will consume something like 150Mb on the first get_text(..) call, and then it will never let the process release that memory, that's the problem for me.

  • I've tried using gc.collect() after get_text() but it never releases the memory
  • tried del text_content etc etc, but that didnt help

ideas? is this a bug?

Happy to throw a few dollars across for supporting your wonderful project!

@dgtlmoon
Copy link
Author

Example adding all the usual del's and gc() cleanups
it seems to be related to the parser and some HTML, I notice usually from youtube or google sheets, but often others too

#!/usr/bin/python3

import time
import gc
def leak_memory(f):
  from inscriptis import get_text
  with open(f, 'r') as f:
    s = f.read()
  text_content = get_text(s)

  # Try all the tricks
  del s
  del text_content
  del get_text
  gc.collect()

leak_memory('leaky.html')

print ("Done, now look at memory usage, should be about 150mb still")
time.sleep(20)

120mb still in use, unable to be released

@dgtlmoon
Copy link
Author

dgtlmoon commented Aug 30, 2022

using gc.DEBUG_LEAK from https://docs.python.org/3/library/gc.html#gc.DEBUG_LEAK (via https://www.oreilly.com/library/view/python-cookbook/0596001673/ch14s10.html ), I think it's showing me that a lot of inscriptis cant be accessed via the garbage collector (too many circular references? even after del?)

#!/usr/bin/python3

import time
import gc

def leak_memory(f):
  from inscriptis import get_text
  with open(f, 'r') as f:
    s = f.read()
  text_content = get_text(s)

  # Try all the tricks
  del s
  del text_content
  del get_text

if __name__=="__main__":

  gc.enable()
  gc.set_debug(gc.DEBUG_LEAK)
  leak_memory('leaky.html')

  gc.collect()
  for x in gc.garbage:
    s = str(x)
    if len(s) > 80: s = s[:77] + '...'
    print (type(x), "\n  ", s)
 python3 ./leaky.py
gc: collectable <Inscriptis 0x7f4f4494df10>
gc: collectable <dict 0x7f4f44942b80>
gc: collectable <dict 0x7f4f43d16940>
gc: collectable <dict 0x7f4f43d16900>
gc: collectable <method 0x7f4f43d16a40>
gc: collectable <list 0x7f4f43d0a6c0>
gc: collectable <list 0x7f4f43da2200>
gc: collectable <list 0x7f4f43d32740>
gc: collectable <ParserConfig 0x7f4f4494dfd0>
gc: collectable <method 0x7f4f44a629c0>
gc: collectable <method 0x7f4f449f3380>
gc: collectable <method 0x7f4f43dd0700>
gc: collectable <method 0x7f4f43dd7440>
gc: collectable <method 0x7f4f43dd0640>
gc: collectable <method 0x7f4f43dd0500>
gc: collectable <method 0x7f4f43dd0740>
gc: collectable <method 0x7f4f43dd0780>
gc: collectable <method 0x7f4f43dd7400>
gc: collectable <method 0x7f4f43dd73c0>
gc: collectable <method 0x7f4f43d0a540>
gc: collectable <method 0x7f4f43d32180>
gc: collectable <method 0x7f4f43d32140>
gc: collectable <Attribute 0x7f4f4489fd00>
gc: collectable <dict 0x7f4f448b7f00>
gc: collectable <dict 0x7f4f43d168c0>
<class 'inscriptis.html_engine.Inscriptis'> 
   <inscriptis.html_engine.Inscriptis object at 0x7f4f4494df10>
<class 'dict'> 
   {'config': <inscriptis.model.config.ParserConfig object at 0x7f4f4494dfd0>, '...
<class 'dict'> 
   {'table': <bound method Inscriptis._start_table of <inscriptis.html_engine.In...
<class 'dict'> 
   {'table': <bound method Inscriptis._end_table of <inscriptis.html_engine.Insc...
<class 'method'> 
   <bound method Attribute.apply_attributes of <inscriptis.model.attribute.Attri...
<class 'list'> 
   [<default prefix=, suffix=, display=Display.inline, margin_before=0, margin_a...
<class 'list'> 
   []
<class 'list'> 
   []
<class 'inscriptis.model.config.ParserConfig'> 
   <inscriptis.model.config.ParserConfig object at 0x7f4f4494dfd0>
<class 'method'> 
   <bound method Inscriptis._start_table of <inscriptis.html_engine.Inscriptis o...
<class 'method'> 
   <bound method Inscriptis._start_tr of <inscriptis.html_engine.Inscriptis obje...
<class 'method'> 
   <bound method Inscriptis._start_td of <inscriptis.html_engine.Inscriptis obje...
<class 'method'> 
   <bound method Inscriptis._start_td of <inscriptis.html_engine.Inscriptis obje...
<class 'method'> 
   <bound method Inscriptis._start_ul of <inscriptis.html_engine.Inscriptis obje...
<class 'method'> 
   <bound method Inscriptis._start_ol of <inscriptis.html_engine.Inscriptis obje...
<class 'method'> 
   <bound method Inscriptis._start_li of <inscriptis.html_engine.Inscriptis obje...
<class 'method'> 
   <bound method Inscriptis._newline of <inscriptis.html_engine.Inscriptis objec...
<class 'method'> 
   <bound method Inscriptis._end_table of <inscriptis.html_engine.Inscriptis obj...
<class 'method'> 
   <bound method Inscriptis._end_ul of <inscriptis.html_engine.Inscriptis object...
<class 'method'> 
   <bound method Inscriptis._end_ol of <inscriptis.html_engine.Inscriptis object...
<class 'method'> 
   <bound method Inscriptis._end_td of <inscriptis.html_engine.Inscriptis object...
<class 'method'> 
   <bound method Inscriptis._end_td of <inscriptis.html_engine.Inscriptis object...
<class 'inscriptis.model.attribute.Attribute'> 
   <inscriptis.model.attribute.Attribute object at 0x7f4f4489fd00>
<class 'dict'> 
   {'display_images': False, 'deduplicate_captions': False, 'display_links': Fal...
<class 'dict'> 
   {'attribute_mapping': {'style': <function CssParse.attr_style at 0x7f4f43d289...

@dgtlmoon
Copy link
Author

update looks like a Python LXML memory leak issue https://medium.com/devopss-hole/python-lxml-memory-leak-b8d0b1000dc7

@AlbertWeichselbraun
Copy link
Contributor

AlbertWeichselbraun commented Aug 30, 2022

thank you for reporting this. although the issue is caused by lxml, it still looks like an interesting problem.

the following code releases the allocated memory (see https://www.mail-archive.com/lxml@python.org/msg00029.html) but probably requires more investigation and testing to determine whether it works stable and across systems.

import ctypes
def trim_memory() -> int:
    libc = ctypes.CDLL("libc.so.6")
    return libc.malloc_trim(0)

@dgtlmoon
Copy link
Author

@AlbertWeichselbraun ok, I was thinking of closing this issue and making a new PR with a small note for the README, but lets keep it open and see what we can find - definitely looks more like a bug in liblxml like you say, thanks!

AlbertWeichselbraun added a commit that referenced this issue Sep 20, 2022
AlbertWeichselbraun added a commit that referenced this issue Sep 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants