-
-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streaming Data Memory Growth #285
Comments
I really need to fix this issue. Thanks. |
I think this patch from a few weeks ago might have fixed this. Unfortunately the patch wasn't pushed to master --- I've only just merged it now. The fix is: 141639e Could you test this by |
Sorry this has taken a while. I'll test again today/tomorrow and get back to you. |
I performed the same tests again, both installing from the zip you posted above and installing directly from master (commit 9cd21ad) and got nearly identical results to my previous trials. If you believe that this is the cause of some kind of memory leak, I think we should really take a look at my testing script and update it as it's very rudimentary and I'm far from an expert profiler. However, I don't think that this is a leak. As I said in my original post, I think that this is just part of how spacy works. When parsing things like social media where there are many tokens that occur only once (e.g. links) storing them in the StringStore causes memory bloat. In your comments on #172 you proposed a batch-processing generator that uses, and then throws away a tokenizer object for each batch in order to help find OOV tokens. I think that's a fine approach, and could even be done a bit more quickly by asynchronously loading the new English() instance and replacing the old one when the new is ready, but that still leads to quite the slow down. If your feeling is that spacy is really meant for batch processing and that I should use mini-batches if I want to approximate streaming, I can do that. Spacy is still far superior to anything else out there in my opinion, but it would be nice if I could use it with the expectation of roughly constant space complexity. -- Eric |
To clarify a little bit, the current release version has three known places that could be growing in memory use.
The patch I asked you to try out addresses 3. We can also easily address 2. Addressing 1 is hard, because we currently intern all the strings, which is a much easier policy to implement than something more subtle. Can you report the lengths of the |
Hi @honnibal, why do you think addressing 1 is so hard? What about FIFO queue or similar, or something like @ELind77 suggested like:
Do you have any more information on this issue since it cropped up a few months ago? I notice the same type of memory issues on my systems that analyze streaming Twitter data - note I've not yet narrowed it down to spacy yet but my first cursory look found this ticket to be the most relevant possibility |
Also curious if this issue is already solved already, I will test updating my version (currently 0.100.6) to see if that helps at all |
Hi @honnibal, I have had similar issues in my streaming application. Basically memory grows at a logarithmic-ish pace. We have to deal with it as though it were a memory leak and periodically re-initialize the code. I ran the benchmark you requested above - collecting metrics on the length of the Here is the code I used to create the metrics. It basically ran until I ran out of memory on a 4G box. https://github.com/natb1/spaCy/blob/memory-benchmark/spacy/tests/benchmark/test_memory.py I'd be glad to help implement some strategies to address this problem if you could help me isolate the issue and/or suggest some approaches. |
Same problem here. Would also be glad to help. |
pinging @henningpeters given recent announcement on spaCy homepage |
To clarify the current behaviour a little: I'll start from the beginning: why intern the strings? Two main reasons:
We can't do without 1 entirely — it's too fundamental to how spaCy is working, and we definitely don't want to be making lots of string comparisons. Comparing by integer value is pretty important. Consideration 2 is very useful, but it's only really a saving if strings occur multiple times. Certainly, for strings that occur once, there's no advantage. And it's also bad to have unbounded memory use on the streaming process. So the solution we want to get to is one where a limited number of somewhat common strings are interned in the common vocab. However, we still need to map all strings, even rare ones, to integers. We also want the string-to-int table to be consistent, even for rare strings. Here's the bit of code where the memory growth is occuring: https://github.com/spacy-io/spaCy/blob/master/spacy/strings.pyx#L147 The purpose here is to resolve a string to an integer. We first hash the string. Now, this is an integer representation --- so why not just use the hash? The problem is we also want the inverse mapping. We therefore store the string, causing the memory growth. If we insist that all integers can always be mapped back to strings, there's no solution. We have to accept the memory growth. But if we can accept that these strings pass out of date, so that they're around for a while and then they're not, the situation should be manageable. A simple way to achieve this is to extend the A slightly more tricky solution is to do some reference counting. The idea here would be for the I think for both solutions, we should use the hash of the string as the integer representation for OOV strings. This means that at least the string-to-int mapping will stay consistent, even if strings are passing out of memory. The only way to have a problem here is if you hold onto the integer representation, release all of the documents, and later want to recover the string. In this situation, you'll be out of luck --- but we'll at least know to use an OOV symbol when you try to look up your string. |
Simply hashing oov tokens sounds good enough to me. As long as we know the On Thu, Aug 25, 2016, 8:03 AM Matthew Honnibal notifications@github.com
|
The input text is currently not saved/represented on the document at the moment. Instead, we guarantee that the |
Yeah I understand. The point I was making was that, since the caller of the |
I think we're in agreement here. However, I think it's important that we either assume the string is unavailable, or save it ourselves. Saving the string on the document isn't a huge waste of memory, and it only impacts the API in a few places (e.g., Here's a design that achieves something like the reference counting:
|
This sounds great! Although probably due to lack of context and familiarity to the code base, I personally would still prefer some simpler approach that can keep the |
Well, I think you could say "the trap is set": the existing design is such that the strings have to be globally available. Recall that we're allowing transport to/from numpy arrays. This means we're expecting to be able to unpack an array of ints and understand some of them as strings, without ties to a particular We could hack through this by writing down the OOV strings in the global store only when we pack into an array. But I hope we can all agree that this is just digging ourselves a deeper hole. I would be very unhappy if I tried to pack an array myself in the obvious way, and I found that the library's version of this was quietly writing to global state, and without this write my method failed, but only on OOV words, so not on my test data! |
Yeah this sounds terrifying.
I might be totally wrong, but I expect the feature of converting to/from numpy to only be used internally? The array doesn't seem to work across different I guess what I was proposing entails always including the original text as part of (de)serialization. This might be too much refactor work, in which case what you mentioned also sounds great :) |
Implemented 🎉 Need to update other modules to reflect the change, and do testing. |
…ddress streaming data memory growth. StringStore.__getitem__ now raises KeyError when it can't find the string. Use StringStore.intern() to get the old behaviour. Still need to hunt down all uses of StringStore.__getitem__ in library and do testing, but logic looks good.
…ls, to address streaming data memory growth. StringStore.__getitem__ now raises KeyError when it can't find the string. Use StringStore.intern() to get the old behaviour. Still need to hunt down all uses of StringStore.__getitem__ in library and do testing, but logic looks good." This reverts commit 8423e86.
Hmm. I don't want to rush this, because it touches a lot of files, but I also don't want to block the v1.0.0 release, which is otherwise ready. So unfortunately I have to move this out of the milestone. I'll probably get back to it week after next. |
… case strings will be pushed into an OOV map. We can then flush this OOV map, freeing all of the OOV strings.
New plan — let's at least get a good workaround in place, where the user will do some manual management of when the strings will be freed. This should be enough to keep you all productive, while we try to plan out a prettier, 'automagical' solution/wrapper around this. The freeze/flush behaviour is off by default, so it shouldn't disrupt anyone. @tomtung — I think this is the sort of solution you were looking for, since this makes it a bit easier to control things manually. Summary:
Example (untested): nlp = spacy.load('en')
nlp.vocab.strings.set_frozen(True)
for doc in nlp.pipe(texts, batch_size=5000, n_threads=2):
do_my_stuff(doc)
nlp.vocab.strings.flush_oov()
The OOV strings are encoded using the hash of the byte string. This means that you'll get consistent integer encodings between flushings. However, if you're holding an integer ID for an OOV string, and you flush the OOVs and try to decode the integer, you'll get an I've pushed the solution to the branch |
If one serializes a Doc with an OOV word, the above is bound to happen. Since serialization is the only way to reuse parsing results in a data pipeline and most real-world docs would have OOV words, this problem is pretty critical. When serializing a Doc (with |
I'd be happy to try to take a crack at it, but things like the use of Perhaps offer a way for self-contained serialization that doesn't depend on any vocab altogether? (Or only depending on a small set of symbols that are future proof) |
The serialiser backs off to a character codec for OOV words. |
Incidentally I have regrets about the serialiser. I think I got carried away... I don't even remember how much bigger a |
I'm so glad that this has received so much thought and attention! @honnibal could we get an update on the current status of this and your thoughts on how best to proceed? Your suggestion of splitting the string store seems most in line with my thoughts on this. If that is still the way you are thinking of going with this and you're thinking of using multiple string stores for OOV words, I'd also just like to put out there that it might be a good idea to use some kind of data structure for storing them other than an array, especially if there are a lot of them. If the integer ids of the multiple StringStores are guaranteed to never overlap a BST might be a good candidate, if they can overlap though you might need to go with something a bit different, like a UnionFind. -- Eric |
#589 issue still exists. So the workaround doesn't really work. This is one of the blocking issues for us now. Will a more stable fix be available in next 1.x releases? Thanks a lot for the work! |
Hey, I just took a look at the StringStore class in main and saw that some work has been done on this. I still need to play with is a bit to see how it works but this looks really great. https://github.com/explosion/spaCy/commits/master/spacy/strings.pyx Thank you so much @honnibal ! -- Eric |
Hi @honnibal First of all, thank you for this great tool, we use it as part of NLP in our product. However, our case is very high-load system with streaming data (hundreds of thousands emails per day). And we are experiencing the same problem as was discussed here - growth of StringStore causes tremendous memory growth over time, so it really blocks usage of spaCy without fear of crashing the whole system because of OOM. The only workaround we came up with is to reload nlp object each N processed content items and force garbage collector to free memory of deleted object. However, it seem not always working way - sometimes it frees all the memory, and sometimes not. So my questions are as follows:
Thanks in advance. |
@azar923 Did you try the The situation around this is much improved in spaCy 2, because the string-to-integer mapping no longer depends on the |
@honnibal thank you for quick answer, |
The restoration idea would look like this: backup_strings_data = nlp.vocab.strings.to_bytes()
backup_strings = StringStore().from_bytes(backup_strings_data)
for i, doc in enumerate(nlp.pipe):
yield doc
for word in doc:
backup_strings.add(word.text)
if i % 1000 == 999:
nlp.vocab.strings = backup_strings
backup_strings = StringStore().from_bytes(backup_strings_data) This would ensure that strings stay available from only the last 1000 documents. It works by keeping two copies of the |
@honnibal Great, thanks very much for these improvements. Will look forward to 2.0 release to try. For now workaround with reloading / collecting nlp object works quite ok in production. |
@honnibal I'm also facing the same issue, (spacy 1.5.0). I know this is hackish, however, would resetting the _map and setting size to 0, or resetting the StringStore itself after a certain critical size is reached could cause any problems? Currently using spacy to get the POS tags. (from sentence subtree etc) |
I am experimenting with this workaround with the 1.x version. So far it is working well. from concurrent.futures import ThreadPoolExecutor
class RestartingEnglish:
RESTART_CALLS = 1000000
def __init__(self, *args, **kwargs):
self.nr_calls = 0
self._init_args = args
self._init_kwargs = kwargs
self.nlp = self._init_nlp()
self.fut_nlp = None
self.exec = ThreadPoolExecutor(max_workers=1)
def _init_nlp(self):
return English(*self._init_args, **self._init_kwargs)
def _restart_nlp(self):
self.nr_calls += 1
if self.nr_calls >= self.RESTART_CALLS:
if self.fut_nlp == None:
print("Getting new NLP", self.nr_calls)
self.fut_nlp = self.exec.submit(self._init_nlp)
else:
if self.fut_nlp.done():
print("Got new NLP", self.nr_calls)
self.nlp = self.fut_nlp.result()
self.fut_nlp = None
self.nr_calls = 0
def __call__(self, *args, **kwargs):
doc = self.nlp(*args, **kwargs)
self._restart_nlp()
return doc |
Fixed! (!!!!) Please see #1424 |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
I have been using spacy for streaming data (twitter and news stories mostly) and I believe that the fundamental design of the vocab/StringStore in spacy is problematic for streaming processing. When used for batch jobs the additional memory overhead of storing a new lexeme struct for each new word form encountered in parsing is negligible compared to the speed gains, and because most text conforms to the assumption that vocabulary size grows logarithmically as the total number of tokens grows linearly this is usually a safe bet. But for streaming text, especially for social media where new terms are invented by the minute (hashtags and URLs in particular) this assumption no longer holds and the spacy vocabulary storage represents a dynamic element in what should be a completely static production deployment.
In order to test this assumption, I took one million tweets and performed a rudimentary analysis using the resources module in python to get the maximum memory used by the program at regular intervals during processing. I first performed some minor preprocessing to remove newlines from the data so that it could be read line by line so that it wouldn't all be kept in memory, then I ran spacy with all models set to false, only the tokenizer loaded. I then did the same thing again after removing all URLs, hashtags, and twitter mentions from the data , and then filtering all empty strings (this resulted in a 1.4% data loss in terms of total tweets processed but that's fairly minor).
The final result was that spacy used an additional 278.6 MB after tokenizing the raw tweets and 60.99 MB of additional memory when tokenizing the pre-processed tweets. This result confirms my hypothesis but also shows that the memory increase really isn't all that significant (especially at the relatively low volume that I am currently processing). But it still points to a potential flaw in the design of the library.
My suggestion/request in the near term would be to have an option to make the vocabulary read only so that users who want to be able to leave spacy alone to do streaming data processing don't need to worry about changing memory requirements. In the long term, I think that an optimal solution would be to add some functionality for a timeout on vocabulary entries that aren't loaded at initialization. E.g. if this lexeme hasn't been accessed for the last n seconds, delete it from the StringStore. And n would be user configurable.
My code and results are available here: https://github.com/ELind77/spacy_memory_growth
Thanks again for continuing to develop such a great library!
-- Eric
The text was updated successfully, but these errors were encountered: