Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storing files across many collections causes high memory usage #3113

Closed
jbrady-4sight opened this issue May 30, 2022 · 7 comments
Closed

Storing files across many collections causes high memory usage #3113

jbrady-4sight opened this issue May 30, 2022 · 7 comments

Comments

@jbrady-4sight
Copy link

Describe the bug
Storing files across many collections uses many times more memory (in my testing, ~400x more) than storing files in one volume.

System Setup
I used these commands to insert 1MB files into new collections using the attached scripts:
weed master -volume.max=50000 -dir=./data -ip=127.0.0.1
curl http://127.0.0.1:9333/dir/assign?collection=<collection name>
curl -F file=test.blob http://127.0.0.1:8080/<volume id>

weedtest.py.txt
weedtest_collections.py.txt
graph.py.txt

The memory usage is graphed below (using the attached graph.py). In both cases, 1MB files are inserted into Seaweed at a rate of 1/s, so the time in seconds is roughly equal to the size of data stored on disk, in MB.
collections
single

OS version:
win10 v10.0.19041.1566

Weed version:
version 30GB 3.06 2f846777bbceea307771e79d4452e071b0bd5a51 windows amd64

Expected behavior
The wiki page says that the in-memory file index uses ~20 bytes per file, but storing each file in its own collection uses ~20MB per file. I would expect slightly more memory usage from having to index the collections and associated volumes, but not 10^6 times more.

Additional context
Our use-case for Seaweed is to store items in timestamped collections, so by making this sufficiently fine-grained (e.g. minutes or seconds) we end up with high memory usage while using our software.

The python test scripts are just to create a minimum working example of the problem we are seeing, which is high memory usage of weed.exe over time, which persists when the .exe is restarted.

We have seen memory usage as high as 5.8GB in the wild, but from this testing it seems the memory usage of weed.exe becomes arbitrarily large as more collections are added.

@ddorian
Copy link

ddorian commented May 30, 2022

How many collections are in the slow version compared to the number of files?

@chrislusf
Copy link
Collaborator

Each collection maps to a set of volumes. Each volume will cost memory.

@jbrady-4sight
Copy link
Author

@ddorian In the test with high memory usage I'm inserting one file per collection, so in my graph, the time in seconds is roughly equal to the number of collections which is roughly equal to the number of MB stored on disk.

Admittedly using one file per collection is an extreme example, but does show how the memory usage gets very large for what seems like not that many collections (>10GB for <5000 collections).

@chrislusf By default each collection maps to 7 volumes, correct? How much memory would you expect each additional volume to use? Because in my testing above it seems each additional volume uses ~3MB memory (10GB memory used / 5000 collections = 20 MB per collection, 20 MB / 7 volumes = ~3MB per volume).

This seems like a lot, given that it is mentioned in the Optimisation section of the wiki that the in-memory file index uses 20 bytes of memory per file, and advises using the leveldb index to cut memory consumption. However, indexing one volume uses the same memory as indexing 150,000 files and there is seemingly no way to work around this. Is there a setting similar to the "-index=leveldb" that will index volumes and collections on disk, as well as files?

@ddorian
Copy link

ddorian commented May 31, 2022

@jbrady-4sight the fix is to not use that many volumes, it's an anti-pattern. Do you have a reason why?

@chrislusf
Copy link
Collaborator

the memory are consumed in batches with 100000 entries in each batch for each volume

https://github.com/chrislusf/seaweedfs/blob/24e11d1e90c2bf6ef512fbf787490caeb59348de/weed/storage/needle_map/compact_map.go#L11

To "save" memory, you can reduce it to 1000.

@jbrady-4sight
Copy link
Author

@chrislusf What exactly does the "batch" variable represent? I'm guessing that it is the number of entries reserved in the file index for a new volume? I've done some more testing after recompiling Seaweed with batch=1000, and this change reduces memory usage by ~90%.

However, your reply suggests that this isn't a "proper" fix. Are there side effects to this change that I should be aware of? e.g. if you recompile with batch=1000, and then insert much more than 1000 items into that volume, is insert performance impacted due to having to reallocate memory for the larger file index?

If this is the case, then it would be helpful to have the batch size configurable in a properties file. This way you could make a tradeoff between reduced memory usage (if you are storing few files per volume) and insert performance (if you are storing many files per volume)

@acejam
Copy link

acejam commented Jun 14, 2022

@chrislusf What exactly does the "batch" variable represent? I'm guessing that it is the number of entries reserved in the file index for a new volume? I've done some more testing after recompiling Seaweed with batch=1000, and this change reduces memory usage by ~90%.

However, your reply suggests that this isn't a "proper" fix. Are there side effects to this change that I should be aware of? e.g. if you recompile with batch=1000, and then insert much more than 1000 items into that volume, is insert performance impacted due to having to reallocate memory for the larger file index?

If this is the case, then it would be helpful to have the batch size configurable in a properties file. This way you could make a tradeoff between reduced memory usage (if you are storing few files per volume) and insert performance (if you are storing many files per volume)

I have observed the same - nearly 90% reduction in volume server memory when setting

const (
	batch = 10000
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants