Storing files across many collections causes high memory usage #3113

jbrady-4sight · 2022-05-30T11:30:10Z

Describe the bug
Storing files across many collections uses many times more memory (in my testing, ~400x more) than storing files in one volume.

System Setup
I used these commands to insert 1MB files into new collections using the attached scripts:
weed master -volume.max=50000 -dir=./data -ip=127.0.0.1
curl http://127.0.0.1:9333/dir/assign?collection=<collection name>
curl -F file=test.blob http://127.0.0.1:8080/<volume id>

weedtest.py.txt
weedtest_collections.py.txt
graph.py.txt

The memory usage is graphed below (using the attached graph.py). In both cases, 1MB files are inserted into Seaweed at a rate of 1/s, so the time in seconds is roughly equal to the size of data stored on disk, in MB.

OS version:
win10 v10.0.19041.1566

Weed version:
version 30GB 3.06 2f846777bbceea307771e79d4452e071b0bd5a51 windows amd64

Expected behavior
The wiki page says that the in-memory file index uses ~20 bytes per file, but storing each file in its own collection uses ~20MB per file. I would expect slightly more memory usage from having to index the collections and associated volumes, but not 10^6 times more.

Additional context
Our use-case for Seaweed is to store items in timestamped collections, so by making this sufficiently fine-grained (e.g. minutes or seconds) we end up with high memory usage while using our software.

The python test scripts are just to create a minimum working example of the problem we are seeing, which is high memory usage of weed.exe over time, which persists when the .exe is restarted.

We have seen memory usage as high as 5.8GB in the wild, but from this testing it seems the memory usage of weed.exe becomes arbitrarily large as more collections are added.

The text was updated successfully, but these errors were encountered:

ddorian · 2022-05-30T17:30:12Z

How many collections are in the slow version compared to the number of files?

chrislusf · 2022-05-30T20:25:15Z

Each collection maps to a set of volumes. Each volume will cost memory.

jbrady-4sight · 2022-05-31T08:50:30Z

@ddorian In the test with high memory usage I'm inserting one file per collection, so in my graph, the time in seconds is roughly equal to the number of collections which is roughly equal to the number of MB stored on disk.

Admittedly using one file per collection is an extreme example, but does show how the memory usage gets very large for what seems like not that many collections (>10GB for <5000 collections).

@chrislusf By default each collection maps to 7 volumes, correct? How much memory would you expect each additional volume to use? Because in my testing above it seems each additional volume uses ~3MB memory (10GB memory used / 5000 collections = 20 MB per collection, 20 MB / 7 volumes = ~3MB per volume).

This seems like a lot, given that it is mentioned in the Optimisation section of the wiki that the in-memory file index uses 20 bytes of memory per file, and advises using the leveldb index to cut memory consumption. However, indexing one volume uses the same memory as indexing 150,000 files and there is seemingly no way to work around this. Is there a setting similar to the "-index=leveldb" that will index volumes and collections on disk, as well as files?

ddorian · 2022-05-31T09:31:56Z

@jbrady-4sight the fix is to not use that many volumes, it's an anti-pattern. Do you have a reason why?

chrislusf · 2022-05-31T15:46:19Z

the memory are consumed in batches with 100000 entries in each batch for each volume

https://github.com/chrislusf/seaweedfs/blob/24e11d1e90c2bf6ef512fbf787490caeb59348de/weed/storage/needle_map/compact_map.go#L11

To "save" memory, you can reduce it to 1000.

jbrady-4sight · 2022-06-13T10:09:26Z

@chrislusf What exactly does the "batch" variable represent? I'm guessing that it is the number of entries reserved in the file index for a new volume? I've done some more testing after recompiling Seaweed with batch=1000, and this change reduces memory usage by ~90%.

However, your reply suggests that this isn't a "proper" fix. Are there side effects to this change that I should be aware of? e.g. if you recompile with batch=1000, and then insert much more than 1000 items into that volume, is insert performance impacted due to having to reallocate memory for the larger file index?

If this is the case, then it would be helpful to have the batch size configurable in a properties file. This way you could make a tradeoff between reduced memory usage (if you are storing few files per volume) and insert performance (if you are storing many files per volume)

acejam · 2022-06-14T19:35:50Z

@chrislusf What exactly does the "batch" variable represent? I'm guessing that it is the number of entries reserved in the file index for a new volume? I've done some more testing after recompiling Seaweed with batch=1000, and this change reduces memory usage by ~90%.

However, your reply suggests that this isn't a "proper" fix. Are there side effects to this change that I should be aware of? e.g. if you recompile with batch=1000, and then insert much more than 1000 items into that volume, is insert performance impacted due to having to reallocate memory for the larger file index?

If this is the case, then it would be helpful to have the batch size configurable in a properties file. This way you could make a tradeoff between reduced memory usage (if you are storing few files per volume) and insert performance (if you are storing many files per volume)

I have observed the same - nearly 90% reduction in volume server memory when setting

const (
	batch = 10000
)

chrislusf closed this as completed May 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Storing files across many collections causes high memory usage #3113

Storing files across many collections causes high memory usage #3113

jbrady-4sight commented May 30, 2022

ddorian commented May 30, 2022

chrislusf commented May 30, 2022

jbrady-4sight commented May 31, 2022

ddorian commented May 31, 2022

chrislusf commented May 31, 2022

jbrady-4sight commented Jun 13, 2022

acejam commented Jun 14, 2022

Storing files across many collections causes high memory usage #3113

Storing files across many collections causes high memory usage #3113

Comments

jbrady-4sight commented May 30, 2022

ddorian commented May 30, 2022

chrislusf commented May 30, 2022

jbrady-4sight commented May 31, 2022

ddorian commented May 31, 2022

chrislusf commented May 31, 2022

jbrady-4sight commented Jun 13, 2022

acejam commented Jun 14, 2022