New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update values (append) #14
Comments
Hi @pakipaki2000 and thanks for your yet again interesting question. I have also thought about the ability to enlarge values via e.g. appending bytes. I haven't tackled it yet because there are nitty gritty details to consider and I haven't needed this functionality yet. Things to consider:
So as you can see, it's not trivial to implement an append operation that doesn't have potentially serious performance ramifications. Can you tell me a bit more about the lifetime of the larger values you anticipate? Will they start off smaller? How often will they grow in size? How long will they stick around? How is their size limited? Will there be lots of different value sizes? What do you think about the following idea / architecture (just thinking out loud): You would use a 1st SHF instance to store the key,value pairs. And a second SHF instance to create large sections of shared memory which effectively become arrays of n bytes. Perhaps you can even uses the SHF queue API [1] [2]? The arrays of n bytes would be e.g. 1,024 bytes per array element. So a value of 100KB would be e.g. 100 array elements (of 1 KB per array element) in a single linked list. This means to read a 100KB value you would first look up the key in the 1st SHF instance, and get the value which is the index of the first array item in the giant value in the 2nd SHF instance, used for first 1,024 bytes of the ~ 100KB value. To read the entire 100KB you would work along the single linked list reading the 100 array items in order. How would this work? The 1st SHF instance would have a key,value pair where the value is an array element index into the array of elements in shared memory via the 2nd SHF instance. When you want to append to the value then only the array elements in the 2nd SHF instance would get updated. Once the value is bigger than e.g. 1,024 bytes -- the size of an array element -- then the last 4 bytes might contain the index of a subsequent array element. Thus the value can grow / enlarge. The advantages of this scheme are as follows: (a) The 1st SHF instance does not have the performance disadvantages listed above because the value is effectively stored outside of 1st SHF instance. (b) The 2nd SHF instance can be a fixed size in memory. Once the array of elements is created as a key and giant value then no more key,values have to be created. This means each process sharing the 2nd SHF instance can use its own fixed address to the base of the array of elements. If the 2nd SHF instance has no new key,values then you don't have to worry about the addresses of key and values changing. The disadvantage of this scheme are as follows: (a) You 'waste' memory which is the difference between the value size and the array element size. But presumably performance trumps a bit of memory wasted? (b) To append to a value then you must traverse the n array elements which is an O(N) operation. However, if you store somewhere the end array item index too then you can just skip to the array item index at the end without traversing the n array elements. What do you think? Do you follow my train of thought? [1] Line 401 in 56d66a4
[2] Line 130 in 56d66a4
|
I apologize for the answering delay. Additionally, if I would require extreme performances, there could be an additional "cache" system I guess. I saw in your library that it seems you are using MurmurHash3 to hash, I was thinking If i could DIRECTLY use my hashes (which are ALWAYS unique) as a KEY source. Your scheme is the best and I'm implementing it as we speak. PS: I have about 5 millions hash, I need to retrieve them at least (all) every seconds. |
Sounds good, @pakipaki2000. Please let me know how you get on.
Yes, you can do this. Because the SHA256 is unique then you can use it as the key. Instead of calling shf_make_hash() to make the hash for the key using Murmurhash, then you'd call something like pp2k_make_hash() which would set the thread local variables shf_hash, shf_hash_key, and shf_hash_key_len. shf_hash would contain e.g. the first 16 bytes from the SHA256 hash that you already have. I added examples to the tests for using SHA256() (as an example hash function) instead of shf_make_hash() on regular keys [1], and for using SHA256() instead of shf_make_hash() for both the (binary) key and the hash [2].
If you have a look at functions like shf_tab_validate() or shf_tab_part() then they iterative over all the keys in a particular 'tab'. It wouldn't be so much more code to iterate over all the keys in all tabs. If your use case is something like: 1. Add keys for 1 second. 2. Iterate over keys added in last second, while adding new keys for this second. You can do something like this using two SHF instances. Once the one second is over then have all your processes start putting the new keys in the next SHF instance. Meanwhile the SHF instance for the last second can be iterated over, processed, and decommissioned. Assuming that the decommissioning happens faster than the new keys are added, then you'll get away with not having to have 'double the memory'; once for the old SHF, and once for the new SHF. Obviously during the iteration phase then the SHF instance would no longer be read or writable via the usual API. Also, the read via iteration would be faster than the API get interface because (a) there would be no spin locks used, and (b) the iteration would read the keys sequentially in memory instead of using random access, so the CPU read-ahead cache would be an advantage. Also, you mentioned earlier that you have a 64 core box at your disposal. Whether using shf_make_hash() or your SHA256 keys, I would suggest considering an architecture in which you can shard SHF instances if necessary, and deploy with 1 SHF instance or deploy with n SHF instances. Internally an SHF instance has 256 shards already. This means that if you have e.g. a 64 core box then the ratio of shards to cores is 4:1. Let's say if you used one of the shf_hash nibbles to shard the SHA256 keys between 16 SHF instances, then you'd have a 16*4:1 or 64:1 shards to core ratio... which is potentially better for performance because there's likely less spin lock contention. If you design your code with SHF instance sharding in might, then it'll be easier to find out later what the performance sweet spot is... :-) [1] sharedhashfile/src/test.9.shf.c Line 164 in 3c1503d
[2] sharedhashfile/src/test.9.shf.c Line 212 in 3c1503d
|
Amazing answer. So helpful! i'll also listen to your idea regarding the optimization and trying to find the sweet spot. |
One question, what happen when you have two SHF instances binded (with only one shf_init()), when you retrieve a value with shf_val? |
First of all, shf_init() is designed to be called once per process regardless of how many SHF instances are bound, and there is a fail safe mechanism [1] to ensure that it is never called multiple times in the same process.
shf_val [2] is a pointer to memory allocated using mmap() (during shf_inti()) and the reason for that is so that it can be expanded at any time to accommodate any size value being copied into it. shf_val also is declared using thread local storage [3] which goes back to one of your previous questions about using SharedHashFile from multiple threads in a single process [4]. Currently SHF is designed to be used from only a single thread per process, but multiple processes can be used. If a particular process is attached to two of more SHF databases, then currently only sequential access to those two databases is possible from that process, because there is only a single thread. For example, you use shf_get_key_val_copy() on the 1st SHF database instance, and this causes the value to be atomically copied into shf_val. Next you use shf_get_key_val_copy() on the 2nd SHF database instance, and this causes the value to be atomically copied into shf_val, overwriting the previous value. So currently, if you want to use the 1st value and the 2nd value, then you need to make a copy of the first value before calling shf_get_key_val_copy() again. I might consider changing this functionality in the future. Perhaps shf_val could be made to be SHF instance specific? [1] Line 193 in 3a8c4ac
[2] Line 62 in 3a8c4ac
[3] https://gcc.gnu.org/onlinedocs/gcc-3.3/gcc/Thread-Local.html [4] #12 (comment) |
Hi,
Me again :)
I have my key:value, i'd like to update the value, add a string at the end.
Example, key : HELLO
Value: TEST-
I like to update my key HELLO with +VALUE TEST2, HELLO VALUE become TEST-TEST2
How can I do it without having to retrieve entirely the string? I'd like to concat at the end. I don't mind reallocating always.
PS: My values can be quite long, 100KB +
Would be very useful for me.
Thanks! :)
The text was updated successfully, but these errors were encountered: