Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WriteTo() variant that exports only the bitset #46

Closed
maciej opened this issue Jun 4, 2018 · 9 comments
Closed

WriteTo() variant that exports only the bitset #46

maciej opened this issue Jun 4, 2018 · 9 comments

Comments

@maciej
Copy link

maciej commented Jun 4, 2018

Currently there is no elegant solution to export only the bloom filter bitset. That makes it difficult to engineer custom serialization formats.

I'm eager to submit a PR to change that if we could agree on the design.

Of the top of my head we could do one of those things:

  • expose BloomFilter.b – the simplest solution. However will allow developers to shoot themselves in the foot.
  • add a function that would return a copy of BloomFilter.b. *BloomFilter.BisetCopy?
  • add a *BloomFilter.WriteBitSetTo() function?
  • and my favourite: add a *BloomFilter.BitSetWriter() function that would return a
type BitSetWriter interface {
  WriterTo
}
@db7
Copy link
Contributor

db7 commented Jun 4, 2018

Is From() not sufficient? https://godoc.org/github.com/willf/bloom#From. We have been using that to serialize the BF in protobuf messages.

@maciej
Copy link
Author

maciej commented Jun 4, 2018

@db7 as I understand the GoDoc func From(data []uint64, k uint) *BloomFilter creates a BloomFilter from a blob of data ([]uint64 here). I'd need the opposite – a ToBitSet(*bloom.BloomFilter) []byte (or []uint64 as the return type) function.

@db7
Copy link
Contributor

db7 commented Jun 4, 2018

Yes, you're right. But you can simply make a slice and pass it to the From function. A uint64 slice will be anyway initialized to 0's, so it's safe to use that in the new BF. You'll just need to calculate the slice length by dividing m (number of bits) by 64.

So this...

data := make([]uint64, m / 64) // perhaps you'd like to ceil the division.
bf := bloom.From(data, k)

should be equivalent to this...

bf := bloom.New(m, k)

@maciej
Copy link
Author

maciej commented Jun 4, 2018

@db7 you're again referring to deserialisation or am I missing something?

@db7
Copy link
Contributor

db7 commented Jun 4, 2018

@maciej From does not copy the data, it simply uses the given slice. So if you modify the BF (eg, adding items), the given slice is modified.

This is how we work/serialize BFs.

data := make([]uint64, m / 64)
// store reference to data slice somewhere

// whenever we need to update or check the BF, we create a BF object with the slice.
bf := bloom.From(data, k)
bf.AddString("whatever")
bf.TestString("whatever")
...

// whevener we need to serialize the BF, use use the data slice.
buf, err := serialize(data)

To make it a bit cleaner, one could create a wrapper for the BF+data slice. It won't take more space in memory since the data slice is not copied.

type SerializableBF struct {
   *bloom.BloomFilter
   data []uint64
}

func NewSerializableBF(m int, k hashes) *SerializableBF {
  data := make([]uint64, m/64)
  return &SerializableBF{bloom.From(data, k), data}
}

func (s *SerializableBF) Serialize() ([]byte, error) {
  // serialize s.data in the format you want as buf byte slice
  return buf, nil
}

I hope I am not completely missing your issue... and perhaps this is not the most elegant solution too.

@maciej
Copy link
Author

maciej commented Jun 4, 2018

@db7 aaah! I've missed the key point in your last comment: that buffer passed to From is reused! Thank you! And sorry for having to explain this to me 3 times :-)

OK, so I'm leaving this ticket up for discussion. Do we need a more elegant solution or is the one @db7 described good enough.

@lemire
Copy link
Member

lemire commented May 13, 2021

@maciej The current format saves m and k followed by the bitset.

It is a rather thin format.

Admittedly using 64 bits per parameter is a bit wasteful but it does not seem like a big deal. Furthermore, if your false-positive and capacities are fixed, you can omit these parameters...

We are talking about 16 bytes... which we could reduce to 8 bytes easily... If that is a large fraction of your storage... I'd be curious about your use case?

Can you elaborate?

I am totally open to proposing something finer.

@paralin
Copy link
Contributor

paralin commented Jun 22, 2021

#65

	m := uint(b.GetM())
	k := uint(b.GetK())
	return bloom.FromWithM(b.GetBitSet().GetSet(), m, k)

@lemire
Copy link
Member

lemire commented Sep 30, 2021

Since we are exposing the bitset (@paralin), I think that this issue is resolved. Please open a new issue if the underlying problem remains.

@lemire lemire closed this as completed Sep 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants