# Deduplication with Bloom Filters

## Deduplication with Bloom Filters

In this notebook, we'll explore how to use Redis Bloom Filters to deduplicate events in a stream. We'll also use a machine learning model to filter posts based on their content.

In [10]:
%use coroutines

## Deduplication with Bloom Filter
Redis Bloom Filter is a probabilistic data structure that allows us to check if an element is in a set. It's very memory efficient and has a constant time complexity for both insertion and lookup operations.


### Creating a Bloom Filter
This function creates a Bloom Filter with the given name. The filter is configured with an error rate of 0.01 and an initial capacity of 1,000,000 elements.

In [2]:
import dev.raphaeldelio.*
import redis.clients.jedis.bloom.BFReserveParams
import redis.clients.jedis.exceptions.JedisDataException
fun createBloomFilter(name: String) {
    try {
        val errorRate = 0.01
        val capacity = 1_000_000L
        val reserveParams = BFReserveParams().expansion(2)
        jedisPooled.bfReserve(name, errorRate, capacity, reserveParams)
    } catch (_: JedisDataException) {
        println("Bloom filter already exists")
    }
}

### Deduplication Handler
This function creates a handler that checks if an event has already been processed by checking if its URI is in the Bloom Filter. If the URI is in the filter, the handler returns false, which stops the processing of the event.


In [3]:
fun deduplicate(bloomFilter: String): (Event) -> Pair<Boolean, String> {
    return { event ->
        if (jedisPooled.bfExists(bloomFilter, event.uri)) {
            Pair(false, "${event.uri} already processed")
        } else {
            Pair(true, "OK")
        }
    }
}

### Atomic Acknowledgment and Bloom Filter Update
This function creates a handler that acknowledges the message and adds the URI to the Bloom Filter in a single atomic transaction. This ensures that if the acknowledgment succeeds, the URI is also added to the filter, and vice versa.


In [5]:
import redis.clients.jedis.Connection
import redis.clients.jedis.JedisPool
import redis.clients.jedis.Transaction
import redis.clients.jedis.resps.StreamEntry

val jedisPool = JedisPool()

fun ackAndBfFn(bloomFilter: String):  (String, String, StreamEntry) -> Unit {
    return { streamName, consumerGroup, entry ->
        jedisPool.resource.use { jedis ->
            // Create a transaction
            val multi = jedis.multi()

            // Acknowledge the message
            multi.xack(
                streamName,
                consumerGroup,
                entry.id
            )

            // Add the URI to the bloom filter
            multi.bfAdd(bloomFilter, Event.fromMap(entry).uri)

            // Execute the transaction
            multi.exec()
        }
    }
}

In [6]:
createConsumerGroup("jetstream", "deduplicate-example")

Group already exists


In [7]:
val bloomFilterName = "processed-uris"
createBloomFilter("processed-uris")

Bloom filter already exists


In [12]:
runBlocking {
    consumeStream(
        streamName = "jetstream",
        consumerGroup = "deduplicate-example",
        consumer = "deduplicate-1",
        handlers = listOf(deduplicate(bloomFilterName), printUri),
        ackFunction = ackAndBfFn(bloomFilterName),
        count = 100,
        limit = 200
    )
}

Got event from at://did:plc:smdkfx7db3pbvkdfwidns4rv/app.bsky.feed.post/3lpprhxknnn2u
Got event from at://did:plc:ihlnz2jflr7qdlltjitnqkk4/app.bsky.feed.post/3lpprhxlkjk2e
Got event from at://did:plc:p7kwdof4qtn7rprbaful22y4/app.bsky.feed.post/3lpprhxrps22q
Got event from at://did:plc:44zocvhgzpss25bumybmz4hl/app.bsky.feed.post/3lpprhsopxs2q
Got event from at://did:plc:eg5pb4jan53agtahzryq6rnt/app.bsky.feed.post/3lpprhwlpxs2j
Got event from at://did:plc:gngr6bvlifll55dyqaxy5vsk/app.bsky.feed.post/3lpprhtrzu22o
Got event from at://did:plc:425iddzal5kpwz2f7wxbzlxu/app.bsky.feed.post/3lpprhy4sku2n
Got event from at://did:plc:htgcwoihuk25whhip57vlrln/app.bsky.feed.post/3lpprhxkzwc2s
Got event from at://did:plc:tcvpjkdcsdotgvcgdf6epyoe/app.bsky.feed.post/3lpprhxp7pk2l
Got event from at://did:plc:ypd623sroyr4crw2rvn55h5q/app.bsky.feed.post/3lpprhxekw42r
Got event from at://did:plc:nzt52plpdc3q4m5a2aqdifhs/app.bsky.feed.post/3lpprhxukmc2r
Got event from at://did:plc:tospzlswwp3egvrgj7p7iwq5/a