# Deduplication with Bloom Filters

## Deduplication with Bloom Filters

In this notebook, we'll explore how to use Redis Bloom Filters to deduplicate events in a stream. We'll also use a machine learning model to filter posts based on their content.

In [1]:
%use coroutines

## Deduplication with Bloom Filter
Redis Bloom Filter is a probabilistic data structure that allows us to check if an element is in a set. It's very memory efficient and has a constant time complexity for both insertion and lookup operations.


### Creating a Bloom Filter
This function creates a Bloom Filter with the given name. The filter is configured with an error rate of 0.01 and an initial capacity of 1,000,000 elements.

In [2]:
import dev.raphaeldelio.*
import redis.clients.jedis.bloom.BFReserveParams
import redis.clients.jedis.exceptions.JedisDataException
fun createBloomFilter(name: String) {
    try {
        val errorRate = 0.01
        val capacity = 1_000_000L
        val reserveParams = BFReserveParams().expansion(2)
        jedisPooled.bfReserve(name, errorRate, capacity, reserveParams)
    } catch (_: JedisDataException) {
        println("Bloom filter already exists")
    }
}

### Deduplication Handler
This function creates a handler that checks if an event has already been processed by checking if its URI is in the Bloom Filter. If the URI is in the filter, the handler returns false, which stops the processing of the event.


In [3]:
fun deduplicate(bloomFilter: String): (Event) -> Pair<Boolean, String> {
    return { event ->
        if (jedisPooled.bfExists(bloomFilter, event.uri)) {
            Pair(false, "${event.uri} already processed")
        } else {
            Pair(true, "OK")
        }
    }
}

### Atomic Acknowledgment and Bloom Filter Update
This function creates a handler that acknowledges the message and adds the URI to the Bloom Filter in a single atomic transaction. This ensures that if the acknowledgment succeeds, the URI is also added to the filter, and vice versa.


In [4]:
import redis.clients.jedis.Connection
import redis.clients.jedis.JedisPool
import redis.clients.jedis.Transaction
import redis.clients.jedis.resps.StreamEntry

val jedisPool = JedisPool()

fun ackAndBfFn(bloomFilter: String):  (String, String, StreamEntry) -> Unit {
    return { streamName, consumerGroup, entry ->
        jedisPool.resource.use { jedis ->
            // Create a transaction
            val multi = jedis.multi()

            // Acknowledge the message
            multi.xack(
                streamName,
                consumerGroup,
                entry.id
            )

            // Add the URI to the bloom filter
            multi.bfAdd(bloomFilter, Event.fromMap(entry).uri)

            // Execute the transaction
            multi.exec()
        }
    }
}

In [5]:
createConsumerGroup("jetstream", "deduplicate-example")

In [6]:
val bloomFilterName = "processed-uris"
createBloomFilter("processed-uris")

In [7]:
runBlocking {
    consumeStream(
        streamName = "jetstream",
        consumerGroup = "deduplicate-example",
        consumer = "deduplicate-1",
        handlers = listOf(deduplicate(bloomFilterName), printUri),
        ackFunction = ackAndBfFn(bloomFilterName),
        count = 100,
        limit = 400
    )
}

Got event from at://did:plc:tz5ykftv7i4uowym33cmey4m/app.bsky.feed.post/3lpy74bea3s2l
Got event from at://did:plc:gipeuijaannaburxm6voyuf7/app.bsky.feed.post/3lpy75f6rkc2s
Got event from at://did:plc:tpc42zt2y25vzh4a2djx5tvo/app.bsky.feed.post/3lpy75f7s2526
Got event from at://did:plc:xfkk7zpf5czqz52qbizqr5tj/app.bsky.feed.post/3lpy65uq2hk2e
Got event from at://did:plc:ftedet2hx7eru6du5rae63js/app.bsky.feed.post/3lpy75f73pc23
Got event from at://did:plc:annmh2aamt3ctc2gubsvi6dj/app.bsky.feed.post/3lpy75fms6f2d
Got event from at://did:plc:vwbjoq5c54l325p5ync72is6/app.bsky.feed.post/3lpy75fdsok27
Got event from at://did:plc:vwbjoq5c54l325p5ync72is6/app.bsky.feed.post/3lpy75fimy227
Got event from at://did:plc:vwbjoq5c54l325p5ync72is6/app.bsky.feed.post/3lpy75fipvs27
Got event from at://did:plc:dwfbgsqwdsps4aobupgrb553/app.bsky.feed.post/3lpy75fiabs2l
Got event from at://did:plc:dzetpmp4nf3g2iretw2mb72m/app.bsky.feed.post/3lpy75fhnq22r
Got event from at://did:plc:ou4lljlduhupgbk5izv4dtpz/a

In [12]:
createConsumerGroup("jetstream", "deduplicate-example2")

In [13]:
runBlocking {
    consumeStream(
        streamName = "jetstream",
        consumerGroup = "deduplicate-example2",
        consumer = "deduplicate-1",
        handlers = listOf(deduplicate(bloomFilterName), printUri),
        ackFunction = ackAndBfFn(bloomFilterName),
        count = 100,
        limit = 400
    )
}

deduplicate-1: Handler stopped processing: at://did:plc:tz5ykftv7i4uowym33cmey4m/app.bsky.feed.post/3lpy74bea3s2l already processed
deduplicate-1: Handler stopped processing: at://did:plc:gipeuijaannaburxm6voyuf7/app.bsky.feed.post/3lpy75f6rkc2s already processed
deduplicate-1: Handler stopped processing: at://did:plc:tpc42zt2y25vzh4a2djx5tvo/app.bsky.feed.post/3lpy75f7s2526 already processed
deduplicate-1: Handler stopped processing: at://did:plc:xfkk7zpf5czqz52qbizqr5tj/app.bsky.feed.post/3lpy65uq2hk2e already processed
deduplicate-1: Handler stopped processing: at://did:plc:ftedet2hx7eru6du5rae63js/app.bsky.feed.post/3lpy75f73pc23 already processed
deduplicate-1: Handler stopped processing: at://did:plc:annmh2aamt3ctc2gubsvi6dj/app.bsky.feed.post/3lpy75fms6f2d already processed
deduplicate-1: Handler stopped processing: at://did:plc:vwbjoq5c54l325p5ync72is6/app.bsky.feed.post/3lpy75fdsok27 already processed
deduplicate-1: Handler stopped processing: at://did:plc:vwbjoq5c54l325p5ync7