# Filtering JetStream Events

In this notebook, we'll filter events from the Redis Stream that we created in the previous notebook. We'll use a combination of techniques to filter the events:

1. Deduplication using Redis Bloom Filter to avoid processing the same event multiple times
2. Content-based filtering using a machine learning model to identify software-related posts
3. Storing filtered events in Redis for further processing

Redis Bloom Filter is a probabilistic data structure that allows us to check if an element is in a set. It's very memory efficient and has a constant time complexity for both insertion and lookup operations. The trade-off is that it can have false positives, but the probability of false positives can be controlled by the size of the filter.

Machine learning models can be used to classify text into different categories. In this notebook, we'll use a pre-trained zero-shot classification model to classify posts as software-related or not.

## Consuming from Redis Streams

### Model Redis Streams Event
In this section, we'll define a data class to represent the events stored in the Redis Stream. This model will be used to deserialize the events from the stream.

### Creating a Redis Client
Create a Jedis client to connect to Redis. This is a reusable client that can be used to interact with Redis Streams.

### Creating a Consumer Group
Create a consumer group to read from the Redis Stream. A consumer group allows multiple consumers to read from the same stream without duplicating the work. Each consumer in the group will receive a different subset of the messages.

A consumer group can be created in Redis with the XGROUP CREATE command:

`XGROUP CREATE streamName groupName id [MKSTREAM]`

To create a consumer group in this notebook, we will encapsulate the command in a function. The function will take the stream name and the group name as parameters.

In [1]:
%use coroutines

In [2]:
import redis.clients.jedis.StreamEntryID
import dev.raphaeldelio.*

fun createConsumerGroup(streamName: String, consumerGroupName: String) {
    try {
        jedisPooled.xgroupCreate(streamName, consumerGroupName, StreamEntryID("0-0"), true)
    } catch (_: Exception) {
        println("Group already exists")
    }
}

In [3]:
createConsumerGroup("jetstream", "printer-example")

### Reading from the Stream
Create a reusable function to read from the stream. This function will read from the stream and return a list of entries. It uses the XREADGROUP command to read from the stream as part of a consumer group:

`XREADGROUP GROUP groupName consumerName COUNT count BLOCK blockTime streamName id`

The command will be encapsulated in a function that takes the stream name, consumer group name, consumer name, and count as parameters. The function will return a list of entries.

In [4]:
import redis.clients.jedis.params.XReadGroupParams
import redis.clients.jedis.resps.StreamEntry

fun readFromStream(
    streamName: String,
    consumerGroup: String,
    consumer: String, count: Int
): List<Map.Entry<String, List<StreamEntry>>> {
    return jedisPooled.xreadGroup(
        consumerGroup,
        consumer,
        XReadGroupParams().count(count),
        mapOf(
            streamName to StreamEntryID.XREADGROUP_UNDELIVERED_ENTRY
        )
    ) ?: emptyList()
}

### Acknowledging Messages
Create a function to acknowledge the message. This is important to let Redis know that the message has been processed successfully, so it won't be delivered to other consumers in the group.

This is done by using the XACK command:

`XACK streamName groupName id`

The command will be encapsulated in a lambda function that takes the stream name, consumer group name, and entry as parameters. The function will acknowledge the message by calling the XACK command.

In [7]:
fun ackFunction(): (streamName: String, consumerGroup: String, entry: StreamEntry) -> Unit {
    return { streamName, consumerGroup, entry ->
        jedisPooled.xack(
            streamName,
            consumerGroup,
            entry.id
        )
    }
}

### Consuming the Stream
Create a reusable function to consume the stream.

This function implements a pipeline pattern where each event is processed sequentially by a series of handlers. If any handler returns false, the processing stops for that event.

After processing the event, the function acknowledges the message using the ack function.

In [8]:
import kotlinx.coroutines.*

fun consumeStream(
    streamName: String,
    consumerGroup: String,
    consumer: String,
    handlers: List<(Event) -> Pair<Boolean, String>>,
    ackFunction: ((String, String, StreamEntry) -> Unit),
    count: Int = 1,
    limit: Int = 5
) {
    var lastMessageTime = System.currentTimeMillis()
    var consumed = 0

    while (consumed < limit) {
        val entries = readFromStream(streamName, consumerGroup, consumer, count)
        val allEntries = entries.flatMap { it.value }
        allEntries.map { entry ->
            consumed++
            val event = Event.fromMap(entry)

            for (handler in handlers) {
                val (shouldContinue, message) = handler(event)
                ackFunction(streamName, consumerGroup, entry)

                if (!shouldContinue) {
                    println("$consumer: Handler stopped processing: $message")
                    break
                }
            }
        }

        if (allEntries.isEmpty()) {
            val now = System.currentTimeMillis()
            if (now - lastMessageTime >= 2_000) {
                println("$consumer: No new messages for 2 seconds. Stopping.")
                break
            }
        }
    }

}

To test the consumeStream function, we'll create a simple handler that prints the event's URI.

In [9]:
val printUri: (Event) -> Pair<Boolean, String> = {
    println("Got event from ${it.uri}")
    Pair(true, "OK")
}

In [10]:
runBlocking {
    consumeStream(
        streamName = "jetstream",
        consumerGroup = "printer-example",
        consumer ="printer-1",
        handlers = listOf(printUri),
        ackFunction = ackFunction(),
        count = 1,
        limit = 100
    )
}

Got event from at://did:plc:ru55eyk6bq5uyrq7ggvckgs2/app.bsky.feed.post/3lpt7k4irbs2q
Got event from at://did:plc:4evan2whibyy5bjilipjmqtp/app.bsky.feed.post/3lpt7jthrac2o
Got event from at://did:plc:dvggsjvb56xsn7gp7yjmdpo4/app.bsky.feed.post/3lpt7jzodq22m
Got event from at://did:plc:piqjp53oxmn5temfwfudxu2r/app.bsky.feed.post/3lpt7kbwpos2b
Got event from at://did:plc:ubmc5ibjnrkkouuxu64n3m7s/app.bsky.feed.post/3lpt7kaawyc2b
Got event from at://did:plc:5ueg24jex7re2jckt3pt3mo6/app.bsky.feed.post/3lpt7kaqb7s2e
Got event from at://did:plc:aikbluml2j6tnude4u7scgcs/app.bsky.feed.post/3lpt7ka2zvo2k
Got event from at://did:plc:lyjajndtbldeawszgwvo6cea/app.bsky.feed.post/3lpt7k7kvvs2w
Got event from at://did:plc:da2lukuaakgrqm37bwn2bikh/app.bsky.feed.post/3lpt7kaqaak2d
Got event from at://did:plc:shbfwuygww3edmq3t6koev6a/app.bsky.feed.post/3lpt7k42cfc2s
Got event from at://did:plc:fnayimxlzx4xsi3lwetnzp67/app.bsky.feed.post/3lpt7k7xcfc2i
Got event from at://did:plc:xc6k4agoolf4k7kzru72tu4m/a

## Semantic Classifier
In this section, we'll use a vector similarity search to filter posts based on their content.
We'll use a embed a list of references and classify the posts based on their similarity to the references.
If the post is similar to the references, it will be classified as artificial-intelligence-related.

### Including Spring AI Redis Vector Store
The Redis Vector Store is a Spring AI module that allows you to store and retrieve vectors in Redis. It uses the Redis Vector Search feature to perform similarity searches on the vectors.

In [13]:
@file:DependsOn("org.springframework.ai:spring-ai-redis-store:1.0.0")
@file:DependsOn("org.springframework.ai:spring-ai-transformers:1.0.0")

### Initializing an embedding model
The embedding model is used to convert text into vectors. The Redis Vector Store uses the embedding model to store and retrieve vectors in Redis.

The Transformers Embedding Model provided by Spring AI uses all-minilm-l6-v2 model by default which has 384 dimensions.

The model is loaded from the Hugging Face Hub and can be used to convert text into vectors.


In [164]:
import org.springframework.ai.transformers.TransformersEmbeddingModel

val embeddingModel = TransformersEmbeddingModel()
embeddingModel.setModelResource("https://huggingface.co/sentence-transformers/all-mpnet-base-v2/resolve/main/onnx/model.onnx?download=true")
embeddingModel.setTokenizerResource("https://huggingface.co/sentence-transformers/all-mpnet-base-v2/raw/main/tokenizer.json")
embeddingModel.afterPropertiesSet()

### Creating the store (and the index in Redis)
By creating an index in Redis, we can use the Redis Vector Search feature to perform similarity searches on the vectors.


In [202]:
import org.springframework.ai.vectorstore.redis.RedisVectorStore
import org.springframework.ai.vectorstore.redis.RedisVectorStore.MetadataField
import redis.clients.jedis.search.Schema.FieldType
import dev.raphaeldelio.jedisPooled

val redisVectorStore = RedisVectorStore.builder(jedisPooled, embeddingModel)
    .indexName("classifierIdx")
    .contentFieldName("text")
    .embeddingFieldName("textEmbedding")
    .prefix("classifier:")
    .initializeSchema(true)
    .vectorAlgorithm(RedisVectorStore.Algorithm.FLAT)
    .build()
redisVectorStore.afterPropertiesSet()

### Storing the references as vectors

The references are stored in Redis as vectors. The vectors are created using the embedding model and stored in the Redis Vector Store.

In [203]:
import kotlinx.serialization.json.Json
import org.springframework.ai.document.Document
import java.io.File
import java.util.UUID

val references = Json.decodeFromString<List<String>>(File("resources/filtering-examples.json").readText())

fun storeFilterDocumentsInRedis(references: List<String>) {
    val documents = references.map { text ->
        createFilterDocument(text)
    }

    redisVectorStore.add(documents)
}

fun createFilterDocument(text: String): Document {
    return Document(
        UUID.randomUUID().toString(),
        text,
        mapOf(
            "text" to text,
        )
    )
}

storeFilterDocumentsInRedis(references)

### Creating a Classification Function
Now we'll create a function to classify text using the model.

The function takes a text as input and returns a classification output. The classification output contains the probabilities for each candidate label.


In [220]:
import ai.djl.modality.nlp.translator.ZeroShotClassificationOutput
import org.springframework.ai.vectorstore.SearchRequest

fun classify(post: String): List<Double> {
    val cleanedPost = removeUrls(post)
    return breakSentenceIntoClauses(cleanedPost).map { clause ->
        println(clause)
        (redisVectorStore.similaritySearch(
            SearchRequest.builder()
                .topK(1)
                .query(clause)
                .build()
        )?.map {
            println("Matched sentence: ${it.text}")
            it.score ?: 0.0
        } ?: emptyList())
    }.flatten()
}

fun breakSentenceIntoClauses(sentence: String): List<String> {
    return sentence.split(Regex("""[!?,.:;]+"""))
        .filter { it.isNotBlank() }.map { it.trim() }
}

fun removeUrls(text: String): String {
    return text.replace(Regex("""(?:https?:\/\/)?(?:www\.)?[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}(\/\S*)?"""), "")
        .replace(Regex("""@\w+"""), "")
        .replace(Regex("""\s+"""), " ")
        .trim()
}


In [168]:
classify("Redis is a great tool for building applied AI systems because it's a Vector Database")

Redis is a great tool for building applied AI systems because it's a Vector Database
Matched sentence: best AI tools right now


[0.8059492111206055]

In [169]:
classify("I'm here at KotlinConf, I can't stand people talking about AI anymore.")

I'm here at KotlinConf
Matched sentence: Building a side project with LangChain + OpenAI APIs. Loving it so far.
I can't stand people talking about AI anymore
Matched sentence: why AI matters


[0.6877487897872925, 0.8056651949882507]

In [170]:
classify("Redis is a great tool for building distributed systems")

Redis is a great tool for building distributed systems
Matched sentence: RAG architectures are emerging as the go-to solution for enterprise search systems.


[0.6799676418304443]

In [171]:
classify("Santos is the best city of Brazil.")

Santos is the best city of Brazil
Matched sentence: Claude AI


[0.575647234916687]

In [172]:
classify("Brazil is a polarized country in terms of politics")

Brazil is a polarized country in terms of politics
Matched sentence: bias in AI models


[0.5600488185882568]

In [223]:
classify(
    """
Me and members of the Moderation Team @ Godot just stumbled upon this and we are absolutely blown away by what devs are creating with Godot. Incredible to see a recreation of WinAMP coming to life like this! 💙🔥 #GodotEngine #IndieDev
"""
)

Me and members of the Moderation Team @ Godot just stumbled upon this and we are absolutely blown away by what devs are creating with Godot
Matched sentence: Just tried out the new GPT model—it's wild how good it is at coding.
Incredible to see a recreation of WinAMP coming to life like this
Matched sentence: I didn’t think AI could do creative stuff, but I was wrong.
💙🔥 #GodotEngine #IndieDev
Matched sentence: Building a side project with LangChain + OpenAI APIs. Loving it so far.


[0.7738901376724243, 0.6401819586753845, 0.6505147218704224]

Optimizer for finding the best threshold for the classification function: https://github.com/redis-developer/redis-ai-resources/blob/main/python-recipes/semantic-router/01_routing_optimization.ipynb

### Creating a Filter Handler
Now we'll create a handler that filters events based on their content.

The handler uses the classification function to determine if a post is software-related.

If the post is not software-related, the handler returns false, which stops the processing of the event.


In [207]:
val filter: (Event) -> Pair<Boolean, String> = { event ->
    if (event.text.isNotBlank() && event.langs.contains("en") && event.operation != "delete") {
        val classification = classify(event.text)
        if (classification.any { it > 0.75 }) {
            Pair(true, "OK")
        } else {
            Pair(false, "Not a post related to artificial intelligence: Score: ${classification} \n${event.text}\n")
        }
    } else {
        Pair(false, "Text is null or empty")
    }
}

## Storing Filtered Events
In this section, we'll store the filtered events in Redis for further processing.


### Storing Events in Redis
Now we'll create a handler that stores events in Redis. The handler stores the event as a hash in Redis, with the key being the event's URI.


In [54]:
val storeEvent: (Event) -> Pair<Boolean, String> = { event ->
    jedisPooled.hset("post:" + event.uri.replace("at://did:plc:", ""), event.toMap())
    Pair(true, "OK")
}

### Adding Filtered Events to a New Stream
Finally, we'll create a handler that adds filtered events to a new stream. This allows other consumers to process only the filtered events, rather than having to filter the events themselves.


In [46]:
import redis.clients.jedis.params.XAddParams

val addFilteredEventToStream: (Event) -> Pair<Boolean, String> = { event ->
    jedisPooled.xadd(
        "filtered-events",
        XAddParams.xAddParams().id(StreamEntryID.NEW_ENTRY),
        event.toMap()
    )
    Pair(true, "OK")
}

In [205]:
createConsumerGroup("jetstream", "store-example")

## Putting It All Together
Now we'll put all the pieces together to create a complete pipeline for filtering events from the Redis Stream.

In this example we create two consumers that will process the same stream.
- By doing that, we can scale the processing of the events by adding more consumers to the group.
- Redis will make sure that each consumer will receive different messages.


In [210]:
runBlocking {
        listOf(
            async(Dispatchers.IO) {
                consumeStream(
                    streamName = "jetstream",
                    consumerGroup = "store-example",
                    consumer = "store-1",
                    handlers = listOf(
                        filter,
                        printUri,
                        storeEvent,
                        addFilteredEventToStream
                    ),
                    ackFunction = ackFunction(),
                    count = 1,
                    limit = 1000
                )
            },
            async(Dispatchers.IO) {
                consumeStream(
                    streamName = "jetstream",
                    consumerGroup = "store-example",
                    consumer = "store-2", // Different consumer
                    handlers = listOf(
                        filter,
                        printUri,
                        storeEvent,
                        addFilteredEventToStream
                    ),
                    ackFunction = ackFunction(),
                    count = 1,
                    limit = 1000
                )
            }
        ).awaitAll()
}

An leftover POS
this is not how to live
Matched sentence: AI in everyday life
Matched sentence: runway gen-2
store-2: Handler stopped processing: Not a post related to artificial intelligence: Score: [0.5952517986297607] 
this is not how to live

store-1: Handler stopped processing: Not a post related to artificial intelligence: Score: [0.6166270971298218] 
An leftover POS

i can’t live like this i can’t take care of myself like this and fuck knows noone in the establishment would take me seriously if i asked for help
fuck
Matched sentence: Claude AI
my parents have openly said to my face that they’d put their lives before supporting me and looking after me during recovery if i got bottom surgery and i presume that extends to ANY sort of surgery or medical event
Matched sentence: AI helping me work
store-2: Handler stopped processing: Not a post related to artificial intelligence: Score: [0.5985085964202881] 
i can’t live like this i can’t take care of myself like this and fuck knows n

[kotlin.Unit, kotlin.Unit]

## Next Steps
In the next notebook, we'll enrich the filtered events with additional information, such as topic modeling and embeddings for semantic search.
