# Enriching Filtered Events

In this notebook, we'll enrich the filtered events from the previous notebook with additional information. We'll use a combination of techniques to enrich the events:

1. Topic modeling using a Large Language Model (LLM) to extract topics from the posts
2. Creating embeddings for semantic search using a transformer model
3. Storing the enriched events in Redis for querying



Embeddings are vector representations of text that capture semantic meaning. They allow us to perform semantic search, which is a search based on meaning rather than exact keyword matching. In this notebook, we'll create embeddings for posts and store them in Redis for later querying.


In [1]:
%use coroutines

## Functions from Previous Notebooks

### Creating a Redis Client

In [2]:
import redis.clients.jedis.JedisPooled

val jedisPooled = JedisPooled()

### Model Stream Events

In [3]:
import redis.clients.jedis.resps.StreamEntry

data class Event(
    val did: String,
    val rkey: String,
    val text: String,
    val timeUs: String,
    val operation: String,
    val uri: String,
    val parentUri: String,
    val rootUri: String,
    val langs: List<String>,
) {
    companion object {
        fun fromMap(entry: StreamEntry): Event {
            val fields = entry.fields
            return Event(
                did = fields["did"] ?: "",
                rkey = fields["rkey"] ?: "",
                text = fields["text"] ?: "",
                timeUs = fields["timeUs"] ?: "",
                operation = fields["operation"] ?: "",
                uri = fields["uri"] ?: "",
                parentUri = fields["parentUri"] ?: "",
                rootUri = fields["rootUri"] ?: "",
                langs = fields["langs"]?.replace("[", "")?.replace("]", "")?.split(", ") ?: emptyList()
            )
        }
    }
}

### Create Consumer Group

In [4]:
import redis.clients.jedis.StreamEntryID

fun createConsumerGroup(streamName: String, groupName: String) {
    try {
        jedisPooled.xgroupCreate(streamName, groupName, StreamEntryID("0-0"), true)
    } catch (e: Exception) {
        println("Consumer group already exists")
    }
}

### Create Bloom Filter

In [5]:
fun createBloomFilter(name: String) {
    try {
        jedisPooled.bfReserve(name, 0.01, 1000000)
    } catch (e: Exception) {
        println("Bloom filter already exists")
    }
}

### Read from Stream

In [6]:
import redis.clients.jedis.params.XReadGroupParams

fun readFromStream(
    streamName: String,
    consumerGroup: String,
    consumer: String, count: Int
): List<Map.Entry<String, List<StreamEntry>>> {
    return jedisPooled.xreadGroup(
        consumerGroup,
        consumer,
        XReadGroupParams().count(count),
        mapOf(
            streamName to StreamEntryID.XREADGROUP_UNDELIVERED_ENTRY
        )
    ) ?: emptyList()
}

### Consume Stream

In [7]:
import kotlinx.coroutines.*

fun consumeStream(
    streamName: String,
    consumerGroup: String,
    consumer: String,
    handlers: List<(Event) -> Pair<Boolean, String>>,
    ackFunction: ((String, String, StreamEntry) -> Unit),
    count: Int = 5,
    limit: Int = 5
) {
    var lastMessageTime = System.currentTimeMillis()
    var consumed = 0

    while (consumed < limit) {
        val entries = readFromStream(streamName, consumerGroup, consumer, count)
        val allEntries = entries.flatMap { it.value }
        allEntries.map { entry ->
            consumed++
            val event = Event.fromMap(entry)

            for (handler in handlers) {
                val (shouldContinue, message) = handler(event)
                ackFunction(streamName, consumerGroup, entry)

                if (!shouldContinue) {
                    println("$consumer: Handler stopped processing: $message")
                    break
                }
            }
        }

        if (allEntries.isEmpty()) {
            val now = System.currentTimeMillis()
            if (now - lastMessageTime >= 2_000) {
                println("$consumer: No new messages for 2 seconds. Stopping.")
                break
            }
        }
    }

}

### Deduplicate

In [8]:
fun deduplicate(bloomFilter: String): (Event) -> Pair<Boolean, String> {
    return { event ->
        if (jedisPooled.bfExists(bloomFilter, event.uri)) {
            Pair(false, "${event.uri} already processed")
        } else {
            Pair(true, "OK")
        }
    }
}

In [9]:
import redis.clients.jedis.Connection
import redis.clients.jedis.JedisPool
import redis.clients.jedis.Transaction

val jedisPool = JedisPool()

fun ackAndBfFn(bloomFilter: String):  (String, String, StreamEntry) -> Unit {
    return { streamName, consumerGroup, entry ->
        jedisPool.resource.use { jedis ->
            // Create a transaction
            val multi = jedis.multi()

            // Acknowledge the message
            multi.xack(
                streamName,
                consumerGroup,
                entry.id
            )

            // Add the URI to the bloom filter
            multi.bfAdd(bloomFilter, Event.fromMap(entry).uri)

            // Execute the transaction
            multi.exec()
        }
    }
}

In [10]:
val printUri: (Event) -> Pair<Boolean, String> = {
    println("Got event from ${it.uri}")
    Pair(true, "OK")
}

## Topic Modeling with Large Language Models
Topic modeling is a technique used to discover abstract topics in a collection of documents. In this notebook, we'll use a Large Language Model to extract topics from posts. This will allow us to categorize posts and make them more searchable.

### Setting Up the Ollama API Client
We'll use the Spring AI Ollama client to interact with the Ollama API.

Ollama is a tool that allows us to run large language models locally.

In [11]:
@file:DependsOn("org.springframework.ai:spring-ai-ollama:1.0.0-RC1")

The prompt we'll use for the LLM is designed to extract software-related topics from posts. The prompt includes examples of how to format the output and what types of topics to include.

In [12]:
import java.io.File

val topicExtractorSystemPrompt = File("../resources/topic-extractor-prompt.txt").readText()

Create the Ollama Chat Model

In [13]:
import org.springframework.ai.ollama.OllamaChatModel
import org.springframework.ai.ollama.api.OllamaApi
import org.springframework.ai.ollama.api.OllamaApi.ChatRequest
import org.springframework.ai.ollama.api.OllamaApi.Message
import org.springframework.ai.ollama.api.OllamaApi.Message.Role
import org.springframework.ai.ollama.api.OllamaOptions

val ollamaApi = OllamaApi.builder()
    .baseUrl("http://localhost:11434")
    .build()

val ollamaOptions = OllamaOptions.builder().model("deepseek-coder-v2").build()

val ollamaChatModel = OllamaChatModel.builder()
    .ollamaApi(ollamaApi)
    .defaultOptions(ollamaOptions)
    .build()

### Creating a Topic Modeling Function
This function takes a post as input and uses the Ollama API to extract topics from the post. The function returns a string of comma-separated topics.

In [14]:
import org.springframework.ai.chat.messages.SystemMessage
import org.springframework.ai.chat.messages.UserMessage
import org.springframework.ai.chat.prompt.Prompt

fun extractTopics(post: String, existingTopics: String): String {
    // Build a chat message
    val messages = listOf(
        SystemMessage(topicExtractorSystemPrompt),
        UserMessage("Existing topics: $existingTopics"),
        UserMessage("Post: $post")
    )

    val response = ollamaChatModel.call(Prompt(messages))
    return response.result.output.text ?: ""
}

In [15]:
extractTopics("Kotlin is a great programming language for beginners who want to do Applied AI Engineering", "")

 "Programming Languages, Beginner Tutorials, AI Application Development"

In [16]:
extractTopics("Kotlin is a great programming language for beginners", "")

 ""

In [17]:
extractTopics("Brazilian samba is a great music genre for dancing", "")

 ""

### Counting how many times a topic appears

Count-min sketch is a probabilistic data structure used for estimating the frequency of events in a stream of data.

It is particularly useful for counting the number of occurrences of items in a large dataset without storing all the items explicitly.

In [18]:
import redis.clients.jedis.exceptions.JedisDataException
import java.time.LocalDateTime

fun createTopK(): String {
    val windowBucket = LocalDateTime.now().withMinute(0).withSecond(0).withNano(0)
    try {
        jedisPooled.topkReserve("topics-topk:$windowBucket", 15, 3000, 10, 0.9)
    } catch (_: JedisDataException) {
        println("TopK already exists")
    }

    return "topics-topk:$windowBucket"
}

### Creating a Topic Extraction Handler
This function creates a handler that extracts topics from an event's text and stores them in Redis. The topics are stored as a pipe-separated string in the "topics" field of the event's hash.

In [19]:
val extractTopics: (Event) -> Pair<Boolean, String> = { event ->
    val existingTopics = jedisPooled.smembers("topics")
    val topics = extractTopics(event.text, existingTopics.joinToString(", "))
        .replace("\"", "")
        .replace("“", "")
        .replace("”", "")
        .split(",")
        .map { it.trim() }
        .filter { it.isNotBlank() }

    val topKKey = createTopK()
    if (topics.isNotEmpty()) {
        val filteredTopics = topics.filter { it.isNotBlank() }
        jedisPooled.topkAdd(topKKey, *filteredTopics.toTypedArray())
        jedisPooled.hset("post:" + event.uri.replace("at://did:plc:", ""), mapOf("topics" to filteredTopics.joinToString("|")))
        jedisPooled.sadd("topics", *filteredTopics.toTypedArray())
    }
    Pair(true, "OK")
}

In [20]:
createConsumerGroup("filtered-events", "topic-extraction-example")

Consumer group already exists


In [21]:
val bloomFilterName = "topic-extraction-bf"
createBloomFilter(bloomFilterName)

Bloom filter already exists


In [None]:
runBlocking {
    consumeStream(
        streamName = "filtered-events",
        consumerGroup = "topic-extraction-example",
        consumer = "topic-extraction-1",
        handlers = listOf(deduplicate(bloomFilterName), printUri, extractTopics),
        ackFunction = ackAndBfFn(bloomFilterName),
        count = 1,
        limit = 100
    )
}

Got event from at://did:plc:zc6bjaq5z5mzb7u36dtxfzw7/app.bsky.feed.post/3lprnpnymfh2x
TopK already exists
Got event from at://did:plc:uuykykkdbtq2plsexeztdhpf/app.bsky.feed.post/3lprnpor5vz2v
TopK already exists
Got event from at://did:plc:2llcip7wwsv66wu6vxrxfhsm/app.bsky.feed.post/3lprp5tj4662e
TopK already exists
Got event from at://did:plc:xbsy5qmxblcb2tgt3cmhwnld/app.bsky.feed.post/3lprp5vcdy226
TopK already exists
Got event from at://did:plc:xbsy5qmxblcb2tgt3cmhwnld/app.bsky.feed.post/3lprp5vcgvs26
TopK already exists
Got event from at://did:plc:6hxrccgkh23eeygeqqiqxb6t/app.bsky.feed.post/3lprp5woyvy2h
TopK already exists
Got event from at://did:plc:scy5fhrtswxpfuzchccp5swn/app.bsky.feed.post/3lprp63cam22t
TopK already exists
Got event from at://did:plc:dero4a3fkusyfpiuzc7ax7o5/app.bsky.feed.post/3lprp65wb3c2l
TopK already exists
Got event from at://did:plc:5wm7ascxvmho32pfdihilozg/app.bsky.feed.post/3lprp62epc22v
TopK already exists
Got event from at://did:plc:2zprqnchjss7abl6sv

## Creating Embeddings for Semantic Search
In this section, we'll create embeddings for posts. Embeddings are vector representations of text that capture semantic meaning. They allow us to perform semantic search, which is a search based on meaning rather than exact keyword matching.

For example, if I search for:

"Redis is a cool db for Python devs"

I can still match:

"Redis is a great database for Python developers"

### Setting Up the Embedding Model
We'll use the Spring AI Transformers library to create embeddings for posts. This library provides a simple API for creating embeddings using transformer models.

In [23]:
@file:DependsOn("org.springframework.ai:spring-ai-transformers:1.0.0-RC1")
@file:DependsOn("ai.djl.huggingface:tokenizers:0.33.0")

In [24]:
import org.springframework.ai.transformers.TransformersEmbeddingModel

val embeddingModel = TransformersEmbeddingModel() // uses all-MiniLM-L6-v2 by default
embeddingModel.afterPropertiesSet()

### Creating an Embedding Handler
This function creates a handler that generates embeddings for an event's text and stores them in Redis. The embeddings are stored as binary data in the "textEmbedding" field of the event's hash.

In [25]:
import java.lang.Float
import java.nio.ByteBuffer
import java.nio.ByteOrder

fun createEmbedding(input: String): ByteArray {
    val embedding = embeddingModel.embed(input)
    val embeddingBytes = ByteArray(Float.BYTES * embedding.size)
    ByteBuffer.wrap(embeddingBytes).order(ByteOrder.LITTLE_ENDIAN).asFloatBuffer().put(embedding)
    return embeddingBytes
}

In [26]:
val createEmbedding: (Event) -> Pair<Boolean, String> = { event ->
    val embeddingBytes = createEmbedding(event.text)
    jedisPooled.hset(("post:" + event.uri.replace("at://did:plc:", "")).encodeToByteArray(), mapOf("textEmbedding".encodeToByteArray() to embeddingBytes))
    Pair(true, "OK")
}

In [27]:
createConsumerGroup("filtered-events", "embedding-example")

Consumer group already exists


In [28]:
val bloomFilterName = "embedding-bf"
createBloomFilter(bloomFilterName)

Bloom filter already exists


In [29]:
runBlocking {
    consumeStream(
        streamName = "filtered-events",
        consumerGroup = "embedding-example",
        consumer = "embedding-1",
        handlers = listOf(deduplicate(bloomFilterName), printUri, createEmbedding),
        ackFunction = ackAndBfFn(bloomFilterName),
        count = 1,
        limit = 100
    )
}

Got event from at://did:plc:zc6bjaq5z5mzb7u36dtxfzw7/app.bsky.feed.post/3lprnpnymfh2x
Got event from at://did:plc:uuykykkdbtq2plsexeztdhpf/app.bsky.feed.post/3lprnpor5vz2v
Got event from at://did:plc:2llcip7wwsv66wu6vxrxfhsm/app.bsky.feed.post/3lprp5tj4662e
Got event from at://did:plc:xbsy5qmxblcb2tgt3cmhwnld/app.bsky.feed.post/3lprp5vcdy226
Got event from at://did:plc:xbsy5qmxblcb2tgt3cmhwnld/app.bsky.feed.post/3lprp5vcgvs26
Got event from at://did:plc:6hxrccgkh23eeygeqqiqxb6t/app.bsky.feed.post/3lprp5woyvy2h
Got event from at://did:plc:scy5fhrtswxpfuzchccp5swn/app.bsky.feed.post/3lprp63cam22t
Got event from at://did:plc:dero4a3fkusyfpiuzc7ax7o5/app.bsky.feed.post/3lprp65wb3c2l
Got event from at://did:plc:5wm7ascxvmho32pfdihilozg/app.bsky.feed.post/3lprp62epc22v
Got event from at://did:plc:2zprqnchjss7abl6sveqkx4v/app.bsky.feed.post/3lprp665inr2r
Got event from at://did:plc:mf7uetvngc5rahnfbkphher6/app.bsky.feed.post/3lprp65fkvk2q
Got event from at://did:plc:5vwjnzaibnwscbbcvkzhy57v/a

## Creating a Redis Search Index
In this section, we'll create a Redis Search index to make the enriched events searchable. Redis Search is a module that adds full-text search capabilities to Redis. It allows us to search for events based on their text, topics, and other fields.

### Creating the Index Schema in Code
Now we'll create the index schema in code. We'll use the Jedis client to create the schema and the index.

The following schema defines the fields that will be indexed. The schema includes:
- Text fields for full-text search
- Tag fields for exact matching
- Vector fields for semantic search

```
FT.CREATE postIdx ON HASH PREFIX 1 post: SCHEMA
        parentUri     TEXT
        topics        TAG SEPARATOR "|"
        time_us       TEXT
        langs         TAG
        uri           TEXT
        operation     TAG
        did           TAG
        timeUs        NUMERIC
        rkey          TAG
        rootUri       TEXT
        text          TEXT
```

In [38]:
import redis.clients.jedis.search.IndexDefinition
import redis.clients.jedis.search.IndexOptions
import redis.clients.jedis.search.Schema
import redis.clients.jedis.search.schemafields.VectorField.VectorAlgorithm

val schema = Schema()
    .addTextField("parentUri", 1.0)
    .addTagField("topics", "|")
    .addTextField("time_us", 1.0)
    .addTagField("langs")
    .addTextField("uri", 1.0)
    .addTagField("operation")
    .addTagField("did")
    .addNumericField("timeUs")
    .addTagField("rkey")
    .addTextField("rootUri", 1.0)
    .addTextField("text", 1.0)

// Define index options (e.g., prefix)
val rule = IndexDefinition()
    .setPrefixes("post:")

// Create the index
try {
    jedisPooled.ftCreate("postIdx", IndexOptions.defaultOptions().setDefinition(rule), schema)
} catch (e: JedisDataException) {
    println("Index already exists")
}

### Searching the Index
Now that we have created the index, we can search for events based on their topics, text, and other fields. In this example, we'll search for events with the topic "Samba".

Redis Search uses a query language similar to SQL. For example, to search for events with the topic "machine_learning", we would use the query `@topics:{machine_learning}`.

Exact Matching Search

In [31]:
//FT.SEARCH postIdx "@topics:{machine_learning}"
val result = jedisPooled.ftSearch(
    "postIdx",
    "@topics:{ChatGPT}"
)

result.documents.forEach { post ->
    println(post.get("topics"))
    println(post.get("text"))
    println("\n")
}

ChatGPT|AI Agents|Prompt Engineering
ChatGPT referred to my OC lore as a "cosmic soap opera" and I'm now coping with the realization that this has been the best way all along to describe this story FUCKING HELP ME 😭


ChatGPT|AI Ethics
Because you posted screenshots of chatgpt as if it is a reliable source!

I am critiquing you legitimizing a tool with clear alternatives, that does not do what you claim it does, and that is killing untold numbers, right now.


ChatGPT|Prompt Engineering|AI Applications
Just to say I used ChatGPT today that saves me time from scrolling through various websites to find the answer. It was very helpful. I don't understand why using a tool is wrong - as long as the user knows the limits of the tools one uses.




Full Text Search

In [32]:
//FT.SEARCH postIdx "@text:Open source"
val result = jedisPooled.ftSearch(
    "postIdx",
    "@text:estimated to be"
)

result.documents.forEach { post ->
    println(post.get("text"))
    println("\n")
}

“ChatGPT is now estimated to be the fifth-most visited website in the world, just after Instagram and ahead of X.”

www.technologyreview.com/2025/05/20/1...




Counting the number of occurrences of a topic with Count-min sketch


In [37]:
import redis.clients.jedis.params.ScanParams

val jedisScanFn = { cursor: String ->
    jedisPooled.scan(cursor, ScanParams().match("topics-topk:*"), "TOPK-TYPE")
}

val keys = mutableListOf<String>()
var lastCursor = "0"
do {
    val result = jedisScanFn.invoke(lastCursor)
    lastCursor = result.cursor
    keys.addAll(result.result)
} while (lastCursor != "0")

keys.forEach {
    println(it)
    val topics = jedisPooled.topkList(it)
    println(topics)
}

topics-topk:2025-05-22T20:00
[AI Tooling, Generative Models, , Prompt Engineering, AI, AI Ethics, AI Applications, Generative AI, AI Research, Machine Learning, Anthropic, Azure ML, OpenAI, Cloud Services, Predictive Models]
topics-topk:2025-05-22T19:00
[Generative Models, AI Ethics, Prompt Engineering, , AI Applications, OpenAI, AI Tooling, AI, AI Agents, ChatGPT, Cloud Services, Generative AI, AI Research, Healthcare Technology, Google AI]
