# Enriching Filtered Events

In this notebook, we'll enrich the filtered events from the previous notebook with additional information. We'll use a combination of techniques to enrich the events:

1. Topic modeling using a Large Language Model (LLM) to extract topics from the posts
2. Creating embeddings for semantic search using a transformer model
3. Storing the enriched events in Redis for querying



Embeddings are vector representations of text that capture semantic meaning. They allow us to perform semantic search, which is a search based on meaning rather than exact keyword matching. In this notebook, we'll create embeddings for posts and store them in Redis for later querying.


In [27]:
import dev.raphaeldelio.*
%use coroutines

## Topic Modeling with Large Language Models
Topic modeling is a technique used to discover abstract topics in a collection of documents. In this notebook, we'll use a Large Language Model to extract topics from posts. This will allow us to categorize posts and make them more searchable.

### Setting Up the Ollama API Client
We'll use the Spring AI Ollama client to interact with the Ollama API.

Ollama is a tool that allows us to run large language models locally.

In [28]:
@file:DependsOn("org.springframework.ai:spring-ai-ollama:1.0.0-RC1")

The prompt we'll use for the LLM is designed to extract software-related topics from posts. The prompt includes examples of how to format the output and what types of topics to include.

In [29]:
import java.io.File

val topicModelingSystemPrompt = File("resources/topic-extractor-prompt.txt").readText()

Create the Ollama Chat Model

In [30]:
import org.springframework.ai.ollama.OllamaChatModel
import org.springframework.ai.ollama.api.OllamaApi
import org.springframework.ai.ollama.api.OllamaApi.ChatRequest
import org.springframework.ai.ollama.api.OllamaApi.Message
import org.springframework.ai.ollama.api.OllamaApi.Message.Role
import org.springframework.ai.ollama.api.OllamaOptions

val ollamaApi = OllamaApi.builder()
    .baseUrl("http://localhost:11434")
    .build()

val ollamaOptions = OllamaOptions.builder().model("deepseek-coder-v2").build()

val ollamaChatModel = OllamaChatModel.builder()
    .ollamaApi(ollamaApi)
    .defaultOptions(ollamaOptions)
    .build()

### Creating a Topic Modeling Function
This function takes a post as input and uses the Ollama API to extract topics from the post. The function returns a string of comma-separated topics.

In [31]:
import org.springframework.ai.chat.messages.SystemMessage
import org.springframework.ai.chat.messages.UserMessage
import org.springframework.ai.chat.prompt.Prompt

fun topicModeling(post: String, existingTopics: String): String {
    // Build a chat message
    val messages = listOf(
        SystemMessage(topicModelingSystemPrompt),
        UserMessage("Existing topics: $existingTopics"),
        UserMessage("Post: $post")
    )

    val response = ollamaChatModel.call(Prompt(messages))
    return response.result.output.text ?: ""
}

In [32]:
topicModeling("In 2021, Angela Merkel stepped down after 16 years as Germany’s Chancellor. In the same year, Joe Biden became the 46th U.S. President. Quiet transitions of power still matter.", "")

 ""

In [33]:
topicModeling("Brazilian samba is a great music genre for dancing", "")

 ""

### Counting how many times a topic appears

Count-min sketch is a probabilistic data structure used for estimating the frequency of events in a stream of data.

It is particularly useful for counting the number of occurrences of items in a large dataset without storing all the items explicitly.

In [34]:
import redis.clients.jedis.exceptions.JedisDataException
import java.time.LocalDateTime

fun createCountMinSketch(): String {
    val windowBucket = LocalDateTime.now().withMinute(0).withSecond(0).withNano(0)
    try {
        jedisPooled.cmsInitByDim("topics-cms:$windowBucket", 3000, 10)
    } catch (_: JedisDataException) {
        println("Count-min sketch already exists")
    }

    return "topics-cms:$windowBucket"
}

### Creating a Topic Extraction Handler
This function creates a handler that extracts topics from an event's text and stores them in Redis. The topics are stored as a pipe-separated string in the "topics" field of the event's hash.

In [35]:
val extractTopics: (Event) -> Pair<Boolean, String> = { event ->
    val existingTopics = jedisPooled.smembers("topics")
    val topics = topicModeling(event.text, existingTopics.joinToString(", "))
        .replace("\"", "")
        .replace("“", "")
        .replace("”", "")
        .split(",")
        .map { it.trim() }
        .filter { it.isNotBlank() }

    val cmsKey = createCountMinSketch()
    if (topics.isNotEmpty()) {
        jedisPooled.cmsIncrBy(cmsKey, topics.associateWith { 1 })
        jedisPooled.hset("post:" + event.uri.replace("at://did:plc:", ""), mapOf("topics" to topics.joinToString("|")))
        jedisPooled.sadd("topics", *topics.toTypedArray())
    }
    Pair(true, "OK")
}

In [36]:
createConsumerGroup("filtered-events", "topic-extraction-example")

Group already exists


In [37]:
val bloomFilterName = "topic-extraction-bf"
createBloomFilter(bloomFilterName)

Bloom filter already exists


In [38]:
runBlocking {
    consumeStream(
        streamName = "filtered-events",
        consumerGroup = "topic-extraction-example",
        consumer = "topic-extraction-1",
        handlers = listOf(deduplicate(bloomFilterName), printUri, extractTopics),
        ackFunction = ackAndBfFn(bloomFilterName),
        count = 1,
        limit = 100
    )
}

Got event from at://did:plc:plmrzhbrzs5p3z5hq4ljwjhj/app.bsky.feed.post/3lppt46hvtc2k
Count-min sketch already exists
Got event from at://did:plc:bb4ohqqg3l5uqurpgvn7zaon/app.bsky.feed.post/3lppt4gxxr32m
Count-min sketch already exists
Got event from at://did:plc:lcsq3emslqunozqk4zeskggd/app.bsky.feed.post/3lppt4gog7s2x
Count-min sketch already exists
Got event from at://did:plc:hpp52ivrklmlxzotiurem2af/app.bsky.feed.post/3lppt4hwiig2t
Count-min sketch already exists
Got event from at://did:plc:e33j7rtgfl5nyre4zx4ai6zw/app.bsky.feed.post/3lppt4m2f5k2d
Count-min sketch already exists
Got event from at://did:plc:p3wtkvud3zazjpgysrxywcd5/app.bsky.feed.post/3lppt4nhz4c2y
Count-min sketch already exists
Got event from at://did:plc:2jt2hjhnujd57pr4uy3vh3mb/app.bsky.feed.post/3lppt4petfk25
Count-min sketch already exists
Got event from at://did:plc:zl7yevhfe4stpinqnb2jcmqq/app.bsky.feed.post/3lppt4qhirk2d
Count-min sketch already exists
Got event from at://did:plc:scguov34lrzd6hrorplttx2z/app

## Creating Embeddings for Semantic Search
In this section, we'll create embeddings for posts. Embeddings are vector representations of text that capture semantic meaning. They allow us to perform semantic search, which is a search based on meaning rather than exact keyword matching.

For example, if I search for:

"Redis is a cool db for Python devs"

I can still match:

"Redis is a great database for Python developers"

### Setting Up the Embedding Model
We'll use the Spring AI Transformers library to create embeddings for posts. This library provides a simple API for creating embeddings using transformer models.

In [39]:
@file:DependsOn("org.springframework.ai:spring-ai-transformers:1.0.0-RC1")
@file:DependsOn("ai.djl.huggingface:tokenizers:0.33.0")

In [40]:
import org.springframework.ai.transformers.TransformersEmbeddingModel

val embeddingModel = TransformersEmbeddingModel() // uses all-MiniLM-L6-v2 by default
embeddingModel.afterPropertiesSet()

### Creating an Embedding Handler
This function creates a handler that generates embeddings for an event's text and stores them in Redis. The embeddings are stored as binary data in the "textEmbedding" field of the event's hash.

In [41]:
import java.lang.Float
import java.nio.ByteBuffer
import java.nio.ByteOrder

fun createEmbedding(input: String): ByteArray {
    val embedding = embeddingModel.embed(input)
    val embeddingBytes = ByteArray(Float.BYTES * embedding.size)
    ByteBuffer.wrap(embeddingBytes).order(ByteOrder.LITTLE_ENDIAN).asFloatBuffer().put(embedding)
    return embeddingBytes
}

In [42]:
val createEmbedding: (Event) -> Pair<Boolean, String> = { event ->
    val embeddingBytes = createEmbedding(event.text)
    jedisPooled.hset(("post:" + event.uri.replace("at://did:plc:", "")).encodeToByteArray(), mapOf("textEmbedding".encodeToByteArray() to embeddingBytes))
    Pair(true, "OK")
}

In [43]:
createConsumerGroup("filtered-events", "embedding-example")

Group already exists


In [44]:
val bloomFilterName = "embedding-bf"
createBloomFilter(bloomFilterName)

Bloom filter already exists


In [45]:
runBlocking {
    consumeStream(
        streamName = "filtered-events",
        consumerGroup = "embedding-example",
        consumer = "embedding-1",
        handlers = listOf(deduplicate(bloomFilterName), printUri, createEmbedding),
        ackFunction = ackAndBfFn(bloomFilterName),
        count = 100,
        limit = 100
    )
}

Got event from at://did:plc:plmrzhbrzs5p3z5hq4ljwjhj/app.bsky.feed.post/3lppt46hvtc2k
Got event from at://did:plc:bb4ohqqg3l5uqurpgvn7zaon/app.bsky.feed.post/3lppt4gxxr32m
Got event from at://did:plc:lcsq3emslqunozqk4zeskggd/app.bsky.feed.post/3lppt4gog7s2x
Got event from at://did:plc:hpp52ivrklmlxzotiurem2af/app.bsky.feed.post/3lppt4hwiig2t
Got event from at://did:plc:e33j7rtgfl5nyre4zx4ai6zw/app.bsky.feed.post/3lppt4m2f5k2d
Got event from at://did:plc:p3wtkvud3zazjpgysrxywcd5/app.bsky.feed.post/3lppt4nhz4c2y
Got event from at://did:plc:2jt2hjhnujd57pr4uy3vh3mb/app.bsky.feed.post/3lppt4petfk25
Got event from at://did:plc:zl7yevhfe4stpinqnb2jcmqq/app.bsky.feed.post/3lppt4qhirk2d
Got event from at://did:plc:scguov34lrzd6hrorplttx2z/app.bsky.feed.post/3lppt4s6x6s2z
Got event from at://did:plc:5txuklpddqgrlxcozcq6r5da/app.bsky.feed.post/3lppt4zcqas2g
Got event from at://did:plc:b7bvfaw67cfhiygh7py6ld5w/app.bsky.feed.post/3lppt52umb52e
Got event from at://did:plc:xpvvqpow43ekyqmeuvutpbgl/a

## Creating a Redis Search Index
In this section, we'll create a Redis Search index to make the enriched events searchable. Redis Search is a module that adds full-text search capabilities to Redis. It allows us to search for events based on their text, topics, and other fields.

### Creating the Index Schema in Code
Now we'll create the index schema in code. We'll use the Jedis client to create the schema and the index.

The following schema defines the fields that will be indexed. The schema includes:
- Text fields for full-text search
- Tag fields for exact matching
- Vector fields for semantic search

```
FT.CREATE postIdx ON HASH PREFIX 1 post: SCHEMA
        parentUri     TEXT
        topics        TAG SEPARATOR "|"
        time_us       TEXT
        langs         TAG
        uri           TEXT
        operation     TAG
        did           TAG
        timeUs        NUMERIC
        rkey          TAG
        textEmbedding VECTOR HNSW 6
            DIM 384
            TYPE FLOAT32
            DISTANCE_METRIC COSINE
        rootUri       TEXT
        text          TEXT
```

In [46]:
import redis.clients.jedis.search.IndexDefinition
import redis.clients.jedis.search.IndexOptions
import redis.clients.jedis.search.Schema
import redis.clients.jedis.search.schemafields.VectorField.VectorAlgorithm

val schema = Schema()
    .addTextField("parentUri", 1.0)
    .addTagField("topics", "|")
    .addTextField("time_us", 1.0)
    .addTagField("langs")
    .addTextField("uri", 1.0)
    .addTagField("operation")
    .addTagField("did")
    .addNumericField("timeUs")
    .addTagField("rkey")
    .addHNSWVectorField(
        "textEmbedding",
        mapOf(
            "type" to "FLOAT32",
            "dim" to "384",
            "distance_metric" to "COSINE",
        )
    )
    .addTextField("rootUri", 1.0)
    .addTextField("text", 1.0)

// Define index options (e.g., prefix)
val rule = IndexDefinition()
    .setPrefixes("post:")

// Create the index
try {
    jedisPooled.ftCreate("postIdx", IndexOptions.defaultOptions().setDefinition(rule), schema)
} catch (e: JedisDataException) {
    println("Index already exists")
}

Index already exists


### Searching the Index
Now that we have created the index, we can search for events based on their topics, text, and other fields. In this example, we'll search for events with the topic "Samba".

Redis Search uses a query language similar to SQL. For example, to search for events with the topic "machine_learning", we would use the query `@topics:{machine_learning}`.

Exact Matching Search

In [47]:
//FT.SEARCH postIdx "@topics:{machine_learning}"
val result = jedisPooled.ftSearch(
    "postIdx",
    "@topics:{ChatGPT}"
)

result.documents.forEach { post ->
    println(post.get("topics"))
    println(post.get("text"))
    println("\n")
}

ChatGPT|AI Art|Chatbots
Chatgpt 畫“選腎與熊”


ChatGPT|AI Ethics
According to ChatGPT 4, it’s .13 gallons. So either ChatGPT is lying or God is lying.


ChatGPT|AI Ethics
"ChatGPT is now estimated to be the fifth-most visited website in the world, just after Instagram and ahead of X."

www.similarweb.com/top-websites/




Full Text Search

In [48]:
//FT.SEARCH postIdx "@text:Open source"
val result = jedisPooled.ftSearch(
    "postIdx",
    "@text:estimated to be"
)

result.documents.forEach { post ->
    println(post.get("text"))
    println("\n")
}

"ChatGPT is now estimated to be the fifth-most visited website in the world, just after Instagram and ahead of X."

www.similarweb.com/top-websites/




Vector Similarity Search

In [49]:
import redis.clients.jedis.search.FTSearchParams
import redis.clients.jedis.search.Query

val vector: ByteArray = createEmbedding("ChatGPT")

val queryString = ("* =>[KNN \$K @textEmbedding \$BLOB AS similarityScore]")

val params = mapOf("BLOB" to vector)

var query = Query(queryString)
    .addParam("K", 1)
    .addParam("BLOB", vector)
    .returnFields("uri", "text", "similarityScore")
    .setSortBy("similarityScore", true)
    .dialect(2)

val result = jedisPooled.ftSearch(
    "postIdx",
    query
)

result.documents.forEach { doc ->
    println(doc.get("topics"))
    println(doc.get("text"))
    println("\n")
}

null
Chatgpt 畫“選腎與熊”




Pre filtering

In [50]:
//FT.SEARCH postIdx "@text:Open source" "@tag:{Samba}" "*=>[KNN 1 @textEmbedding $BLOB AS similarityScore]"
val vector: ByteArray = createEmbedding("Open source is wizardry stuff")

val queryString = "@topics:{ChatGPT} =>[KNN \$K @textEmbedding \$BLOB AS similarityScore]"

val params = mapOf("BLOB" to vector)

var query = Query(queryString)
    .addParam("K", 5) // Top 5 results (if enough from pre filtering)
    .addParam("BLOB", vector)
    .returnFields("uri", "text", "similarityScore")
    .setSortBy("similarityScore", true)
    .dialect(2)

val result = jedisPooled.ftSearch(
    "postIdx",
    query
)

result.documents.forEach { doc ->
    println(doc.get("topics"))
    println(doc.get("text"))
    println("\n")
}

null
"ChatGPT is now estimated to be the fifth-most visited website in the world, just after Instagram and ahead of X."

www.similarweb.com/top-websites/


null
According to ChatGPT 4, it’s .13 gallons. So either ChatGPT is lying or God is lying.


null
Chatgpt 畫“選腎與熊”




Counting the number of occurrences of a topic with Count-min sketch


In [51]:
import redis.clients.jedis.params.ScanParams

val jedisScanFn = { cursor: String ->
    jedisPooled.scan(cursor, ScanParams().match("topics-cms:*"), "CMSk-TYPE")
}

val keys = mutableListOf<String>()
var lastCursor = "0"
do {
    val result = jedisScanFn.invoke(lastCursor)
    lastCursor = result.cursor
    keys.addAll(result.result)
} while (lastCursor != "0")

keys.forEach {
    println(it)
    val count = jedisPooled.cmsQuery(it, "ChatGPT")
    println(count)
}

topics-cms:2025-05-22T02:00
[1]
topics-cms:2025-05-22T01:00
[2]
