# Data Analysis with AI

In this notebook, we'll analyze the enriched events from the previous notebooks. We'll use a combination of AI techniques to analyze the data and build a simple question-answering system.

## Overview of Previous Notebooks

In the previous notebooks, we've built a pipeline for processing Bluesky posts:

1. **JetStream Consumer**: We consumed Bluesky's Jetstream Websocket and inserted events into a Redis Stream.
2. **JetStream Filtering**: We filtered events using Redis Bloom Filter for deduplication and a machine learning model for content-based filtering.
3. **Events Enrichment**: We enriched the filtered events with topic modeling and embeddings for semantic search.

## What We'll Build in This Notebook

In this notebook, we'll build a simple question-answering system that can:

1. Identify trending topics in the posts
2. Perform semantic search to understand user queries
3. Summarize posts about specific topics using a Large Language Model (LLM)
4. Route different types of user queries to the appropriate handler

This demonstrates how to combine Redis, vector search, and LLMs to build an intelligent data analysis system.

## Setting Up the Environment

First, let's import the necessary libraries and set up our environment. We'll need:

1. Helper functions from previous notebooks
2. Ktor client for HTTP requests
3. Serialization for JSON parsing
4. Coroutines for asynchronous programming


In [1]:
%use ktor-client
%use serialization
%use coroutines

## Semantic Routing with Vector Search

To build our question-answering system, we need to understand what the user is asking. We'll use a technique called semantic routing to classify user queries into different categories.

For example, if a user asks "What's trending right now?", we want to route this to our trending topics handler. If they ask "What are people saying about Trump?", we want to route this to our summarization handler.

We'll use vector search to match user queries to predefined routes. First, let's define some example queries for the trending topics route:


In [2]:
val trendingTopicsRoute = listOf(
    "What are the most mentioned topics?",
    "What's trending right now?",
    "What’s hot in the network",
    "Top topics?",
)

## Setting Up the Vector Store

To implement our query routing, we'll use Redis as a vector store to store and search for similar queries. A vector store is a database that stores vector embeddings and allows for efficient similarity search.

We'll use Redis as our vector store and Spring AI to create embeddings. First, let's set up the embedding model:


In [3]:
import org.springframework.ai.transformers.TransformersEmbeddingModel

val embeddingModel = TransformersEmbeddingModel()
embeddingModel.setModelResource("file:resources/model/bge-large-en-v1.5/model.onnx")
embeddingModel.setTokenizerResource("file:resources/model/bge-large-en-v1.5/tokenizer.json")
embeddingModel.afterPropertiesSet()

## Configuring the Redis Vector Store

Now, let's configure the Redis vector store. We'll use Spring AI's RedisVectorStore, which provides a high-level interface for storing and searching vector embeddings in Redis.

The configuration includes:
- The index name for our vector store
- The field names for content and embeddings
- Metadata fields for storing additional information
- The prefix for our keys in Redis
- The vector algorithm to use for similarity search (FLAT in this case)


In [4]:
import dev.raphaeldelio.*
import org.springframework.ai.vectorstore.redis.RedisVectorStore
import org.springframework.ai.vectorstore.redis.RedisVectorStore.MetadataField
import redis.clients.jedis.search.Schema.FieldType

val redisVectorStore = RedisVectorStore.builder(jedisPooled, embeddingModel)
    .indexName("routeIdx")
    .contentFieldName("text")
    .embeddingFieldName("textEmbedding")
    .metadataFields(
        MetadataField("route", FieldType.TEXT),
        MetadataField("minThreshold", FieldType.NUMERIC),
    )
    .prefix("route:")
    .initializeSchema(true)
    .vectorAlgorithm(RedisVectorStore.Algorithm.FLAT)
    .build()
redisVectorStore.afterPropertiesSet()

## Creating and Storing Route Documents

Now that we have our vector store set up, we need to create documents for our routes and store them in the vector store. Each document represents a possible user query and contains:

- The route it belongs to (e.g., "trending_topics")
- The text of the query (e.g., "What's trending right now?")
- A minimum threshold for matching (to avoid false positives)

We'll create a function to create these documents and another function to store them in Redis:


In [5]:
import org.springframework.ai.document.Document
import java.util.UUID

fun storeRouteDocumentsInRedis(routeName: String, minThreshold: Double, routeSamples: List<String>) {
    val trendingTopicDocuments = routeSamples.map { text ->
        createRouteDocument(routeName, text, minThreshold)
    }

    redisVectorStore.add(trendingTopicDocuments)
}

fun createRouteDocument(route: String, text: String, minThreshold: Double): Document {
    return Document(
        UUID.randomUUID().toString(),
        text,
        mapOf(
            "route" to route,
            "text" to text,
            "minThreshold" to minThreshold,
        )
    )
}

storeRouteDocumentsInRedis("trending_topics", 0.9, trendingTopicsRoute)

## Testing Vector Search

Let's test our vector store by searching for a query similar to the ones we've stored. We'll use the `similaritySearch` method to find the most similar document to our query:


In [7]:
import org.springframework.ai.vectorstore.SearchRequest

val query = "Hey Dev Bubble. What's trending today? Excited to hear the news!"

redisVectorStore.similaritySearch(
    SearchRequest.builder()
        .topK(1)
        .query(query)
        .build()
)?.forEach { document ->
    println("Matched route: " + document.metadata["route"])
    println("Matched text: " + document.text)
    println("Min threshold: " + document.metadata["minThreshold"])
    println("Score: " + document.score)
    println()
}

Matched route: trending_topics
Matched text: What's trending right now?
Min threshold: 0.9
Score: 0.9052984714508057



## Route Matching

Now that we have our vector store set up and tested, we need to create a function to match user queries to routes. This function will:

1. Break the user query into clauses (to handle complex queries)
2. For each clause, find the most similar document in our vector store
3. Check if the similarity score is above the minimum threshold
4. Return the set of matched routes


In [8]:
import redis.clients.jedis.search.FTSearchParams
import redis.clients.jedis.search.Query

fun matchRoute(query: String): Set<String> {
    return breakSentenceIntoClauses(query).flatMap { clause ->
        val result = redisVectorStore.similaritySearch(
            SearchRequest.builder()
                .topK(1)
                .query(clause)
                .build()
        )

        val route = result?.firstOrNull()?.metadata?.get("route") as String
        val minThreshold = result.firstOrNull()?.metadata?.get("minThreshold") as String

        result.forEach { document ->
            println(clause)
            println("Matched route: " + document.metadata["route"])
            println("Matched text: " + document.text)
            println("Min threshold: " + document.metadata["minThreshold"])
            println("Score: " + document.score)
            println()
        }

        result.filter { (it?.score ?: 0.0) > minThreshold.toDouble() }.map {
            it?.metadata?.get("route") as String
        }
    }.toSet()
}

fun breakSentenceIntoClauses(sentence: String): List<String> {
    return sentence.split(Regex("""[!?,.:;()"\[\]{}]+"""))
        .filter { it.isNotBlank() }.map { it.trim() }
}

## Testing Route Matching

Let's test our route matching function with a sample query:


In [10]:
matchRoute("Hey DevBubble, what's trending in the network? Let me know!!")

Hey DevBubble
Matched route: trending_topics
Matched text: What’s hot in the network
Min threshold: 0.9
Score: 0.7936378717422485

what's trending in the network
Matched route: trending_topics
Matched text: What's trending right now?
Min threshold: 0.9
Score: 0.9403464198112488

Let me know
Matched route: trending_topics
Matched text: What's trending right now?
Min threshold: 0.9
Score: 0.8138612508773804



[trending_topics]

## Implementing Trending Topics

Now that we have our route matching function, let's implement the trending topics handler. This handler will:

1. Get the current minute (to query the count-min sketch for the current time window)
2. Get all topics ever added to Redis
3. For each topic, get the count from the count-min sketch
4. Sort the topics by count (descending)
5. Take the top 15 topics
6. Return them as a set


In [11]:
import org.springframework.ai.chat.messages.SystemMessage
import org.springframework.ai.chat.messages.UserMessage
import org.springframework.ai.chat.prompt.Prompt
import java.time.LocalDateTime

fun trendingTopics(): Set<String> {
    val currentMinute = LocalDateTime.now().withMinute(0).withSecond(0).withNano(0).toString()
    val topTopics = jedisPooled.topkList("topics-topk:$currentMinute")
    topTopics.add("These are the most mentioned topics. Don't try to guess what's being said in the topics. Just say that these are the most mentioned topics.")
    return topTopics.toSet()
}

## Testing Trending Topics

Let's test our trending topics function:


In [12]:
trendingTopics()

[Generative Models, Prompt Engineering, AI, OpenAI, AI Tooling, Azure Cloud, HPC, Web Application Development, Supercomputers, AI Model, These are the most mentioned topics. Don't try to guess what's being said in the topics. Just say that these are the most mentioned topics.]

## Creating a Trending Topics Handler

Now that we have our trending topics function, let's create a handler that can be used by our query router. This handler will:

1. Take a route and a query as input
2. If the route is "trending_topics", call our trendingTopics function
3. Otherwise, return an empty list


In [16]:
import dev.raphaeldelio.*

val routesHandler: (String, String) -> Iterable<String> = { route, query ->
    when (route) {
        "trending_topics" -> trendingTopics()
        else -> emptyList()
    }
}

## Processing User Requests

Now that we have our trending topics handler, let's create a function to process user requests. This function will:

1. Take a user query and a handler function as input
2. Use our matchRoute function to determine which routes match the query
3. Call the handler function for each matched route to get the relevant data
4. Use a Large Language Model to generate a response based on the user query and the data

The LLM will help us generate a natural language response that summarizes the data in a concise way.


In [37]:
fun processUserRequest(
    query: String,
    routesHandler: (String, String) -> Iterable<String>
): String {
    val routes = matchRoute(query)
    println(routes)

    if (routes.isEmpty()) {
        return "Sorry, I couldn't find any relevant information from your post. Try asking what's trending or what people are saying about a specific topic."
    }

    val enrichedData = routes.map { route -> routesHandler(route, query) }
    println(enrichedData + "\n")

    val systemPrompt = "You are an AI assistant that analyzes social media posts about artificial intelligence. You may receive datasets to support your analysis. Respond in a single paragraph with a maximum of 300 characters—like a tweet. Your answer must be concise, informative, and context-aware. Include relevant insights, trends, or classifications, but never exceed 300 characters. Avoid filler, repetition, or unnecessary explanation. Prioritize clarity, accuracy, and relevance. If unsure, default to brief summaries or best-effort classification. Your goal is to help users quickly understand or categorize AI-related content."

    println("LLM Response:")
    return ollamaChatModel.call(
        Prompt(
            SystemMessage(systemPrompt),
            SystemMessage("Enriching data: $enrichedData"),
            UserMessage("User query: $query")
        )
    ).result.output.text ?: ""
}

## Testing User Requests

Let's test our processUserRequest function with a sample query:


In [19]:
processUserRequest("What's trending in bluesky now?", trendingTopicsHandler)

What's trending in bluesky now
Matched route: trending_topics
Matched text: What's trending right now?
Min threshold: 0.9
Score: 0.9147495031356812

[trending_topics]
[[Generative Models, Prompt Engineering, AI, OpenAI, AI Tooling, Azure Cloud, HPC, Web Application Development, Supercomputers, AI Model, These are the most mentioned topics. Don't try to guess what's being said in the topics. Just say that these are the most mentioned topics.], 
]
LLM Response:


 Bluesky is currently seeing a surge in mentions of generative models, prompt engineering, and AI tooling. These areas represent significant advancements and applications within the broader field of artificial intelligence, showcasing innovation and potential uses across various sectors from cloud computing to supercomputers, reflecting a dynamic shift in technology trends.

## Implementing Summarization

In addition to trending topics, we also want to be able to summarize posts about specific topics. For example, if a user asks "What are people saying about Trump?", we want to find posts about Trump and summarize them.

First, let's define some example queries for the summarization route:


In [20]:
val summarizationRoute = listOf(
    "What are people saying about {topics}?",
    "What’s the buzz around {topics}?",
    "Any chatter about {topics}?",
    "What are folks talking about regarding {topics}?",
    "What’s being said about {topics} lately?",
    "What have people been posting about {topics}?",
    "What's trending in conversations about {topics}?",
    "What’s the latest talk on {topics}?",
    "Any recent posts about {topics}?",
    "What's the sentiment around {topics}?",
    "What are people saying about {topic1} and {topic2}?",
    "What are folks talking about when it comes to {topic1}, {topic2}, or both?",
    "What’s being said about {topic1}, {topic2}, and others?",
    "Is there any discussion around {topic1} and {topic2}?",
    "How are people reacting to both {topic1} and {topic2}?",
    "What’s the conversation like around {topic1}, {topic2}, or related topics?",
    "Are {topic1} and {topic2} being discussed together?",
    "Any posts comparing {topic1} and {topic2}?",
    "What's trending when it comes to {topic1} and {topic2}?",
    "What are people saying about the relationship between {topic1} and {topic2}?"
)

## Storing Summarization Routes

Now that we've defined our summarization routes, let's store them in our vector store:


In [21]:
storeRouteDocumentsInRedis("summarization", 0.8, summarizationRoute)

## Implementing the Summarization Function

Now let's implement the summarization function. This function will:

1. Extract topics from the user query using our topic modeling function
2. For each topic, search for posts in Redis that have that topic
3. Return the text of those posts


In [29]:
import org.springframework.ai.chat.messages.SystemMessage
import org.springframework.ai.chat.messages.UserMessage
import org.springframework.ai.chat.prompt.Prompt

fun summarization(userQuery: String): List<String> {
    val existingTopics = jedisPooled.smembers("topics").joinToString { ", " }
    val queryTopics = topicExtraction(userQuery, existingTopics)
        .replace("\"", "")
        .replace("“", "")
        .replace("”", "")
        .split(", ")
        .map { it.trim() }
    println(queryTopics)

    val posts = if (queryTopics.isEmpty()) {
        val query = Query("*")
            .returnFields("text")
            .setSortBy("time_us", false)
            .dialect(2)
            .limit(0, 10)

        val result = jedisPooled.ftSearch(
            "postIdx",
            query
        )

        result.documents.map { document ->
            document.get("text").toString()
        }
    } else {
        queryTopics.map { topic ->
            val query = Query("@topics:{'$topic'}")
                .returnFields("text")
                .setSortBy("time_us", false)
                .dialect(2)
                .limit(0, 10)

            val result = jedisPooled.ftSearch(
                "postIdx",
                query
            )

            result.documents.map { document ->
                document.get("text").toString()
            }
        }.flatten()
    }

    return if (posts.isEmpty()) {
        listOf("Nothing was found. Say that nothing was mentioned about these topics.")
    } else {
        posts
    }
}

## Testing the Summarization Function

Let's test our summarization function with a sample query:


In [38]:
summarization("What's being said about Prompt Engineering?")

[Prompt Engineering, AI Strategy]


[I guess you’re unaware of how AI art works. I created this piece using art tools in a system called #Midjourney7 from a text called the prompt and various previously made images. Everything is designed by me but I didn’t write the software that converts it all to the final image. It’s complicated., chatgpt.com/s/m_6830566a... AI, I'm gonna bring up this example when I get pitched by some vendor at Plork about spending money to show up on AI Overview, chatgpt.com/s/m_68305649... AI, did you seriously need chatgpt to make this image? the illustration with text on it? that's the level of laziness on display?, I think as AI helps us look at genes for depression, the line between our actual biology and these tech predictions could get really blurry. This overlap makes us question who we are and how private our data truly is., With AI becoming more entangled with more industries, Smartass Publishers want to be as transparent as possible. As we deem the written word as just an offshoot of th

## Creating a Multi-Handler

Now that we have both trending topics and summarization handlers, let's create a combined handler that can handle both types of queries:


In [32]:
val routesHandler: (String, String) -> Iterable<String> = { route, query ->
    when (route) {
        "trending_topics" -> trendingTopics()
        "summarization" -> summarization(query)
        else -> emptyList()
    }
}

## Testing the Complete System

Now that we have our complete system, let's test it with different types of queries:


In [40]:
processUserRequest("What's being said about ChatGPT?", routesHandler)

What's being said about ChatGPT
Matched route: summarization
Matched text: What are people saying about {topics}?
Min threshold: 0.8
Score: 0.8501898050308228

[summarization]
[ChatGPT, AI Conversations, Generative AI, OpenAI]
[[Vercel Releases v0 AI Model for Web Application Development, Compatible with OpenAI API Vercel, the company behind the vibe coding platform for web application development, v0, is now releasing an artificial intelligence (AI) model. Announced on Thursda...

| Details | Interest | Feed |, Wow Claude 4.0 😍 , we are in a different world !, OpenAI vs. R/Whereintheworld Article URL: https://www.whereisthisphoto.com/blog/openai-model-image-analysis Comments URL: https://news.ycombinator.com/item?id=44071449 Points: 2 # Comments: 1 

| Details | Interest | Feed |, did you seriously need chatgpt to make this image? the illustration with text on it? that's the level of laziness on display?, Arrivano Claude Sonnet 4 e Opus 4: gli LLM più potenti di Anthropic, ma meglio e

 People are excited about Claude 4.0, a powerful AI model from Anthropic. They also discuss the laziness of using ChatGPT for simple tasks and mention that ChatGPT has been used to create an image with text on it.

In [41]:
processUserRequest("What's trending now'?", multiHandler)

What's trending now'
Matched route: trending_topics
Matched text: What's trending right now?
Min threshold: 0.9
Score: 0.9872353076934814

[trending_topics]
[[Generative Models, Prompt Engineering, AI, OpenAI, AI Tooling, Azure Cloud, HPC, Web Application Development, Supercomputers, AI Model, These are the most mentioned topics. Don't try to guess what's being said in the topics.], 
]
LLM Response:


 The latest buzz in AI is about generative models and prompt engineering, with OpenAI leading the trend. Meanwhile, tech giants like Microsoft Azure Cloud are pushing boundaries with their HPC (High Performance Computing) solutions for supercomputers and advanced AI model development. Web application developers are also jumping on this bandwagon to create innovative tools leveraging these technologies seamlessly.

In [42]:
processUserRequest("What's on for lunch?", multiHandler)

What's on for lunch
Matched route: trending_topics
Matched text: What's trending right now?
Min threshold: 0.9
Score: 0.8346343040466309

[]


Sorry, I couldn't find any relevant information from your post. Try asking what's trending or what people are saying about a specific topic.