# Data Analysis with AI

In this notebook, we'll analyze the enriched events from the previous notebooks. We'll use a combination of AI techniques to analyze the data and build a simple question-answering system.

## Overview of Previous Notebooks

In the previous notebooks, we've built a pipeline for processing Bluesky posts:

1. **JetStream Consumer**: We consumed Bluesky's Jetstream Websocket and inserted events into a Redis Stream.
2. **JetStream Filtering**: We filtered events using Redis Bloom Filter for deduplication and a machine learning model for content-based filtering.
3. **Events Enrichment**: We enriched the filtered events with topic modeling and embeddings for semantic search.

## What We'll Build in This Notebook

In this notebook, we'll build a simple question-answering system that can:

1. Identify trending topics in the posts
2. Perform semantic search to understand user queries
3. Summarize posts about specific topics using a Large Language Model (LLM)
4. Route different types of user queries to the appropriate handler

This demonstrates how to combine Redis, vector search, and LLMs to build an intelligent data analysis system.

## Setting Up the Environment

First, let's import the necessary libraries and set up our environment. We'll need:

1. Helper functions from previous notebooks
2. Ktor client for HTTP requests
3. Serialization for JSON parsing
4. Coroutines for asynchronous programming


In [144]:
import dev.raphaeldelio.*

In [145]:
%use ktor-client
%use serialization
%use coroutines

## Semantic Routing with Vector Search

To build our question-answering system, we need to understand what the user is asking. We'll use a technique called semantic routing to classify user queries into different categories.

For example, if a user asks "What's trending right now?", we want to route this to our trending topics handler. If they ask "What are people saying about Trump?", we want to route this to our summarization handler.

We'll use vector search to match user queries to predefined routes. First, let's define some example queries for the trending topics route:


In [146]:
val trendingTopicsRoute = listOf(
    "What are the most mentioned topics?",
    "What's trending right now?",
    "What’s hot in the network",
    "Top topics?",
)

## Setting Up the Vector Store

To implement our query routing, we'll use Redis as a vector store to store and search for similar queries. A vector store is a database that stores vector embeddings and allows for efficient similarity search.

We'll use Redis as our vector store and Spring AI to create embeddings. First, let's set up the embedding model:


In [147]:
@file:DependsOn("org.springframework.ai:spring-ai-redis-store:1.0.0-RC1")

In [148]:
import org.springframework.ai.transformers.TransformersEmbeddingModel

val embeddingModel = TransformersEmbeddingModel()
embeddingModel.setModelResource("file:resources/model/bge-large-en-v1.5/model.onnx")
embeddingModel.setTokenizerResource("file:resources/model/bge-large-en-v1.5/tokenizer.json")
embeddingModel.afterPropertiesSet()

## Configuring the Redis Vector Store

Now, let's configure the Redis vector store. We'll use Spring AI's RedisVectorStore, which provides a high-level interface for storing and searching vector embeddings in Redis.

The configuration includes:
- The index name for our vector store
- The field names for content and embeddings
- Metadata fields for storing additional information
- The prefix for our keys in Redis
- The vector algorithm to use for similarity search (FLAT in this case)


In [149]:
import org.springframework.ai.vectorstore.redis.RedisVectorStore
import org.springframework.ai.vectorstore.redis.RedisVectorStore.MetadataField
import redis.clients.jedis.search.Schema.FieldType

val redisVectorStore = RedisVectorStore.builder(jedisPooled, embeddingModel)
    .indexName("routeIdx")
    .contentFieldName("text")
    .embeddingFieldName("textEmbedding")
    .metadataFields(
        MetadataField("route", FieldType.TEXT),
        MetadataField("minThreshold", FieldType.NUMERIC),
    )
    .prefix("route:")
    .initializeSchema(true)
    .vectorAlgorithm(RedisVectorStore.Algorithm.FLAT)
    .build()
redisVectorStore.afterPropertiesSet()

## Creating and Storing Route Documents

Now that we have our vector store set up, we need to create documents for our routes and store them in the vector store. Each document represents a possible user query and contains:

- The route it belongs to (e.g., "trending_topics")
- The text of the query (e.g., "What's trending right now?")
- A minimum threshold for matching (to avoid false positives)

We'll create a function to create these documents and another function to store them in Redis:


In [150]:
import org.springframework.ai.document.Document
import java.util.UUID

fun createRouteDocument(route: String, text: String, minThreshold: Double): Document {
    return Document(
        UUID.randomUUID().toString(),
        text,
        mapOf(
            "route" to route,
            "text" to text,
            "minThreshold" to minThreshold,
        )
    )
}

fun storeRouteDocumentsInRedis(routeName: String, minThreshold: Double, routeSamples: List<String>) {
    val trendingTopicDocuments = routeSamples.map { text ->
        createRouteDocument(routeName, text, minThreshold)
    }

    redisVectorStore.add(trendingTopicDocuments)
}

storeRouteDocumentsInRedis("trending_topics", 0.9, trendingTopicsRoute)

## Testing Vector Search

Let's test our vector store by searching for a query similar to the ones we've stored. We'll use the `similaritySearch` method to find the most similar document to our query:


In [151]:
import org.springframework.ai.vectorstore.SearchRequest

redisVectorStore.similaritySearch(
    SearchRequest.builder()
        .topK(1)
        .query("Hey Dev Bubble. What's trending today? Excited to hear the news!")
        .build()
)

[Document{id='b8198ac3-eba2-4f67-9cd0-5df85b889003', text='What's trending right now?', media='null', metadata={vector_score=0.09470153, minThreshold=0.9, route=trending_topics, distance=0.09470153}, score=0.9052984714508057}]

## Route Matching

Now that we have our vector store set up and tested, we need to create a function to match user queries to routes. This function will:

1. Break the user query into clauses (to handle complex queries)
2. For each clause, find the most similar document in our vector store
3. Check if the similarity score is above the minimum threshold
4. Return the set of matched routes


In [152]:
import redis.clients.jedis.search.FTSearchParams
import redis.clients.jedis.search.Query

fun breakSentenceIntoClauses(sentence: String): List<String> {
    return sentence.split(Regex("""[!?,.:;()"\[\]{}]+"""))
        .filter { it.isNotBlank() }.map { it.trim() }
}

fun matchRoute(query: String): Set<String> {
    return breakSentenceIntoClauses(query).flatMap { clause ->
        val result = redisVectorStore.similaritySearch(
            SearchRequest.builder()
                .topK(2)
                .query(clause)
                .build()
        )

        val route = result?.firstOrNull()?.metadata?.get("route") as String
        val minThreshold = result.firstOrNull()?.metadata?.get("minThreshold") as String

        result.forEach {
            println(clause)
            println(route)
            println(it.score ?: 0.0)
            println(minThreshold)
            println()
        }

        result.filter { (it?.score ?: 0.0) > minThreshold.toDouble() }.map {
            it?.metadata?.get("route") as String
        }
    }.toSet()
}

## Testing Route Matching

Let's test our route matching function with a sample query:


In [153]:
matchRoute("Hey DevBubble, what's trending today? Excited to hear the news!")

Hey DevBubble
summarization
0.808184027671814
0.8

Hey DevBubble
summarization
0.808184027671814
0.8

what's trending today
trending_topics
0.979021430015564
0.9

what's trending today
trending_topics
0.979021430015564
0.9

Excited to hear the news
summarization
0.8198691606521606
0.8

Excited to hear the news
summarization
0.8198691606521606
0.8



[summarization, trending_topics]

## Implementing Trending Topics

Now that we have our route matching function, let's implement the trending topics handler. This handler will:

1. Get the current minute (to query the count-min sketch for the current time window)
2. Get all topics ever added to Redis
3. For each topic, get the count from the count-min sketch
4. Sort the topics by count (descending)
5. Take the top 10 topics
6. Return them as a set


In [154]:
import org.springframework.ai.chat.messages.SystemMessage
import org.springframework.ai.chat.messages.UserMessage
import org.springframework.ai.chat.prompt.Prompt
import java.time.LocalDateTime

fun trendingTopics(): Set<String> {
    val currentMinute = LocalDateTime.now().withMinute(0).withSecond(0).withNano(0).toString()
    val top5Topics = jedisPooled.smembers("topics")
        .map { it to jedisPooled.cmsQuery("topics-cms:$currentMinute", it).first() }
        .sortedByDescending { it.second }
        .take(5)
        .map { it.first }
        .toMutableSet()

    top5Topics.add("These are the most mentioned topics. Don't try to guess what's being said in the topics.")
    return top5Topics.toSet()
}

## Testing Trending Topics

Let's test our trending topics function:


In [155]:
trendingTopics()

[AI Ethics, Generative AI, AI Tooling, AI Agents, Machine Learning, These are the most mentioned topics. Don't try to guess what's being said in the topics.]

## Creating a Trending Topics Handler

Now that we have our trending topics function, let's create a handler that can be used by our query router. This handler will:

1. Take a route and a query as input
2. If the route is "trending_topics", call our trendingTopics function
3. Otherwise, return an empty list


In [156]:
import dev.raphaeldelio.*

val trendingTopicsHandler: (String, String) -> Iterable<String> = { route, query ->
    when (route) {
        "trending_topics" -> trendingTopics()
        else -> emptyList()
    }
}

## Processing User Requests

Now that we have our trending topics handler, let's create a function to process user requests. This function will:

1. Take a user query and a handler function as input
2. Use our matchRoute function to determine which routes match the query
3. Call the handler function for each matched route to get the relevant data
4. Use a Large Language Model to generate a response based on the user query and the data

The LLM will help us generate a natural language response that summarizes the data in a concise way.


In [157]:
fun processUserRequest(
    query: String,
    handler: (String, String) -> Iterable<String>
): String {
    val routes = matchRoute(query)
    println(routes)

    val enrichedData = routes.map { route -> handler(route, query) }
    println(enrichedData + "\n")

    val systemPrompt = "You are a bot that helps users analyse posts about artificial intelligence posts. You may be given a data set to help you answer questions. Answer in a max od 300 chars. I MEAN IT. It's a TWEET. Don't write more than 300 chars. Respond in only ONE paragraph. Be as concise as possible"

    return ollamaChatModel.call(
        Prompt(
            SystemMessage(systemPrompt),
            SystemMessage("Enriching data: $enrichedData"),
            UserMessage("User query: $query")
        )
    ).result.output.text ?: ""
}

## Testing User Requests

Let's test our processUserRequest function with a sample query:


In [158]:
processUserRequest("What's trending right now?", trendingTopicsHandler)

What's trending right now
trending_topics
0.9956518411636353
0.9

What's trending right now
trending_topics
0.9956518411636353
0.9

[trending_topics]
[[AI Ethics, Generative AI, AI Tooling, AI Agents, Machine Learning, These are the most mentioned topics. Don't try to guess what's being said in the topics.], 
]


 The current trend in artificial intelligence discussions revolves around AI ethics, generative AI models, tooling for machine learning, AI agents, and advancements in general within the field of machine learning. These topics are prominent across various platforms where AI is being actively debated and developed, indicating a focus on responsible innovation, creative applications, and advanced technology tools to enhance efficiency and ethical considerations in AI usage.

## Implementing Summarization

In addition to trending topics, we also want to be able to summarize posts about specific topics. For example, if a user asks "What are people saying about Trump?", we want to find posts about Trump and summarize them.

First, let's define some example queries for the summarization route:


In [159]:
val summarizationRoute = listOf(
    "What are people saying about {topics}?",
    "What’s the buzz around {topics}?",
    "Any chatter about {topics}?",
    "What are folks talking about regarding {topics}?",
    "What’s being said about {topics} lately?",
    "What have people been posting about {topics}?",
    "What's trending in conversations about {topics}?",
    "What’s the latest talk on {topics}?",
    "Any recent posts about {topics}?",
    "What's the sentiment around {topics}?",
    "What are people saying about {topic1} and {topic2}?",
    "What are folks talking about when it comes to {topic1}, {topic2}, or both?",
    "What’s being said about {topic1}, {topic2}, and others?",
    "Is there any discussion around {topic1} and {topic2}?",
    "How are people reacting to both {topic1} and {topic2}?",
    "What’s the conversation like around {topic1}, {topic2}, or related topics?",
    "Are {topic1} and {topic2} being discussed together?",
    "Any posts comparing {topic1} and {topic2}?",
    "What's trending when it comes to {topic1} and {topic2}?",
    "What are people saying about the relationship between {topic1} and {topic2}?"
)

## Storing Summarization Routes

Now that we've defined our summarization routes, let's store them in our vector store:


In [161]:
storeRouteDocumentsInRedis("summarization", 0.8, summarizationRoute)

## Implementing the Summarization Function

Now let's implement the summarization function. This function will:

1. Extract topics from the user query using our topic modeling function
2. For each topic, search for posts in Redis that have that topic
3. Return the text of those posts


In [162]:
import org.springframework.ai.chat.messages.SystemMessage
import org.springframework.ai.chat.messages.UserMessage
import org.springframework.ai.chat.prompt.Prompt

fun summarization(userQuery: String): List<String> {
    val existingTopics = jedisPooled.smembers("topics").joinToString { ", " }
    val queryTopics = topicExtraction(userQuery, existingTopics).replace("\"", "").split(", ")
    println(queryTopics)

    val posts = if (queryTopics.isEmpty()) {
        val query = Query("*")
            .returnFields("text")
            .setSortBy("time_us", false)
            .dialect(2)
            .limit(0, 10)

        val result = jedisPooled.ftSearch(
            "postIdx",
            query
        )

        result.documents.map { document ->
            document.get("text").toString()
        }
    } else {
        queryTopics.map { topic ->
            val query = Query("@topics:{'$topic'}")
                .returnFields("text")
                .setSortBy("time_us", false)
                .dialect(2)
                .limit(0, 10)

            val result = jedisPooled.ftSearch(
                "postIdx",
                query
            )

            result.documents.map { document ->
                document.get("text").toString()
            }
        }.flatten()
    }

    return if (posts.isEmpty()) {
        listOf("Nothing was found. Say that nothing was mentioned about these topics.")
    } else {
        posts
    }
}

## Testing the Summarization Function

Let's test our summarization function with a sample query:


In [163]:
summarization("What's being said about ChatGPT and Chatbots?")

[ ChatGPT, Chatbots]


[You’re using ChatGPT to send me a 2 sentence email?, AI is taking over, but nobody can figure out how to stop bots from trying to follow us on every platform. The dumb, it overwhelms., Chatgpt 畫“選腎與熊”, Engineers are learning with us. Maybe not ripe for govt yet.

But the allure of a Clever Hans digital factotum that doesn't demand wages will greenlight adoption faster than any policy or due diligence will.

There's good usages, but a lot more bad. Endorsed chatbot ai for all employees is insane.]

## Creating a Multi-Handler

Now that we have both trending topics and summarization handlers, let's create a combined handler that can handle both types of queries:


In [164]:
val multiHandler: (String, String) -> Iterable<String> = { route, query ->
    when (route) {
        "trending_topics" -> trendingTopics()
        "summarization" -> summarization(query)
        else -> emptyList()
    }
}

## Testing the Complete System

Now that we have our complete system, let's test it with different types of queries:


In [165]:
processUserRequest("What's being said about ChatGPT and Chatbots?", multiHandler)

What's being said about ChatGPT and Chatbots
summarization
0.8352869153022766
0.8

What's being said about ChatGPT and Chatbots
summarization
0.8352869153022766
0.8

[summarization]
[ ChatGPT, Chatbots, AI Conversational Agents]
[[You’re using ChatGPT to send me a 2 sentence email?, AI is taking over, but nobody can figure out how to stop bots from trying to follow us on every platform. The dumb, it overwhelms., Chatgpt 畫“選腎與熊”, Engineers are learning with us. Maybe not ripe for govt yet.

But the allure of a Clever Hans digital factotum that doesn't demand wages will greenlight adoption faster than any policy or due diligence will.

There's good usages, but a lot more bad. Endorsed chatbot ai for all employees is insane.], 
]


 The discussion revolves around the implications of AI in various domains, especially within digital communications like emails and social media platforms. It highlights concerns over the increasing prevalence of bots attempting to follow users across different online channels. There is a debate on whether this trend could lead to unintended consequences or if it's just a part of technological advancement. The conversation also touches on potential misuse of AI tools, such as using ChatGPT for tasks not intended for its capabilities (e.g., generating emails), and the ethical considerations surrounding the endorsement of chatbot AIs across all organizational levels.

In [167]:
processUserRequest("What's trending now'?", multiHandler)

What's trending now'
trending_topics
0.9872353076934814
0.9

What's trending now'
trending_topics
0.9872353076934814
0.9

[trending_topics]
[[AI Ethics, Generative AI, AI Tooling, AI Agents, Machine Learning, These are the most mentioned topics. Don't try to guess what's being said in the topics.], 
]


 The latest trends in artificial intelligence discussions revolve around ethics, generative models, tool development for AI, and intelligent agents. Topics like machine learning advancements are also prominent in various conversations on social media platforms.

## Conclusion

In this notebook, we've built a simple question-answering system that can:

1. Identify trending topics in posts
2. Summarize posts about specific topics
3. Route different types of user queries to the appropriate handler
4. Generate natural language responses using a Large Language Model

This demonstrates how to combine Redis, vector search, and LLMs to build an intelligent data analysis system. The system can be extended to handle more types of queries and to provide more detailed analysis of the data.
