# Classification - Open Ended Questions

This notebook, highlight the steps that can extract insights from open-ended questions which are part of the overall survey. The open-ended questions are the ones where the respondent can write their own answers. As per discussion with the survey team, the open-ended questions require context, this context is part of previous set of questions.

The approach here is to create a new table that has the context and the open-ended question. Additional fields would be added to the table as needed. The table would be used to extract insights from the open-ended questions. The primary key for all tables is the field `ResponseId`.

## Classification approach - Using embeddings

Given a predefined list of classes and their definitions:

| Classification | Description |
|---|---|
| Integration | Key issues include difficulties integrating capabilities of cloud services into solutions, ensuring seamless interoperability. |
| Breadth | Key issues include navigating an overwhelming range of service options. |
| Containers | Key issues include challenges in container orchestration, ensuring compatibility in containerized environments.|
...

The list of classes can have multiple hierarchies. In our case though, there is a single level of classes. As initial step we would embed the class definition and save it to a `json` file.

Per row, we will call a method that would extract insights from the information in few techniques:
- word count - deterministic, using ' ' (space) as delimiter, we count the number of words. Note that LLM are really not god at word count.
- Sentiment, keywords, themes - using LLM to extract this information as `json`
- Model Scores - using the verbatim text and grounding information (attribution) - Using embedding and cosine similarity, we would assign a score to each class. The choice we took is to pick all classes with score higher than a threshold.

## Process

### Step 1 - load packages and libraries

In [None]:
#r "nuget: Azure.AI.OpenAI, 2.1.0"
#r "nuget: Azure.Identity, 1.8.0"
#r "nuget: DotNetEnv, 2.5.0"
#r "nuget: Microsoft.Data.Sqlite, 6.0.0"


using Azure.Identity;
using Azure;

using DotNetEnv;

using System.IO;
using System.Text.Json;
using System.ClientModel;

using Azure.AI.OpenAI;
using Azure.AI.OpenAI.Chat;

using OpenAI.Chat;

using OpenAI.Embeddings;

### Step 2 - Configure Azure OpenAI

In [None]:
string _configurationFile = @"../../../../configuration/.env";
Env.Load(_configurationFile);

string oAiEndpoint = Environment.GetEnvironmentVariable("AOAI_ENDPOINT") 
    ?? "AOAI_ENDPOINT not found";
string chatCompletionDeploymentName = Environment.GetEnvironmentVariable("CHATCOMPLETION_DEPLOYMENTNAME") 
    ?? "CHATCOMPLETION_DEPLOYMENTNAME not found";
string embeddingDeploymentName = Environment.GetEnvironmentVariable("EMBEDDING_DEPLOYMENTNAME") ?? "EMBEDDING_DEPLOYMENTNAME not found";
var credential = new DefaultAzureCredential();

// Now create the client using your identity:
AzureOpenAIClient openAIClient = new AzureOpenAIClient(
    new Uri(oAiEndpoint),
    credential
);

Console.WriteLine($"OpenAI Client created with user identity at: {oAiEndpoint}, using deployment: {chatCompletionDeploymentName}");

### Step 3 - Loading additional classes and helper functions

In [None]:
// include a local class named VectorMath - this class is used to calculate the cosine similarity between two vectors


#load "VectorMath.cs"
#load "ClassificationNode.cs"
#load "SQLiteHelper.cs"

In [None]:
async Task<float[]> GetEmbeddingAsync(AzureOpenAIClient _openAIClient,string textToBeVecorized)
{
    // Prepare the embeddings options with the user story\n",
    EmbeddingClient embeddingClient = _openAIClient.GetEmbeddingClient(embeddingDeploymentName);
    ClientResult<OpenAIEmbedding> embeddingResult = await embeddingClient.GenerateEmbeddingAsync(textToBeVecorized);   
    float[] response = embeddingResult?.Value?.ToFloats().ToArray() ?? new float[0];
    return response;
}

In [None]:
async Task<string> CallOpenAI(AzureOpenAIClient _openAIClient, string prompt, string systemMessage, bool jsonResponse = true)
{
    // Get the chat client (using your deployment or model name)
    ChatClient chatClient = _openAIClient.GetChatClient(chatCompletionDeploymentName);

    ChatCompletionOptions chatComletionOptions = new ChatCompletionOptions(){
        MaxOutputTokenCount = 450,
        Temperature = 0.7f,
        TopP = 1.0f,
        FrequencyPenalty = 0.7f,
        PresencePenalty = 0.7f,

    };

    chatComletionOptions.ResponseFormat = jsonResponse ? ChatResponseFormat.CreateJsonObjectFormat() : ChatResponseFormat.CreateTextFormat();

    // Prepare your messages
    ChatMessage[] messages = new ChatMessage[]
    {
        new SystemChatMessage(systemMessage),
        new UserChatMessage(prompt)
    };

    // Call the chat completions endpoint with parameters directly
    ChatCompletion completions = await chatClient.CompleteChatAsync(        
    messages: messages, 
    options: chatComletionOptions);

    // Get the text from the first completion choice
    // var resp = completions.Content[0];
    
    string result = completions.Content[0].Text;
    return result;
}

### Step 4 - Load and embed the classification classes

In [None]:

var inputFilePath = "cic_classes.json"; // Adjust path as needed
var jsonString = File.ReadAllText(inputFilePath);


var classifications = JsonSerializer.Deserialize<List<ClassificationNode>>(jsonString, new JsonSerializerOptions
{
    PropertyNameCaseInsensitive = true
});

if (classifications == null)
{
    Console.WriteLine("Failed to deserialize JSON. Check file format.");
    return;
}

**Helper Method:** EmbedClassificationDataAsync - call the LLM to embed the classification classes

In [None]:
async Task<List<ClassificationNode>> EmbedClassificationDataAsync(
    AzureOpenAIClient _openAIClient,
    List<ClassificationNode> classificationNodes)
{
    foreach (var node in classificationNodes)
    {
        // Build the text to embed, e.g. "Topic:Definition"
        string textToEmbed = $"{node.Topic}: {node.Definition}";

        // Call your actual embedding method (replace with real logic)
        var embedding = await GetEmbeddingAsync(_openAIClient,textToEmbed);
        node.Embedded = embedding;

        Console.WriteLine($"Embedded => Topic: {node.Topic}");

        // Recursively embed all child topics
        if (node.ChildTopics != null && node.ChildTopics.Count > 0)
        {
            await EmbedClassificationDataAsync(_openAIClient,node.ChildTopics);
        }
    }

    // Return the updated list
    return classificationNodes;
}

**Calling for embedding:** calling the helper function and saving to a new file. This operation is required when the classes are updated or new classes added.

In [None]:
var outputFilePath = "cic_classes_with_embeddings.json"; // Adjust path as needed

var updatedWafData = await EmbedClassificationDataAsync(openAIClient,classifications);

// Pretty-print for readability
var options = new JsonSerializerOptions
{
    WriteIndented = true
};

var updatedJson = JsonSerializer.Serialize(updatedWafData, options);
File.WriteAllText(outputFilePath, updatedJson);

Console.WriteLine($"Updated JSON with embeddings saved to {outputFilePath}.");

### Step 5 - load the classification classes with embeddings to memory

In [None]:
// loading the previously generated classes with embeddings
var inputFilePath = "cic_classes_with_embeddings.json"; 
var jsonString = File.ReadAllText(inputFilePath);
// Console.WriteLine(jsonString); // Check structure
List<ClassificationNode> classifications = new List<ClassificationNode>();

try
{
    classifications = JsonSerializer.Deserialize<List<ClassificationNode>>(jsonString, new JsonSerializerOptions
    {
        PropertyNameCaseInsensitive = true
    });
}
catch (JsonException)
{
    Console.WriteLine("Failed to deserialize JSON. Check file format.");
    return;
}


if (classifications == null)
{
    Console.WriteLine("Failed to deserialize JSON. Check file format.");
}

**Helper Class:** ClassficationResult - as an example, we have the attribution (although we also use the verbatim text) and a list of possible classes with their similarities scores.

In [None]:
public class ClassificationResult
{
    public string Attribution { get; set; } = string.Empty;

    // Each dictionary contains one key-value pair: { "label": score }
    public List<Dictionary<string, float>> Matches { get; set; } = new();
}

**Helper Method:** FlattenNodes - converting a nested list of classes to a flat list

In [None]:
public static List<(ClassificationNode Node, List<string> Path)> FlattenNodes(
    List<ClassificationNode> nodes,
    List<string> parentPath = null)
{
    var results = new List<(ClassificationNode, List<string>)>();

    foreach (var node in nodes)
    {
        var currentPath = (parentPath == null || parentPath.Count == 0)
            ? new List<string> { node.Topic }
            : new List<string>(parentPath) { node.Topic };

        // Add this node
        results.Add((node, currentPath));

        // Recurse if child topics exist
        if (node.ChildTopics != null && node.ChildTopics.Count > 0)
        {
            results.AddRange(FlattenNodes(node.ChildTopics, currentPath));
        }
    }

    return results;
}

**Helper Method:** ClassifyTopMatchesAsync - using the embeddings and cosine similarity, we classify the top matches

In [None]:
public async Task<ClassificationResult> ClassifyTopMatchesAsync(
    AzureOpenAIClient _openAIClient,
    string text2Classify,
    List<ClassificationNode> classificationNodes,
    float threshold = 0.8f)
{
    var allNodes = FlattenNodes(classificationNodes); // Flatten the tree
    var result = new ClassificationResult
    {
        Attribution = text2Classify
    };

    float[] embedding = await GetEmbeddingAsync(_openAIClient, text2Classify);
    if (embedding == null)
    {
        Console.WriteLine("Failed to generate embedding.");
        return result;
    }

    var matches = new List<(string Label, float Score)>();

    foreach (var (node, path) in allNodes)
    {
        if (node.Embedded == null) continue;

        float sim = VectorMath.CosineSimilarity(embedding, node.Embedded);
        if (sim >= threshold)
        {
            // string label = string.Join(" > ", path); // e.g., "Security > IAM"
            string label = path.Count > 1 
                    ? string.Join(" > ", path.Skip(1)) 
                    : path.First(); // fallback just in case
            matches.Add((label, sim));
        }
    }

    // Sort by similarity descending
    var sorted = matches.OrderByDescending(m => m.Score);

    result.Matches = sorted
        .Select(m => new Dictionary<string, float> { [m.Label] = m.Score })
        .ToList();

    return result;
}

## Loading the data from the sqlite database



In [None]:
var connection = SQLiteHelper.LoadDatabase("../fy25-raw.db");
var tableName = "OverallPailPoints";
var query2 = $@"
    SELECT * FROM {tableName}
    LIMIT 5;";
var results = SQLiteHelper.ExecuteQuery(connection, query2);

// Print
foreach (var row in results)
{
    Console.WriteLine(string.Join(", ", row));
}

### Step 6 - Classify the open-ended questions and attribution

In [None]:
foreach (var row in results)
{
    var attributionFields = row
        .Where(kvp => !kvp.Key.Equals("Verbatim", StringComparison.OrdinalIgnoreCase)
                      && kvp.Value != null 
                      && !string.IsNullOrWhiteSpace(kvp.Value.ToString()))
        .Select(kvp => kvp.Value.ToString());

    string attribution = string.Join(", ", attributionFields);
    string verbatim = row["Verbatim"].ToString();

    // var classification = await ClassifyTopMatchesAsync(openAIClient, attribution, classifications, 0.823f);
    var classification = await ClassifyTopMatchesAsync(openAIClient, verbatim+attribution, classifications, 0.82f);

    Console.WriteLine($"Attribution: {classification.Attribution}");
    // Format as a flat string manually, since default JSON serializer uses double-quotes
    var formatted = "[" + string.Join(", ", classification.Matches.Select(m =>
    {
        var kvp = m.First(); // only one entry per dict
        return $"{{'{kvp.Key}': {kvp.Value:F2}}}";
    })) + "]";

    Console.WriteLine(formatted); 
}

### Putting it together

- Load the data (classes and embeddings)
- Query the database
- Classify the open-ended questions & attribution
- Store to the origin table

In [None]:
public async Task ClassifyRowsFromDatabaseAsync(
    string dbPath,
    string tableName,
    int limit,
    AzureOpenAIClient _openAIClient,
    List<ClassificationNode> classifications,
    float threshold = 0.82f,
    bool persistClassification = true,
    string keyColumn = "ResponseId") // default PK column name
{
    var connection = SQLiteHelper.LoadDatabase(dbPath);

    // Ensure Classification column exists
    if (persistClassification)
    {
        var alter = $"ALTER TABLE {tableName} ADD COLUMN ModelClassificationScores TEXT;";
        try { SQLiteHelper.ExecuteNonQuery(connection, alter); }
        catch { /* Ignore if column already exists */ }
    }

    var query = $@"SELECT * FROM {tableName} LIMIT {limit};";
    var results = SQLiteHelper.ExecuteQuery(connection, query);

    foreach (var row in results)
    {
        var attributionFields = row
            .Where(kvp => !kvp.Key.Equals("Verbatim", StringComparison.OrdinalIgnoreCase)
                          && kvp.Value != null 
                          && !string.IsNullOrWhiteSpace(kvp.Value.ToString()))
            .Select(kvp => kvp.Value.ToString());

        string attribution = string.Join(", ", attributionFields);
        string verbatim = row.ContainsKey("Verbatim") ? row["Verbatim"]?.ToString() ?? "" : "";
        string combinedText = $"{verbatim} {attribution}".Trim();

        var classification = await ClassifyTopMatchesAsync(_openAIClient, combinedText, classifications, threshold);

        // Console.WriteLine($"Attribution: {classification.Attribution}");
        // Format using double quotes for SQLite compatibility
        var formatted = "[" + string.Join(", ", classification.Matches.Select(m =>
        {
            var kvp = m.First();
            return $"{{\"{kvp.Key}\": {kvp.Value:F2}}}";
        })) + "]";

        // var escaped = formatted.Replace("\"", "\"\""); // SQLite escape

        Console.WriteLine(formatted);
        Console.WriteLine();

        // Optional: persist to database
        if (persistClassification && row.ContainsKey(keyColumn))
        {
            // string classificationLabel = string.Join(" | ", classification.Matches.Select(m => m.First().Key.Replace("'", "''")));

            string idValue = row[keyColumn].ToString().Replace("'", "''");
            var update = $@"UPDATE {tableName} 
                            SET ModelClassificationScores = '{formatted}' 
                            WHERE {keyColumn} = '{idValue}';";

            SQLiteHelper.ExecuteNonQuery(connection, update);
        }
    }

    connection.Close();
}

In [None]:
await ClassifyRowsFromDatabaseAsync(
    "../fy25-raw.db",
    "OverallPailPoints",
    5,
    openAIClient,
    classifications,
    0.82f,
    true,
    "ResponseId");