# Research grouping / clustering techniques

## WAF Classification

The baseline classification are based on Well Architected Framework (WAF) pillars. The WAF is a framework which is used in many scenarios and its a good starting point to classify the feedbacks. The WAF has 5 pillars:

1. Operational Excellence
2. Security
3. Reliability
4. Performance Efficiency
5. Cost Optimization

There are several techniques to classify feedbacks based on the WAF pillars, the concern is with the speed and total number of tokens that would be used to classify the entire set of feedbacks. The current volume is ~50K feedback items. The approach would be multi-step. During the initial transformation of the feedback to user stories we will instruct the LLM to also assign top classification (WAF pillar) to the feedback. The second step would be to assign the feedback a subcategory based on the specific pillar. This will reduce the size of each request allowing the LLM the efficiently to classify the feedbacks.

Per Pillar, we created subcategories based on the specific pillar recommendations. The file `classes.json` contains the mapping of the subcategories to the pillars.
The first step of the process is to generate embeddings for each class. The feedback already were transformed to user stories and then generated embedding based on the user story.

In [None]:
#r "nuget: Azure.AI.OpenAI, 1.0.0-beta.12"
#r "nuget: DotNetEnv, 2.5.0"
// ability to load the entire console project, so no need to create local classes
# r "../bin/Debug/net8.0/console.dll"

using Azure; 
using Azure.AI.OpenAI;
using DotNetEnv;
using System.IO;
using System.Text.Json; 
using ProductLeaders.console.Models;

var fed = new ProductLeaders.console.Models.FeedbackRecord();
Console.WriteLine(fed.ClassificationLevels);

## Creating OpenAI Client

In [None]:

static string _configurationFile = @"../../../configuration/.env";
Env.Load(_configurationFile);

string oAiApiKey = Environment.GetEnvironmentVariable("AOAI_APIKEY") ?? "AOAI_APIKEY not found";
string oAiEndpoint = Environment.GetEnvironmentVariable("AOAI_ENDPOINT") ?? "AOAI_ENDPOINT not found";
string chatCompletionDeploymentName = Environment.GetEnvironmentVariable("CHATCOMPLETION_DEPLOYMENTNAME") ?? "CHATCOMPLETION_DEPLOYMENTNAME not found";
string embeddingDeploymentName = Environment.GetEnvironmentVariable("EMBEDDING_DEPLOYMENTNAME") ?? "EMBEDDING_DEPLOYMENTNAME not found";

AzureKeyCredential azureKeyCredential = new AzureKeyCredential(oAiApiKey);
OpenAIClient openAIClient = new OpenAIClient(new Uri(oAiEndpoint), azureKeyCredential);

Console.WriteLine($"OpenAI Client created: {oAiEndpoint} with: {chatCompletionDeploymentName} and {embeddingDeploymentName} deployments");

### Get Embeddings

helper method to get embeddings from OpenAI API.

In [None]:
async Task<float[]> GetEmbeddingAsync(string textToBeVecorized)
{
    // Prepare the embeddings options with the user story
    EmbeddingsOptions embeddingsOptions = new EmbeddingsOptions(embeddingDeploymentName, new List<string> { textToBeVecorized });
    var modelResponse = await openAIClient.GetEmbeddingsAsync( embeddingsOptions);
    float[] response = modelResponse.Value.Data[0].Embedding.ToArray();
    return response;
}

## Classification using preset classes

The classes are based on the WAF pillars and subcategories. The classes are defined in the `classes.json` file. The classes are used to classify the feedbacks based on the WAF pillars. 
`WafRoot` and `WafItem` are classes used in the process of generating embeddings. The embeddings would be used to match the feedback to the classes.

## Generate embedding for the classes 

The classes are the subcategories of the WAF pillars. The embeddings are generated using `ada` model. First we load the json file with the classes. Then we will iterate through it creating embedding based on the class name and short description.
Using a generic structure to maintain WAF topics.

In [None]:
var inputFilePath = "new_classification.json"; // Adjust path as needed
var jsonString = File.ReadAllText(inputFilePath);

// This deserializes into our WafRoot structure
// var wafData = JsonSerializer.Deserialize<WafRoot>(jsonString);
var classifications = JsonSerializer.Deserialize<List<ClassificationNode>>(jsonString, new JsonSerializerOptions
{
    PropertyNameCaseInsensitive = true
});

if (classifications == null)
{
    Console.WriteLine("Failed to deserialize JSON. Check file format.");
    return;
}

### Creating the embeddings

Using the helper method, the embedding are added to the same class, and then will be saved to a new file.

In [None]:
async Task<List<ClassificationNode>> EmbedClassificationDataAsync(
    List<ClassificationNode> classificationNodes)
{
    foreach (var node in classificationNodes)
    {
        // Build the text to embed, e.g. "Topic:Definition"
        string textToEmbed = $"{node.Topic}: {node.Definition}";

        // Call your actual embedding method (replace with real logic)
        var embedding = await GetEmbeddingAsync(textToEmbed);
        node.Embedded = embedding;

        Console.WriteLine($"Embedded => Topic: {node.Topic}");

        // Recursively embed all child topics
        if (node.ChildTopics != null && node.ChildTopics.Count > 0)
        {
            await EmbedClassificationDataAsync(node.ChildTopics);
        }
    }

    // Return the updated list
    return classificationNodes;
}

Saving the enriched classes to a new file.

In [None]:
var outputFilePath = "classes_with_embeddings3.json"; // Adjust path as needed

var updatedWafData = await EmbedClassificationDataAsync(classifications);

// Pretty-print for readability
var options = new JsonSerializerOptions
{
    WriteIndented = true
};

var updatedJson = JsonSerializer.Serialize(updatedWafData, options);
File.WriteAllText(outputFilePath, updatedJson);

Console.WriteLine($"Updated JSON with embeddings saved to {outputFilePath}.");

In [None]:
// loading the previously generated classes with embeddings
var inputFilePath = "classes_with_embeddings3.json"; 
var jsonString = File.ReadAllText(inputFilePath);
var classifications = JsonSerializer.Deserialize<List<ClassificationNode>>(jsonString, new JsonSerializerOptions
{
    PropertyNameCaseInsensitive = true
});

if (classifications == null)
{
    Console.WriteLine("Failed to deserialize JSON. Check file format.");
    return;
}

## Creating groups / clusters

Now we will iterate through the list of feedback items and find the most similar class. The similarity is calculated using the cosine similarity. The feedback will be assigned to the class with the highest similarity. The feedback will be saved to a new file with the class assigned. An enhanced version of the class `FeedbackRecord` is created to store the class and the similarity.

helper class to calculate the similarity between the feedback and the classes.

In [None]:
public static class VectorMath
{
    // If your embeddings are guaranteed to be length 1536, you can fix that in the code.
    // Or you can remove references to VectorDimension and just use vector.Length.

    public const int VectorDimension = 1536;

    public static float Length(float[] vector)
    {
        float sum = 0;
        for (int i = 0; i < VectorDimension; i++)
        {
            sum += vector[i] * vector[i];
        }
        return (float)Math.Sqrt(sum);
    }

    public static float DotProduct(float[] a, float[] b)
    {
        float sum = 0;
        for (int i = 0; i < VectorDimension; i++)
        {
            sum += a[i] * b[i];
        }
        return sum;
    }

    // Standard Cosine Similarity: dot(a, b) / (|a| * |b|)
    public static float CosineSimilarity(float[] a, float[] b)
    {
        float dot = DotProduct(a, b);
        float magA = Length(a);
        float magB = Length(b);

        // Handle potential divide-by-zero if either vector is all zeros
        if (magA < 1e-8f || magB < 1e-8f) return 0f;

        return dot / (magA * magB);
    }
}

## Load Data

Loading to memory the feedbacks with embeddings and the classes.

In [None]:
// the classifications are already loaded
var feedbackJson = File.ReadAllText("adf.json");
var feedbackList = JsonSerializer.Deserialize<List<ProductLeaders.console.Models.FeedbackRecord>>(feedbackJson);

## Classify the feedbacks

In [None]:
float threshold = 0.7555f;

In [None]:
public static List<(ClassificationNode Node, List<string> Path)> FlattenNodes(
    List<ClassificationNode> nodes,
    List<string> parentPath = null)
{
    var results = new List<(ClassificationNode, List<string>)>();

    foreach (var node in nodes)
    {
        var currentPath = (parentPath == null || parentPath.Count == 0)
            ? new List<string> { node.Topic }
            : new List<string>(parentPath) { node.Topic };

        // Add this node
        results.Add((node, currentPath));

        // Recurse if child topics exist
        if (node.ChildTopics != null && node.ChildTopics.Count > 0)
        {
            results.AddRange(FlattenNodes(node.ChildTopics, currentPath));
        }
    }

    return results;
}

In [None]:
public List<ProductLeaders.console.Models.FeedbackRecord> ClassifyItems(
    List<ProductLeaders.console.Models.FeedbackRecord> feedbackList,
    List<ClassificationNode> classificationNodes)
{
    if (classificationNodes == null || feedbackList == null)
    {
        Console.WriteLine("No classification nodes or feedback data provided.");
        return feedbackList;
    }

    // 1) Flatten the classification hierarchy
    var allNodes = FlattenNodes(classificationNodes); 
    // allNodes is List<(ClassificationNode Node, List<string> Path)>

    // 2) For each feedback, find the best match
    foreach (var feedback in feedbackList)
    {
        float bestSimilarity = threshold;
        (ClassificationNode bestNode, List<string> bestPath) = (null, null);

        // If feedback has no embedding, skip or handle
        if (feedback.Embedding == null)
        {
            Console.WriteLine($"Feedback {feedback.Id} has no embedding. Skipping classification.");
            continue;
        }

        // 3) Compare to each classification node that has an embedding
        foreach (var (node, path) in allNodes)
        {
            if (node.Embedded == null) 
                continue; // Node has no embedding => skip

            float sim = VectorMath.CosineSimilarity(feedback.Embedding, node.Embedded);
            
            if (sim > bestSimilarity)
            {
                bestSimilarity = sim;
                bestNode = node;
                bestPath = path; // e.g., ["Reliability", "Simplicity and efficiency"]
            }
        }

        // 4) Assign classification levels if we found something
        feedback.ClassificationLevels.Clear();
        if (bestNode != null && bestPath != null)
        {
            // e.g. bestPath = ["Reliability", "Simplicity and efficiency"]
            feedback.ClassificationLevels.AddRange(bestPath);
        }
        else
        {
            feedback.ClassificationLevels.Add("Other - Unclassified");
            // Could store an empty list or note "Unclassified"
        }
    }

    // Optionally, serialize updated feedback
    var outputJson = JsonSerializer.Serialize(feedbackList, new JsonSerializerOptions { WriteIndented = true });
    File.WriteAllText("feedback_classified.json", outputJson);

    Console.WriteLine("Classification complete! Output written to feedback_classified.json");
    return feedbackList;
}

In [None]:
feedbackList = ClassifyItems(feedbackList, classifications);

## Review classifications

In [None]:
using System;
using System.Collections.Generic;
using System.Linq;

// Assuming 'feedbackList' is your List<FeedbackRecord> that has WafPillar and WafSubCategory populated.

Console.WriteLine("-----------------------------------------------------------------------------------------------");
Console.WriteLine("| ID               | Topic              | Sub Topic                | User Story       |");
Console.WriteLine("-----------------------------------------------------------------------------------------------");

foreach (var feedback in feedbackList)
{
    // Truncate or safely shorten strings if needed:
    string idTrunc = (feedback.Id ?? "").PadRight(18).Substring(0, 18);

    // Extract first and second levels (if they exist)
    string pillar = feedback.ClassificationLevels.Count >= 1
        ? feedback.ClassificationLevels[0]
        : "";
    string subCat = feedback.ClassificationLevels.Count >= 2
        ? feedback.ClassificationLevels[1]
        : "";

    // Pad/Substring to maintain table alignment
    string pillarTrunc = pillar.PadRight(24).Substring(0, 24);
    string subCatTrunc = subCat.PadRight(28).Substring(0, 28);

    // Adjust user story snippet as you prefer
    string userStoryTrunc = feedback.UserStory ?? "";
    if (userStoryTrunc.Length > 80)
        userStoryTrunc = userStoryTrunc.Substring(0, 80) + "...";

    // Print row
    Console.WriteLine($"| {idTrunc} | {pillarTrunc} | {subCatTrunc} | {userStoryTrunc} |");
}

Console.WriteLine("-----------------------------------------------------------------------------------------------");

## Putting it together

Check out the next notebook [report.md](./report.ipynb) for the generating the report.

In [None]:
public List<ProductLeaders.console.Models.FeedbackRecord> ClassifyItemsNRecord(
    List<ProductLeaders.console.Models.FeedbackRecord> feedbackList,
    List<ClassificationNode> classificationNodes)
{
    if (classificationNodes == null || feedbackList == null)
    {
        Console.WriteLine("No classification nodes or feedback data provided.");
        return feedbackList;
    }

    var allNodes = FlattenNodes(classificationNodes); // Flatten hierarchy

    List<float> allSimilarities = new List<float>(); // Collect similarity scores for analysis

    foreach (var feedback in feedbackList)
    {
        float bestSimilarity = float.NegativeInfinity;
        (ClassificationNode bestNode, List<string> bestPath) = (null, null);

        if (feedback.Embedding == null)
        {
            Console.WriteLine($"Feedback {feedback.Id} has no embedding. Skipping classification.");
            continue;
        }

        foreach (var (node, path) in allNodes)
        {
            if (node.Embedded == null) 
                continue; // Node has no embedding => skip

            float sim = VectorMath.CosineSimilarity(feedback.Embedding, node.Embedded);
            allSimilarities.Add(sim); // Store for analysis

            if (sim > bestSimilarity)
            {
                bestSimilarity = sim;
                bestNode = node;
                bestPath = path;
            }
        }

        feedback.ClassificationLevels.Clear();
        if (bestNode != null && bestPath != null)
        {
            feedback.ClassificationLevels.AddRange(bestPath);
        }
        else
        {
            feedback.ClassificationLevels.Add("Yoav"); // Assign 'Other' as default
        }

        // Log best match for debugging
        Console.WriteLine($"Feedback {feedback.Id} - Best Similarity: {bestSimilarity}");
    }

    // Save similarity scores for offline analysis
    var similarityStats = new
    {
        Min = allSimilarities.Min(),
        Max = allSimilarities.Max(),
        Avg = allSimilarities.Average(),
        Scores = allSimilarities
    };
    File.WriteAllText("similarity_scores.json", JsonSerializer.Serialize(similarityStats, new JsonSerializerOptions { WriteIndented = true }));

    Console.WriteLine("Classification complete! Similarity values saved to similarity_scores.json.");
    return feedbackList;
}