# Service Summary & Clustering

This notebook, shows two main pre-processing activities that are performed on the data:
1. **Service Summary**: This is a summary of all feedbacks per service, few aggregative statistics are calculated for each service.
2. **Service Clustering**: Using the embeddings of the normalized feedbacks, we cluster the services into different groups.

## Required packages

In [None]:
#r "nuget: System.Text.Json"
#r "nuget: Microsoft.ML"
#r "nuget: Azure.AI.OpenAI, 1.0.0-beta.12"
#r "nuget: DotNetEnv, 2.5.0"

In [None]:
using Microsoft.ML;
using Microsoft.ML.Data;
using System.Text.Json;
using System.IO;
using System.Text.Json.Serialization;
using System.Linq;
using Azure; 
using Azure.AI.OpenAI;
using DotNetEnv;

## Loading required types

In [None]:
// loading the csv feedback record class
# load "./FeedbackRecord.cs"
# load "./ServiceCluster.cs"

## OpenAI Client

Creating an AI client to interact with the OpenAI API.

In [None]:

static string _configurationFile = @"../../configuration/.env";
Env.Load(_configurationFile);

string oAiApiKey = Environment.GetEnvironmentVariable("AOAI_APIKEY") ?? "AOAI_APIKEY not found";
string oAiEndpoint = Environment.GetEnvironmentVariable("AOAI_ENDPOINT") ?? "AOAI_ENDPOINT not found";
string chatCompletionDeploymentName = Environment.GetEnvironmentVariable("CHATCOMPLETION_DEPLOYMENTNAME") ?? "CHATCOMPLETION_DEPLOYMENTNAME not found";
string embeddingDeploymentName = Environment.GetEnvironmentVariable("EMBEDDING_DEPLOYMENTNAME") ?? "EMBEDDING_DEPLOYMENTNAME not found";
string dataRoot = Environment.GetEnvironmentVariable("DB_ROOT_FOLDER") ?? "DB_ROOT_FOLDER not found";

AzureKeyCredential azureKeyCredential = new AzureKeyCredential(oAiApiKey);
OpenAIClient openAIClient = new OpenAIClient(new Uri(oAiEndpoint), azureKeyCredential);

Console.WriteLine($"OpenAI Client created: {oAiEndpoint} with: {chatCompletionDeploymentName} and {embeddingDeploymentName} deployments");

## CallOpenAI

Helper method to call the open ai chat completion API. It is set to return `json` objects. In this notebook, it is used to provide insights into the each cluster.

In [None]:
async Task<string> CallOpenAI(string prompt, string systemMessage, bool JasonResponse = true)
{
    // Create ChatCompletionsOptions and set up the system and user messages
    ChatCompletionsOptions options = new ChatCompletionsOptions();
    
    // Add system message
    options.Messages.Add(new ChatRequestSystemMessage(systemMessage));
    
    // Add user message (the prompt generated from feedback)
    options.Messages.Add(new ChatRequestUserMessage(prompt));

    // Configure request properties
    options.MaxTokens = 4096;
    options.Temperature = 0.7f;
    options.NucleusSamplingFactor = 0.95f;
    options.FrequencyPenalty = 0.0f;
    options.PresencePenalty = 0.0f;
    // options.StopSequences.Add("\n"); 
    options.DeploymentName = chatCompletionDeploymentName;
    if (JasonResponse) options.ResponseFormat = ChatCompletionsResponseFormat.JsonObject;

    // Make the API request to get the chat completions
    Response<ChatCompletions> response = await openAIClient.GetChatCompletionsAsync(options);

    // Extract and return the first response from the choices
    ChatCompletions completions = response.Value;
    if (completions.Choices.Count > 0)
    {
        return completions.Choices[0].Message.Content;
    }
    else
    {
        return "No response generated.";
    }
}

## Loading the right data segment

As we have three service types as part of the data, we will load the data for each service type and perform the pre-processing activities.

In [None]:
var servicename = "fabric"; // "aks"  or "cosmosdb" or "fabric" or "adf"
var jsonFilePath = $"{dataRoot}/{servicename}.json";
var jsonString = File.ReadAllText(jsonFilePath);
var feedbackRecords = JsonSerializer.Deserialize<List<FeedbackRecord>>(jsonString);
// Print number of feedback records
Console.WriteLine($"Number of feedback records for {servicename}: {feedbackRecords.Count}");

## performing clustering on the data

Yes, your understanding is correct. Here's a breakdown of what this code does:

1. **Data Preparation:**
   - The code creates an `embeddingData` list by mapping the embeddings of the feedback records into `EmbeddingData` objects.
   - The `embeddingData` is then converted into a `dataView`, which is required by ML.NET for processing.

2. **Clustering using KMeans:**
   - The `MLContext` object is used to set up a machine learning environment.
   - The pipeline uses the **KMeans** algorithm for clustering, where `featureColumnName` refers to the embeddings of the feedback records.
   - The `numberOfClusters` (in this case, 50) is passed to the KMeans algorithm, indicating the number of clusters you want the algorithm to fit the data into.

3. **Model Training:**
   - The `Fit` method trains the KMeans model on the provided `dataView` (which contains the feedback embeddings).

4. **Cluster Prediction:**
   - After training, the `Transform` method applies the clustering model to the `dataView` to predict the cluster assignments for each feedback record.
   - The resulting `predictions` contain the cluster number (label) for each embedding in `PredictedLabel`.

5. **Assigning Clusters:**
   - The code uses `Zip` to combine the original feedback records with the predicted cluster labels, creating `feedbackWithClusters` which pairs each feedback record with its assigned cluster.
   - The number of distinct clusters is calculated using `Distinct()` and printed.

### Summary:
- This code clusters the feedback data into the specified number of clusters (50 in this case).
- **KMeans** is used to group the feedback records into 50 clusters based on their embedding vectors.
- The code then assigns each feedback record to one of the predicted clusters (`PredictedCluster`).
- Finally, it prints the actual number of clusters formed by `feedbackWithClusters`, although in practice, this should be equal to the number set by the model (50 in this case).



In [None]:
var clusterCount = 50;
var mlContext = new MLContext();
var embeddingData = feedbackRecords.Select(f => new EmbeddingData { Embedding = f.Embedding }).ToList();
var dataView = mlContext.Data.LoadFromEnumerable(embeddingData);

// Cluster the embeddings using KMeans (set number of clusters, e.g., 5)
var pipeline = mlContext.Clustering.Trainers.KMeans(featureColumnName: "Embedding", numberOfClusters: clusterCount);
var model = pipeline.Fit(dataView);

// Predict the cluster for each feedback record
var predictions = model.Transform(dataView);
var clusters = mlContext.Data.CreateEnumerable<ClusterPrediction>(predictions, reuseRowObject: false).ToList();

// Console.WriteLine($"Number of clusters: {clusters.Count}");

public class ClusterPrediction
{
    [ColumnName("PredictedLabel")]
    public uint PredictedCluster { get; set; }  // Cluster number (1, 2, 3, etc.)
}

List<(FeedbackRecord Feedback, uint Cluster)> feedbackWithClusters = feedbackRecords
    .Zip(clusters, (feedback, cluster) => 
        (Feedback: feedback, Cluster: cluster.PredictedCluster)
    )
    .ToList();
    


// print the number of numberOfClusters
var numberOfClusters = feedbackWithClusters.Select(f => f.Cluster).Distinct().Count();
Console.WriteLine($"Number of clusters: {numberOfClusters}");

## GenerateClusters 

The GenerateClusters method is designed to take a list of feedback records that have been grouped into clusters and transform this data into ServiceCluster objects, which represent each cluster’s summary, common feedback themes, and details about the feedback records associated with that cluster.

In [None]:

public List<ServiceCluster> GenerateClusters(List<(FeedbackRecord Feedback, uint Cluster)> feedbackWithClusters)
{
    var serviceClusters = feedbackWithClusters
        .GroupBy(fc => fc.Cluster) // Group by the predicted cluster
        .Select(clusterGroup =>
        {
            // Collect full FeedbackRecords for this cluster
            var feedbackRecords = clusterGroup
                .Select(fc => fc.Feedback)
                .ToList();

            // Calculate distinct customers
            var distinctCustomers = feedbackRecords
                .Select(f => f.CustomerName)
                .Distinct()
                .Count();

            // Create the service cluster object
            return new ServiceCluster
            {
                ClusterId = clusterGroup.Key.ToString(),
                CommonElement = "Common Theme Placeholder", // Replace with actual summarization from OpenAI
                SimilarFeedbacks = feedbackRecords.Count,
                DistinctCustomers = distinctCustomers,
                FeedbackRecords = feedbackRecords,  // Full feedback records
                Summary = "Cluster summary placeholder" // Use OpenAI for summarization
            };
        })
        .ToList();

    return serviceClusters;
}

## Examining the Clusters

high level info, this is before we call open ai.

In [None]:
var sortedByBoth = GenerateClusters(feedbackWithClusters)
    .OrderByDescending(cluster => cluster.DistinctCustomers)
    .ThenByDescending(cluster => cluster.SimilarFeedbacks)
    .ToList();
// print the cluster with its data
foreach (var cluster in sortedByBoth)
{
    Console.WriteLine($"Cluster {cluster.ClusterId}: {cluster.SimilarFeedbacks} similar feedbacks from {cluster.DistinctCustomers} customers");
}

In [None]:
string systemMessage = @"
Generate a JSON response with the following structure:
{
  ""CommonElement"": ""<A concise phrase describing the common theme>"",
  ""Summary"": ""<A detailed explanation summarizing the feedback>""
}
Make sure the common element is clear and concise, and the summary provides a comprehensive explanation.
You base your summary only on the provided user stories by the user.
";
public class OpenAIResponse
{
    public string CommonElement { get; set; }
    public string Summary { get; set; }
}

In [None]:
public async Task<List<ServiceCluster>> EnhanceClustersWithOpenAIAsync(List<ServiceCluster> sortedClusters)
{
    int count = 0; // for testing, limit to 3 clusters
    foreach (var cluster in sortedClusters)
    {
        // Prepare the prompt by concatenating the feedback user stories for each cluster
        count++;
        string prompt = string.Empty;
        foreach (var feedback in cluster.FeedbackRecords)
        {
            prompt += $"- {feedback.UserStory}\n";
        }
        
        // Call OpenAI to generate the common element and summary
        var openAIResponse = await CallOpenAI(prompt, systemMessage);
        Console.WriteLine($"Called OpenAI  {cluster.ClusterId}");
                // Deserialize the JSON response from OpenAI
        try
            {
                var openAIResult = JsonSerializer.Deserialize<OpenAIResponse>(openAIResponse);

                if (openAIResult != null)
                {
                    // Update the cluster with OpenAI results
                    cluster.CommonElement = openAIResult.CommonElement;
                    cluster.Summary = openAIResult.Summary;
                }
                else
                {
                    Console.WriteLine("Failed to deserialize OpenAI response");
                }
            }
            catch (JsonException ex)
            {
                Console.WriteLine($"JSON deserialization error: {ex.Message}");
            }
        // break after 3
        // if (count == 3)
        // {
        //     break;
        // }
    }

    return sortedClusters;
}

In [None]:
var clusterList = await EnhanceClustersWithOpenAIAsync(sortedByBoth);

In [None]:
public static double CalculateCosineSimilarity(float[] vectorA, float[] vectorB)
{
    double dotProduct = 0;
    double magnitudeA = 0;
    double magnitudeB = 0;

    for (int i = 0; i < vectorA.Length; i++)
    {
        dotProduct += vectorA[i] * vectorB[i];
        magnitudeA += Math.Pow(vectorA[i], 2);
        magnitudeB += Math.Pow(vectorB[i], 2);
    }

    magnitudeA = Math.Sqrt(magnitudeA);
    magnitudeB = Math.Sqrt(magnitudeB);

    return dotProduct / (magnitudeA * magnitudeB);
}

In [None]:
public List<List<FeedbackRecord>> CreateSubClustersWithKMeans(List<FeedbackRecord> feedbackRecords, int subClusterCount)
{
    var mlContext = new MLContext();
    var embeddingData = feedbackRecords.Select(f => new EmbeddingData { Embedding = f.Embedding }).ToList();
    var dataView = mlContext.Data.LoadFromEnumerable(embeddingData);

    // Use K-Means to create sub-clusters
    var pipeline = mlContext.Clustering.Trainers.KMeans(featureColumnName: "Embedding", numberOfClusters: subClusterCount);
    var model = pipeline.Fit(dataView);

    var predictions = model.Transform(dataView);
    var clusters = mlContext.Data.CreateEnumerable<ClusterPrediction>(predictions, reuseRowObject: false).ToList();

    // Assign feedback records to their sub-clusters
    return feedbackRecords
        .Zip(clusters, (feedback, cluster) => (Feedback: feedback, ClusterId: cluster.PredictedCluster))
        .GroupBy(fc => fc.ClusterId)
        .Select(g => g.Select(fc => fc.Feedback).ToList())
        .ToList();
}

In [None]:
public List<List<FeedbackRecord>> CreateSubClustersWithHAC(List<FeedbackRecord> feedbackRecords, double similarityThreshold)
{
    // Initially, each feedback record is its own cluster
    var clusters = feedbackRecords.Select(r => new List<FeedbackRecord> { r }).ToList();

    while (true)
    {
        double maxSimilarity = double.MinValue;
        int mergeIndex1 = -1;
        int mergeIndex2 = -1;

        // Find the most similar pair of clusters
        for (int i = 0; i < clusters.Count; i++)
        {
            for (int j = i + 1; j < clusters.Count; j++)
            {
                double similarity = CalculateAverageCosineSimilarity(clusters[i], clusters[j]);
                if (similarity > maxSimilarity)
                {
                    maxSimilarity = similarity;
                    mergeIndex1 = i;
                    mergeIndex2 = j;
                }
            }
        }

        // Stop merging if the highest similarity is below the threshold
        if (maxSimilarity < similarityThreshold)
        {
            break;
        }

        // Merge the two most similar clusters
        clusters[mergeIndex1].AddRange(clusters[mergeIndex2]);
        clusters.RemoveAt(mergeIndex2);
    }

    return clusters;
}

// Helper method to calculate the average similarity between two clusters
private double CalculateAverageCosineSimilarity(List<FeedbackRecord> cluster1, List<FeedbackRecord> cluster2)
{
    double totalSimilarity = 0;
    int comparisons = 0;

    foreach (var record1 in cluster1)
    {
        foreach (var record2 in cluster2)
        {
            totalSimilarity += CalculateCosineSimilarity(record1.Embedding, record2.Embedding);
            comparisons++;
        }
    }

    return totalSimilarity / comparisons;
}

In [None]:
// Sub-clustering method, as discussed
public List<List<FeedbackRecord>> CreateSubClusters(List<FeedbackRecord> feedbackRecords, double threshold)
{
    var subClusters = new List<List<FeedbackRecord>>();
    var unassignedRecords = new HashSet<FeedbackRecord>(feedbackRecords);

    while (unassignedRecords.Any())
    {
        var seed = unassignedRecords.First();
        unassignedRecords.Remove(seed);

        var currentCluster = new List<FeedbackRecord> { seed };

        foreach (var record in unassignedRecords.ToList())
        {
            double similarity = CalculateCosineSimilarity(seed.Embedding, record.Embedding);
            if (similarity >= threshold)
            {
                currentCluster.Add(record);
                unassignedRecords.Remove(record);
            }
        }

        subClusters.Add(currentCluster);
    }

    return subClusters;
}

In [None]:
// load full cluster from a file
var fullClusterFilePath = $"{dataRoot}/fabric-clusters-full.json";
Console.WriteLine($"Loading full clusters from file: {fullClusterFilePath}");
var clusterList = await LoadClustersFromFile(fullClusterFilePath);

In [None]:
// Define your cosine similarity threshold for sub-clustering
double cosineThreshold = 0.862;

// Iterate over each main cluster in clusterList and apply sub-clustering
foreach (var cluster in clusterList)
{
    Console.WriteLine($"Processing sub-clustering for main cluster {cluster.ClusterId} with {cluster.FeedbackRecords.Count} items");

    // Run sub-clustering within each main cluster
    // var subClusters = CreateSubClusters(cluster.FeedbackRecords, cosineThreshold);
    // var kMeansSubClusters = CreateSubClustersWithKMeans(cluster.FeedbackRecords, 10);
    var hacSubClusters = CreateSubClustersWithHAC(cluster.FeedbackRecords, cosineThreshold);

    // Console.WriteLine($"Main Cluster {cluster.ClusterId} has {subClusters.Count} greedy-sub-clusters, and {kMeansSubClusters.Count} KMeans and {hacSubClusters.Count} HAC .");
    Console.WriteLine($"Main Cluster {cluster.ClusterId} has  {hacSubClusters.Count} HAC based clusters .");

    // Enhance the main cluster with sub-clusters (if needed)
    cluster.SubClusters = hacSubClusters;
}

In [None]:
private double CalculateAverageSimilarity(List<FeedbackRecord> feedbackRecords)
{
    // Example calculation of average similarity between feedback items in a sub-cluster.
    // Adjust this to match how similarity is measured in your context.
    
    double totalSimilarity = 0;
    int count = 0;

    for (int i = 0; i < feedbackRecords.Count; i++)
    {
        for (int j = i + 1; j < feedbackRecords.Count; j++)
        {
            // Calculate similarity between two embeddings (e.g., cosine similarity)
            double similarity = CalculateCosineSimilarity(feedbackRecords[i].Embedding, feedbackRecords[j].Embedding);
            totalSimilarity += similarity;
            count++;
        }
    }
    return count > 0 ? totalSimilarity / count : 0;
}

In [None]:
string thematicMessage = @"
“Given the following summary, create one concise, overarching statement that captures the main theme or purpose described. 
Focus on summarizing the core idea in a single short sentence.";

In [None]:
// Filter main clusters with more than one sub-cluster
var clustersWithMultipleSubClusters = clusterList
    .Where(mainCluster => mainCluster.SubClusters != null && mainCluster.SubClusters.Count > 1)
    .ToList();

In [None]:
foreach (var mainCluster in clustersWithMultipleSubClusters)
{
    var openAIResponse = await CallOpenAI(mainCluster.Summary, thematicMessage, JasonResponse : false);
    Console.WriteLine($"{mainCluster.ClusterId} with summary {mainCluster.Summary} \n with: {openAIResponse}");
}

In [None]:


// Log information about each main cluster that has multiple sub-clusters
foreach (var mainCluster in clustersWithMultipleSubClusters)
{
    

    Console.WriteLine($"Main Cluster {openAIResponse} with {mainCluster.FeedbackRecords.Count} items and {mainCluster.SubClusters.Count} sub-clusters.");
    
    foreach (var subCluster in mainCluster.SubClusters)
    {
        Console.WriteLine("  ------------------");
        Console.WriteLine($"  - Sub-Cluster with {subCluster.Count} feedback items.");

        // Display a summary or statistics for each sub-cluster if desired
        var avgSimilarity = CalculateAverageSimilarity(subCluster);
        Console.WriteLine($"    Average Similarity within Sub-Cluster: {avgSimilarity:F2}");
        
        
        // print only id the number of feedbacks is more than 1
        if (subCluster.Count > 1)
        {
            foreach (var feedback in subCluster) 
            {
                Console.WriteLine($"    Sample Feedback:{feedback.Id}|| {feedback.UserStory} ||");
            }
        }


    }
}

## Saving to a file

Before saving to a file, there is no need for the embeddings to be saved to a file.

In [None]:
public List<ServiceCluster> CleanClusterList(List<ServiceCluster> clusterList)
{
    foreach (var cluster in clusterList)
    {
        foreach (var feedback in cluster.FeedbackRecords)
        {
            // Set the Embedding field to null (or simply remove this line from the class definition if you don't need it)
            feedback.Embedding = null;
        }
    }

    return clusterList;
}

In [None]:
// Method to load clusters from a JSON file
public async Task<List<ServiceCluster>> LoadClustersFromFile(string filePath)
{
    try
    {
        if (!File.Exists(filePath))
        {
            throw new FileNotFoundException("File not found.", filePath);
        }

        // Read JSON from file and deserialize to List<ServiceCluster>
        string json = await File.ReadAllTextAsync(filePath);
        var clusters = JsonSerializer.Deserialize<List<ServiceCluster>>(json);

        Console.WriteLine($"Clusters loaded from {filePath}");
        return clusters ?? new List<ServiceCluster>();
    }
    catch (Exception ex)
    {
        Console.WriteLine($"Error loading clusters from file: {ex.Message}");
        return new List<ServiceCluster>();  // Return empty list if an error occurs
    }
}

In [None]:
public async Task SaveClustersToJsonAsync(List<ServiceCluster> clusterList, string outputPath, bool cleanEmbeddings = true)
{
    try
    {
        // clean the cluster list
        if (cleanEmbeddings) clusterList = CleanClusterList(clusterList);
        // Serialize the cluster list to JSON
        var json = JsonSerializer.Serialize(clusterList, new JsonSerializerOptions { WriteIndented = true });

        // Write the JSON string to a file
        await File.WriteAllTextAsync(outputPath, json);

        Console.WriteLine($"Cluster data saved to {outputPath}");
    }
    catch (Exception ex)
    {
        Console.WriteLine($"Error saving clusters to JSON: {ex.Message}");
    }
}

## Saving to a file

Last step is to save the clusters to a file. This file would be used to display the clusters in the UI.
There is an API `[HttpGet("GetServiceClusters/{serviceName}")]` that would be used to get the clusters from the file and display them in the UI.

In [None]:
var outputPath = $"{dataRoot}/{servicename}-clusters-full.json";  // Define the output file path
await SaveClustersToJsonAsync(clusterList, outputPath, cleanEmbeddings: false);  // Save the full cluster data

In [None]:
string jsonString = File.ReadAllText(outputPath);

In [None]:
var options = new JsonSerializerOptions
{
    PropertyNameCaseInsensitive = true
};

    List<ServiceCluster> clusters = JsonSerializer.Deserialize<List<ServiceCluster>>(jsonString, options);

    // Process each cluster to generate initiative ideas
    foreach (var cluster in clusters)
    {
        // Extract necessary information
        string clusterId = cluster.ClusterId;
        string commonElement = cluster.CommonElement;
        int similarFeedbacks = cluster.SimilarFeedbacks;
        int distinctCustomers = cluster.DistinctCustomers;
        string summary = cluster.Summary;

        // Generate initiative idea
        string initiativeIdea = $"Initiative Idea for Cluster {clusterId}:\n" +
            $"- **Focus Area**: {commonElement}\n" +
            $"- **Description**: {summary}\n" +
            $"- **Potential Impact**: Addresses feedback from {similarFeedbacks} similar feedback items across {distinctCustomers} customers.\n";

        // Output the initiative idea
        Console.WriteLine(initiativeIdea);
        Console.WriteLine(new string('*', 50));
    }
