In [None]:
#r "nuget: System.Text.Json"
#r "nuget: Microsoft.ML"
#r "nuget: Azure.AI.OpenAI, 1.0.0-beta.12"
#r "nuget: DotNetEnv, 2.5.0"



In [2]:
using Microsoft.ML;
using Microsoft.ML.Data;
using System.Text.Json;
using System.IO;

using System.Text.Json.Serialization;

using System.Linq;
using Azure; 
using Azure.AI.OpenAI;
using DotNetEnv;



## load required types

In [3]:
// loading the csv feedback record class
# load "./FeedbackRecord.cs"
# load "./ServiceCluster.cs"

## OpenAI Client

In [55]:

static string _configurationFile = @"../../configuration/.env";
Env.Load(_configurationFile);

string oAiApiKey = Environment.GetEnvironmentVariable("AOAI_APIKEY") ?? "AOAI_APIKEY not found";
string oAiEndpoint = Environment.GetEnvironmentVariable("AOAI_ENDPOINT") ?? "AOAI_ENDPOINT not found";
string chatCompletionDeploymentName = Environment.GetEnvironmentVariable("CHATCOMPLETION_DEPLOYMENTNAME") ?? "CHATCOMPLETION_DEPLOYMENTNAME not found";
string embeddingDeploymentName = Environment.GetEnvironmentVariable("EMBEDDING_DEPLOYMENTNAME") ?? "EMBEDDING_DEPLOYMENTNAME not found";
string dataRoot = Environment.GetEnvironmentVariable("DB_ROOT_FOLDER") ?? "DB_ROOT_FOLDER not found";

AzureKeyCredential azureKeyCredential = new AzureKeyCredential(oAiApiKey);
OpenAIClient openAIClient = new OpenAIClient(new Uri(oAiEndpoint), azureKeyCredential);

Console.WriteLine($"OpenAI Client created: {oAiEndpoint} with: {chatCompletionDeploymentName} and {embeddingDeploymentName} deployments");

OpenAI Client created: https://yd-openai-sweden.openai.azure.com/ with: yd-sweeden-40 and emedd-ada-002 deployments


## CallOpenAI

Helper method to call the open ai chat completion API. It is set to return `json` objects.

In [5]:
async Task<string> CallOpenAI(string prompt, string systemMessage)
{
    // Create ChatCompletionsOptions and set up the system and user messages
    ChatCompletionsOptions options = new ChatCompletionsOptions();
    
    // Add system message
    options.Messages.Add(new ChatRequestSystemMessage(systemMessage));
    
    // Add user message (the prompt generated from feedback)
    options.Messages.Add(new ChatRequestUserMessage(prompt));

    // Configure request properties
    options.MaxTokens = 500;
    options.Temperature = 0.7f;
    options.NucleusSamplingFactor = 0.95f;
    options.FrequencyPenalty = 0.0f;
    options.PresencePenalty = 0.0f;
    // options.StopSequences.Add("\n"); 
    options.DeploymentName = chatCompletionDeploymentName;
    options.ResponseFormat = ChatCompletionsResponseFormat.JsonObject;

    // Make the API request to get the chat completions
    Response<ChatCompletions> response = await openAIClient.GetChatCompletionsAsync(options);

    // Extract and return the first response from the choices
    ChatCompletions completions = response.Value;
    if (completions.Choices.Count > 0)
    {
        // output all choices
        // foreach (var choice in completions.Choices)
        // {
        //     Console.WriteLine($"in the loop: {choice.Message.Content}");
        // }
        return completions.Choices[0].Message.Content;
    }
    else
    {
        return "No response generated.";
    }
}

## loading the right data segment

In [57]:
var servicename = "cosmosdb"; // "aks"  or "cosmosdb"
var jsonFilePath = $"{dataRoot}/{servicename}.json";
var jsonString = File.ReadAllText(jsonFilePath);
var feedbackRecords = JsonSerializer.Deserialize<List<FeedbackRecord>>(jsonString);
// Print number of feedback records
Console.WriteLine($"Number of feedback records for {servicename}: {feedbackRecords.Count}");

Number of feedback records for cosmosdb: 600


## performing clustering on the data

Yes, your understanding is correct. Here's a breakdown of what this code does:

1. **Data Preparation:**
   - The code creates an `embeddingData` list by mapping the embeddings of the feedback records into `EmbeddingData` objects.
   - The `embeddingData` is then converted into a `dataView`, which is required by ML.NET for processing.

2. **Clustering using KMeans:**
   - The `MLContext` object is used to set up a machine learning environment.
   - The pipeline uses the **KMeans** algorithm for clustering, where `featureColumnName` refers to the embeddings of the feedback records.
   - The `numberOfClusters` (in this case, 50) is passed to the KMeans algorithm, indicating the number of clusters you want the algorithm to fit the data into.

3. **Model Training:**
   - The `Fit` method trains the KMeans model on the provided `dataView` (which contains the feedback embeddings).

4. **Cluster Prediction:**
   - After training, the `Transform` method applies the clustering model to the `dataView` to predict the cluster assignments for each feedback record.
   - The resulting `predictions` contain the cluster number (label) for each embedding in `PredictedLabel`.

5. **Assigning Clusters:**
   - The code uses `Zip` to combine the original feedback records with the predicted cluster labels, creating `feedbackWithClusters` which pairs each feedback record with its assigned cluster.
   - The number of distinct clusters is calculated using `Distinct()` and printed.

### Summary:
- This code clusters the feedback data into the specified number of clusters (50 in this case).
- **KMeans** is used to group the feedback records into 50 clusters based on their embedding vectors.
- The code then assigns each feedback record to one of the predicted clusters (`PredictedCluster`).
- Finally, it prints the actual number of clusters formed by `feedbackWithClusters`, although in practice, this should be equal to the number set by the model (50 in this case).



In [None]:
var clusterCount = 50;
var mlContext = new MLContext();
var embeddingData = feedbackRecords.Select(f => new EmbeddingData { Embedding = f.Embedding }).ToList();
var dataView = mlContext.Data.LoadFromEnumerable(embeddingData);

// Cluster the embeddings using KMeans (set number of clusters, e.g., 5)
var pipeline = mlContext.Clustering.Trainers.KMeans(featureColumnName: "Embedding", numberOfClusters: clusterCount);
var model = pipeline.Fit(dataView);

// Predict the cluster for each feedback record
var predictions = model.Transform(dataView);
var clusters = mlContext.Data.CreateEnumerable<ClusterPrediction>(predictions, reuseRowObject: false).ToList();

// Console.WriteLine($"Number of clusters: {clusters.Count}");

public class ClusterPrediction
{
    [ColumnName("PredictedLabel")]
    public uint PredictedCluster { get; set; }  // Cluster number (1, 2, 3, etc.)
}

List<(FeedbackRecord Feedback, uint Cluster)> feedbackWithClusters = feedbackRecords
    .Zip(clusters, (feedback, cluster) => 
        (Feedback: feedback, Cluster: cluster.PredictedCluster)
    )
    .ToList();
    


// print the number of numberOfClusters
var numberOfClusters = feedbackWithClusters.Select(f => f.Cluster).Distinct().Count();
Console.WriteLine($"Number of clusters: {numberOfClusters}");

## GenerateClusters 

The GenerateClusters method is designed to take a list of feedback records that have been grouped into clusters and transform this data into ServiceCluster objects, which represent each cluster’s summary, common feedback themes, and details about the feedback records associated with that cluster.

In [46]:

public List<ServiceCluster> GenerateClusters(List<(FeedbackRecord Feedback, uint Cluster)> feedbackWithClusters)
{
    var serviceClusters = feedbackWithClusters
        .GroupBy(fc => fc.Cluster) // Group by the predicted cluster
        .Select(clusterGroup =>
        {
            // Collect full FeedbackRecords for this cluster
            var feedbackRecords = clusterGroup
                .Select(fc => fc.Feedback)
                .ToList();

            // Calculate distinct customers
            var distinctCustomers = feedbackRecords
                .Select(f => f.CustomerName)
                .Distinct()
                .Count();

            // Create the service cluster object
            return new ServiceCluster
            {
                ClusterId = clusterGroup.Key.ToString(),
                CommonElement = "Common Theme Placeholder", // Replace with actual summarization from OpenAI
                SimilarFeedbacks = feedbackRecords.Count,
                DistinctCustomers = distinctCustomers,
                FeedbackRecords = feedbackRecords,  // Full feedback records
                Summary = "Cluster summary placeholder" // Use OpenAI for summarization
            };
        })
        .ToList();

    return serviceClusters;
}

## Examining the Clusters

high level info, this is before we call open ai.

In [54]:
var sortedByBoth = GenerateClusters(feedbackWithClusters)
    .OrderByDescending(cluster => cluster.DistinctCustomers)
    .ThenByDescending(cluster => cluster.SimilarFeedbacks)
    .ToList();
// print the cluster with its data
foreach (var cluster in sortedByBoth)
{
    Console.WriteLine($"Cluster {cluster.ClusterId}: {cluster.SimilarFeedbacks} similar feedbacks from {cluster.DistinctCustomers} customers");
}

Cluster 10: 51 similar feedbacks from 31 customers
Cluster 4: 35 similar feedbacks from 22 customers
Cluster 1: 34 similar feedbacks from 22 customers
Cluster 6: 31 similar feedbacks from 21 customers
Cluster 41: 23 similar feedbacks from 20 customers
Cluster 27: 18 similar feedbacks from 17 customers
Cluster 8: 19 similar feedbacks from 14 customers
Cluster 34: 16 similar feedbacks from 14 customers
Cluster 26: 13 similar feedbacks from 12 customers
Cluster 32: 32 similar feedbacks from 11 customers
Cluster 14: 15 similar feedbacks from 11 customers
Cluster 30: 12 similar feedbacks from 11 customers
Cluster 47: 15 similar feedbacks from 10 customers
Cluster 13: 23 similar feedbacks from 9 customers
Cluster 7: 10 similar feedbacks from 9 customers
Cluster 38: 9 similar feedbacks from 9 customers
Cluster 33: 11 similar feedbacks from 8 customers
Cluster 25: 11 similar feedbacks from 7 customers
Cluster 11: 10 similar feedbacks from 7 customers
Cluster 22: 8 similar feedbacks from 7 cust

In [48]:
string systemMessage = @"
Generate a JSON response with the following structure:
{
  ""CommonElement"": ""<A concise phrase describing the common theme>"",
  ""Summary"": ""<A detailed explanation summarizing the feedback>""
}
Make sure the common element is clear and concise, and the summary provides a comprehensive explanation.
You base your summary only on the provided user stories by the user.
";
public class OpenAIResponse
{
    public string CommonElement { get; set; }
    public string Summary { get; set; }
}

In [49]:
public async Task<List<ServiceCluster>> EnhanceClustersWithOpenAIAsync(List<ServiceCluster> sortedClusters)
{
    int count = 0; // for testing, limit to 3 clusters
    foreach (var cluster in sortedClusters)
    {
        // Prepare the prompt by concatenating the feedback user stories for each cluster
        count++;
        string prompt = string.Empty;
        foreach (var feedback in cluster.FeedbackRecords)
        {
            prompt += $"- {feedback.UserStory}\n";
        }
        
        // Call OpenAI to generate the common element and summary
        var openAIResponse = await CallOpenAI(prompt, systemMessage);
        Console.WriteLine($"Called OpenAI  {cluster.ClusterId}");
                // Deserialize the JSON response from OpenAI
        try
            {
                var openAIResult = JsonSerializer.Deserialize<OpenAIResponse>(openAIResponse);

                if (openAIResult != null)
                {
                    // Update the cluster with OpenAI results
                    cluster.CommonElement = openAIResult.CommonElement;
                    cluster.Summary = openAIResult.Summary;
                }
                else
                {
                    Console.WriteLine("Failed to deserialize OpenAI response");
                }
            }
            catch (JsonException ex)
            {
                Console.WriteLine($"JSON deserialization error: {ex.Message}");
            }
        // break after 3
        if (count == 3)
        {
            break;
        }
    }

    return sortedClusters;
}

In [None]:
var clusterList = await EnhanceClustersWithOpenAIAsync(sortedByBoth);

## Saving to a file

Before saving to a file, there is no need for the embeddings to be saved to a file.

In [51]:
public List<ServiceCluster> CleanClusterList(List<ServiceCluster> clusterList)
{
    foreach (var cluster in clusterList)
    {
        foreach (var feedback in cluster.FeedbackRecords)
        {
            // Set the Embedding field to null (or simply remove this line from the class definition if you don't need it)
            feedback.Embedding = null;
        }
    }

    return clusterList;
}

In [52]:
public async Task SaveClustersToJsonAsync(List<ServiceCluster> clusterList, string outputPath)
{
    try
    {
        // clean the cluster list
        clusterList = CleanClusterList(clusterList);
        // Serialize the cluster list to JSON
        var json = JsonSerializer.Serialize(clusterList, new JsonSerializerOptions { WriteIndented = true });

        // Write the JSON string to a file
        await File.WriteAllTextAsync(outputPath, json);

        Console.WriteLine($"Cluster data saved to {outputPath}");
    }
    catch (Exception ex)
    {
        Console.WriteLine($"Error saving clusters to JSON: {ex.Message}");
    }
}

## Saving to a file

Last step is to save the clusters to a file. This file would be used to display the clusters in the UI.
There is an API `[HttpGet("GetServiceClusters/{serviceName}")]` that would be used to get the clusters from the file and display them in the UI.

In [None]:
var outputPath = $"{dataRoot}{servicename}-clusters.json";  // Define the output file path
await SaveClustersToJsonAsync(clusterList, outputPath);