# Creating Embeddings with OpenAI and Performing a Search in .NET

If you've looked at a lot of Jupyter Notebooks before and think mine is laid out stupidly, I'm sorry. It's my first day.

## Nuget packages & using statements

We'll be using OpenAI's SDK to interact with their API and CsvHelper to manage our test data.

In [56]:
#r "nuget:OpenAI-DotNet, 7.7.7"
#r "nuget:CsvHelper, 31.0.4"
#r "nuget:System.Numerics.Tensors, 8.0.0"

In [57]:
using CsvHelper;
using CsvHelper.Configuration;
using OpenAI;
using System;
using System.Collections.Generic;
using System.Globalization;
using System.IO;
using System.Linq;
using System.Numerics.Tensors;
using System.Text;
using System.Threading.Tasks;

## Parse the CSV

Using CsvHelper, we'll parse the Amazon review file. We'll also map the fields we'll be creating embeddings from ("Summary" and "Text"). We'll keep the other properties from the CSV, but the only "searchable" property will be the new one we create: "Combined".


In [62]:
class AmazonReview
{
    public string Id { get; set; }
    public string ProductId { get; set; }
    public string ProfileName { get; set; }
    public string HelpfulnessNumerator { get; set; }
    public string HelpfulnessDenominator { get; set; }
    public int Score { get; set; }
    public int Time { get; set; }
    public string Summary { get; set; }
    public string Text { get; set; }

    // These won't be in the CSV.
    // We'll created "Combined" with a custom mapping, and then
    // we'll calculate "Embeddings" using OpenAI API.
    public string Combined { get; set; }
    public float[] Embeddings { get; set; }
}

class AmazonReviewMap : ClassMap<AmazonReview>
{
    public AmazonReviewMap()
    {
        AutoMap(CultureInfo.InvariantCulture);
        Map(m => m.Embeddings).Ignore();

        Map(m => m.Combined).Convert(args =>
        {
            var combined = "Title: " 
              + args.Row.GetField<string>("Summary").Trim() 
              + " Content: "
              + args.Row.GetField<string>("Text").Trim();

            return combined;
        });
    }
}

AmazonReview[] reviews;

using (var reader = new StreamReader("./amazon_reviews_1000.csv"))
using (var csv = new CsvHelper.CsvReader(reader, new CsvHelper.Configuration.CsvConfiguration(CultureInfo.InvariantCulture) 
  {
    HasHeaderRecord = true, 
    HeaderValidated = null, 
    MissingFieldFound = null 
  }))
{
    csv.Context.RegisterClassMap<AmazonReviewMap>();
    reviews = csv.GetRecords<AmazonReview>().ToArray();

    // Print a sample
    reviews.Take(5).ToList().ForEach(review => Console.WriteLine(review.Combined));
}

Title: Good Quality Dog Food Content: I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.
Title: Not as Advertised Content: Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".
Title: "Delight" says it all Content: This is a confection that has been around a few centuries.  It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar.  And it is a tiny mouthful of heaven.  Not too chewy, and very flavorful.  I highly recommend this yummy treat.  If you are familiar with the story of C.S. Lewis' "The Lion, The Witch, and The Wardrobe" - this is the treat 

## Create Embeddings

We're going to use OpenAI's SDK to calculate embeddings from the API. This is going to turn the "Combined" property into a large vector of floats. This will take a few minutes to run.

By default, the `text-embedding-3-small` model returns 1536 dimensions, but OpenAI recommends you provide a smaller dimension parameter to save on compute time and resources. They say you can do this without a notable dip in quality ([see here](https://openai.com/blog/new-embedding-models-and-api-updates)).

One thing to note here, I'm not checking token count ahead of time. In OpenAI's tutorials/notebooks, they use their Python package (`tiktoken`) to count tokens ahead of time and to filter out any review that exceeds a token count. There are some 3rd-party C# packages that do this too, but that's an exercise left up to the reader.

In [64]:
const string model = "text-embedding-3-small";
const string apiKey = "<your real key goes here";

using (var api = new OpenAIClient(apiKey))
{
    foreach (var review in reviews)
    {
        var response = await api.EmbeddingsEndpoint.CreateEmbeddingAsync(review.Combined, model, dimensions: 512);
        review.Embeddings = response.Data.First().Embedding.Select(e => (float)e).ToArray();
    }
}


In [65]:
// Sanity check
System.Diagnostics.Debug.Assert(reviews.All(review => review.Embeddings.Length == 512));

## Searching with Embeddings

Now that we've converted our "Combined" text property into embeddings, we can search it. The steps to do so are:

1. Collect a search query (a string of text).
2. Convert that query into embeddings (using the OpenAI SDK).
3. Use math to calculate which "review embeddings" are near your "query embeddings. OpenAI recommends using [Cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity).
4. Grab some of the best matches.

In [60]:
class AmazonReviewQueryMatch
{
    public AmazonReview Review { get; set; }
    public double Relatedness { get; set; }
}

async Task<AmazonReviewQueryMatch[]> Search(string query, int topN = 5)
{
    double[] queryEmbeddings;

    using (var api = new OpenAIClient(apiKey))
    {
        var response = await api.EmbeddingsEndpoint.CreateEmbeddingAsync(query, model, dimensions: 512);
        queryEmbeddings = response.Data.First().Embedding.ToArray();
    }

    var matches = reviews.Select(review => new AmazonReviewQueryMatch
    {
        Review = review,
        Relatedness = TensorPrimitives.CosineSimilarity(
          new ReadOnlySpan<float>(review.Embeddings.Select(e => (float)e).ToArray()),
          new ReadOnlySpan<float>(queryEmbeddings.Select(e => (float)e).ToArray())
        )
    })
    .OrderByDescending(match => match.Relatedness)
    .Take(topN)
    .ToArray();

    return matches;
}

In [70]:
var results = await Search("garbage");

results.ToList().ForEach(match => Console.WriteLine($"{match.Relatedness} {match.Review.Combined}"));

0.4490898847579956 Title: Garbage Content: Don't waste your money on any of the Kettle brand potato chips.  I bought a case of these, and a case of the cheddar and sour cream.  Both cases ended up in the garbage can.
0.3766050636768341 Title: Disgusting Content: These chips are nasty.  I thought someone had spilled a drink in the bag, no the chips were just soaked with grease.  Nasty!!
0.3587847948074341 Title: HORRIBLE I CANT BELIEVE THIS! Content: Terrible! I cannot believe this, I received this item and EVERY SINGLE BAG WAS OPENED BUT 4!!!! I'm stationed in Afghanistan and this was gonna be a snack for my team while going out on missions. I was so embarrassed when the bags were opened and spilt out all in the box, gross! And the box is filled with grease stains. Idk if it was from the airplane ride all the way here, but the box should have been more insulated and bubble wrap should have been used instead of paper. I'm very unhappy with stale crusty chips out the bag and us soldiers 