# Creating Embeddings with OpenAI and Performing a Search in .NET

If you've looked at a lot of Jupyter Notebooks before and think mine is laid out stupidly, I'm sorry. It's my first day.

## Nuget packages & using statements

We'll be using OpenAI's SDK to interact with their API and CsvHelper to manage our test data.

In [2]:
#r "nuget:OpenAI-DotNet, 7.7.7"
#r "nuget:CsvHelper, 31.0.4"
#r "nuget:System.Numerics.Tensors, 8.0.0"

In [3]:
using CsvHelper;
using CsvHelper.Configuration;
using OpenAI;
using System;
using System.Collections.Generic;
using System.Globalization;
using System.IO;
using System.Linq;
using System.Numerics.Tensors;
using System.Text;
using System.Threading.Tasks;

## Parse the CSV

Using CsvHelper, we'll parse the Amazon review file. We'll also map the fields we'll be creating embeddings from ("Summary" and "Text"). We'll keep the other properties from the CSV, but the only "searchable" property will be the new one we create: "Combined".


In [4]:
class AmazonReview
{
    public string Id { get; set; }
    public string ProductId { get; set; }
    public string ProfileName { get; set; }
    public string HelpfulnessNumerator { get; set; }
    public string HelpfulnessDenominator { get; set; }
    public int Score { get; set; }
    public int Time { get; set; }
    public string Summary { get; set; }
    public string Text { get; set; }

    // These won't be in the CSV.
    // We'll created "Combined" with a custom mapping, and then
    // we'll calculate "Embeddings" using OpenAI API.
    public string Combined { get; set; }
    public float[] Embeddings { get; set; }
}

class AmazonReviewMap : ClassMap<AmazonReview>
{
    public AmazonReviewMap()
    {
        AutoMap(CultureInfo.InvariantCulture);
        Map(m => m.Embeddings).Ignore();

        Map(m => m.Combined).Convert(args =>
        {
            var combined = "Title: " 
              + args.Row.GetField<string>("Summary").Trim() 
              + " Content: "
              + args.Row.GetField<string>("Text").Trim();

            return combined;
        });
    }
}

AmazonReview[] reviews;

using (var reader = new StreamReader("./amazon_reviews_1000.csv"))
using (var csv = new CsvHelper.CsvReader(reader, new CsvHelper.Configuration.CsvConfiguration(CultureInfo.InvariantCulture) 
    {
      HasHeaderRecord = true, 
      HeaderValidated = null, 
      MissingFieldFound = null 
    })
)
{
    csv.Context.RegisterClassMap<AmazonReviewMap>();
    reviews = csv.GetRecords<AmazonReview>().ToArray();

    // Print a sample
    reviews.Take(5).ToList().ForEach(review => Console.WriteLine(review.Combined));
}

Title: Good Quality Dog Food Content: I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.
Title: Not as Advertised Content: Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".
Title: "Delight" says it all Content: This is a confection that has been around a few centuries.  It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar.  And it is a tiny mouthful of heaven.  Not too chewy, and very flavorful.  I highly recommend this yummy treat.  If you are familiar with the story of C.S. Lewis' "The Lion, The Witch, and The Wardrobe" - this is the treat 

## Create Embeddings

We're going to use OpenAI's SDK to calculate embeddings from the API. This is going to turn the "Combined" property into a large vector of floats. This will take a few minutes to run.

By default, the `text-embedding-3-small` model returns 1536 dimensions, but OpenAI recommends you provide a smaller dimension parameter to save on compute time and resources. They say you can do this without a notable dip in quality ([see here](https://openai.com/blog/new-embedding-models-and-api-updates)).

One thing to note here, I'm not checking token count ahead of time. In OpenAI's tutorials/notebooks, they use their Python package (`tiktoken`) to count tokens ahead of time and to filter out any review that exceeds a token count. There are some 3rd-party C# packages that do this too, but that's an exercise left up to the reader.

In [5]:
const string model = "text-embedding-3-small";
const string apiKey = "<YOUR_API_KEY>";

using (var api = new OpenAIClient(apiKey))
{
    foreach (var review in reviews)
    {
        var response = await api.EmbeddingsEndpoint.CreateEmbeddingAsync(review.Combined, model, dimensions: 512);
        review.Embeddings = response.Data.First().Embedding.Select(e => (float)e).ToArray();
    }
}

In [6]:
// Sanity check
System.Diagnostics.Debug.Assert(reviews.All(review => review.Embeddings.Length == 512));

## Searching with Embeddings

Now that we've converted our "Combined" text property into embeddings, we can search it. The steps to do so are:

1. Collect a search query (a string of text).
2. Convert that query into embeddings (using the OpenAI SDK).
3. Use math to calculate which "review embeddings" are near your "query embeddings. OpenAI recommends using [Cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity).
4. Grab some of the best matches.

In [10]:
class AmazonReviewQueryMatch
{
    public AmazonReview Review { get; set; }
    public double Relatedness { get; set; }
}

async Task<AmazonReviewQueryMatch[]> Search(string query, int topN = 5)
{
    float[] queryEmbeddings;

    using (var api = new OpenAIClient(apiKey))
    {
        var response = await api.EmbeddingsEndpoint.CreateEmbeddingAsync(query, model, dimensions: 512);
        queryEmbeddings = response.Data.First().Embedding.Select(e => (float)e).ToArray();
    }

    var matches = reviews.Select(review => new AmazonReviewQueryMatch
    {
        Review = review,
        Relatedness = TensorPrimitives.CosineSimilarity(
          new ReadOnlySpan<float>(review.Embeddings),
          new ReadOnlySpan<float>(queryEmbeddings)
        )
    })
    .OrderByDescending(match => match.Relatedness)
    .Take(topN)
    .ToArray();

    return matches;
}

In [14]:
var results = await Search("disgusting");

results.ToList().ForEach(match => Console.WriteLine($"{match.Relatedness} {match.Review.Combined}"));

0.48951831459999084 Title: Disgusting Content: These chips are nasty.  I thought someone had spilled a drink in the bag, no the chips were just soaked with grease.  Nasty!!
0.42020636796951294 Title: ABSOLUTELY VILE!!! Content: Imagine taking some kids' playground chalk, perhaps that tan colored one and grinding it into powder with plenty of chunks. Then scoop it up and put it in your mouth.<br /><br />That's how disgusting this bar is.<br /><br />I have no idea how and who is rating this as flavorful, IT HAS NONE. I've done nutritional coaching for 30 years and watched healthy foods progress from painfully bad to barely edible to excellent.<br /><br />This neanderthal-era gluten free bar is so bad I want to go to my local grocery store, take all the boxes off off the shelf and hide them in a different part of the store (perhaps, if there's a trash compactor?) just to protect people from the experience my daughter had with this product.<br /><br />Save your money for these ridiculously