Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.Net: IVectorStore implementation for Azure SQL #10623

Merged

Conversation

adamsitnik
Copy link
Member

@adamsitnik adamsitnik commented Feb 20, 2025

The PR is ready for review.

fixes #10416

…o sqlServerTests

# Conflicts:
#	dotnet/SK-dotnet.sln
The ToListAsync change is just a wokaround dotnet/runtime#79782 (comment)

namespace SqlServerIntegrationTests.VectorSearch;

public class SqlServerBasicVectorSearchTests_Hnsw(SqlServerFixture fixture)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this now with Hsnw being unsupported after all?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wanted to make sure it's actually covered with tests (so we keep throwing for non-supported filter names)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't they be covered with the other SqlServerBasicVectorSearchTests implementation anyway, i.e. the one using FLAT. Although I assume both are using FLAT since HNSW isn't supported.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SqlServerVectorSearchDistanceFunctionComplianceTests file does not set the index name, so it's using the default.

The _Hnsw does provide the index name and the tests ensure that we throw NotSupportedException on purpose.

{
await this.ExecuteAsync(async collection =>
{
Assert.Empty(await collection.UpsertBatchAsync([]).ToArrayAsync());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, during our discussion yesterday I misunderstood when you asked whether we should throw if we are upsetting with an empty collection. I thought you were referring to the storage collection, not the collection being passed into the method. Hence my confusion as to why this would be a problem 😄

That said, I don't have a strong opinion about whether to throw in this case. I tend to lean towards throwing, since it could help the user find bugs, but it's not a strong opinion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@roji we have agreed offline to keep it as is (don't throw) and wait until you are back since you most likely have designed a similar API in the past and have opinion about it

- add support for VectorStoreGenericDataModel
- respect IncludeVectors in Get* methods
- rename BasicVectorSearchTests to VectorSearchDistanceFunctionComplianceTests
…/VectorStoreRecordKeyProperty.cs

Co-authored-by: westey <164392973+westey-m@users.noreply.github.com>
Copy link
Member

@roji roji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adamsitnik great work, it's really nice to see this. Take a look at all the comments.

There's a lot that I think we could improve, but we also shouldn't get blocked too much on this. I do have a bit of a problem with introducing AutoGenerated as a public API into the abstraction (or at least want us to discuss this first). A lot of the rest is easy/simple, and I'm OK with deferring most of it to a future PR so we can release soon.

{
protected VectorStoreFixture Fixture { get; } = fixture;

protected virtual string GetUniqueCollectionName() => Guid.NewGuid().ToString();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some databases, creating collections is a heavy operation, so we should generally try to avoid creating a collection for each test unless it's necessary. This is why the filter tests have a single collection, and all tests reuse it. I think that should be possible for the new tests being introduced here.

var collection = this.Fixture.TestStore.DefaultVectorStore.GetCollection<TKey, TRecord>(collectionName,
this.GetRecordDefinition());

await collection.CreateCollectionAsync();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, creating a collection is a bit more complicated - in some databases this happens asynchronously, so CreateCollectionAsync returns before the collection is actually ready, and the test will fail.

See how this is done in BasicfilterTests - I've tried to design this in a way which is reusable across test suites (e.g. WaitForDataAsync).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I was not aware of that. Is it OK to fix it later, when we actually add some tests for such DBs?

Copy link
Member

@roji roji Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course. I really hope we'll be able to invest a bit in the integration test suite/infrastructure before the release to ensure we cover behavior properly across connectors.

adamsitnik and others added 4 commits March 6, 2025 11:17
Co-authored-by: Shay Rojansky <roji@roji.org>
…o sqlServerNew

# Conflicts:
#	dotnet/SK-dotnet.sln
#	dotnet/src/Connectors/Connectors.Memory.Postgres/PostgresVectorStoreCollectionSqlBuilder.cs
#	dotnet/src/Connectors/Connectors.Memory.Sqlite/SqliteVectorStoreRecordCollection.cs
- remove NONCLUSTERED used PRIMARY KEY
- remove COLLATE Latin1_General_100_BIN2 for VARCHAR columns
- remove TimeSpan => TIME mapping
- add support for TimeOnly (TIME)
- make Schema optional
- handle Where(x => x.Bool)
- use square brackets for escaping column names in WHERE clause
…o sqlServerNew

# Conflicts:
#	dotnet/src/Connectors/Connectors.Memory.Postgres/PostgresVectorStoreRecordCollection.cs
@adamsitnik adamsitnik requested a review from roji March 6, 2025 21:48
Copy link
Member

@roji roji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once the comments below have been addressed, thanks @adamsitnik!

this._sql.Append('\'').Append(s.Replace("'", "''")).Append('\'');
return;
case bool b:
this._sql.Append(b ? "TRUE" : "FALSE");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the representation of bool varies considerably across databases, this probably shouldn't be here (only in the overrides of this method in connector subtypes).

Type t when t == typeof(bool) => ("BIT", null),
Type t when t == typeof(DateTime) => ("DATETIME2", null),
#if NET
Type t when t == typeof(TimeOnly) => ("TIME", null),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need DateOnly here, mapped to DATE.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at implementing DateOnly and I don't see a way to easily do that as of now. The reason is that the SqlDataReader maps such an output as DateTime and reader.GetProviderSpecificFieldType(this._sqlDataReader.GetOrdinal(storageName)) also returns DateTime and we also support DATETIME2 which does the same.

For TIME it was similar, as it also returns TimeSpan rather than TimeOnly. But we don't support TimeSpan, so we can just map all TimeSpan instances into TimeOnly.

I know I could give the reader the property infos or the definitions to deal with that, I just don't believe it's worth it as of now, especially if we want to get rid of this layer (the need to represent SqlDataReader as IDictionary<string, object?>).

#if NET
// We don't support mapping TimeSpan to TIME on purpose
// See https://github.com/microsoft/semantic-kernel/pull/10623#discussion_r1980350721
typeof(TimeOnly), // TIME
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also DateOnly

/// <summary>
/// Gets or sets the database schema.
/// </summary>
public string? Schema { get; init; } = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am noting that in PG and Sqlite this isn't nullable, but rather initialized to the default; as discussed before, I think this means that all SQLs always contain the schema (e.g. public for PG), whereas we should just omit it like we do here.

@adamsitnik do you mind applying this change (turn Schema to nullable) for the other relation providers as well? Definitely doesn't have to be in this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I've logged an issue to make sure I don't forget to do that: #10853

Co-authored-by: Shay Rojansky <roji@roji.org>
@adamsitnik adamsitnik merged commit 7aa7d29 into microsoft:feature-vector-data-preb1 Mar 7, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kernel Issues or pull requests impacting the core kernel memory .NET Issue or Pull requests regarding .NET code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants