Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error detail regression #380

Open
finestructure opened this issue May 16, 2023 · 40 comments
Open

Error detail regression #380

finestructure opened this issue May 16, 2023 · 40 comments

Comments

@finestructure
Copy link
Member

finestructure commented May 16, 2023

I'm not sure if this is a fluent-postgres-driver issue or one in postgres-kit - but FWIW, I can avoid it by pinning to .package(url: "https://github.com/vapor/fluent-postgres-driver.git", exact: "2.6.2").

We have a test that asserts on the error description when violating a not-null constraint:

        do {  // commit unset
            let v = Version()
            v.commitDate = .distantPast
            v.reference = .branch("main")
            try await v.save(on: app.db)
            XCTFail("save must fail")
        } catch {
            // validation
            XCTAssertEqual(error.localizedDescription,
                           #"server: null value in column "commit" of relation "versions" violates not-null constraint (ExecConstraints)"#)
        }

With the latest version of fluent-postgres-driver this starts failing because the error message is now The operation couldn’t be completed. (PostgresNIO.PSQLError error 1.),

Of course the exact error message isn't important but the point of the test is to ensure that we have relevant information about an error in the logs.

We have another test that checks unique key violations:

            XCTAssert($0.localizedDescription.contains(
                #"duplicate key value violates unique constraint "idx_repositories_owner_name""#),
                      "was: \($0.localizedDescription)"
            )

That test is also failing due to that same new error message, The operation couldn’t be completed. (PostgresNIO.PSQLError error 1.).

Ideally, both violations would still be reported as such (with some detail as to which kind of constraint was violated) and also include the name of the constraint that was violated (if available).

Is there a way to recover this error information with the latest version of fluent-postgres-driver/postgres-kit?

@finestructure
Copy link
Member Author

finestructure commented May 16, 2023

We have another handful of other test that I believe are failing for the same underlying reason. We have logic that handles unique key violations as follows (I've noticed the clumsy catch and now replaced it with catch let error as ... where ... 🙂):

                    do {
                        try await Build(id: buildId,
                                        versionId: trigger.versionId,
                                        jobUrl: jobUrl,
                                        platform: pair.platform,
                                        status: .triggered,
                                        swiftVersion: pair.swiftVersion)
                        .create(on: database)
                    } catch {
                        if let error = error as? PostgresError,
                           error.code == .uniqueViolation {
                            if let oldBuild = try await Build.query(on: database,
                                                                    platform: pair.platform,
                                                                    swiftVersion: pair.swiftVersion,
                                                                    versionId: trigger.versionId) {
                                // Fluent doesn't allow modification of the buildId of an existing
                                // record, therefore we need to delete + create.
                                let newBuild = Build(id: buildId,
                                                     versionId: trigger.versionId,
                                                     buildCommand: oldBuild.buildCommand,
                                                     jobUrl: jobUrl,
                                                     platform: pair.platform,
                                                     status: .triggered,
                                                     swiftVersion: pair.swiftVersion)
                                try await database.transaction { tx in
                                    try await oldBuild.delete(on: tx)
                                    try await newBuild.create(on: tx)
                                }
                            }
                        } else {
                            throw error
                        }
                    }

This seems to be falling through now to throw error.

Is there another/better way to handle unique key violation errors in the latest fluent-postgres-driver/postgres-kit?

(I could of course avoid the UK violation here by checking for existance first but the error case is very rare and I don't want to have to run the extra query on the happy path for every request.)

@finestructure
Copy link
Member Author

I've just seen vapor/postgres-kit#244, which I'm guessing will address all this.

@0xTim
Copy link
Member

0xTim commented May 17, 2023

@finestructure Are we good to close this with the linked PR?

@finestructure
Copy link
Member Author

@0xTim I've just had a look and our errors are actually happening with that version (postgres-kit 2.11.2).

It seems that all errors are now PSQLErrors. Is there a way to recover the PostgresError we used to have in some way?

@finestructure
Copy link
Member Author

In particular the PostgresError.code property is important to handle unique key violation errors and there doesn't seem to be an equivalent in PSQLError.

@0xTim
Copy link
Member

0xTim commented May 18, 2023

@gwynne ping

@finestructure
Copy link
Member Author

finestructure commented Jun 1, 2023

There is an additional problem I've just run into while trying to switch over to PSQLError from PostgresError. We have the following function:

    func attach(to build: Build, on database: Database) async throws {
        $build.id = try build.requireID()
        build.$docUpload.id = try requireID()
        try await database.transaction {
            try await self.save(on: $0)
            try await build.save(on: $0)
        }
    }

It turns out that a PSQLError raised inside the transaction changes to ExpressivePSQLError outside:

    func attach(to build: Build, on database: Database) async throws {
        $build.id = try build.requireID()
        build.$docUpload.id = try requireID()
        do {
            try await database.transaction {
                do {
                    try await self.save(on: $0)
                    try await build.save(on: $0)
                } catch {
                    print("in transaction: \(type(of: error))")
                    throw error
                }
            }
        } catch {
            print("outside transaction: \(type(of: error))")
            throw error
        }
    }
Test Case '-[AppTests.DocUploadTests test_unique_constraint_doc_uploads_build_id]' started.
in transaction: PSQLError
outside transaction: ExpressivePSQLError

The data still seems to be there. In fact the error reports as PSQLError when dumped - even though type(of:) is ExpressivePSQLError:

▿ PSQLError(code: server, serverInfo: [message: duplicate key value violates unique constraint "uq:doc_uploads.build_id", severity: ERROR, sqlState: 23505, detail: Key (build_id)=(30447262-eee7-4837-93fa-347480da3b4a) already exists., file: nbtinsert.c, localizedSeverity: ERROR, schemaName: public, routine: _bt_check_unique, constraintName: uq:doc_uploads.build_id, line: 649, tableName: doc_uploads], triggeredFromRequestInFile: PostgresKit/PostgresDatabase+SQL.swift, line: 53, query: PostgresQuery(sql: "INSERT INTO \"doc_uploads\" (\"id\", \"created_at\", \"updated_at\", \"build_id\", \"error\", \"file_count\", \"log_url\", \"mb_size\", \"status\") VALUES ($1, $2, $3, $4, NULL, NULL, NULL, NULL, $5) RETURNING \"id\"", binds: PostgresNIO.PostgresBindings(metadata: [PostgresNIO.PostgresBindings.Metadata(dataType: UUID, format: binary), PostgresNIO.PostgresBindings.Metadata(dataType: TIMESTAMPTZ, format: binary), PostgresNIO.PostgresBindings.Metadata(dataType: TIMESTAMPTZ, format: binary), PostgresNIO.PostgresBindings.Metadata(dataType: UUID, format: binary), PostgresNIO.PostgresBindings.Metadata(dataType: TEXT, format: binary)], bytes: ByteBuffer { readerIndex: 0, writerIndex: 70, readableBytes: 70, capacity: 1024, storageCapacity: 1024, slice: _ByteBufferSlice { 0..<1024 }, storage: 0x0000000105b00800 (1024 bytes) }

@gwynne
Copy link
Member

gwynne commented Jun 1, 2023

ExpressivePSQLError is a wrapper type I added in PosgresKit as a temporary workaround while the underlying concern in PostgresNIO is addressed. You can get detailed error information in a way that will work correctly both now and later by asking for the error's debugging description, e.g. String(reflecting: error)

@gwynne
Copy link
Member

gwynne commented Jun 1, 2023

You can also explicitly check whether an error is a constraint violation (including a uniqueness violation) by using the DatabaseError conformance, e.g. error.isConstraintFailure

@finestructure
Copy link
Member Author

finestructure commented Jun 1, 2023

Thanks for taking a look, @gwynne !

Casting to DatabaseError only seems to work when the error is a PSQLError. When it's a ExpressivePSQLError I get nil and I can't find a way to catch it specifically:

            } catch {
                req.logger.critical("\(error)")
                dump(error)
                print("type(of: error)", type(of: error))
                print("error as? PSQLError:", error as? PSQLError)
                print("error as? DatabaseError:", error as? DatabaseError)
                throw error
            }
in transaction: PSQLError
outside transaction: ExpressivePSQLError
2023-06-01T13:01:50+0200 critical codes.vapor.request : request-id=CCE0BFBF-BA03-4530-99DE-C19E1B2BFA87 [App] Database error
▿ PSQLError(code: server, serverInfo: [schemaName: public, severity: ERROR, message: duplicate key value violates unique constraint "uq:builds.version_id+partial", tableName: builds, file: nbtinsert.c, line: 649, routine: _bt_check_unique, constraintName: uq:builds.version_id+partial, sqlState: 23505, localizedSeverity: ERROR, detail: Key (version_id)=(b8dfc86e-5675-4aca-ba1f-1a8561056f6f) already exists.], triggeredFromRequestInFile: PostgresKit/PostgresDatabase+SQL.swift, line: 53, query: PostgresQuery(sql: "UPDATE \"builds\" SET \"updated_at\" = $1, \"doc_upload_id\" = $2 WHERE \"builds\".\"id\" = $3", binds: PostgresNIO.PostgresBindings(metadata: [PostgresNIO.PostgresBindings.Metadata(dataType: TIMESTAMPTZ, format: binary), PostgresNIO.PostgresBindings.Metadata(dataType: UUID, format: binary), PostgresNIO.PostgresBindings.Metadata(dataType: UUID, format: binary)], bytes: ByteBuffer { readerIndex: 0, writerIndex: 52, readableBytes: 52, capacity: 512, storageCapacity: 512, slice: _ByteBufferSlice { 0..<512 }, storage: 0x0000000104c1ec00 (512 bytes) }
... <snip> ...
type(of: error) ExpressivePSQLError
error as? PSQLError: nil
error as? DatabaseError: nil

@gwynne
Copy link
Member

gwynne commented Jun 1, 2023

catch let error as DatabaseError where error.isConstraintFailure is supposed to work, but it appears I forgot to forward the DatabaseError conformance when I added the workaround 🤦‍♀️ I'll add a fix for that to FluentPostgresDriver later today.

@gwynne
Copy link
Member

gwynne commented Jun 1, 2023

Either:

req.logger.critical("\(String(reflecting: error))")

or

req.logger.report(error)

should at least get you the debug information.

The difference between the error type inside versus outside the transaction is an unfortunate side effect of the limitations of the workaround, although I think I can fix that specific limitation at the same time as fixing the DatabaseError conformance.

(of course, what we really should be doing is getting #360 taken care of, so I can ditch the workaround completely)

@finestructure
Copy link
Member Author

Thanks for the additional info, @gwynne ! I think I'll keep fluent-posgres-driver pinned to 2.6.2 for now until everything has settled down.

If it helps, I can clean up and point you to the branch where I'm testing the migration. Maybe the additional context would be useful?

@rausnitz
Copy link
Contributor

rausnitz commented Jun 5, 2023

@finestructure Thanks for reporting this. I have been struggling with a similar issue, seeing error PSQLError(backing: PostgresNIO.PSQLError.(unknown context at $_________).Backing) for various database-generated errors. Pinning to 2.6.2 looks like it'll work for me as a short-term fix too.

@rausnitz
Copy link
Contributor

rausnitz commented Jun 5, 2023

Pinning to 2.6.2 looks like it'll work for me as a short-term fix too.

Also had to do this, actually:

.package(url: "https://github.com/vapor/postgres-kit.git", .exact("2.10.1")),

@gwynne
Copy link
Member

gwynne commented Jun 9, 2023

This has finally been addressed by #372; update your dependencies (and don't forget to un-pin fluent-postgres-driver, postgres-kit, and/or postgres-nio if you pinned them!) to get the new logic.

A few critical notes:

  • If you're seeing PSQLErrors via Vapor's error reporting logic (such as logging from ErrorMiddleware in route handlers), you'll see detailed errors automatically without any additional action.
  • If you've been getting PSQLErrors from calling autoMigrate(), or from anything in configure(_:), have a look at this template update for how to adjust your entrypoint.swift (or, for older projects, main.swift - the change is the same either way).
  • Finally, if there are PSQLErrors from other sources which yield this message in your logs:
    PSQLError – Generic description to prevent accidental leakage of sensitive data. For debugging details, use `String(reflecting: error)`.
    
    Look for where the log message is coming from. Most often, this will be the result of doing something like logger.error("Got an error: \(error)"). You can either change it to logger.report(error: error) if you have imported Vapor, or, if Vapor is not available or there's additional text in the log message, you can do what the generic message says, i.e.: logger.error("Got an error: \(String(reflecting: error))").

@gwynne gwynne closed this as completed Jun 9, 2023
@finestructure
Copy link
Member Author

  • Look for where the log message is coming from. Most often, this will be the result of doing something like logger.error("Got an error: \(error)"). You can either change it to logger.report(error: error) if you have imported Vapor, or, if Vapor is not available or there's additional text in the log message, you can do what the generic message says, i.e.: logger.error("Got an error: \(String(reflecting: error))").

Is there perhaps a way to restore the original error reporting behaviour with this new error type?

The problem with logger.report(error: error) is that it doesn't allow setting of the error level. Moreover, it seems to have changed the error level from critical to warning for errors reported in this way.

For example, we have

XCTAssertEqual(log.level, .critical)
XCTAssertEqual(log.message,
               #"server: duplicate key value violates unique constraint "idx_repositories_owner_name" (_bt_check_unique)"#)

This now fails with

XCTAssertEqual failed: ("warning") is not equal to ("critical")
XCTAssertEqual failed: ("PSQLError(code: server, serverInfo: [sqlState: 23505, detail: Key (lower(owner), lower(name))=(owner, name) already exists., file: nbtinsert.c, line: 649, message: duplicate key value violates unique constraint "idx_repositories_owner_name", routine: _bt_check_unique, localizedSeverity: ERROR, severity: ERROR, constraintName: idx_repositories_owner_name, schemaName: public, tableName: repositories], triggeredFromRequestInFile: PostgresKit/PostgresDatabase+SQL.swift, line: 53, query: PostgresQuery(sql: INSERT INTO "repositories" ("id", "created_at", "updated_at", "package_id", "default_branch", "forks", "homepage_url", "is_archived", "is_in_organization", "keywords", "last_issue_closed_at", "last_pull_request_closed_at", "license", "license_url", "name", "open_issues", "open_pull_requests", "owner", "owner_name", "owner_avatar_url", "readme_url", "readme_html_url", "releases", "stars", "summary") VALUES ($1, $2, $3, $4, $5, $6, NULL, $7, $8, $9, NULL, NULL, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22) RETURNING "id", binds: [(****; UUID; format: binary), (****; TIMESTAMPTZ; format: binary), (****; TIMESTAMPTZ; format: binary), (****; UUID; format: binary), (****; TEXT; format: binary), (****; BIGINT; format: binary), (****; BOOLEAN; format: binary), (****; BOOLEAN; format: binary), (****; TEXT[]; format: binary), (****; TEXT; format: binary), (****; TEXT; format: binary), (****; TEXT; format: binary), (****; BIGINT; format: binary), (****; BIGINT; format: binary), (****; TEXT; format: binary), (****; TEXT; format: binary), (****; TEXT; format: binary), (****; TEXT; format: binary), (****; TEXT; format: binary), (****; JSONB[]; format: binary), (****; BIGINT; format: binary), (****; TEXT; format: binary)]))") is not equal to ("server: duplicate key value violates unique constraint "idx_repositories_owner_name" (_bt_check_unique)")

This is from an error reported via logger.report(error: error) - i.e. we were already adopting the suggested error reporting.

This is having quite an impact on our server logs and on downstream systems looking at these logs. I'm sure there were good reasons to make this change, I'm just wondering if it's feasible to restore the original behaviour with these changes in place?

It's going to be tricky for us to audit that we're reporting from everywhere in this new way and some of the behaviour we can't restore (critical vs warning) unless we move from logger.report(error:) to logger.critical("... \(String(reflecting:))") in select places, which is going to take a good amount of fiddling.

Plus the resulting errors would be much more verbose (but that's a change we can likely absorb ok, because in testing the original error is still contained within the bigger error message - that means matchers will still work 🤞).

@gwynne
Copy link
Member

gwynne commented Jun 18, 2023

@finestructure logger.report(error:) hasn't changed what level it reports errors at; I'm not sure why you're seeing a difference there 😕. As far as I know, we've never reported such errors at the critical level (at least, not that I can find anywhere in the Fluent or Vapor logic).

@0xTim
Copy link
Member

0xTim commented Jun 18, 2023

Yeah we should never report critical errors in Fluent, though we used to allow the log level to be set for SQL logging - is that what you meant @finestructure ?

@finestructure
Copy link
Member Author

@gwynne @0xTim It's entirely possible I'm messing that up somehow.

Let me try and distill this into one test that's failing depending on what fluent-postgres-driver we're running:

This is current main:

❯ git diff
❯ git rev-parse @
b8c0c79eb2c06aa296033a97c11933980820f750
❯ swift test --filter IngestorTests.test_ingest_unique_owner_name_violation
Test Suite 'Selected tests' started at 2023-06-19 11:02:50.269.
Test Suite 'SPI-ServerPackageTests.xctest' started at 2023-06-19 11:02:50.270.
Test Suite 'IngestorTests' started at 2023-06-19 11:02:50.270.
Test Case '-[AppTests.IngestorTests test_ingest_unique_owner_name_violation]' started.
Test Case '-[AppTests.IngestorTests test_ingest_unique_owner_name_violation]' passed (0.247 seconds).
Test Suite 'IngestorTests' passed at 2023-06-19 11:02:50.518.
	 Executed 1 test, with 0 failures (0 unexpected) in 0.247 (0.247) seconds
Test Suite 'SPI-ServerPackageTests.xctest' passed at 2023-06-19 11:02:50.518.
	 Executed 1 test, with 0 failures (0 unexpected) in 0.247 (0.248) seconds
Test Suite 'Selected tests' passed at 2023-06-19 11:02:50.518.
	 Executed 1 test, with 0 failures (0 unexpected) in 0.247 (0.248) seconds

And this is current main when I unpin fluent-postgres-driver:

❯ git diff
diff --git a/Package.resolved b/Package.resolved
index befc9a6e..1a4c077d 100644
--- a/Package.resolved
+++ b/Package.resolved
@@ -59,8 +59,8 @@
       "kind" : "remoteSourceControl",
       "location" : "https://github.com/vapor/fluent-postgres-driver.git",
       "state" : {
-        "revision" : "c7e39767bc2440461a37ecd6d1cdfb21c959c272",
-        "version" : "2.6.2"
+        "revision" : "bb3ab8e861152157f712cd08fc473e885bd0b4df",
+        "version" : "2.7.2"
       }
     },
     {
diff --git a/Package.swift b/Package.swift
index f5fc15a4..3722ff98 100644
--- a/Package.swift
+++ b/Package.swift
@@ -38,7 +38,7 @@ let package = Package(
         // FIXME: change this back to a release once Xcode 15 support has been tagged
         .package(url: "https://github.com/pointfreeco/swift-snapshot-testing.git", revision: "dc46eeb3928a75390651fac6c1ef7f93ad59a73b"),
         .package(url: "https://github.com/scinfu/SwiftSoup.git", from: "2.3.2"),
-        .package(url: "https://github.com/vapor/fluent-postgres-driver.git", exact: "2.6.2"),
+        .package(url: "https://github.com/vapor/fluent-postgres-driver.git", from: "2.0.0"),
         .package(url: "https://github.com/vapor/fluent.git", from: "4.0.0"),
         .package(url: "https://github.com/vapor/vapor.git", from: "4.0.0"),
     ],
❯ swift test --filter IngestorTests.test_ingest_unique_owner_name_violation
Test Suite 'Selected tests' started at 2023-06-19 11:06:57.772.
Test Suite 'SPI-ServerPackageTests.xctest' started at 2023-06-19 11:06:57.773.
Test Suite 'IngestorTests' started at 2023-06-19 11:06:57.773.
Test Case '-[AppTests.IngestorTests test_ingest_unique_owner_name_violation]' started.
/Users/sas/Projects/SPI/spi-server/Tests/AppTests/IngestorTests.swift:369: error: -[AppTests.IngestorTests test_ingest_unique_owner_name_violation] : XCTAssertEqual failed: ("warning") is not equal to ("critical")
/Users/sas/Projects/SPI/spi-server/Tests/AppTests/IngestorTests.swift:370: error: -[AppTests.IngestorTests test_ingest_unique_owner_name_violation] : XCTAssertEqual failed: ("PSQLError(code: server, serverInfo: [sqlState: 23505, detail: Key (lower(owner), lower(name))=(owner, name) already exists., file: nbtinsert.c, line: 649, message: duplicate key value violates unique constraint "idx_repositories_owner_name", routine: _bt_check_unique, localizedSeverity: ERROR, severity: ERROR, constraintName: idx_repositories_owner_name, schemaName: public, tableName: repositories], triggeredFromRequestInFile: PostgresKit/PostgresDatabase+SQL.swift, line: 53, query: PostgresQuery(sql: INSERT INTO "repositories" ("id", "created_at", "updated_at", "package_id", "default_branch", "forks", "homepage_url", "is_archived", "is_in_organization", "keywords", "last_issue_closed_at", "last_pull_request_closed_at", "license", "license_url", "name", "open_issues", "open_pull_requests", "owner", "owner_name", "owner_avatar_url", "readme_url", "readme_html_url", "releases", "stars", "summary") VALUES ($1, $2, $3, $4, $5, $6, NULL, $7, $8, $9, NULL, NULL, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22) RETURNING "id", binds: [(****; UUID; format: binary), (****; TIMESTAMPTZ; format: binary), (****; TIMESTAMPTZ; format: binary), (****; UUID; format: binary), (****; TEXT; format: binary), (****; BIGINT; format: binary), (****; BOOLEAN; format: binary), (****; BOOLEAN; format: binary), (****; TEXT[]; format: binary), (****; TEXT; format: binary), (****; TEXT; format: binary), (****; TEXT; format: binary), (****; BIGINT; format: binary), (****; BIGINT; format: binary), (****; TEXT; format: binary), (****; TEXT; format: binary), (****; TEXT; format: binary), (****; TEXT; format: binary), (****; TEXT; format: binary), (****; JSONB[]; format: binary), (****; BIGINT; format: binary), (****; TEXT; format: binary)]))") is not equal to ("server: duplicate key value violates unique constraint "idx_repositories_owner_name" (_bt_check_unique)")
Test Case '-[AppTests.IngestorTests test_ingest_unique_owner_name_violation]' failed (0.271 seconds).
Test Suite 'IngestorTests' failed at 2023-06-19 11:06:58.044.
	 Executed 1 test, with 2 failures (0 unexpected) in 0.271 (0.271) seconds
Test Suite 'SPI-ServerPackageTests.xctest' failed at 2023-06-19 11:06:58.044.
	 Executed 1 test, with 2 failures (0 unexpected) in 0.271 (0.272) seconds
Test Suite 'Selected tests' failed at 2023-06-19 11:06:58.044.
	 Executed 1 test, with 2 failures (0 unexpected) in 0.271 (0.272) seconds

What the test does is configure a CapturingLogger which records all the logs. The essential bit of the test assertions is

XCTAssertEqual(logs.count, 1)
let log = try XCTUnwrap(logs.first)
XCTAssertEqual(log.level, .critical)
XCTAssertEqual(log.message, #"server: duplicate key value violates unique constraint "idx_repositories_owner_name" (_bt_check_unique)"#)

There are no other code changes that could affect the change in assert level we're seeing, as far as I can see, just the fluent-postgres-driver version.

Does help to explain the issue? Is there something wrong with the way we're using/testing this?

@gwynne
Copy link
Member

gwynne commented Jun 19, 2023

To be honest, I'm at a loss as far as the log level goes - even a code search across the entire Vapor org doesn't show anywhere that we were doing anything at the critical level for database errors, or any other kind really. Is it possible you have either an extension on PostgresError that conformed it to DebuggableError (which Vapor's ErrorMiddleware would then use to log it at the level that conformance specifies), or a catch clause specific to PostgresError somewhere?

@gwynne
Copy link
Member

gwynne commented Jun 19, 2023

@fabianfett BTW, told you the new conformance was a little verbose... 😜

@gwynne
Copy link
Member

gwynne commented Jun 19, 2023

@finestructure Oh, I keep forgetting to mention - for a check of this nature you can leverage DatabaseError conformance (which both the old PostgresError and the new PSQLError have) by checking error.isConstraintFailure rather than trying to parse a log message (though your test looks to be aimed more at ensuring that requisite details get logged for later investigation, which that won't help with).

@finestructure
Copy link
Member Author

or a catch clause specific to PostgresError somewhere

Oooh, thanks so much @gwynne , that was exactly it. We had

        case let .failure(error) where error is PostgresError:
            // Escalate database errors to critical
            Current.logger().critical("\(error)")
            try await recordError(database: database, error: error, stage: stage)

and that was now falling through to a more generic logging statement.

@finestructure
Copy link
Member Author

@finestructure Oh, I keep forgetting to mention - for a check of this nature you can leverage DatabaseError conformance (which both the old PostgresError and the new PSQLError have) by checking error.isConstraintFailure rather than trying to parse a log message (though your test looks to be aimed more at ensuring that requisite details get logged for later investigation, which that won't help with).

Yes, I've actually been using that already, and you're right - this test is just to ensure our logging assumptions are being met.

I do however have a to do to come back to error.isConstraintFailure: I'd like to add another variant of this, error.is UniqueViolation. That's what we're actually testing for and I'd like to preserve the level of granularity here if possible.

I'll have a look and open a PR, ok?

@gwynne
Copy link
Member

gwynne commented Jun 20, 2023

I do however have a to do to come back to error.isConstraintFailure: I'd like to add another variant of this, error.is UniqueViolation. That's what we're actually testing for and I'd like to preserve the level of granularity here if possible.

I'll have a look and open a PR, ok?

It'd be quite welcome! But be aware you'll have to add the requirement to the DatabaseError protocol itself in fluent-kit, and then also add the implementations to the separate individual conformances in fluent-postgres-driver, fluent-mysql-driver, fluent-sqlite-driver, and fluent-mongo-driver 🤦‍♀️ You can make the the appropriate check specifically for Postgres with something like (error as? PSQLError)?.serverInfo?[.sqlState] == PostgresError.Code.uniqueViolation.raw. (And wow, that last bit in particular is really ugly; apparently we didn't make the "well-known" SQL states available through a less awkward interface as of yet...)

EDIT: Come to think of it, if you added isUniqueViolation, you'd also have to add isForeignKeyViolation (since Postgres correctly reports many different kinds of constraint violations with different SQL states), and I'm not sure you can distinguish between even just those two kinds reliably with any of the other drivers (I'm almost sure it's not that simple with SQLite in particular).

@gwynne
Copy link
Member

gwynne commented Jun 20, 2023

@fabianfett Did I miss anything in the above re: an easier way to check for an appropriate PSQLError?

@finestructure
Copy link
Member Author

EDIT: Come to think of it, if you added isUniqueViolation, you'd also have to add isForeignKeyViolation (since Postgres correctly reports many different kinds of constraint violations with different SQL states), and I'm not sure you can distinguish between even just those two kinds reliably with any of the other drivers (I'm almost sure it's not that simple with SQLite in particular).

Ah missed this edit... I've got the four repos forked and with branches for the change. Is this worth PRing then?

I'd mentioned this to Fabian earlier that I miss the granularity we had from Postgres with the Code enum. I wonder if there'd be a way to preserve these codes in a generic fashion such that you can still test them depending on your backend. That could avoid having to come up with types of errors that don't exist universally.

@finestructure
Copy link
Member Author

finestructure commented Jun 21, 2023

I'm almost sure it's not that simple with SQLite in particular

Turns out I'd actually forgotten to do isUniqueViolation for sqlite and even with that it's not really possible to map it correctly I think. There's only Reason.constraint to compare with (which isConstraintFailure uses), which would be too wide.

I think some sort of DatabaseError.code would be best but I'm not sure how you'd define that on the protocol in a way that the codes could be type-defined by the various backends.

Perhaps some protocol ErrorCode { } which is returned from DatabaseError via var code: any ErrorCode? I suppose the problem with that is that all the error codes would need to be explicitly re-mapped and maintained in the backend specific packages because you wouldn't want to conform external types to an external protocol. Unless you argue ErrorCode isn't external because it's from the family of fluent-* packages?

@finestructure
Copy link
Member Author

finestructure commented Jun 21, 2023

Another thought: if isUniqueViolation (and others like it) returned a tri-state

enum State {
  case true
  case false
  case unknown // undetermined, undefined?
}

backends that can't resolve certain conditions would have a way to opt out.

@finestructure
Copy link
Member Author

@gwynne @fabianfett I'm tempted to reopen this because I'm still struggling to retain the original behaviour but I wanted to check if I'm perhaps missing something obvious first. The problem is that this now obscures errors during development. For instance just now, in a test, I'm getting this error:

/Users/sas/Projects/SPI/spi-server/Tests/AppTests/SearchTests.swift:0: error: -[AppTests.SearchTests test_productTypeFilter_macro] : 
failed: caught error: "PSQLError – Generic description to prevent accidental leakage of sensitive data. For debugging details, use `String(reflecting: error)`."

I suppose I could wrap the whole test each time this happens and do the casting but it feels like that's less than ideal. Would it be possible to restore the original test error reporting at least for DEBUG builds? Or is there another, simpler way to see what's happening during development?

@finestructure
Copy link
Member Author

Sorry, but I feed I need to reopen this issue to see if there's a better way of dealing with these error messages.

I just got the following running the tests where I'd forgotten to start the docker container with the database:

Test Case '-[AppTests.API_PackageControllerTests test_BuildInfo_query]' started.
2023-07-12T08:34:00+0200 error codes.vapor.application : component=server [Vapor] Application.shutdown() was not called before Application deinitialized.
/Users/sas/Library/Developer/Xcode/DerivedData/spi-server-gsbjdobzvfgnckdblpfcxbfdtnun/SourcePackages/<compiler-generated>:0: error: -[AppTests.API_PackageControllerTests test_BuildInfo_query] : failed: caught error: "PSQLError – Generic description to prevent accidental leakage of sensitive data. For debugging details, use `String(reflecting: error)`."
<unknown>:0: -[AppTests.API_PackageControllerTests test_BuildInfo_query] : Test skipped: threw error "The operation couldn’t be completed. (PostgresNIO.PSQLError error 1.)"

In this case it's very easy to find the problem but imagine it's a more complex problem or you're a new contributor. Surely there has to be a better way to surface errors?

The error message says the motivation for the initial change was to prevent leaking sensitive data. Would it not be possible to at least do this in release builds only, if it's not possible to restrict it only to errors that are actually sensitive?

@finestructure finestructure reopened this Jul 12, 2023
@finestructure
Copy link
Member Author

CleanShot 2023-07-13 at 10 36 39@2x

😢

@gwynne
Copy link
Member

gwynne commented Jul 13, 2023

@finestructure We're still trying to come up with a good answer to this that's both safe and helpful. Our current thinking is to add an environment variable that switches on verbose-always errors (something like POSTGRES_NIO_VERBOSE_ERRORS; bikeshedding on the name is welcome) and only works in Debug builds - would this be workable for your case?

@finestructure
Copy link
Member Author

finestructure commented Jul 13, 2023

Thanks @gwynne , that would indeed be very helpful. Anything that's a one-off setup that we can also bake into the project so that contributors don't trip over this is great 🤗

As to naming, maybe even POSTGRES_NIO_UNSAFE_ERRORS or POSTGRES_NIO_ALLOW_SENSITIVE_DATA_ERRORS to stress why it exists. VERBOSE alone might tempt people to use it carelessly.

Although in effect we will enable this in general for DEBUG builds so maybe that's not needed as a deterrent. I can't imagine not setting this in a DEBUG build to be honest.

@rausnitz
Copy link
Contributor

rausnitz commented Aug 2, 2023

I have this code:

if let databaseError = error as? PostgresError, databaseError.code == .uniqueViolation

which worked up through fluent-postgres-driver 2.6.2 and postgres-kit 2.10.1. It lets us know when a user tries to change their email address to an email address that is already claimed by someone else, violating a unique constraint on the table.

I updated to the latest everything (fluent-postgres-driver 2.8.0, postgres-kit 2.12.0, postgres-nio 1.17.0) and now the error is a PSQLError instead of a PostgresError, so if statement is no longer ever true.

It looks like I can resolve it by doing

if let databaseError = error as? PSQLError, databaseError.serverInfo?[.sqlState] == PostgresError.Code.uniqueViolation.raw

as a way to detect the code for a unique violation, though it feels a bit off to look up a value in PSQLError in the enum cases of PostgresError.

@gwynne I wonder why this change (throwing PSQLError instead of PostgresError) was not considered a semver major change, since it was a breaking change. Is there another way I can find out when breaking changes are introduced to packages I'm using? Should I generally not be relying on Error types remaining unchanged?

@gwynne
Copy link
Member

gwynne commented Aug 4, 2023

@rausnitz So there are a lot of factors in play here:

  • Fluent consists of lots of moving parts; it and SQLKit both try to be abstraction layers over multiple database backends (with mediocre at best success in many ways, it must be admitted). Most of those backends have been relatively stable for some time, so when change does happen, it almost can't help but be disruptive.
  • PostgresNIO has, to date, easily seen more changes and updates than any of the others, in no smart part because it's the most popular of Fluent's drivers. (Although the MySQL backend is about to undergo a similar upheaval.) During all the work done in its overhaul, PostgresNIO tried desperately to not break adopters. Fabian would be the first to admit that this wasn't always successful and that sometimes small patch releases were necessary towards this end, but he'd also point out that the intent was always there. I'd also add that there are very few people actively contributing code to these packages right now; we can only do so much at a time (hence how long it's taken me to respond to this properly!).
  • PostgresNIO introduced a number of new, enormously improved APIs in the last year, and started to deprecate the existing APIs. The new APIs, which FluentPostgresDriver and PostgresKit have been rewritten to take as much advantage of as is practical, return the new PSQLError because it is vastly better than the old PostgresError, containing considerably more information; Fabian and I both made the call that it was better to provide the new error type than not, despite the lack of any higher-level abstraction.
  • In an ideal world, Fluent would have never forwarded errors from the lower level database drivers verbatim at all; it would have wrapped them in a higher-level Fluent error type that gave access to the underlying error at need. Whether this would have allowed better abstraction for this change is unclear, but it probably would have at the very least resulted in fewer people feeling the side effects. Unfortunately, this is not an ideal world, and present-day Fluent 4 is still stuck with several design choices Tanner made several years ago.

Also, to specifically address this:

I wonder why this change (throwing PSQLError instead of PostgresError) was not considered a semver major change, since it was a breaking change. Is there another way I can find out when breaking changes are introduced to packages I'm using? Should I generally not be relying on Error types remaining unchanged?

Swift does not consider the specific errors thrown by a given API part of that API's contract (notwithstanding APIs which use Result); as such, it is not a semver-major break - although it's obviously more than a little inconvenient in cases like this.

That being said, it's easily arguable that the overhaul of PostgresNIO should have been a new major version. We went back and forth on this several times during the development, and at the end of the day the fact of the matter is that I probably made the wrong call when I talked Fabian into doing it the way we did; it was a decision made based on the lack of any consensus in Vapor as a whole (at least at the time) on how to handle versioning, and concern about how to leverage the new APIs in higher layers (PostgresKit and FluentPostgresDriver). Since that time there has been considerably more discussion and benefit of experience, and if we were doing it now, I think we'd go the other way. I probably will do exactly that with the MySQLNIO updates - but even if I don't, I can promise right now there will not be a repeat of this problem 😰.

As for solutions to what we have with PostgresNIO now, there's considerable reluctance to provide a way to change PostgresNIO's default behavior, but the current plan is actually better than that - we're going to make PSQLError.description show at least some information. Exactly what information hasn't quite been nailed down yet, but I'm going to transfer this issue over to postgres-nio, and we'll link it to the PR as soon as there is one.

@gwynne gwynne transferred this issue from vapor/fluent-postgres-driver Aug 4, 2023
@fabianfett
Copy link
Collaborator

fabianfett commented Aug 4, 2023

That being said, it's easily arguable that the overhaul of PostgresNIO should have been a new major version. We went back and forth on this several times during the development, and at the end of the day the fact of the matter is that I probably made the wrong call when I talked Fabian into doing it the way we did; it was a decision made based on the lack of any consensus in Vapor as a whole (at least at the time) on how to handle versioning, and concern about how to leverage the new APIs in higher layers (PostgresKit and FluentPostgresDriver).

I want to raise, that I disagree with this statement. Code that is not in use is very much useless. If we had tagged a major for PostgresNIO, we would have been on the hook to maintain two versions of the library. This is for a very simple reason: Most PostgresNIO users don't use it directly but indirectly through Fluent or SQLKit. Those users would have been stuck on the previous version of PostgresNIO until Fluent would have made a major release. Looking at it from today's perspective, this state would have been ongoing for two and a half years by now.

For us maintainers this would have meant, that we need to fix bugs in the PostgresNIO 1.x, while developing 2.x code that only a minority of PostgresNIO users benefited from.

With the approach of slowly changing the underpinnings of PostgresNIO we have enabled huge performance and stability improvements for all users. All Fluent users benefit from those changes. This is especially important as we have fixed tons of edge cases, that are extremely hard to debug for users, that are not comfortable with the Postgres wire protocol.

Thanks to the huge adoption that the rewrite has seen (thanks to us not tagging a major release), we are now extremely confident in which APIs work and which don't. Once we will tag the next major version, we will "just" remove all APIs that we don't want to maintain further. Very little actual code changes will be required besides this.


Regarding the description that PSQLError prints, please post here what information you would like to see. cc @finestructure

@rausnitz
Copy link
Contributor

rausnitz commented Aug 7, 2023

Thanks @gwynne and @fabianfett, that is helpful context. From my perspective, the most important thing is being prepared for changes like this going forward. It sounds like I should expect that error types may change. I can also try to keep up more with release notes, though Vapor has a lot of dependencies, so I don't know if that is feasible.

@finestructure
Copy link
Member Author

@fabianfett asked me to write up a summary of the "ideal" error interface from our point of view. Here's what I've come up with 🙂

The original interface worked just fine for us, so I think in terms of an API that was pretty much what we needed.

One thing that stood out (but I'm not sure how fixable that is), was that we were reaching beyond Fluent into a PG specific error. That's certainly fine and we're so tied to PG that this isn't really a problem. It's just slightly odd that for specific error handling you then need to import PG modules. A Fluent wrapper (like @gwynne suggested above) would be closer what I'd have wished for.

I haven't thought through how you'd pass along specific error codes but I think preserving any explicit codes as they are thrown by the underlying driver would be helpful. A "self-documenting" enum (or struct with static) that gives them labels would be great but getting at the underlying raw value also feels quite important for web searches etc when something goes wrong. Old PostgresError had all this and was pretty much what we needed.

My understanding is that new PSQLError's main raison d'être is to prevent leaking of sensitive data, which of course is desirable. But I wonder if there isn't a narrower way of addressing this than masking all errors?

I'm sure you've considered all this but just for the sake of thinking out loud it seems the options are:

  1. Mask all errors (new PSQLError)
  2. Mask no errors (old PostgresError)
  3. Audit sensitive errors and mask only those
  4. Make all of the above configurable via env variable and/or tied to build settings (release/debug)

I'm not sure if 3. is feasible but I think it'd be the ideal version of this.

Earlier, we discussed how an env variable could configure this and the suggestion was to tie this to debug mode. However, I can see that making trouble-shooting of live issues quite hard. I don't think that's a setting we'd want, having thought about it some more.

I can see how bigger orgs / setups than ours cannot afford to treat their logs as secret and would have a need for this, maybe even as the default. But turning every error into The operation couldn’t be completed. (PostgresNIO.PSQLError error 1.) is really problematic operationally, where clear messages help tremendously to diagnose and fix things quickly.

I'm wondering: How would I actually get an informative error message in case something goes wrong in prod? Re-deploy with a new error handler somewhere that does the String(describing:) logging? How would I even know where to insert that in a hurry? The info to guide me where to put it is exactly what's hidden 😧

And if it's tied to an env variable with a debug build, that would mean a ~30min turnaround and running a live system with -c debug. That doesn't feel like a great solution.

Also, there mere act of redeploying might make errors go away and now you might be constantly fiddling with the logging until you catch the error with the right setup.

So I think if at all possible, only masking the actually sensitive errors is the best option (i.e. 3.).

Does that description make sense and help?

I hope it does, and thanks a lot for your time looking into this and for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants