Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CDCSDK] CDC service does not populate DDL record in case of cache miss error #20698

Closed
yugabyte-ci opened this issue Jan 19, 2024 · 0 comments
Closed
Assignees
Labels
area/cdcsdk CDC SDK jira-originated kind/bug This issue is a bug priority/low Low priority

Comments

@yugabyte-ci
Copy link
Contributor

yugabyte-ci commented Jan 19, 2024

Jira Link: DB-9701

Upon investigation we found that the flow which was causing the error is similar to:

  1. Create YB instance
  2. Create enum type and table
  3. Start connector to stream data
  4. Everything will work fine
  5. Delete or stop connector
  6. Create new enum type OR drop enum type and recreate the same type again (note that we’re still using the same YB server)
  7. Now on the service, we have cached the enum type in step c and the code only checks whether an enum map is present in the cache or not, since we have already cached it in step c , it hits an error:
vlog1: PopulateCDCSDKWriteRecordWithInvalidSchemaRetry: Recevied error status: Cache miss error (yb/docdb/docdb_pgapi.cc:1535): enum, while prcoessing WRITE_OP, with op_id: term: 1 index: 4, on tablet: 278f85f66fde4ad6997f22f8ee48a3ed
  1. While handling the error, service clears out all the processed records and forms a response again, but this time a DDL record is not populated as we have already cached the schema details and service assumes that it has sent the DDL record to the connector which is the wrong assumption.
@yugabyte-ci yugabyte-ci added area/cdcsdk CDC SDK jira-originated kind/bug This issue is a bug priority/low Low priority status/awaiting-triage Issue awaiting triage labels Jan 19, 2024
@yugabyte-ci yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Apr 15, 2024
@yugabyte-ci yugabyte-ci changed the title [CDCSDK] Connector tests for ENUM throwing NullPointerException [CDCSDK] CDC service does not populate DDL record in case of cache miss error Apr 15, 2024
vaibhav-yb added a commit that referenced this issue Apr 18, 2024
…f needed

Summary:
When there's a `GetChanges` request (req_1) and service layer receives a `CacheMissError`, it refetches the enum labels and executes a new internal `GetChanges` request (req_2) for a fresh `GetChangesResponse`.

Now suppose this is the first `GetChanges` request from the connector where it still hasn't received the DDL record, after the service clears the response, it looks at the `cached_schema_details` object while making `req_2` to decide whether or not to publish the DDL record. But since we have already populated the `cached_schema_details` while processing `req_1`, it will mean that we do not populate the DDL record and thus the client will not receive the DDL record in `GetChangesResponse` causing it to fail while decoding further change events.

**Solution:**

This diff implements a simple solution by clearing the `cached_schema_details` while executing `req_2` if the connector/client has indicated that it needs the schema i.e. if `req->need_schema_info() == true`.
Jira: DB-9701

Test Plan:
```
./yb_build.sh --cxx-test cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestPopulationOfDDLRecordUponCacheMiss
```

Reviewers: skumar, asrinivasan, stiwary

Reviewed By: skumar

Subscribers: ycdcxcluster

Differential Revision: https://phorge.dev.yugabyte.com/D34107
vaibhav-yb added a commit that referenced this issue Apr 23, 2024
…late DDL record if needed

Summary:
Original commit: 5cdbe49 / D34107
When there's a `GetChanges` request (req_1) and service layer receives a `CacheMissError`, it refetches the enum labels and executes a new internal `GetChanges` request (req_2) for a fresh `GetChangesResponse`.

Now suppose this is the first `GetChanges` request from the connector where it still hasn't received the DDL record, after the service clears the response, it looks at the `cached_schema_details` object while making `req_2` to decide whether or not to publish the DDL record. But since we have already populated the `cached_schema_details` while processing `req_1`, it will mean that we do not populate the DDL record and thus the client will not receive the DDL record in `GetChangesResponse` causing it to fail while decoding further change events.

**Solution:**

This diff implements a simple solution by clearing the `cached_schema_details` while executing `req_2` if the connector/client has indicated that it needs the schema i.e. if `req->need_schema_info() == true`.
Jira: DB-9701

Test Plan:
```
./yb_build.sh --cxx-test cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestPopulationOfDDLRecordUponCacheMiss
```

Reviewers: skumar, asrinivasan, stiwary

Reviewed By: skumar

Subscribers: ycdcxcluster

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D34274
vaibhav-yb added a commit that referenced this issue Apr 23, 2024
…te DDL record if needed

Summary:
Original commit: 5cdbe49 / D34107
When there's a `GetChanges` request (req_1) and service layer receives a `CacheMissError`, it refetches the enum labels and executes a new internal `GetChanges` request (req_2) for a fresh `GetChangesResponse`.

Now suppose this is the first `GetChanges` request from the connector where it still hasn't received the DDL record, after the service clears the response, it looks at the `cached_schema_details` object while making `req_2` to decide whether or not to publish the DDL record. But since we have already populated the `cached_schema_details` while processing `req_1`, it will mean that we do not populate the DDL record and thus the client will not receive the DDL record in `GetChangesResponse` causing it to fail while decoding further change events.

**Solution:**

This diff implements a simple solution by clearing the `cached_schema_details` while executing `req_2` if the connector/client has indicated that it needs the schema i.e. if `req->need_schema_info() == true`.
Jira: DB-9701

Test Plan:
```
./yb_build.sh --cxx-test cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestPopulationOfDDLRecordUponCacheMiss
```

Reviewers: skumar, asrinivasan, stiwary

Reviewed By: skumar

Subscribers: ycdcxcluster

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D34275
vaibhav-yb added a commit that referenced this issue Apr 24, 2024
…te DDL record if needed

Summary:
Original commit: 5cdbe49 / D34107
When there's a `GetChanges` request (req_1) and service layer receives a `CacheMissError`, it refetches the enum labels and executes a new internal `GetChanges` request (req_2) for a fresh `GetChangesResponse`.

Now suppose this is the first `GetChanges` request from the connector where it still hasn't received the DDL record, after the service clears the response, it looks at the `cached_schema_details` object while making `req_2` to decide whether or not to publish the DDL record. But since we have already populated the `cached_schema_details` while processing `req_1`, it will mean that we do not populate the DDL record and thus the client will not receive the DDL record in `GetChangesResponse` causing it to fail while decoding further change events.

**Solution:**

This diff implements a simple solution by clearing the `cached_schema_details` while executing `req_2` if the connector/client has indicated that it needs the schema i.e. if `req->need_schema_info() == true`.
Jira: DB-9701

Test Plan:
```
./yb_build.sh --cxx-test cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestPopulationOfDDLRecordUponCacheMiss
```

Reviewers: skumar, asrinivasan, stiwary

Reviewed By: skumar

Subscribers: ycdcxcluster

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D34370
svarnau pushed a commit that referenced this issue May 25, 2024
…f needed

Summary:
When there's a `GetChanges` request (req_1) and service layer receives a `CacheMissError`, it refetches the enum labels and executes a new internal `GetChanges` request (req_2) for a fresh `GetChangesResponse`.

Now suppose this is the first `GetChanges` request from the connector where it still hasn't received the DDL record, after the service clears the response, it looks at the `cached_schema_details` object while making `req_2` to decide whether or not to publish the DDL record. But since we have already populated the `cached_schema_details` while processing `req_1`, it will mean that we do not populate the DDL record and thus the client will not receive the DDL record in `GetChangesResponse` causing it to fail while decoding further change events.

**Solution:**

This diff implements a simple solution by clearing the `cached_schema_details` while executing `req_2` if the connector/client has indicated that it needs the schema i.e. if `req->need_schema_info() == true`.
Jira: DB-9701

Test Plan:
```
./yb_build.sh --cxx-test cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestPopulationOfDDLRecordUponCacheMiss
```

Reviewers: skumar, asrinivasan, stiwary

Reviewed By: skumar

Subscribers: ycdcxcluster

Differential Revision: https://phorge.dev.yugabyte.com/D34107
siddharth2411 pushed a commit to siddharth2411/yugabyte-db that referenced this issue Jun 13, 2024
…o populate DDL record if needed

Summary:
Original commit: 5cdbe49 / D34107
When there's a `GetChanges` request (req_1) and service layer receives a `CacheMissError`, it refetches the enum labels and executes a new internal `GetChanges` request (req_2) for a fresh `GetChangesResponse`.

Now suppose this is the first `GetChanges` request from the connector where it still hasn't received the DDL record, after the service clears the response, it looks at the `cached_schema_details` object while making `req_2` to decide whether or not to publish the DDL record. But since we have already populated the `cached_schema_details` while processing `req_1`, it will mean that we do not populate the DDL record and thus the client will not receive the DDL record in `GetChangesResponse` causing it to fail while decoding further change events.

**Solution:**

This diff implements a simple solution by clearing the `cached_schema_details` while executing `req_2` if the connector/client has indicated that it needs the schema i.e. if `req->need_schema_info() == true`.
Jira: DB-9701

Test Plan:
```
./yb_build.sh --cxx-test cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestPopulationOfDDLRecordUponCacheMiss
```

Reviewers: skumar, asrinivasan, stiwary

Reviewed By: skumar

Subscribers: ycdcxcluster

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D34275
ZhenYongFan pushed a commit to ZhenYongFan/yugabyte-db that referenced this issue Jun 15, 2024
… to populate DDL record if needed

Summary:
Original commit: 5cdbe49 / D34107
When there's a `GetChanges` request (req_1) and service layer receives a `CacheMissError`, it refetches the enum labels and executes a new internal `GetChanges` request (req_2) for a fresh `GetChangesResponse`.

Now suppose this is the first `GetChanges` request from the connector where it still hasn't received the DDL record, after the service clears the response, it looks at the `cached_schema_details` object while making `req_2` to decide whether or not to publish the DDL record. But since we have already populated the `cached_schema_details` while processing `req_1`, it will mean that we do not populate the DDL record and thus the client will not receive the DDL record in `GetChangesResponse` causing it to fail while decoding further change events.

**Solution:**

This diff implements a simple solution by clearing the `cached_schema_details` while executing `req_2` if the connector/client has indicated that it needs the schema i.e. if `req->need_schema_info() == true`.
Jira: DB-9701

Test Plan:
```
./yb_build.sh --cxx-test cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestPopulationOfDDLRecordUponCacheMiss
```

Reviewers: skumar, asrinivasan, stiwary

Reviewed By: skumar

Subscribers: ycdcxcluster

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D34274
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cdcsdk CDC SDK jira-originated kind/bug This issue is a bug priority/low Low priority
Projects
None yet
Development

No branches or pull requests

2 participants