Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ceph: fix improper json parsing #7354

Merged

Conversation

satoru-takeuchi
Copy link
Member

Description of your changes:

Sometimes radosgw-admin succeed after showing error message.

Here is an example.

2021-02-26 04:10:44.190418 I | op-bucket-prov: creating Ceph user "ceph-user-aSzNqgE7"
E0226 04:11:37.901310       8 controller.go:199] error syncing 'logging/loki-bucket': error provisioning bucket: Provision: can't create ceph user: error creating ceph user "ceph-user-aSzNqgE7": failed to unmarshal json. 2021-02-26T04:11:21.425+0000 7f6714be4980  1 robust_notify: If at first you don't succeed: (110) Connection timed out
2021-02-26T04:11:21.426+0000 7f6714be4980  0 ERROR: failed to distribute cache for ceph-hdd-object-store.rgw.meta:users.uid:ceph-user-aSzNqgE7
2021-02-26T04:11:32.168+0000 7f6714be4980  1 robust_notify: If at first you don't succeed: (110) Connection timed out
2021-02-26T04:11:32.168+0000 7f6714be4980  0 ERROR: failed to distribute cache for ceph-hdd-object-store.rgw.meta:users.keys:23Z8GUEXR0TJDO86PSJR
{
    "user_id": "ceph-user-aSzNqgE7",
    "display_name": "ceph-user-aSzNqgE7",
...
    "mfa_ids": []
}: invalid character '-' after top-level value: failed to unmarshal json.

In this case, some messages like "robust_notify: If at first you don't succeed: (110) Connection timed out"
was shown in stderr before showing standard error that consists of json. Since stdout and stderr
are joined, the caller can't parse json properly.

Checklist:

  • [o] Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
  • [o] Skip Tests for Docs: Add the flag for skipping the build if this is only a documentation change. See here for the flag.
  • [o] Skip Unrelated Tests: Add a flag to run tests for a specific storage provider. See test options.
  • [o] Reviewed the developer guide on Submitting a Pull Request
  • [o] Documentation has been updated, if necessary.
  • [o] Unit tests have been added, if necessary.
  • [o] Integration tests have been added, if necessary.
  • [o] Pending release notes updated with breaking and/or notable changes, if necessary.
  • [o] Upgrade from previous release is tested and upgrade user guide is updated, if necessary.
  • [o] Code generation (make codegen) has been run to update object specifications, if necessary.

@@ -67,7 +67,7 @@ func RunAdminCommandNoMultisite(c *Context, args ...string) (string, error) {
}

// start the rgw admin command
output, err := c.Context.Executor.ExecuteCommandWithTimeout(timeout, command, args...)
output, err := c.Context.Executor.ExecuteCommandWithTimeout(timeout, true, command, args...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can instead call ExecuteCommandWithOutputFile() to have the output written to a file so the stdout and stderr won't conflict. Does that work for you? Then we wouldn't need the new flag.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good idea, I'll fix!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this function can't be used because radosgw-admin doesn't have an option to specify the output file. I introduced a new function that returns only stdout and can specify timeout.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that extracting json from combined output seems to be better. Some error handling routine s use whether the output contains some strings, to judge the type of errors. I'll fix my PR later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@satoru-takeuchi How did you repro this issue? I heard another instance of it for an external cluster scenario, but I can't get a repro yet. Anyway, looking forward to the fix, thanks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue happens in the CI of my k8s cluster. Then OB is created under heavy load consists of RBD creation and many applications' initialization. However, I don't know the reliable simple reproducer yet.

@travisn travisn added backport-release-1.5 ceph main ceph tag object Object protocol - S3 labels Mar 4, 2021
@satoru-takeuchi satoru-takeuchi changed the title WIP: ceph: fix improper json parsing Wceph: fix improper json parsing Mar 4, 2021
@satoru-takeuchi satoru-takeuchi changed the title Wceph: fix improper json parsing ceph: fix improper json parsing Mar 4, 2021
@satoru-takeuchi satoru-takeuchi force-pushed the ceph-fix-improper-json-parsing-in-radosgw-admin branch 5 times, most recently from 7287334 to 57a523c Compare March 9, 2021 05:45
@@ -75,6 +76,17 @@ func RunAdminCommandNoMultisite(c *Context, args ...string) (string, error) {
return output, nil
}

func ExtractJson(output string) (string, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you thought about putting this inside the Executor directly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean that embedding this logic in Executor.Execute*() methods? If so, I didn't adopt this way because of two reasons. First, some callers don't parse the output as json. In addition, this behavior seems to be radosgw-admin specific. So I hesitate to put this logic in Executer which is for all commands.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but do the extract json in an executor that ran with json output only. I think it's fine for now and more an RGW bug so I'm fine with that!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call ExtractJson() from RunAdminCommandNoMultisite()? That seems like the right place instead of calling it everywhere else.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Year, it's nice idea. The new PR is far simpler than before.

I introduced a flag in runAdminCommand() to decide whether calling extractJSON() inside them. It's because some callers checking output before parsing JSON. In this case, that calllers directly call extractJSON() after checking output.

pkg/operator/ceph/object/admin_test.go Show resolved Hide resolved
@@ -75,6 +76,17 @@ func RunAdminCommandNoMultisite(c *Context, args ...string) (string, error) {
return output, nil
}

func ExtractJson(output string) (string, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call ExtractJson() from RunAdminCommandNoMultisite()? That seems like the right place instead of calling it everywhere else.

@satoru-takeuchi satoru-takeuchi force-pushed the ceph-fix-improper-json-parsing-in-radosgw-admin branch from 57a523c to 21dbf97 Compare March 10, 2021 13:23
Copy link
Member

@travisn travisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just some small suggestions

@@ -230,7 +230,7 @@ func (r *ReconcileObjectZone) createCephZone(zone *cephv1.CephObjectZone, realmN
}

// create zone
_, err = object.RunAdminCommandNoMultisite(objContext, "zone", "get", realmArg, zoneGroupArg, zoneArg)
_, err = object.RunAdminCommandNoMultisite(objContext, false, "zone", "get", realmArg, zoneGroupArg, zoneArg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though the result is ignored, how about still passing true whenever we have a get.

Suggested change
_, err = object.RunAdminCommandNoMultisite(objContext, false, "zone", "get", realmArg, zoneGroupArg, zoneArg)
_, err = object.RunAdminCommandNoMultisite(objContext, true, "zone", "get", realmArg, zoneGroupArg, zoneArg)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, fixed.

@@ -308,7 +308,7 @@ func (r *ReconcileObjectZone) reconcileCephZoneGroup(zone *cephv1.CephObjectZone
zoneGroupArg := fmt.Sprintf("--rgw-zonegroup=%s", zone.Spec.ZoneGroup)
objContext := object.NewContext(r.context, r.clusterInfo, zone.Name)

_, err := object.RunAdminCommandNoMultisite(objContext, "zonegroup", "get", realmArg, zoneGroupArg)
_, err := object.RunAdminCommandNoMultisite(objContext, false, "zonegroup", "get", realmArg, zoneGroupArg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_, err := object.RunAdminCommandNoMultisite(objContext, false, "zonegroup", "get", realmArg, zoneGroupArg)
_, err := object.RunAdminCommandNoMultisite(objContext, true, "zonegroup", "get", realmArg, zoneGroupArg)

@@ -230,7 +230,7 @@ func (r *ReconcileObjectZoneGroup) createCephZoneGroup(zoneGroup *cephv1.CephObj
}

// create zone group
output, err = object.RunAdminCommandNoMultisite(objContext, "zonegroup", "get", realmArg, zoneGroupArg)
output, err = object.RunAdminCommandNoMultisite(objContext, false, "zonegroup", "get", realmArg, zoneGroupArg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
output, err = object.RunAdminCommandNoMultisite(objContext, false, "zonegroup", "get", realmArg, zoneGroupArg)
output, err = object.RunAdminCommandNoMultisite(objContext, true, "zonegroup", "get", realmArg, zoneGroupArg)

@@ -279,7 +279,7 @@ func (r *ReconcileObjectZoneGroup) reconcileCephRealm(zoneGroup *cephv1.CephObje
realmArg := fmt.Sprintf("--rgw-realm=%s", zoneGroup.Spec.Realm)
objContext := object.NewContext(r.context, r.clusterInfo, zoneGroup.Name)

_, err := object.RunAdminCommandNoMultisite(objContext, "realm", "get", realmArg)
_, err := object.RunAdminCommandNoMultisite(objContext, false, "realm", "get", realmArg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_, err := object.RunAdminCommandNoMultisite(objContext, false, "realm", "get", realmArg)
_, err := object.RunAdminCommandNoMultisite(objContext, true, "realm", "get", realmArg)

@satoru-takeuchi satoru-takeuchi force-pushed the ceph-fix-improper-json-parsing-in-radosgw-admin branch 2 times, most recently from 38a8f02 to de5c40f Compare March 10, 2021 15:09
@travisn
Copy link
Member

travisn commented Mar 10, 2021

Looks good, there is just a unit test failure now.

Sometimes `radosgw-admin` succeeds after showing logs to stderr. We should
skip non-json strings if parsing output as json.

Here is an example.

```
2021-02-26 04:10:44.190418 I | op-bucket-prov: creating Ceph user "ceph-user-aSzNqgE7"
E0226 04:11:37.901310       8 controller.go:199] error syncing 'logging/loki-bucket': error provisioning bucket: Provision: can't create ceph user: error creating ceph user "ceph-user-aSzNqgE7": failed to unmarshal json. 2021-02-26T04:11:21.425+0000 7f6714be4980  1 robust_notify: If at first you don't succeed: (110) Connection timed out
2021-02-26T04:11:21.426+0000 7f6714be4980  0 ERROR: failed to distribute cache for ceph-hdd-object-store.rgw.meta:users.uid:ceph-user-aSzNqgE7
2021-02-26T04:11:32.168+0000 7f6714be4980  1 robust_notify: If at first you don't succeed: (110) Connection timed out
2021-02-26T04:11:32.168+0000 7f6714be4980  0 ERROR: failed to distribute cache for ceph-hdd-object-store.rgw.meta:users.keys:23Z8GUEXR0TJDO86PSJR
{
    "user_id": "ceph-user-aSzNqgE7",
    "display_name": "ceph-user-aSzNqgE7",
...
    "mfa_ids": []
}: invalid character '-' after top-level value: failed to unmarshal json.
```

In this case, some logs like "robust_notify:..." was shown in stderr.
Unmarsharing was failed due to tried to parse these logs as json.

Signed-off-by: Satoru Takeuchi <satoru.takeuchi@gmail.com>
@satoru-takeuchi satoru-takeuchi force-pushed the ceph-fix-improper-json-parsing-in-radosgw-admin branch from de5c40f to a89d35e Compare March 10, 2021 16:07
@satoru-takeuchi
Copy link
Member Author

satoru-takeuchi commented Mar 10, 2021

This failure was resolved by returning json from fake radosgw-admin ... get.

@travisn travisn merged commit c06a009 into rook:master Mar 10, 2021
mergify bot added a commit that referenced this pull request Mar 10, 2021
@satoru-takeuchi satoru-takeuchi deleted the ceph-fix-improper-json-parsing-in-radosgw-admin branch June 1, 2021 00:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ceph main ceph tag object Object protocol - S3
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants