Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AP => ATProto new user creation bottleneck on datastore #1376

Closed
snarfed opened this issue Oct 12, 2024 · 7 comments
Closed

AP => ATProto new user creation bottleneck on datastore #1376

snarfed opened this issue Oct 12, 2024 · 7 comments

Comments

@snarfed
Copy link
Owner

snarfed commented Oct 12, 2024

We got a bunch of new user signups from Flipboard this afternoon, ~9k during ~4:10-6:20p PT, ie roughly 1qps. Awesome!...and also a couple orders of magnitude higher than our usual incoming follow rate 😁, so it exposed a bottleneck in creating new ATProto users, specifically contention on the datastore transaction that creates the new repo and stores the initial commit and blocks. Example log below.

Task 39427228699028707821
AS2: {
  "actor": "https://flipboard.com/magazines/3zb0UW_iTeWO8Gz0um91Ew:m:2520005099",
  "id": "https://flipboard.com/70c2adc1-2719-4662-9401-ddf9dab03a0a",
  "object": "https://bsky.brid.gy/bsky.brid.gy",
  "type": "Follow",
  "url": "https://flipboard.com/magazines/3zb0UW_iTeWO8Gz0um91Ew:m:2520005099#followed-bsky.brid.gy"
}
Receiving activitypub follow https://flipboard.com/70c2adc1-2719-4662-9401-ddf9dab03a0a 
AS1: {
  "actor": "https://flipboard.com/magazines/3zb0UW_iTeWO8Gz0um91Ew:m:2520005099",
  "id": "https://flipboard.com/70c2adc1-2719-4662-9401-ddf9dab03a0a",
  "url": "https://flipboard.com/magazines/3zb0UW_iTeWO8Gz0um91Ew:m:2520005099#followed-bsky.brid.gy",
  "objectType": "activity",
  "verb": "follow",
  "object": "https://bsky.brid.gy/"
}
Follow of bot user, reloading https://flipboard.com/magazines/3zb0UW_iTeWO8Gz0um91Ew:m:2520005099
Signing with fed.brid.gy 's key
requests.get https://flipboard.com/magazines/3zb0UW_iTeWO8Gz0um91Ew:m:2520005099 {'data': None}
Received 200:
Received 200
Creating new did:plc for https://flipboard.com/magazines/3zb0UW_iTeWO8Gz0um91Ew:m:2520005099 woodworking-table-ideas-anikasdiylife.flipboard.com.ap.brid.gy https://atproto.brid.gy
...
adding GCP DNS TXT record for _atproto.woodworking-table-ideas-anikasdiylife.flipboard.com.ap.brid.gy. "did=did:plc:qynny5xtgq264h7q4nppzvoo"
Checking for existing record
done!
...
Storing ATProto app.bsky.actor.profile self
Storing ATProto chat declaration record

Traceback (most recent call last):
...
  File "oauth_dropins/webutil/flask_util.py", line 400, in decorator
    return fn(*args, **kwargs)
  File "/workspace/protocol.py", line 1679, in receive_task
    return PROTOCOLS[obj.source_protocol].receive(obj=obj, authed_as=authed_as,
  File "/workspace/protocol.py", line 995, in receive
    from_user.enable_protocol(proto)
  File "/workspace/models.py", line 555, in enable_protocol
    new_self = enable()
  File "google/cloud/ndb/_transaction.py", line 347, in transactional_inner_wrapper
    return transaction(
...
  File "/workspace/models.py", line 551, in enable
    to_proto.create_for(user)
  File "google/cloud/ndb/_transaction.py", line 347, in transactional_inner_wrapper
    return transaction(
...
  File "/workspace/atproto.py", line 428, in create_for
    repo = Repo.create(
  File "arroba/repo.py", line 216, in create
    return cls.create_from_commit(storage, commit_data, signing_key=signing_key,
  File "arroba/repo.py", line 178, in create_from_commit
    storage.apply_commit(commit_data)
  File "arroba/datastore_storage.py", line 416, in decorated
    ret = fn(self, *args, **kwargs)
  File "google/cloud/ndb/_transaction.py", line 347, in transactional_inner_wrapper
    return transaction(
...
  File "arroba/datastore_storage.py", line 564, in apply_commit
    repo = AtpRepo.get_by_id(commit['did'])
  File "google/cloud/ndb/_options.py", line 102, in wrapper
    return wrapped(*pass_args, **kwargs)
...
google.api_core.exceptions.RetryError: Maximum number of 3 retries exceeded while calling <function make_call.<locals>.rpc_call at 0x7f1219ed4220>, last exception: 409 too much contention on these datastore entities. please try again.
@snarfed snarfed added the now label Oct 12, 2024
@snarfed
Copy link
Owner Author

snarfed commented Oct 12, 2024

Ugh, ok, pretty obvious. We're unintentionally doing external network requests inside a datastore transaction here, specifically creating the new ATProto DID. OK, should at least be a straightforward fix.

snarfed added a commit that referenced this issue Oct 12, 2024
…s.enable

we were doing heavy work like generating RSA keys and external network requests for creating DIDs inside datastore txes, which led to bad contention. details in #1376
@snarfed
Copy link
Owner Author

snarfed commented Oct 12, 2024

^ Looks like this did the trick, datastore tx contention is down ~20x.

image

@snarfed snarfed closed this as completed Oct 12, 2024
@snarfed
Copy link
Owner Author

snarfed commented Oct 12, 2024

Total active user count over the last two days. Flipboard isn't messing around! Y axis is zoomed in so it's a bit misleading, but still 😆

image

@snarfed snarfed added the infra label Oct 12, 2024
@mackuba
Copy link

mackuba commented Oct 12, 2024

What is this triggered by? Did they add some new feature and people are enabling it, or is it automatically enabled, or what?

@snarfed
Copy link
Owner Author

snarfed commented Oct 12, 2024

They're automatically bridging publishers who are federated. It's not many, right now just 1k publishers and 15k "magazines," each of which is an AP actor. Still big for Bridgy Fed though!

@mackuba
Copy link

mackuba commented Oct 12, 2024

Yeah, I see around 20k new *.ap.brid.gy handles in my db, which is a double-digit % increase, and I think that more than doubled the total number of bsky.brid.gy repos?

@snarfed
Copy link
Owner Author

snarfed commented Oct 13, 2024

Hmm! I see a bit less than 10k new repos on my end. Remind me, are you looking at PLC directory operations and counting DIDs with atproto.brid.gy as their PDS? If you de-dupe by handle, does that number come down? This issue definitely made us retry and create extra DIDs that we didn't end up using. 10k extra DIDs sounds high, but not impossible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants