Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Support for Token Rotation #1244

Conversation

misscoded
Copy link
Contributor

@misscoded misscoded commented May 21, 2021

#1241

Although the original consensus was to tie the expiry check and subsequent refresh to making a call to the API, it was at some point decided that this should occur when events are received. The current state of this draft reflects that decision.

HOW IT WORKS

On OAuth installation (the only installation flow allowed when token rotation is enabled), new access tokens are stored as they have been, but now include an accompanying refresh_token and expires_in. The refresh_token (does not expire) is used to "trade-in" for a new access_token and refresh_token pair.

For each token (bot|user), we use the expires_in value to calculate and store a corresponding future expiry timestamp (UTC, seconds).

Each time an event is received, authorize gets run. Here we determine if token rotation is enabled. If so, we check the expiry of the token(s) and refresh those that have expired or will expire within 2 hours. We then update the returned authResult and the Installation.

NOTABLE CHANGES

  1. AuthorizationResult has four new keys: botRefreshToken, botTokenExpiresAt, userRefreshToken, and userTokenExpiresAt. This was to handle apps that have both User and Bot tokens while still keeping the shape flat.

  2. Installation has four new keys: a refreshToken and expiresAt for both the user and bot objects.

TO CONSIDER

2 hours is the current window of time we try to get ahead of an expiry. If customization of the TTL is introduced later on, this is likely too inflexible and will need to be revisited.

Refreshing on incoming events still feels strange, but it appears to work as expected thus far. Logic for the expiry checks are run on every incoming event, which is no problem with single user testing, but I'm unsure if results would be different with events constantly pouring in across many channels.

TODO / OUTSTANDING

  • Address TypeScript issues
  • Write tests - will write during beta
  • Retry on failure to refresh (discussion)
  • Revisit possibility of concurrent calls (discussion)
  • Introduce deleteInstallation (discussion)
  • Verify non-token rotation-enabled apps still work as expected
  • Stress testing

Requirements (place an x in each [ ])

@misscoded misscoded added semver:minor enhancement M-T: A feature request for new functionality pkg:web-api applies to `@slack/web-api` pkg:oauth applies to `@slack/oauth-helper` draft labels May 21, 2021
@misscoded misscoded self-assigned this May 21, 2021
@@ -91,7 +91,7 @@ export class InstallProvider {
}

/**
* Fetches data from the installationStore for non Org Installations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvements! As this pull request takes a bit long before merging into main branch, can you submit another pull request including some token rotation unrelated changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! The correction to the comment above is the only non-token rotation change included here. Since it's not time-sensitive and it's just a comment, I think it's fine to just wait until this PR goes in to see it fixed.

@@ -636,7 +739,11 @@ export type OrgInstallationQuery = InstallationQuery<true>;
// of Bolt.
export interface AuthorizeResult {
botToken?: string;
botRefreshToken?: string;
botTokenExpiresAt?: number;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about having a comment like // UTC time in seconds for the expiresAt properties?

const installationUpdates: any = { ...queryResult }; // TODO :: TS
client = new WebClient(undefined, this.clientOptions);

// TODO :: Concurrent Calls?
Copy link
Member

@seratch seratch May 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concurrent calls are fine as long as no call can be skipped. Specifically, we cannot have delayed calls here as the calls can be forcibly terminated when running on a FaaS runtime (AWS Lambda, Google Cloud Functions, Azure Functions). We may want to have a flag option like processBeforeResponse in Bolt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point. Since it's only ever two calls, do you think implementing a concurrent strategy is necessary? Also, thoughts on how to move forward with the retry logic if a call fails?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think implementing a concurrent strategy is necessary?

Yes, I think that it's valuable for end-user experience and less 3 second timeouts. To avoid having an extra overhead, we would like to asynchronously do these refreshing API calls as it's not necessary for this incoming request handling. If the oauth.v2.access here takes 300-500 milliseconds in average, calling twice may take 1 second in a worst case scenario. This may not be a small overhead.

I haven't checked if this works but one possible idea would be having a private function like async function refreshExpiringToken(tokensToRefresh) and calling the method asynchronously (no await) by default. In addition to that, we can consider FaaS applications by calling the same method with await if processBeforeResponse is true.

Also, thoughts on how to move forward with the retry logic if a call fails?

It depends on the error. If the error is recoverable (e.g., connection failure, timeouts), we can retry with the same set of parameters. If the response has an error code like invalid parameters or invalid auth, of course, we don't need to retry the call anymore.


if (tokensToRefresh.length > 0) {
const installationUpdates: any = { ...queryResult }; // TODO :: TS
client = new WebClient(undefined, this.clientOptions);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This client can be const client as it's used only inside this code block (scope).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! This is a remnant from a change I made during cleanup. Will fix.

refresh_token: refreshToken,
}) as OAuthV2Response;

// TODO :: Sort out TS issues
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, as we use automatic code generator for the types, we are unable to have non-optional properties here.
When we can assume the property exists 100%, using ! can be okay but if I wrote this part, I would have if statements verifying the existence. Doing this satisfies the TS compiler plus may detect really unexpected situations (say, Slack server-side regressions / incidents).

            const tokenType = refreshResp.token_type;
            if (tokenType === undefined) {
              throw new Error("unexpectedly token_type is missing here!");
            }
            const expireIn = refreshResp.expires_in;
            if (expireIn === undefined) {
              throw new Error("unexpectedly expire_in is missing here!");
            }
            const expiresAt: number = currentUTCSec + expireIn;

            if (tokenType === 'bot') {
              authResult.botToken = refreshResp.access_token;
              authResult.botRefreshToken = refreshResp.refresh_token;
              authResult.botTokenExpiresAt = expiresAt;
            }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type is actually not using the generated code version of the response, but it brings up the greater question: should it? Are we throwing out all of our hand-rolled types and opting for the generated ones at this point?

Thanks for your suggestion for a resolution the current state of things -- I'll plan to incorporate it.

Copy link
Member

@seratch seratch May 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type is actually not using the generated code version of the response, but it brings up the greater question: should it?

Yes, the type will be updated once the properties are GAed (or at least in public beta). With the current approach, tall the properties in responses are optional. There is no way to exclude specific properties at this point. During the beta period, you can go with (=internally cast the type) an internal type in this source file.

Are we throwing out all of our hand-rolled types and opting for the generated ones at this point?

We don't have any hand written response types now. Also, we don’t have the bandwidth to guarantee the quality of it for all API responses and I have to say that the situation can be the same in the future. That's why we are auto generating the code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For OAuth, we do have hand rolled types right now. See https://github.com/slackapi/node-slack-sdk/blob/main/packages/oauth/src/index.ts#L705-L726

But maybe we shouldn't have these and instead be switching over to using the generated types

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every time we remove types that reliably indicate what properties are present I die a little inside, but I do acknowledge the motivation outlined above. 🙂

Just let me know if you two would like me to remove it when the generated types are updated to reflect the payload changes!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha I agree. Lets not remove anything right now. Maybe in the future we can get more reliable with the generated response types and then we can revist this.


if (authResult.botRefreshToken !== undefined && authResult.botTokenExpiresAt !== undefined) {
const botTokenExpiresIn = authResult.botTokenExpiresAt - currentUTCSec;
if (botTokenExpiresIn <= 7200) { // 2 hours
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are doing the same twice here. How about extracting the logic as a private function function isExpiring(expiresIn, currentUTCSeconds): boolean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After circling back to this, I've decided to pull out the bad use of a magic number into a const above. I don't think that creating a new function for a comparison operator is a good choice here, as it causes even more legwork for the reader.

Copy link
Member

@seratch seratch May 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. That's okay. Another factor to generally consider is that naming a chunk of logic may help us easily understand what the lines of code do but this time, having a method or not is not a big deal at all.

On another note, while checking this pull request again, I noticed that we can have two groups of tokens here.

  1. the tokens already expired
  2. the tokens that are going to be expired in 2 hours

In the case of 1. , we always must immediately call oauth.v2.access API with a refresh token for following usage.

On the other hand, in the case of 2., we basically don't want to have the refresh API call in the synchronous operation for an incoming request as it's not yet expired. As I mentioned in my comment, we may want to have a new flag option similar to processBeforeResponse for FaaS users. If the flag is true, we have to call the refresh API synchronously.

I don't know having 1. and 2. in the tokensToRefresh array is intentional but a bit more consideration for 2. can improve app's response time and it would be valuable for user experience and less timeouts. Thoughts? If you want to see example code, I can come up with Python draft PR in a few days.

Copy link
Contributor Author

@misscoded misscoded May 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely agree re: the chunks of logic and their role in aiding readability, so I do understand your point. I'll take another pass at it and see if I can improve upon what's there.

Regarding the second part, you bring up excellent concerns. The use of tokensToRefresh was definitely intentional when I wrote it because we considered the situation either refresh vs. no refresh, but I agree with you that we can do better by drilling down. Given my lack of first-hand experience with the FaaS experience, I will rely on you heavily to ensure the inclusion of an additional option is being incorporated properly.

I'll work on the above over the next day or so and would like to compare notes with your implementation so that we can end up with the best solution. Thanks for your continued input!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding 2, I think for the beta we stick with the current approach (make the listener code wait until token refresh is done, even if token isn't expired). I'd like to see if people in the real world start getting 3 second timeout issues with this change. If the answer is no, than our solution is sufficient and we don't need to optimize this for now. If timeouts do increase, then we may want to look at moving token refresh code into web-api or possibly or even adding an entry point to run token refresh after the listener is completed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing thoughts, you both!

I agree that we can go with a simple approach first and see how it goes during the beta. The discussion here should be helpful when we optimize the part when it's needed. Let's go with this way for its beta releases.

Copy link
Member

@mwbrooks mwbrooks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @misscoded and thanks for the detailed write up about Token Rotation!

Two questions that come to mind are:

1. Tokens refreshed on events: When an event is received, the authorize is run. What happens when an app doesn't subscribe to any events? For example, the app may only want to use channels.list and chat.postMessage. Would token rotation not work in that situation?

2. Refreshing a token: When we refresh a token, is the previous token immediately invalidated?

I can imagine a scenario where multiple events trigger at once, causing the same refresh token to be refreshed by multiple event handlers at the same time. Each one will be written to the database and there's no guarantee it would be the most "fresh" token.

In this scenario, would any of the refreshed tokens be valid? Or is there a chance that we would save an "old" token to the database?

@misscoded
Copy link
Contributor Author

@mwbrooks

1. Tokens refreshed on events: When an event is received, the authorize is run. What happens when an app doesn't subscribe to any events? For example, the app may only want to use channels.list and chat.postMessage. Would token rotation not work in that situation?

This is a very valid point, IMO, and was my biggest push back when Steve/Kaz hit me with the redirection to event-based refresh rather than API-call-based. I can't speak in depth about the rationale beyond what I understand to be considering it an edge case and unlikely to happen, but I could be misremembering that so I'll tap @stevengill to fill in the details of the conversation where it was decided.

2. Refreshing a token: When we refresh a token, is the previous token immediately invalidated?

So, refresh tokens actually expire 5 minutes after they're first used, meaning that the same refresh token could be used multiple times back-to-back. That said, only two access_tokens can exists at any given point, so a dev will have -- at most -- two unique, valid access_tokens and a single refresh_token. Once those two exist, any more refreshes beyond that would invalidate the older of the two access tokens.

I can imagine a scenario where multiple events trigger at once, causing the same refresh token to be refreshed by multiple event handlers at the same time. Each one will be written to the database and there's no guarantee it would be the most "fresh" token. In this scenario, would any of the refreshed tokens be valid? Or is there a chance that we would save an "old" token to the database?

In the case of multiple events coming in at once, assuming each is written to the DB, I've been assured it's been deemed safe to assume that the last one (or two) would be valid.

@stevengill
Copy link
Member

This is a very valid point, IMO, and was my biggest push back when Steve/Kaz hit me with the redirection to event-based refresh rather than API-call-based. I can't speak in depth about the rationale beyond what I understand to be considering it an edge case and unlikely to happen, but I could be misremembering that so I'll tap @stevengill to fill in the details of the conversation where it was decided.

I think the thinking here is that the main way to interact with client in bolt is to use a listener. Before any listener gets executed, authorize will run (which means the token can get refreshed).

If they use client outside of the listener, that client will potentially have the wrong token attached. We may want to see if we can update the token on the top level app.client property (or remove this property?). Issue is the way webclient initializes, it takes in a token. We don't have a mechanism to change the token on a web client instance as far as I know.

I wouldn't be surprised if people file an issue for this. Lets see how it goes. We very well may have to do some refactoring to web-api.

@@ -506,6 +506,7 @@ export abstract class Methods extends EventEmitter<WebClientEvent> {
access: bindApiCall<OAuthAccessArguments, OauthAccessResponse>(this, 'oauth.access'),
v2: {
access: bindApiCall<OAuthV2AccessArguments, OauthV2AccessResponse>(this, 'oauth.v2.access'),
exchange: bindApiCall<OAuthV2AccessArguments, OauthV2AccessResponse>(this, 'oauth.v2.exchange'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Need to have OAuthV2ExchangeArguments for this. The arguments are not the same with oauth.v2.access. It has token. Regarding the response, it's fine to use WebAPICallResult for now.

@misscoded misscoded force-pushed the 1241-add-token-rotation-support branch from a5ecedd to 33d1c6d Compare June 3, 2021 04:06
}).catch(e => e) as OAuthV2ExchangeResponse;
});

return Promise.all(refreshPromises);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent a good amount of time trying to best solution this, but would love input.

With our current tsconfig, we don't have the ability to use Promise.allSettled (we use es2017 and would need es2020), which is what we want in this case (to wait until all Promises are settled, despite outcome of each).

Promise.all will fail fast if a rejection occurs, and I think under normal circumstances that would be handled fine with the additional .catch(e => e) that's included for each individual Promise, but in the case of a failure, the WebAPI actually throws an error before that catch can happen (I think that's what I've been observing while testing. Feel free to challenge that).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we dropped Node 10 support, perhaps, Promise.allSettled should be available in all supported Node runtimes.

const currentUTCSec = Math.floor(Date.now() / 1000); // seconds
const tokensToRefresh: string[] = this.checkTokenExpiry(authResult, currentUTCSec);

if (tokensToRefresh.length > 0) {
Copy link
Contributor Author

@misscoded misscoded Jun 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pulled out some logic (checkTokenExpiry and refreshExpiringTokens) in an attempt to clean up and separate concerns, but stopped short of going beyond this point because I couldn't figure out a clean way to do so without also mutating authResult within the confines of the helper function and not hindering readability. Happy to take someone else's attempt and incorporate it.

@@ -725,6 +859,20 @@ interface OAuthV2Response extends WebAPICallResult {
};
}

export interface OAuthV2ExchangeResponse extends WebAPICallResult {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to add this because of conflicts created in other, pre-existing parts of the code. If someone else wants to take a stab at how to resolve this in another way, again, more than happy to accept a better solution!

team: { id: string, name: string };
enterprise: { name: string, id: string } | null;
is_enterprise_install: boolean;
response_metadata: {}; // TODO
Copy link
Contributor Author

@misscoded misscoded Jun 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find what the actual shape of this property is, neither in the docs nor in conversations. If anyone happens to know, would like to update this.

@misscoded misscoded changed the base branch from main to feat-token-rotation June 3, 2021 04:29
Copy link
Member

@seratch seratch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't run the code yet but left a few comments

*
* The return value is an Array of expired or soon-to-expire access tokens.
*/
private checkTokenExpiry(authResult: AuthorizeResult, currentUTCSec: number): string[] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think check is generally a bit unclear naming. How about naming the method using detect or extract? The method's responsibility is to return expired/expiring tokens from the auth result.

Suggested change
private checkTokenExpiry(authResult: AuthorizeResult, currentUTCSec: number): string[] {
private detectExpiredOrExpiringTokens(authResult: AuthorizeResult, currentUTCSec: number): string[] {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On another note, these new private methods are stateless (=they don't access the internal state of this object and probably they should not). You may want to place them at top level. You can call them without self..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both great suggestions, I'll make the changes!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I went to actually make this change, it became apparent that the second function does rely on the instance (client, clientId and clientSecret). If there is a way to do as you've indicated that I'm not seeing, happy to hear it and adjust, but for the time being, I'll just make the name change and move the detection function to the top level.

}).catch(e => e) as OAuthV2ExchangeResponse;
});

return Promise.all(refreshPromises);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we dropped Node 10 support, perhaps, Promise.allSettled should be available in all supported Node runtimes.

@misscoded misscoded force-pushed the 1241-add-token-rotation-support branch from 33d1c6d to 4c55fbf Compare June 3, 2021 21:34
@misscoded misscoded removed the draft label Jun 4, 2021
@misscoded misscoded marked this pull request as ready for review June 4, 2021 17:36
@misscoded misscoded merged commit 2f3ae3e into slackapi:feat-token-rotation Jun 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement M-T: A feature request for new functionality pkg:oauth applies to `@slack/oauth-helper` pkg:web-api applies to `@slack/web-api` semver:minor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants