Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spinning up several clusters at the same time creates multiple SystemsManager entries #14

Open
fedianine-statpro opened this issue Feb 25, 2020 · 9 comments
Labels
bug Something isn't working module/sys-mgr-ext p1 This is a high priority issue queued

Comments

@fedianine-statpro
Copy link

I found that if you spin up a multi-node cluster that uses the AWS Data Protection Provider and all nodes are starting at the exact same time, then you might encounter an issue where more than one "/MyApplication/DataProtection" entry is created in AWS Systems Manager Parameter Store. As a result, these clusters end up not sharing the AWS Systems Manager entry which causes problems.

The only solution is to get a single node spun up first, and only when it's fully loaded, to spin up other nodes, which will use the same Systems Manager entry as the first node.

@klaytaybai
Copy link

Thanks for reporting this.

@harnocz
Copy link

harnocz commented May 5, 2020

Is there a timeline or a recommended workaround for this? We've run into the same issue.

@ashishdhingra ashishdhingra added bug Something isn't working module/sys-mgr-ext needs-triage This issue or PR still needs to be triaged. labels Aug 12, 2020
@StummeJ
Copy link

StummeJ commented Aug 14, 2020

Once the key is generated can you remove the older ones and restart the cluster?

@benoram
Copy link

benoram commented Sep 28, 2020

This particular problem makes the data protection provider difficult to use in AWS Lambda

@ashishdhingra ashishdhingra added B and removed needs-triage This issue or PR still needs to be triaged. labels Nov 5, 2020
@raRaRa
Copy link

raRaRa commented Dec 9, 2020

What would be a good way to solve this issue?

I previously used the package AspNetCore.DataProtection.Aws which writes it to S3, perhaps there's something there that could help: https://github.com/hotchkj/AspNetCore.DataProtection.Aws/blob/master/src/AspNetCore.DataProtection.Aws.S3/S3XmlRepository.cs

If multiple instances are simultaneously reading and writing to the Parameter Store, then one way to solve this would be to write a temp value to a temp parameter to indicate that the work is in progress. If other instances read that temp value and detect that it's in progress, then they should wait 50-100ms and retry reading the data protection value.

Case 1:

  1. Check if the temp parameter exists
  2. If the temp parameter doesn't exist then write a GUID value to it.
  3. Read the temp parameter value again and check if the GUID is still the same.
  4. If the GUID is still the same, then write to the data protection value.
  5. If it isn't the same GUID, wait 50-100ms and load the data protection value.

Case 2:

  1. Check if the temp parameter exists
  2. If the temp parameter exists, wait 50-100ms and load the data protection value. Retry if it doesn't exist.

The idea of using GUID is to make sure that multiple instances haven't written to the temp parameter, e.g. if multiple instances read the temp parameter at the same time and write to it. Then only one instance will be in charge of writing the data protection value, whoever was last to write its GUID.

I would personally use DynamoDb for this, but I wouldn't want to make it a dependency in this package.

I'm pretty sure there's an easier way to solve this, but this comes to mind.

@schmitty1970
Copy link

Is there an ETA for this issue or a work around?

@ghost
Copy link

ghost commented Feb 2, 2022

What would be a good way to solve this issue?

I previously used the package AspNetCore.DataProtection.Aws which writes it to S3, perhaps there's something there that could help: https://github.com/hotchkj/AspNetCore.DataProtection.Aws/blob/master/src/AspNetCore.DataProtection.Aws.S3/S3XmlRepository.cs

If multiple instances are simultaneously reading and writing to the Parameter Store, then one way to solve this would be to write a temp value to a temp parameter to indicate that the work is in progress. If other instances read that temp value and detect that it's in progress, then they should wait 50-100ms and retry reading the data protection value.

Case 1:

  1. Check if the temp parameter exists
  2. If the temp parameter doesn't exist then write a GUID value to it.
  3. Read the temp parameter value again and check if the GUID is still the same.
  4. If the GUID is still the same, then write to the data protection value.
  5. If it isn't the same GUID, wait 50-100ms and load the data protection value.

Case 2:

  1. Check if the temp parameter exists
  2. If the temp parameter exists, wait 50-100ms and load the data protection value. Retry if it doesn't exist.

The idea of using GUID is to make sure that multiple instances haven't written to the temp parameter, e.g. if multiple instances read the temp parameter at the same time and write to it. Then only one instance will be in charge of writing the data protection value, whoever was last to write its GUID.

I would personally use DynamoDb for this, but I wouldn't want to make it a dependency in this package.

I'm pretty sure there's an easier way to solve this, but this comes to mind.

There is no easier way to solve this... This is actually a problem of distributed locking. The scenario you have described is still not bulletproof. In Azure, you can have a distributed locking mechanism using a Blob lease. However, that would introduce dependency on another service/system. In my personal opinion, the best option would be for a developer to implement themselves a custom DataProtectionProvider. After all, one would have to implement only 2 methods :
byte[] Protect(byte[] plaintext);
byte[] Unprotect(byte[] protectedData);
That could be done with AES128 or AES256 crypto behind the scene, and the crypto key could be store in AWS Secrets Manager. Creating the AesManaged instance all the time is a bit of a performance penalty so probably you should have a mechanism of being able to use (multiple) cached instances of AesManaged. Multiple instances because of the multithreading nature of web applications.

@damianh
Copy link

damianh commented Mar 31, 2022

Also using lambda and multiple nodes (fargate) and we've taken the following approach...

AspNet Core Data Protection is designed to initialize a key if none exist on first usage and rotate them when the key is approaching it's expiration date. I guess it mostly "just works" for most of the scenarios envisaged by the AspNet core team.

We've decided to not allow our services that handle application HTTP requests to generate/rotate keys automatically via DisableAutomaticKeyGeneration:

services
  .AddDataProtection()
  .SetApplicationName(appName)
  .PersistKeysToAWSSystemsManager("/MyApplication/DataProtection");
  .DisableAutomaticKeyGeneration(); // Key generation is handled seperately.

... and setting IAM policy to NOT allow these services to write to SSM Parameter store.

And instead we have this separate code:

var services = new ServiceCollection();
services.AddAWSService<IAmazonSimpleSystemsManagement>(awsOptions);
services.AddLogging(...);
services
  .AddDataProtection()
  .SetApplicationName(appName)
  .PersistKeysToAWSSystemsManager("/MyApplication/DataProtection");
var dataProtectionProvider = serviceProvider.GetRequiredService<IDataProtectionProvider>();
dataProtectionProvider.CreateProtector("doesntmatter"); // Initializes a key if none exists, will rotate if approaching expiration date

There are a couple of ways of running this code to do the key management activity:

  1. Put it in a lambda and have a Cloudwatch schedule invoke it daily.
  2. Run it on a schedule on your CD platform of choice.

I will argue this is a better / simpler approach than dealing with race conditions and distributed locking.

@ashishdhingra ashishdhingra added p2 This is a standard priority issue queued and removed B labels Nov 2, 2022
@gouriu
Copy link

gouriu commented Jun 17, 2024

Also using lambda and multiple nodes (fargate) and we've taken the following approach...

AspNet Core Data Protection is designed to initialize a key if none exist on first usage and rotate them when the key is approaching it's expiration date. I guess it mostly "just works" for most of the scenarios envisaged by the AspNet core team.

We've decided to not allow our services that handle application HTTP requests to generate/rotate keys automatically via DisableAutomaticKeyGeneration:

services
  .AddDataProtection()
  .SetApplicationName(appName)
  .PersistKeysToAWSSystemsManager("/MyApplication/DataProtection");
  .DisableAutomaticKeyGeneration(); // Key generation is handled seperately.

... and setting IAM policy to NOT allow these services to write to SSM Parameter store.

And instead we have this separate code:

var services = new ServiceCollection();
services.AddAWSService<IAmazonSimpleSystemsManagement>(awsOptions);
services.AddLogging(...);
services
  .AddDataProtection()
  .SetApplicationName(appName)
  .PersistKeysToAWSSystemsManager("/MyApplication/DataProtection");
var dataProtectionProvider = serviceProvider.GetRequiredService<IDataProtectionProvider>();
dataProtectionProvider.CreateProtector("doesntmatter"); // Initializes a key if none exists, will rotate if approaching expiration date

There are a couple of ways of running this code to do the key management activity:

  1. Put it in a lambda and have a Cloudwatch schedule invoke it daily.
  2. Run it on a schedule on your CD platform of choice.

I will argue this is a better / simpler approach than dealing with race conditions and distributed locking.

I am using this approach but it's not creating a key if it does not exist. Any ideas why?

@bhoradc bhoradc added p1 This is a high priority issue and removed p2 This is a standard priority issue labels Sep 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working module/sys-mgr-ext p1 This is a high priority issue queued
Projects
None yet
Development

No branches or pull requests