Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MLOps modules #62

Merged
merged 70 commits into from
Apr 28, 2020
Merged

MLOps modules #62

merged 70 commits into from
Apr 28, 2020

Conversation

jacksandom
Copy link
Contributor

@jacksandom jacksandom commented Mar 6, 2020

  • @aaronsteers: Fix issue with unknown count of resources, driven by dynamically-calculated s3_triggers on python-lambda module. (P1)
  • Auto-build and auto-upload whl file for glue transformation dependency. (P2)
  • Use aws credential helper for ECR login.

Copy link
Contributor

@aaronsteers aaronsteers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just did a quick prelim review. More to follow in later discussions but two quick items:

  1. I've called out a few places where you can use name_prefix to avoid potential naming collisions
  2. I understand the csv files are valuable in git as a training dataset, but would be good to replace the zip (binary) files with their code equivalent if and when possible. As a general best practice, it's a good idea to avoid binary files since their diffs can't be readily evaluated by humans.

If these zips are for lambda functions, #2 will get a lot easier after refactoring onto the existing lambda-python module - since the zipping and packaging become automated during the terraform apply.

  • Resolved

components/aws/step-functions/iam.tf Outdated Show resolved Hide resolved
components/aws/step-functions/iam.tf Outdated Show resolved Hide resolved
@jacksandom
Copy link
Contributor Author

I just did a quick prelim review. More to follow in later discussions but two quick items:

  1. I've called out a few places where you can use name_prefix to avoid potential naming collisions
  2. I understand the csv files are valuable in git as a training dataset, but would be good to replace the zip (binary) files with their code equivalent if and when possible. As a general best practice, it's a good idea to avoid binary files since their diffs can't be readily evaluated by humans.

If these zips are for lambda functions, #2 will get a lot easier after refactoring onto the existing lambda-python module - since the zipping and packaging become automated during the terraform apply.

Hi @aaronsteers ,

I've added in the name prefix's. Yes, the ZIPs are the Lambda functions (not CSVs) so yes I hope that will become easier with the lambda module!

One question, are there any plans to add "force destroy" to the S3 mod? Sagemaker training jobs will store their models in S3 buckets which makes clean-up harder if we want to destroy said bucket.

Jack

@aaronsteers
Copy link
Contributor

One question, are there any plans to add "force destroy" to the S3 mod? Sagemaker training jobs will store their models in S3 buckets which makes clean-up harder if we want to destroy said bucket.

I would like to avoid using force_destroy - but I have been thinking about this a bunch. The challenge is that we have to be very careful not to inadvertantly delete the data lake due to bucket rename or some other Terraform code change. I just added a feature to cataog/aws/data-lake master which is the option to provider a data_bucket_override input variable. This would be useful in cases where you already have a data lake bucket and want to add the surrounding features around an existing bucket. I believe you should be able to use this BYOB (bring-your-own-bucket) feature to pass in a bucket that does have the force_destroy property set. This should open up test/development patterns while not creating a dangerous default for actual production usage.

What do you think? Do you think this would work for what you are looking for?

@jacksandom
Copy link
Contributor Author

jacksandom commented Mar 8, 2020

One question, are there any plans to add "force destroy" to the S3 mod? Sagemaker training jobs will store their models in S3 buckets which makes clean-up harder if we want to destroy said bucket.

I would like to avoid using force_destroy - but I have been thinking about this a bunch. The challenge is that we have to be very careful not to inadvertantly delete the data lake due to bucket rename or some other Terraform code change. I just added a feature to cataog/aws/data-lake master which is the option to provider a data_bucket_override input variable. This would be useful in cases where you already have a data lake bucket and want to add the surrounding features around an existing bucket. I believe you should be able to use this BYOB (bring-your-own-bucket) feature to pass in a bucket that does have the force_destroy property set. This should open up test/development patterns while not creating a dangerous default for actual production usage.

What do you think? Do you think this would work for what you are looking for?

@aaronsteers Completely get the rationale and thinking about it, it probably makes sense for me not to use force_destroy on my buckets also.. I tested the feature and it works fine. However, it would only work with an existing bucket right? Is there a way to create a dependency if you wanted to create that bucket in the same Terraform apply?

jacksandom and others added 7 commits March 9, 2020 18:41
* replace zip files with py

* move lambda files to catalog

* remove extra comments

* reference arns from lambda module

* refactor vars to insulate lambda defs from s3_triggers

* get arns from lambda module

* refactor lambda function definitions

* python cleanup and auto-formatting (using black)

* fix source path

* move lambda functions to ml ops

* fix errors from refactoring

* fix missing requirements.txt

* Attempted bugfix: accessing non-existent pip[0]

* typo

* improved examples

* Update components/aws/lambda-python/outputs.tf

per suggested change

Co-Authored-By: Jack Sandom <60360603+jacksandom@users.noreply.github.com>

* updated variable name

* updated variable name

* output iam roles for ecs-task and lambda

* Lambda IAM SageMaker policy attachment

Co-authored-by: Jack Sandom <60360603+jacksandom@users.noreply.github.com>
Co-authored-by: jacksandom <jack.sandom@slalom.com>
@aaronsteers
Copy link
Contributor

aaronsteers commented Mar 12, 2020

@jacksandom - When I merged in master, it started including documentation metadata in the CI/CD tests. Here's the output.

image

Basically, this just means we need to add a new comment block at the top of the main.tf file. You can copy-paste the below and then customize the description as needed.

/*
* This is my short description about the module and how to use it. 
* _Markdown_ formatting *is* supported.
*
*/

Do you mind taking a stab at this?

If you are interested, you can also test the auto-documentation by running the following:

# for Windows:
choco install terraform-docs  
# for Mac:
brew install terraform-docs 

cd docs
python build.py

The above will update the two *_index.md files in the docs folder, and will also update the README.md files in each component and catalog module.

@aaronsteers aaronsteers linked an issue Mar 12, 2020 that may be closed by this pull request
@aaronsteers
Copy link
Contributor

@jacksandom - I went ahead and added the comment header in catalog/aws/ml-ops-on-aws and I've updated the docs using the new auto-docs feature. You can checkout the new README.md files and the two %_index.md files in the docs folder.

Copy link
Contributor

@aaronsteers aaronsteers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jacksandom - I am almost done with the full review, and I wanted to send this over since I'm already late sending you this feedback. I may have a few other smaller comments but I think this covers the bulk of it. Thanks!

.gitignore Show resolved Hide resolved
catalog/aws/ml-ops-on-aws/README.md Outdated Show resolved Hide resolved
catalog/aws/ml-ops-on-aws/ecr-image.tf Outdated Show resolved Hide resolved
catalog/aws/ml-ops-on-aws/lambda-python/unique_job_name.py Outdated Show resolved Hide resolved
catalog/aws/ml-ops-on-aws/outputs.tf Outdated Show resolved Hide resolved
catalog/aws/ml-ops-on-aws/variables.tf Outdated Show resolved Hide resolved
catalog/aws/ml-ops-on-aws/variables.tf Outdated Show resolved Hide resolved
Copy link
Contributor

@aaronsteers aaronsteers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Round 2 of feedback. I think this is everything. Thanks!

samples/ml-ops-on-aws/01_ml-ops.tf Outdated Show resolved Hide resolved
samples/ml-ops-on-aws/01_ml-ops.tf Outdated Show resolved Hide resolved
samples/ml-ops-on-aws/01_ml-ops.tf Show resolved Hide resolved
samples/ml-ops-on-aws/01_ml-ops.tf Show resolved Hide resolved
samples/ml-ops-on-aws/01_ml-ops.tf Show resolved Hide resolved
catalog/aws/ml-ops-on-aws/variables.tf Outdated Show resolved Hide resolved
@aaronsteers aaronsteers self-assigned this Apr 22, 2020
@aaronsteers
Copy link
Contributor

@jacksandom - I have good news! It believe I've resolved the resource count issue in this commit: a65ac67

The problem-area was a distinct list of bucket names which was then being used to drive the count of resource permission objects created for iam policies. Instead, there is now only one iam/policy object of each type. While the objects the policy refers to are still driven by the distinct list of buckets, that count no longer affects the number of resources terraform is creating.

For readability, I also migrated a policy JSON string into the aws_iam_policy_document data resource - which just improves readability and reduces the need for escaping that is related to creating dynamic strings. More info on that here if you are interested: https://www.terraform.io/docs/providers/aws/d/iam_policy_document.html

@aaronsteers
Copy link
Contributor

@jacksandom - I sent you a direct message in regards to the ECR login error. I would like to use the AWS Credential Helper if possible - here: https://github.com/awslabs/amazon-ecr-credential-helper

I remember I got this working before but I don't see where/if I had logged any documentation on that here in this repo. The desired behavior would be that we could use the credential helper instead of having to run aws ecr get-login.

@jacksandom
Copy link
Contributor Author

@aaronsteers - thanks for reviewing - all my changes are in. Two main things left:

  • Zip up Glue function
  • AWS credential helper for ECR login

@aaronsteers
Copy link
Contributor

Regarding Glue jobs:

  • This might make a better glue PR after the fact.

Complexities:

  • zip method puts the custom code in the same zip as the packaged dependencies
  • whl method keeps dependencies separate from the custom source code
  • when it runs, I think the platform/architecture need to match the aws (linux) environment
    • unclear if it will work to build this on windows and then upload to run under aws (linux)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: MLOps Solution in Infra Catalog
2 participants