MLOps modules #62

jacksandom · 2020-03-06T16:18:18Z

@aaronsteers: Fix issue with unknown count of resources, driven by dynamically-calculated s3_triggers on python-lambda module. (P1)
Auto-build and auto-upload whl file for glue transformation dependency. (P2)
Use aws credential helper for ECR login.

aaronsteers

I just did a quick prelim review. More to follow in later discussions but two quick items:

I've called out a few places where you can use name_prefix to avoid potential naming collisions
I understand the csv files are valuable in git as a training dataset, but would be good to replace the zip (binary) files with their code equivalent if and when possible. As a general best practice, it's a good idea to avoid binary files since their diffs can't be readily evaluated by humans.

If these zips are for lambda functions, #2 will get a lot easier after refactoring onto the existing lambda-python module - since the zipping and packaging become automated during the terraform apply.

Resolved

components/aws/step-functions/iam.tf

jacksandom · 2020-03-07T10:19:41Z

I just did a quick prelim review. More to follow in later discussions but two quick items:

I've called out a few places where you can use name_prefix to avoid potential naming collisions

I understand the csv files are valuable in git as a training dataset, but would be good to replace the zip (binary) files with their code equivalent if and when possible. As a general best practice, it's a good idea to avoid binary files since their diffs can't be readily evaluated by humans.

If these zips are for lambda functions, #2 will get a lot easier after refactoring onto the existing lambda-python module - since the zipping and packaging become automated during the terraform apply.

Hi @aaronsteers ,

I've added in the name prefix's. Yes, the ZIPs are the Lambda functions (not CSVs) so yes I hope that will become easier with the lambda module!

One question, are there any plans to add "force destroy" to the S3 mod? Sagemaker training jobs will store their models in S3 buckets which makes clean-up harder if we want to destroy said bucket.

Jack

aaronsteers · 2020-03-07T22:53:56Z

One question, are there any plans to add "force destroy" to the S3 mod? Sagemaker training jobs will store their models in S3 buckets which makes clean-up harder if we want to destroy said bucket.

I would like to avoid using force_destroy - but I have been thinking about this a bunch. The challenge is that we have to be very careful not to inadvertantly delete the data lake due to bucket rename or some other Terraform code change. I just added a feature to cataog/aws/data-lake master which is the option to provider a data_bucket_override input variable. This would be useful in cases where you already have a data lake bucket and want to add the surrounding features around an existing bucket. I believe you should be able to use this BYOB (bring-your-own-bucket) feature to pass in a bucket that does have the force_destroy property set. This should open up test/development patterns while not creating a dangerous default for actual production usage.

What do you think? Do you think this would work for what you are looking for?

jacksandom · 2020-03-08T11:48:03Z

One question, are there any plans to add "force destroy" to the S3 mod? Sagemaker training jobs will store their models in S3 buckets which makes clean-up harder if we want to destroy said bucket.

I would like to avoid using force_destroy - but I have been thinking about this a bunch. The challenge is that we have to be very careful not to inadvertantly delete the data lake due to bucket rename or some other Terraform code change. I just added a feature to cataog/aws/data-lake master which is the option to provider a data_bucket_override input variable. This would be useful in cases where you already have a data lake bucket and want to add the surrounding features around an existing bucket. I believe you should be able to use this BYOB (bring-your-own-bucket) feature to pass in a bucket that does have the force_destroy property set. This should open up test/development patterns while not creating a dangerous default for actual production usage.

What do you think? Do you think this would work for what you are looking for?

@aaronsteers Completely get the rationale and thinking about it, it probably makes sense for me not to use force_destroy on my buckets also.. I tested the feature and it works fine. However, it would only work with an existing bucket right? Is there a way to create a dependency if you wanted to create that bucket in the same Terraform apply?

…nfra into feat/mlops2

* replace zip files with py * move lambda files to catalog * remove extra comments * reference arns from lambda module * refactor vars to insulate lambda defs from s3_triggers * get arns from lambda module * refactor lambda function definitions * python cleanup and auto-formatting (using black) * fix source path * move lambda functions to ml ops * fix errors from refactoring * fix missing requirements.txt * Attempted bugfix: accessing non-existent pip[0] * typo * improved examples * Update components/aws/lambda-python/outputs.tf per suggested change Co-Authored-By: Jack Sandom <60360603+jacksandom@users.noreply.github.com> * updated variable name * updated variable name * output iam roles for ecs-task and lambda * Lambda IAM SageMaker policy attachment Co-authored-by: Jack Sandom <60360603+jacksandom@users.noreply.github.com> Co-authored-by: jacksandom <jack.sandom@slalom.com>

aaronsteers · 2020-03-12T17:56:12Z

@jacksandom - When I merged in master, it started including documentation metadata in the CI/CD tests. Here's the output.

Basically, this just means we need to add a new comment block at the top of the main.tf file. You can copy-paste the below and then customize the description as needed.

/*
* This is my short description about the module and how to use it. 
* _Markdown_ formatting *is* supported.
*
*/

Do you mind taking a stab at this?

If you are interested, you can also test the auto-documentation by running the following:

# for Windows:
choco install terraform-docs  
# for Mac:
brew install terraform-docs 

cd docs
python build.py

The above will update the two *_index.md files in the docs folder, and will also update the README.md files in each component and catalog module.

aaronsteers · 2020-03-13T04:58:25Z

@jacksandom - I went ahead and added the comment header in catalog/aws/ml-ops-on-aws and I've updated the docs using the new auto-docs feature. You can checkout the new README.md files and the two %_index.md files in the docs folder.

aaronsteers

@jacksandom - I am almost done with the full review, and I wanted to send this over since I'm already late sending you this feedback. I may have a few other smaller comments but I think this covers the bulk of it. Thanks!

.gitignore

catalog/aws/ml-ops-on-aws/README.md

catalog/aws/ml-ops-on-aws/ecr-image.tf

catalog/aws/ml-ops-on-aws/lambda-python/run_glue_crawler.py

catalog/aws/ml-ops-on-aws/lambda-python/unique_job_name.py

catalog/aws/ml-ops-on-aws/outputs.tf

catalog/aws/ml-ops-on-aws/variables.tf

aaronsteers

Round 2 of feedback. I think this is everything. Thanks!

samples/ml-ops-on-aws/01_ml-ops.tf

catalog/aws/ml-ops-on-aws/variables.tf

samples/ml-ops-on-aws/01_ml-ops.tf

aaronsteers · 2020-04-22T23:37:29Z

@jacksandom - I have good news! It believe I've resolved the resource count issue in this commit: a65ac67

The problem-area was a distinct list of bucket names which was then being used to drive the count of resource permission objects created for iam policies. Instead, there is now only one iam/policy object of each type. While the objects the policy refers to are still driven by the distinct list of buckets, that count no longer affects the number of resources terraform is creating.

For readability, I also migrated a policy JSON string into the aws_iam_policy_document data resource - which just improves readability and reduces the need for escaping that is related to creating dynamic strings. More info on that here if you are interested: https://www.terraform.io/docs/providers/aws/d/iam_policy_document.html

…nfra into feat/mlops2

aaronsteers · 2020-04-24T08:44:32Z

@jacksandom - I sent you a direct message in regards to the ECR login error. I would like to use the AWS Credential Helper if possible - here: https://github.com/awslabs/amazon-ecr-credential-helper

I remember I got this working before but I don't see where/if I had logged any documentation on that here in this repo. The desired behavior would be that we could use the credential helper instead of having to run aws ecr get-login.

jacksandom · 2020-04-24T13:31:45Z

@aaronsteers - thanks for reviewing - all my changes are in. Two main things left:

Zip up Glue function
AWS credential helper for ECR login

aaronsteers · 2020-04-27T16:56:49Z

Regarding Glue jobs:

This might make a better glue PR after the fact.

Complexities:

zip method puts the custom code in the same zip as the packaged dependencies
whl method keeps dependencies separate from the custom source code
when it runs, I think the platform/architecture need to match the aws (linux) environment
- unclear if it will work to build this on windows and then upload to run under aws (linux)

jacksandom added 3 commits March 4, 2020 18:05

removed parameterisation of account_id

15304e2

ml-ops component, catalogue and sample modules

01478f9

Update gitignore and yml

14de56c

aaronsteers mentioned this pull request Mar 6, 2020

removed parameterisation of account_id #60

Closed

jacksandom and others added 3 commits March 6, 2020 16:30

Formatting changes to pass checks

1f5252d

remove acct refs

e97e842

remove acct refs (2)

a6fa31f

aaronsteers suggested changes Mar 6, 2020

View reviewed changes

components/aws/step-functions/iam.tf Outdated Show resolved Hide resolved

components/aws/step-functions/iam.tf Outdated Show resolved Hide resolved

jacksandom added 2 commits March 7, 2020 10:11

Amends to passing IAM role and adding name prefix

303ebcc

format fix to component module

0aeee1a

Merge branch 'master' into feat/mlops2

5bd7189

jacksandom and others added 7 commits March 9, 2020 18:41

Paramaterising of state machine input

9422ceb

Paramaterising of state machine input

955fe90

Merge branch 'feat/mlops2' of https://github.com/slalom-ggp/dataops-i…

680afec

…nfra into feat/mlops2

Remove comments

aa1fa6d

Encode JSON for hyperparameter tuning input

8f60897

Merge branch 'master' into feat/mlops2

dae7f76

aaronsteers linked an issue Mar 12, 2020 that may be closed by this pull request

Feature Request: MLOps Solution in Infra Catalog #9

Closed

aaronsteers added 3 commits March 12, 2020 21:42

updated auto-docs

d488537

add ml-ops module header

abf2d22

Merge branch 'master' into feat/mlops2

98fce73

jacksandom and others added 3 commits March 13, 2020 11:30

ECS Shap added and variable descriptions

155fd49

updated docs

112846d

Merge branch 'master' into feat/mlops2

c201fff

jacksandom added 6 commits April 15, 2020 13:12

tfplan to gitignore

3741fe9

Re-ordering Step Functions

7a1cbdc

Add DynamoDB metadata store

07a4bf2

Naming change and line endings

8b3b0d8

S3 as metadata store

be8acb9

Update usage markdown

0df2537

aaronsteers suggested changes Apr 22, 2020

View reviewed changes

aaronsteers reviewed Apr 22, 2020

View reviewed changes

samples/ml-ops-on-aws/01_ml-ops.tf Outdated Show resolved Hide resolved

aaronsteers self-assigned this Apr 22, 2020

resolved: non-deterministic resource count error

a65ac67

jacksandom and others added 3 commits April 23, 2020 10:12

Change catalog name to 'ml-ops'

8b4743a

Merge branch 'feat/mlops2' of https://github.com/slalom-ggp/dataops-i…

269e970

…nfra into feat/mlops2

fixed case of no s3 triggers in lambda-python

c7cbc78

jacksandom added 5 commits April 24, 2020 12:07

Lambda docstrings and ECR kill switch,

32b1a38

Lambda docstrings and ECR kill switch,

ffe8d8a

Remove indent to ECR-image main

695b5c7

READMEs and formatting

1ad70c3

Make score data optional for endpoint inference

53ba0c8

jacksandom and others added 2 commits April 27, 2020 11:37

Add MacOS / Linux ECR login command

1810797

add comment re: ECR creds

79d5641

jacksandom merged commit 781a7fc into master Apr 28, 2020

aaronsteers mentioned this pull request Apr 28, 2020

Feature Request: MLOps Solution in Infra Catalog #9

Closed

This was referenced Apr 30, 2020

Use AWS credential helper for ECR login #93

Open

Auto-build and auto-upload whl file for glue transformation dependency #94

Open

aaronsteers deleted the feat/mlops2 branch July 7, 2020 05:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MLOps modules #62

MLOps modules #62

jacksandom commented Mar 6, 2020 •

edited

Loading

aaronsteers left a comment •

edited

Loading

jacksandom commented Mar 7, 2020

aaronsteers commented Mar 7, 2020

jacksandom commented Mar 8, 2020 •

edited

Loading

aaronsteers commented Mar 12, 2020 •

edited

Loading

aaronsteers commented Mar 13, 2020

aaronsteers left a comment

aaronsteers left a comment

aaronsteers commented Apr 22, 2020

aaronsteers commented Apr 24, 2020

jacksandom commented Apr 24, 2020

aaronsteers commented Apr 27, 2020

MLOps modules #62

MLOps modules #62

Conversation

jacksandom commented Mar 6, 2020 • edited Loading

aaronsteers left a comment • edited Loading

Choose a reason for hiding this comment

jacksandom commented Mar 7, 2020

aaronsteers commented Mar 7, 2020

jacksandom commented Mar 8, 2020 • edited Loading

aaronsteers commented Mar 12, 2020 • edited Loading

aaronsteers commented Mar 13, 2020

aaronsteers left a comment

Choose a reason for hiding this comment

aaronsteers left a comment

Choose a reason for hiding this comment

aaronsteers commented Apr 22, 2020

aaronsteers commented Apr 24, 2020

jacksandom commented Apr 24, 2020

aaronsteers commented Apr 27, 2020

jacksandom commented Mar 6, 2020 •

edited

Loading

aaronsteers left a comment •

edited

Loading

jacksandom commented Mar 8, 2020 •

edited

Loading

aaronsteers commented Mar 12, 2020 •

edited

Loading