Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update comprehend detection filters #1090

Merged

Conversation

swhite-oreilly
Copy link
Contributor

@swhite-oreilly swhite-oreilly commented Aug 25, 2023

This PR includes three new modules and four updated modules for AWS Comprehend.

Adding Completed to the job status types that are ignored upon cleanup for all Comprehend job types.

Adding support for the following job types: ComprehendEventsDetectionJob, ComprehendPiiEntititesDetectionJob, and ComprehendTargetedSentimentDetectionJob

Testing

Comprehend resources were created using the setup code mentioned below, and then AWS Nuke was used to clean these resources up, specifying
"ComprehendDominantLanguageDetectionJob"
"ComprehendEntitiesDetectionJob"
"ComprehendEventsDetectionJob"
"ComprehendKeyPhrasesDetectionJob"
"ComprehendPiiEntititesDetectionJob"
"ComprehendSentimentDetectionJob"
"ComprehendTargetedSentimentDetectionJob"

Resources in the Completed state no longer cause Nuke to error. These don't actually get deleted as they are just metadata pointing to the trained files in S3. The original function goal was to stop in progress jobs, but they missed filtering out those in the Completed status so they were causing errors because they were in the wrong state upon running nuke against these resource types.

Setup code

#!/bin/bash

# Generate a random string to use as a bucket name and classifier name suffix
RANDOM_STRING=$(openssl rand -hex 20)
# Generate a random string for shorter names
SHORT_RANDOM_STRING=$(openssl rand -hex 10)

# Set your preferred bucket names
INPUT_BUCKET="input-bucket-$RANDOM_STRING"
OUTPUT_BUCKET="output-bucket-$RANDOM_STRING"

# Get AWS account ID
AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query "Account" --output text)
echo "AWS Account ID: $AWS_ACCOUNT_ID"

# Create input bucket
aws s3api create-bucket --bucket $INPUT_BUCKET
echo "Input bucket created: s3://$INPUT_BUCKET"

# Create output bucket
aws s3api create-bucket --bucket $OUTPUT_BUCKET
echo "Output bucket created: s3://$OUTPUT_BUCKET"

# Create IAM Role for Comprehend access
ROLE_NAME="comprehend-access-role-$RANDOM_STRING"
aws iam create-role \
  --role-name $ROLE_NAME \
  --assume-role-policy-document '{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Principal": {
          "Service": "comprehend.amazonaws.com"
        },
        "Action": "sts:AssumeRole"
      }
    ]
  }'

# Attach AmazonS3FullAccess managed policy to IAM role (for demo purposes)
aws iam attach-role-policy \
  --role-name $ROLE_NAME \
  --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess

echo "AmazonS3FullAccess policy attached to IAM role"

# Create a sample input file
echo "This is an example text for analysis." > sample-input.txt

# Upload the input file to the input bucket
aws s3 cp sample-input.txt s3://$INPUT_BUCKET/

# Create a folder in the input bucket for entity recognizer data
aws s3api put-object --bucket $INPUT_BUCKET --key documents/
aws s3api put-object --bucket $INPUT_BUCKET --key annotations/

# Upload entity recognizer data to the input bucket
echo "This is a person's name." > input-doc.txt

# Create annotation files for three samples
echo "File,Line,Begin Offset,End Offset,Type" > input-ann-1.csv
echo "input-doc.txt,0,0,18,PERSON" >> input-ann-1.csv

echo "File,Line,Begin Offset,End Offset,Type" > input-ann-2.csv
echo "input-doc.txt,0,0,5,PERSON" >> input-ann-2.csv

echo "File,Line,Begin Offset,End Offset,Type" > input-ann-3.csv
echo "input-doc.txt,0,6,18,PERSON" >> input-ann-3.csv

aws s3 cp input-doc.txt s3://$INPUT_BUCKET/documents/
aws s3 cp input-ann-1.csv s3://$INPUT_BUCKET/annotations/
aws s3 cp input-ann-2.csv s3://$INPUT_BUCKET/annotations/
aws s3 cp input-ann-3.csv s3://$INPUT_BUCKET/annotations/

# Generate a larger corpus file for the entity recognizer
LARGER_CORPUS_FILE="larger-entity-corpus.txt"
for i in {1..500}; do
  echo "This is a larger entity corpus." >> $LARGER_CORPUS_FILE
done

# Upload the larger corpus file to the input bucket
aws s3 cp $LARGER_CORPUS_FILE s3://$INPUT_BUCKET/documents/

# Create a folder in the output bucket
aws s3api put-object --bucket $OUTPUT_BUCKET --key output-folder/

# Create a bucket policy to grant read and write access to the output folder
BUCKET_POLICY='{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowComprehendAccess",
      "Effect": "Allow",
      "Principal": {
        "Service": "comprehend.amazonaws.com"
      },
      "Action": "s3:PutObject",
      "Resource": "arn:aws:s3:::'$OUTPUT_BUCKET'/output-folder/*"
    }
  ]
}'
aws s3api put-bucket-policy --bucket $OUTPUT_BUCKET --policy "$BUCKET_POLICY"
echo "Bucket policy updated to grant Comprehend access to write to output folder"

# Create a CSV file for document classifier training data
echo "label,text" > training-data.csv
for i in {1..20}; do
  echo "category,$(cat sample-input.txt)" >> training-data.csv
done

# Upload the training data to the input bucket
aws s3 cp training-data.csv s3://$INPUT_BUCKET/

# Create an entity recognizer with a unique name
ENTITY_RECOGNIZER_NAME="test-$RANDOM_STRING"
aws comprehend create-entity-recognizer \
  --language-code en \
  --recognizer-name $ENTITY_RECOGNIZER_NAME \
  --data-access-role-arn "arn:aws:iam::$AWS_ACCOUNT_ID:role/$ROLE_NAME" \
  --input-data-config "EntityTypes=[{Type=PERSON}],Documents={S3Uri=s3://$INPUT_BUCKET/documents/},Annotations={S3Uri=s3://$INPUT_BUCKET/annotations/}" \
  --region us-east-1

echo "Entity recognizer $ENTITY_RECOGNIZER_NAME created"

# Create the taskConfig.json file
TASK_CONFIG_FILE="taskConfig.json"
echo '{
    "LanguageCode": "en",
    "DocumentClassificationConfig": {
        "Mode": "MULTI_LABEL",
        "Labels": ["optimism", "anger"]
    }
}' > $TASK_CONFIG_FILE

# Create a flywheel for the entity recognizer
FLYWHEEL_ROLE_ARN="arn:aws:iam::$AWS_ACCOUNT_ID:role/testFlywheelDataAccess"
FLYWHEEL_NAME="myFlywheel-$RANDOM_STRING"
aws comprehend create-flywheel \
  --flywheel-name $FLYWHEEL_NAME \
  --data-access-role-arn $FLYWHEEL_ROLE_ARN \
  --model-type "DOCUMENT_CLASSIFIER" \
  --data-lake-s3-uri "s3://$INPUT_BUCKET/documents" \
  --task-config file://$TASK_CONFIG_FILE

echo "Flywheel $FLYWHEEL_NAME created"

# Start Sentiment Detection Job
echo "Starting sentiment detection job for sample-input.txt"
aws comprehend start-sentiment-detection-job \
  --input-data-config S3Uri=s3://$INPUT_BUCKET/sample-input.txt \
  --output-data-config S3Uri=s3://$OUTPUT_BUCKET/output-folder/ \
  --data-access-role-arn arn:aws:iam::$AWS_ACCOUNT_ID:role/$ROLE_NAME \
  --job-name "sentiment-analysis-$RANDOM_STRING" \
  --language-code en


# Start Entities Detection Job
echo "Starting entities detection job for sample-input.txt"
aws comprehend start-entities-detection-job \
  --input-data-config S3Uri=s3://$INPUT_BUCKET/sample-input.txt \
  --output-data-config S3Uri=s3://$OUTPUT_BUCKET/output-folder/ \
  --data-access-role-arn arn:aws:iam::$AWS_ACCOUNT_ID:role/$ROLE_NAME \
  --job-name "entities-analysis-$RANDOM_STRING" \
  --language-code en

# Start Dominant Language Detection Job
echo "Starting dominant language detection job for sample-input.txt"
aws comprehend start-dominant-language-detection-job \
  --input-data-config S3Uri=s3://$INPUT_BUCKET/sample-input.txt \
  --output-data-config S3Uri=s3://$OUTPUT_BUCKET/output-folder/ \
  --data-access-role-arn arn:aws:iam::$AWS_ACCOUNT_ID:role/$ROLE_NAME \
  --job-name "dominant-language-analysis-$RANDOM_STRING" \


# Start Key Phrases Detection Job
echo "Starting key phrases detection job for sample-input.txt"
aws comprehend start-key-phrases-detection-job \
  --input-data-config S3Uri=s3://$INPUT_BUCKET/sample-input.txt \
  --output-data-config S3Uri=s3://$OUTPUT_BUCKET/output-folder/ \
  --data-access-role-arn arn:aws:iam::$AWS_ACCOUNT_ID:role/$ROLE_NAME \
  --job-name "key-phrases-analysis-$RANDOM_STRING" \
  --language-code en

# Create and Upload Sample Text Files for Events Detection Job
echo "Creating and uploading sample text files for events detection job"
echo "\"Company AnyCompany grew by increasing sales and through acquisitions. After purchasing competing firms in 2020, AnyBusiness, a part of the AnyBusinessGroup, gave Jane Does firm a going rate of one cent a gallon or forty-two cents a barrel.\"" > SampleText1.txt
echo "\"In 2021, AnyCompany officially purchased AnyBusiness for 100 billion dollars, surprising and exciting the shareholders.\"" > SampleText2.txt
echo "\"In 2022, AnyCompany stock crashed 50. Eventually later that year they filed for bankruptcy.\"" > SampleText3.txt

aws s3 cp SampleText1.txt s3://$INPUT_BUCKET/EventsData/
aws s3 cp SampleText2.txt s3://$INPUT_BUCKET/EventsData/
aws s3 cp SampleText3.txt s3://$INPUT_BUCKET/EventsData/

# Start Events Detection Job
echo "Starting events detection job"
aws comprehend start-events-detection-job \
  --job-name "events-detection-$SHORT_RANDOM_STRING" \
  --input-data-config "S3Uri=s3://$INPUT_BUCKET/EventsData" \
  --output-data-config "S3Uri=s3://$OUTPUT_BUCKET/output-folder/" \
  --data-access-role-arn arn:aws:iam::$AWS_ACCOUNT_ID:role/$ROLE_NAME \
  --language-code en \
  --target-event-types "BANKRUPTCY" "EMPLOYMENT" "CORPORATE_ACQUISITION" "CORPORATE_MERGER" "INVESTMENT_GENERAL"

echo "Events detection job started"

# Create and Upload Sample Text Files for PII Entities Detection Job
echo "Creating and uploading sample text files for PII entities detection job"
echo "\"Hello Zhang Wei, I am John. Your AnyCompany Financial Services, LLC credit card account 1111-XXXX-1111-XXXX has a minimum payment of $24.53 that is due by July 31st.\"" > SampleText1.txt
echo "\"Dear Max, based on your autopay settings for your account Internet.org account, we will withdraw your payment on the due date from your bank account number XXXXXX1111 with the routing number XXXXX0000. \"" > SampleText2.txt
echo "\"Jane, please submit any customer feedback from this weekend to Sunshine Spa, 123 Main St, Anywhere and send comments to Alice at AnySpa@example.com.\"" > SampleText3.txt

aws s3 cp SampleText1.txt s3://$INPUT_BUCKET/
aws s3 cp SampleText2.txt s3://$INPUT_BUCKET/
aws s3 cp SampleText3.txt s3://$INPUT_BUCKET/

# Start PII Entities Detection Job
echo "Starting PII entities detection job"
aws comprehend start-pii-entities-detection-job \
  --job-name "pii-detecdtion-$SHORT_RANDOM_STRING" \
  --input-data-config "S3Uri=s3://$INPUT_BUCKET/" \
  --output-data-config "S3Uri=s3://$OUTPUT_BUCKET/testfolder/" \
  --data-access-role-arn arn:aws:iam::$AWS_ACCOUNT_ID:role/$ROLE_NAME \
  --language-code en \
  --mode ONLY_OFFSETS

# Create and Upload Sample Movie Review Text Files for Targeted Sentiment Detection Job
echo "Creating and uploading sample movie review text files for targeted sentiment detection job"
echo "\"The film, AnyMovie, is fairly predictable and just okay.\"" > SampleMovieReview1.txt
echo "\"AnyMovie is the essential sci-fi film that I grew up watching when I was a kid. I highly recommend this movie.\"" > SampleMovieReview2.txt
echo "\"Don't get fooled by the 'awards' for AnyMovie. All parts of the film were poorly stolen from other modern directors.\"" > SampleMovieReview3.txt

aws s3 cp SampleMovieReview1.txt s3://$INPUT_BUCKET/MovieData/
aws s3 cp SampleMovieReview2.txt s3://$INPUT_BUCKET/MovieData/
aws s3 cp SampleMovieReview3.txt s3://$INPUT_BUCKET/MovieData/

# Start Targeted Sentiment Detection Job
echo "Starting targeted sentiment detection job"
aws comprehend start-targeted-sentiment-detection-job \
  --job-name "targeted_sentiment-$SHORT_RANDOM_STRING" \
  --input-data-config "S3Uri=s3://$INPUT_BUCKET/MovieData" \
  --output-data-config "S3Uri=s3://$OUTPUT_BUCKET/testfolder/" \
  --data-access-role-arn arn:aws:iam::$AWS_ACCOUNT_ID:role/$ROLE_NAME \
  --language-code en

echo "Targeted sentiment detection job started"

# Start Topics Detection Job
echo "Starting topics detection job"
aws comprehend start-topics-detection-job \
  --job-name example_topics_detection_job \
  --input-data-config "S3Uri=s3://$INPUT_BUCKET/" \
  --output-data-config "S3Uri=s3://$OUTPUT_BUCKET/testfolder/" \
  --data-access-role-arn arn:aws:iam::$AWS_ACCOUNT_ID:role/$ROLE_NAME \

echo "Topics detection job started"

# Create Document Classifier Model
CLASSIFIER_MODEL_NAME="my-doc-classifier-$SHORT_RANDOM_STRING"
CLASSIFIER_MODEL_RESULT=$(aws comprehend create-document-classifier \
  --document-classifier-name $CLASSIFIER_MODEL_NAME \
  --language-code en \
  --input-data-config S3Uri=s3://$INPUT_BUCKET/training-data.csv \
  --data-access-role-arn arn:aws:iam::$AWS_ACCOUNT_ID:role/$ROLE_NAME)

# Extract the Model ARN from the response
MODEL_ARN=$(echo $CLASSIFIER_MODEL_RESULT | jq -r '.DocumentClassifierArn')

echo "Document classifier model $CLASSIFIER_MODEL_NAME created with ARN: $MODEL_ARN"

# After creating the document classifier, it goes through a multi step process to train the model.  
# Creating a Comprehend endpoint requires a fully trained model, so we will wait until the classifier is trained
# before creating the endpoint.
echo "Waiting for the classifier to be trained..."
while true; do
  CLASSIFIER_STATUS=$(aws comprehend describe-document-classifier \
    --document-classifier-arn $MODEL_ARN \
    --query "DocumentClassifierProperties.Status" \
    --output text)

  echo "Classifier status: $CLASSIFIER_STATUS"

  if [ "$CLASSIFIER_STATUS" = "TRAINED" ]; then
    echo "Classifier is trained!"
    break
  else
    echo "Classifier is still training. Waiting..."
    sleep 30  # Wait for 30 seconds before checking again
  fi
done

# Create and Upload Sample SMS Text Files for Document Classification Job
echo "Creating and uploading sample SMS text files for document classification job"
echo "\"CONGRATULATIONS! TXT 2155550100 to win $5000\"" > SampleSMStext1.txt
echo "\"Hi, when do you want me to pick you up from practice?\"" > SampleSMStext2.txt
echo "\"Plz send bank account # to 2155550100 to claim prize!!\"" > SampleSMStext3.txt

aws s3 cp SampleSMStext1.txt s3://$INPUT_BUCKET/
aws s3 cp SampleSMStext2.txt s3://$INPUT_BUCKET/
aws s3 cp SampleSMStext3.txt s3://$INPUT_BUCKET/

# Start Document Classification Job
echo "Starting document classification job for SMS texts"
aws comprehend start-document-classification-job \
  --input-data-config S3Uri=s3://$INPUT_BUCKET/ \
  --output-data-config S3Uri=s3://$OUTPUT_BUCKET/testfolder/ \
  --data-access-role-arn arn:aws:iam::$AWS_ACCOUNT_ID:role/$ROLE_NAME \
  --document-classifier-arn $MODEL_ARN \
  --job-name "document-classification-sms-$SHORT_RANDOM_STRING"

echo "Document classification job for SMS texts started"

# Create Comprehend Endpoint
ENDPOINT_NAME="comprehend-endpoint-$SHORT_RANDOM_STRING"
DESIRED_INFERENCE_UNITS=5

aws comprehend create-endpoint \
  --endpoint-name $ENDPOINT_NAME \
  --model-arn $MODEL_ARN \
  --desired-inference-units $DESIRED_INFERENCE_UNITS

echo "Comprehend endpoint $ENDPOINT_NAME created with $DESIRED_INFERENCE_UNITS inference units"

# Clean up - remove sample input file, training data, entity recognizer data, larger sample document, larger entity corpus file rm sample-input.txt input-doc.txt input-ann-1.csv input-ann-2.csv input-ann-3.csv $LARGER_CORPUS_FILE SampleText1.txt SampleText2.txt SampleText3.txt training-data.csv SampleMovieReview1.txt SampleMovieReview2.txt SampleMovieReview3.txt SampleSMStext1.txt SampleSMStext2.txt SampleSMStext3.txt


## Note: The following commands are not part of the test script, but are useful for listing the resources created by this script
# aws comprehend list-document-classifiers
# aws comprehend list-entity-recognizers
# aws comprehend list-flywheels
# aws comprehend list-sentiment-detection-jobs
# aws comprehend list-entities-detection-jobs
# aws comprehend list-dominant-language-detection-jobs
# aws comprehend list-key-phrases-detection-jobs
# aws comprehend list-events-detection-jobs
# aws comprehend list-pii-entities-detection-jobs
# aws comprehend list-targeted-sentiment-detection-jobs
# aws comprehend list-topics-detection-jobs
# aws comprehend list-document-classification-jobs
# aws comprehend list-endpoints

@swhite-oreilly swhite-oreilly requested a review from a team as a code owner August 25, 2023 19:49
@bjoernhaeuser bjoernhaeuser enabled auto-merge (squash) August 29, 2023 10:02
@bjoernhaeuser bjoernhaeuser enabled auto-merge (squash) August 29, 2023 11:00
@der-eismann der-eismann merged commit 36a47fe into rebuy-de:main Aug 29, 2023
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants