S3 Data Cleaner Lambda is an AWS Lambda function written in Go designed to clean and validate data files stored in Amazon S3. The service works with both CSV and JSONL file formats, automatically infers data types for each field, and handles errors gracefully. The cleaned-up output is written back to an S3 bucket.
- File Format Support: Supports both CSV and JSONL formats.
- Data Type Inference: Automatically infers data types for each field.
- Error Handling: Rows with parsing issues are flagged, and error details are stored in special columns (_error and has_error).
- Go 1.16 or higher
- AWS CLI configured with appropriate permissions
- AWS Lambda Go SDK
- AWS SDK for Go
git clone https://github.com/stancsz/s3-data-cleaner-lambda.git
Navigate to the project directory and download the dependencies:
cd s3-data-cleaner-lambda
go mod tidy
Compile the Go code to create a binary:
GOOS=linux go build -o main
Package the binary into a ZIP file:
zip function.zip main
- Navigate to the AWS IAM console and create a new role.
- Attach the AWSLambdaExecute policy to the role.
Upload the ZIP file to create a new Lambda function:
aws lambda create-function \
--function-name S3DataCleanerLambda \
--zip-file fileb://function.zip \
--handler main \
--runtime go1.x \
--role arn:aws:iam::[your-account-id]:role/[your-execution-role]
Replace [your-account-id]
and [your-execution-role]
with your AWS account ID and the execution role you created.
To update the function code:
aws lambda update-function-code \
--function-name S3DataCleanerLambda \
--zip-file fileb://function.zip
You can set up an S3 trigger in the Lambda console to invoke this function whenever a new file is uploaded to a specific bucket.
Specify input and output S3 paths, the S3 bucket name, and the file type (CSV or JSONL) as environment variables or directly in the Lambda function's configuration.
To invoke the function manually, you can use:
aws lambda invoke \
--function-name S3DataCleanerLambda \
--payload '{"inputS3Path": "s3://input-bucket/file.csv", "outputS3Path": "s3://output-bucket/cleaned_file.csv", "fileType": "csv"}' \
output.txt
Feel free to open issues or submit pull requests. Contributions are welcome!
MIT