Welcome to the carbon data lake AWS Cloud Development Kit (CDK) application

The carbon data lake guidance with sample code implements a foundational data lake (and ingestion and processing framework) using the AWS Cloud Development Kit (AWS CDK). The deployed asset provides the base infrastructure for customers and partners to build their carbon accounting use cases.

Note: This solution by itself will not make a customer compliant with any end to end carbon accounting solution. It provides the foundational infrastructure from which additional complementary solutions can be integrated.

The carbon data lake reduces the undifferentiated heavy lifting of ingesting, standardizing, transforming, and calculating greenhouse gas emission data in carbon dioxide equivalent (CO2eq). Customers can use this guidance with sample code to advance their starting point for building decarbonization reporting, forecasting, and analytics solutions and/or products. The carbon data lake includes a purpose-built data pipeline, data quality module, data lineage module, emissions calculator microservice, business intelligence services, prebuilt forecasting machine learning notebook and compute service, GraphQL API, and sample web application.

Customer emissions data (such as databases, historians, existing data lakes, internal/external APIs, Images, CSVs, JSON, IoT/sensor data, and third party applications including CRMs, ERPs, MES, and more) can be mapped to the standard CSV format to support centralization and processing of customer carbon data. Carbon data is ingested through the carbon data lake landing zone, and can be ingested from any service within or connected to the AWS cloud. This calculator can be deployed with a sample emissions factor model or can be modified or augmented with additional bring your own standards lookup tables and calculator logic.

What it does

This guidance with sample code provides core functionality to accelerate data ingestion, processing, calculation, storage, analytics and insights. The following list outlines the current capabilities of the carbon data lake. Please submit a PR to request additional capabilities and features. We appreciate your feedback as we continue to improve this offering.

Capabilities

The following list covers current capabilities as of today:

Accepts a standard CSV formatted data input as S3 upload to the carbon data lake Landing Bucket
Accepts multi-part and standard upload via S3 CLI, Console, and other programmatic means
Accepts single file upload via AWS Amplify console with optional web application
Provides daily data compaction at midnight in local time
Performs calculation using sample calculator lookup table
Can accept new calculator lookup table and data model with required updates to JSON files for data quality AND calculator.

🛠 What you will build

Deploying this repository with default parameters builds the following carbon data lake environment in the AWS Cloud.

Figure 1: Solution Architecture Diagram

As shown in Figure 1: Solution Architecture Diagram, this guidance with sample code sets up the following application stacks

Customer emissions data from various sources is mapped to a standard CSV upload template. The CSV is uploaded, either directly to the Amazon Simple Storage Service (Amazon S3) landing bucket, or through the user interface
Amazon S3 landing bucket provides a single landing zone for all ingested emissions data. Data ingress to the landing zone bucket triggers the data pipeline.
AWS Step Functions workflow orchestrates the data pipeline including data quality check, data compaction, transformation, standardization, and enrichment with an emissions calculator AWS Lambda function.
AWS Glue DataBrew provides data quality auditing and an alerting workflow, and Lambda functions provide integration with Amazon Simple Notification Service (Amazon SNS) and AWS Amplify web application.
AWS Lambda functions provide data lineage processing, queued by Amazon Simple Queue Service (Amazon SQS). Amazon DynamoDB provides NoSQL pointer storage for the data ledger, and a Lambda function provides data lineage audit functionality, tracing all data transformations for a given record.
“An AWS Lambda function outputs calculated CO2 equivalent emissions by referencing a DynamoDB table with Customer provided emissions factors”
Amazon S3 enriched bucket provides data object storage for analytics workloads and the DynamoDB calculated emissions table provides storage for GraphQL API (a query language for users API).
Optionally deployable artificial intelligence and machine learning (AI/ML) and business intelligence stacks provide customers with options to deploy a prebuilt Amazon SageMaker notebook and a prebuilt Amazon QuickSight dashboard. Deployments come with prebuilt Amazon Athena queries that can be used to query data stored in Amazon S3. Each service is pre-integrated with Amazon S3 enriched object storage.
An optionally deployable Web Application stack uses AWS Appsync to provide a GraphQL API backend for integration with web applications and other data consumer applications. AWS Amplify provides a serverless, pre-configured management application that includes basic data browsing, data visualization, data uploader, and application configuration.

Application Stacks

Shared Resource Stack

The shared resource stack deploys all cross-stack referenced resources such as S3 buckets and lambda functions that are built as dependencies.

Review the Shared Resources Stack and Stack Outputs

Data Pipeline

The carbon data lake data pipeline is an event-driven Step Functions Workflow triggered by each upload to the carbon data lake landing zone S3 bucket. The data pipeline performs the following functions:

AWS Glue Data Brew Data Quality Check: If the data quality check passes the data is passed to the next step. If the data quality check fails the admin user receives a Simple Notification Services alert via email.
Data Transformation Glue Workflow: Batch records are transformed and prepared for the carbon data lake calculator microservice.
Data Compaction: night data compaction jobs prepare data for analytics and machine learning workloads.
Emissions calculator AWS Lambda microservice: An AWS Lambda function performed emissions factor database lookup and calculation, outputting records to a Amazon DynamoDB table and to an S3 bucket for analytics and AI/ML application.
Data Transformation Ledger: Each transformation of data is recorded to a ledger using Amazon Simple Queue Service, AWS Lambda, and Amazon DynamoDB.

Review the Data Pipeline Stack, README, and Stack Outputs

Emissions Factor Reference Database Sample

The carbon emissions calculator microservice comes with a pre-seeded Amazon DynamoDB reference table. This data model directly references the sample emissions factor model provided for development purposes. The sample data model is adapted from the World Resource Institute (WRI) GHG Protocol Guidance. Please consult the WRI guidance to confirm the most up-to-date information and versions.

The sample provided is for development purposes only, and it is recommended that carbon data lake users modify this JSON document and/or create their own using a similar format. Please modify the provided data model when deploying your own application using the instructions found in the Setup section.

AWS AppSync GraphQL API

A pre-built AWS AppSync GraphQL API provides flexible querying for application integration. This GraphQL API is authorized using Amazon Cognito User Pools and comes with a predefined Admin and Basic User role. This GraphQL API is used for integration with the carbon data lake AWS Amplify Sample Web Application.

Review the AppSync GraphQL API Stack, Documentation, and Stack Outputs

Optional: AWS Amplify Sample Web Application

An AWS Amplify application can be deployed optionally and hosted via Amazon Cloudfront and AWS Amplify. To review deployment steps complete a successful carbon data lake application deployment. The AWS Amplify Web Application depends on the core carbon data lake components.

Review the Web Application Stack and Stack Outputs.

Optional: Amazon Quicksight Module with prebuilt visualizations and Analysis

An Amazon Quicksight stack can be deployed optionally with pre-built visualizations for Scope 1, 2, and 3 emissions. This stack requires additional manual setup in the AWS console detailed in this guide.

Review the Amazon Quicksight Stack

Optional: Sagemaker Notebook Instance with pre-built Machine Learning notebook

A pre-built machine learning notebook is deployed on an Amazon Sagemaker Notebook EC2 instance with .ipynb and pre-built prompts and functions.

Review the Sagemaker Notebook Instance Stack.

Sample Data Collection for Testing

The carbon data lake guidance with sample code comes with sample data for testing successful deployment of the application and can be found in the sample-data directory.

💲 Cost and Licenses

You are responsible for the cost of the AWS services used while running this reference deployment. There is no additional cost for using this.

The AWS CDK stacks for this repository include configuration parameters that you can customize. Some of these settings, such as instance type, affect the cost of deployment. For cost estimates, see the pricing pages for each AWS service you use. Prices are subject to change.

Tip: After you deploy the repository, create AWS Cost and Usage Reports to track costs associated with the guidance with sample code. These reports deliver billing metrics to an S3 bucket in your account. They provide cost estimates based on usage throughout each month and aggregate the data at the end of the month. For more information, see What are AWS Cost and Usage Reports?

This application doesn’t require any software license or AWS Marketplace subscription.

🚀 How to Deploy

You can deploy the carbon data lake guidance with sample code through the manual setup process using AWS CDK. We recommend use of an AWS Cloud9 instance in your AWS account or VS Code and the AWS CLI. We also generally recommend a fresh AWS account that can be integrating with your existing infrastructure using AWS Organizations.

🎒 Pre-requisites

The aws-cli must be installed -and- configured with an AWS account on the deployment machine (see https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html for instructions on how to do this on your preferred development platform).
This project requires Node.js. To make sure you have it available on your machine, try running the following command.
```
node -v
```
For best experience we recommend installing CDK globally: npm install -g aws-cdk

🔐 Security Note

This repository has been developed using architectural and security best practices as defined by AwsSolutions CDK Nag Pack. CDK Nag provides integrated tools for automatically reviewing infrastructure for common security, business, and architectural best practices.

This repository comes with AwsSolutions CDK Nag Pack pre-configured and enabled by default. This means that any changes to existing code or deployments will be automatically checked for architectural and development best practices as defined by the AwsSolutions CDK Nag Pack. You can disable this feature in cdk.context.json by switching the nagEnabled flag to false.

As part of the shared responsibility model for security we recommend taking additional steps within your AWS account to secure this application. We recommend you implement the following AWS services once your application is in production:

🚀 Setup

0/ Use git to clone this repository to your local environment

git clone #insert-http-or-ssh-for-this-repository

1/ Set up your AWS environment

Configure your AWS credentials: aws configure
For more on setting up your AWS Credentials please visit setting up your aws credentials

2/ Prepare your CDK environment

Navigate to CDK Directory
Set up your emissions factor document (see Set up your emissions factor document below)
Copy cdk.context.template.json or remove .template
Enter your parameters in cdk.context.json (see Context Parameters below)

--Set up your emissions factor document--

Modify and/or replace the existing emissions factor sample document
Alternatively make a copy and point the emissions factor lookup table component to the new filename by editing the Calculator Construct
If you are making substantial changes beyond category and/or emissions factor coefficients you may have to edit the Calculator Microservice stack to reflect changes in input category headers. This will include editing the generateItem method and the IDdbEmissionFactor interface found in the Calculator Construct

--Context Parameters--

Before deployment navigate to cdk.context.json and update the required context parameters which include: adminEmail, and repoBranch. Review the optional and required context variables below.

Required:adminEmail The email address for the administrator of the app
Required:repoBranch The branch to deploy in your pipeline (default is /main)
Optional:quicksightUserName Username for access to the carbon emissions dataset and dashboard.
Optional:deployQuicksightStack Determines whether this stack is deployed. Default is false, change to true if you want to deploy this stack.
Optional:deploySagemakerStack Determines whether this stack is deployed. Default is false, change to true if you want to deploy this stack.
Optional:deployWebStack Determines whether this stack is deployed. Default is false, change to true if you want to deploy this stack.
- Optional:nagEnabled Enables cdk_nag audit tool. Default is true. Change to false if you want to disable.

Quicksight Note: If you choose to deploy the optional Quicksight Module make sure you review QuickSight setup instructions

Web Application Note: If you choose to deploy the optional Web Module make sure you review web application setup instructions

3/ Install dependencies, build, and synthesize the CDK app

Install dependencies

npm ci

Build your node application and environment

npm run build

Make sure that you have assumed an AWS Profile or credentials through AWS Configure or some other means
Get your AWS Account Number aws sts get-caller-identity
Bootstrap CDK so that you can build cdk assets

cdk bootstrap aws://ACCOUNT-NUMBER/REGION

or

cdk bootstrap # if you are authenticated through `aws configure`

Synthesize the CDK application

cdk synth

4/ Deploy the application

✅ Recommended: deploy for local development

cdk deploy --all

4/ Optional: Set up the Amplify Web Application

If you are reading this it is because you deployed the carbon data lake guidance with sample code Web Application by setting deployWebStack: true in the cdk.context.json file. Your application is already up and running in the AWS Cloud and there are a few simple steps to begin working with and editing your application.

Visit the AWS Amplify Console by navigating to the AWS Console and searching for Amplify. Make sure you are in the same region that you just selected to deploy your application.
Visit your live web application --> click on the link in the Amplify console When you open the web application in your browser you should see a cognito login page with input fields for an email address and password. Enter your email address and the temporary password sent to your email when you created your carbon data lake guidance with sample code CDK Application. After changing your password, you should be able to sign in successfully at this point.

NOTE: The sign-up functionality is disabled intentionally to help secure your application. You may change this and add the UI elements back, or manually add the necessary users in the cognito console while following the principle of least privilege (recommended).
Learn more about working with AWS Amplify CLI or the AWS Amplify Console.
Make the web application your own and let us know what you choose do to with it.

Success! At this point, you should successfully have the Amplify app working.

Optional A/ Manually enable & set up Amazon Quicksight Stack

If you choose to deploy the Amazon Quicksight business intelligence stack it will include prebuilt data visualizations that leverage Amazon Athena to query your processed data. If you elect to deploy this stack you will need to remove the comments.

Before you proceed you need to set up your quicksight account and user. This needs to be done manually in the console, so please open this link and follow the instructions here.

To deploy this stack navigate to cdk.context.json and change deployQuicksightStack value to true and redeploy the application by running cdk deploy --all

Optional B/ Manually enable & set up Forecast stack

The forecast stack includes a pre-built sagemaker notebook instance running an .ipynb with embedded machine learning tools and prompts.

To deploy this stack navigate to cdk.context.json and change deploySagemakerStack value to true and redeploy the application by running cdk deploy --all

🗑 How to Destroy

You can destroy all stacks included in carbon data lake guidance with sample code with cdk destroy --all. You can destroy individual stacks with cdk destroy --StackName. By default using CDK Destroy will destroy EVERYTHING. Use this with caution! We strongly recommend that you modify this functionality by applying no delete defaults within your CDK constructs. Some stacks and constructs that we recommend revising include:

DynamoDB Tables
S3 Buckets
Cognito User Pools

Work with outputs

The CDK stacks by default export all stack outputs to cdk-outputs.json at the top level of the directory. You can disable this feature by removing "outputsFile": "cdk-outputs.json" from cdk.json but we recommend leaving this feature, as it is a requirement for some other features. By default this file is ignored via .gitignore so any outputs will not be committed to a version control repository. Below is a guide to the standard outputs.

Shared Resources Stack Outputs

Shared resource stack outputs include:

cdlAwsRegion: Region of CDK Application AWS Deployment.
cdlEnrichedDataBucket: Enriched data bucket with outputs from calculator service.
cdlEnrichedDataBucketUrl: Url for enriched data bucket with outputs from calculator service
cdlDataLineageBucket: Data lineage S3 bucket
cdlDataLineageBucketUrl: Data lineage S3 bucket URL

API Stack Outputs

-cdluserPoolId: Cognito user pool ID for authentication -CLQidentityPoolId: Cognito Identity pool ID for authentication -cdluserPoolClientId: Cognito user pool client ID for authentication -cdlcdlAdminUserRoleOutput: Admin user role output -cdlcdlStandardUserRoleOutput: Standard user role output -cdlApiEndpoint: GraphQL API endpoint -cdlApiUsername: GraphQL API admin username -cdlGraphQLTestQueryURL: GraphQL Test Query URL (takes you to AWS console if you are signed in).

Data Pipeline Stack Outputs

-LandingBucketName: S3 Landing Zone bucket name for data ingestion to carbon data lake guidance with sample code Data Pipeline. -cdlLandingBucketUrl: S3 Landing Zone bucket URL for data ingestion to carbon data lake guidance with sample code Data Pipeline. -cdlGlueDataBrewURL: URL for Glue Data Brew in AWS Console. -cdlDataPipelineStateMachineUrl: URL to open cdl state machine to view step functions workflow status.

Web Stack Outputs

-cdlWebAppRepositoryLink: Amplify Web Application codecommit repository link. -cdlWebAppId: Amplify Web Application ID. -cdlAmplifyLink: Amplify Web Application AWS Console URL. -cdlWebAppDomain: Amplify Web Application live web URL.

Quicksight Stack Outputs

-QuickSightDataSource: ID of QuickSight Data Source Connector Athena Emissions dataset. Use this connector to create additional QuickSight datasets based on Athena dataset. -QuickSightDataSet: ID of pre-created QuickSight DataSet, based on Athena Emissions dataset. Use this pre-created dataset to create new dynamic analyses and dashboards. -QuickSightDashboard: ID of pre-created QuickSight Dashboard, based on Athena Emissions dataset. Embed this pre-created dashboard directly into your user facing applications. -cdlQuicksightUrl: URL of Quicksight Dashboard.

Sagemaker Notebook Stack Outputs

-cdlSagemakerRepository: Codecommit repository of sagemaker notebook. -cdlSagemakerNotebookUrl: AWS console URL for Sagemaker Notebook ML Instance.

Test Stack Outputs

-e2eTestLambdaFunctionName: Name of carbon data lake lambda test function. -e2eTestLambdaConsoleLink: URL to open and invoke calculator test function in the AWS Console.

🛠 Usage

Time to get started using carbon data lake guidance with sample code! Follow the steps below to see if everything is working and get familiar with this solution.

1/ Make sure all the infrastructure deployed properly

In your command line shell you should see confirmation of all resources deploying. Did they deploy successfully? Any errors or issues? If all is successful you should see indication that CDK deployed. You can also verify this by navigating to the Cloudformation service in the AWS console. Visually check the series of stacks that all begin with CLQS to see that they deployed successfully. You can also search for the tag:

"application": "carbon-data-lake"

2/ Drop some synthetic test data into the carbon data lake landing zone S3 Bucket

Time to test some data out and see if everything is working. This section assumes basic prerequisite knowledge of how to manually upload an object to S3 with the AWS console. For more on this please review how to upload an object to S3.

Go to the S3 console and locate your carbon data lake landing zone bucket it will be called cdlpipelinestack-cdllandingbucket with a unique identifier appended to it
Upload carbon data lake synthetic input data to the S3 bucket manually
This will kick trigger the pipeline kickoff lambda function and start the data pipeline step functions workflow -- continue!

3/ Take a look at the step functions workflow

Navigate to step functions service in the aws console
Select the step function named cdlPipeline with an appended uuid
Select the recent active workflow
Select from the "executions" list
Have a quick look and familiarize yourself with the workflow graph inspector
The workflow will highlight green for each passed step. See two image examples below.

Figure. In progress step function workflow

Figure. Completed step function workflow

3/ Review your calculated outputs

The calculator outputs emissions calculator outputs referenced in the data model section below. Outputs are written to Amazon DynamoDB and Amazon S3. You can review the outputs using the AWS console or AWS CLI:

Amazon DynamoDB: Navigate to Amazon DynamoDB in the AWS console. Look for a Database called DataBase and a table called Table
Amazon S3: Navigate to S3 in the console and look for a bucket called BucketName. This bucket contains all calculator outputs.

You can also query this data using the GraphQL API detailed below.

4/ Query your GraphQL API endpoint

Navigate to AWS AppSync in the console
In the AWS AppSync, choose the Queries tab and then enter the following text in the editor:
Run the following query and hit "run"

This one will get all of the records (with a default limit of 10)

query MyQuery {
  all {
    items {
      activity
      activity_event_id
      asset_id
      category
      emissions_output
      geo
      origin_measurement_timestamp
      raw_data
      units
      source
      scope
    }
  }
}

Did that all work? Continue...

5/ Take a look at the Amplify Sample Web Application

If you have not yet this is a great time to deploy the sample web application. Once you've run some data through the pipeline you should see that successfully populating in the application. Remember that to deploy the web application you will need to set "deployWebStack": "true" in cdk.context.json.

6/ Try dropping some other sample data into the landing zone

Generate or select some additional data (it can be anything really, but carbon emissions data is good)
Test out the data quality module by dropping different data into the bucket. Does it run through? Do you get a notification if it does not?

7/ Start connecting your own data to the carbon data lake landing zone

Connect other data sources such as IoT, Streaming Data, Database Migration Workloads, or other applications to the S3 landing zone bucket. Try something out and let us know how it goes.

🧪 Tests

This application currently includes unit tests, infrastructure tests, deployment tests. We are working on an end to end testing solution as well. Read on for the test details:

Pipeline Tests

For Gitlab users only -- The Gitlab CI runs each time you commit to remote and/or merge to main. This runs automatically and does the following:

Static Tests

npm ci installs all dependencies from package.lock.json
npm run build builds the javascript from typescript and makes sure everything works!
cdk synth synthesizes all CDK stacks in the application
Runs bandit security tests for common vulnerabilities in Python
Runs ESLint for common formatting issues in Javascript and Typescript

Security Tests

cdk_nag
git-secrets
Chechov
semgrep
python bandit

Deployment Tests

Runs CDKitten deployment tests -- these deploy your CDK in several major AWS regions, checking that it builds and deploys successfully, and then destroying those stacks after confirming that they build.
Runs e2e data integration test -- runs an end to end test by dropping data into the pipeline and querying the GraphQL api output. If the test is successful it returns Success

Manual Tests

You can run several of these tests manually on your local machine to check that everything is working as expected.

sh test-deployment.sh Runs CDKitten locally using your assumed AWS role
sh test-e2e.shruns an end to end test by dropping data into the pipeline and querying the GraphQL api output. If the test is successful it returns Success
npm run lint tests your code locally with the prebuilt linter configuration

Extending carbon data lake

If you are looking to utilize existing features of carbon data lake while integrating your own features, modules, or applications this section provides details for how to ingest your data to the carbon data lake data pipeline, how to connect data outputs, how to integrate other applications, and how to integrate other existing AWS services. As we engage with customers this list of recommendations will grow with customer use-cases. Please feel free to submit issues that describe use-cases you would like to be documented.

Ingesting data into carbon data lake

To ingest data into carbon data lake you can use various inputs to get data into the carbon data lake landing zone S3 bucket. This bucket can be found via AWS Console or AWS CLI under the name bucketName. It can also be accessed as a public readonly stack output via props stackOutputName. There are several methods for bringing data into an S3 bucket to start an event-driven pipeline. This article is a helpful resource as you explore options. Once your data is in S3 it will kick off the pipeline and the data quality check will begin.

Integrating carbon data lake data outputs

General Guide to adding features

To add additional features to carbon data lake we recommend developing your own stack that integrates with the existing carbon data lake stack inputs and outputs. We recommend starting by reviewing the concepts of application, stack, and construct in AWS CDK. Adding a stack is the best way to add functionality to carbon data lake.

Start by adding your own stack directory to lib/stacks
```
mkdir lib/stacks/stack-title
```

Add a stack file to this directory

touch lib/stacks/stack-title/stack-title.ts

Use basic CDK stack starter code to formulate your own stack. See example below:

import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
// import * as sqs from 'aws-cdk-lib/aws-sqs';

export class ExampleStack extends cdk.Stack {
    constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // The code that defines your stack goes here

    // example resource
    // const queue = new sqs.Queue(this, 'ExampleStackQueue', {
    //   visibilityTimeout: cdk.Duration.seconds(300)
    // });
    }
}

We recommend using a single stack, and integrating additional submodular components as constructs. Constructs are logical groupings of AWS resources with "sane" defaults. In many cases the CDK team has already created a reusable construct and you can simply work with that. But in specific cases you may way to create your own. You can create a construct using the commands and example below:
```
mkdir lib/constructs/construct-title
touch lib/constructs/construct-title/title-construct.ts
```
If you have integrated your stack successfully you should see it build when you run cdk synth. For development purposes we recommend deploying your stack in isolation before you deploy with the full application. You can run cdk deploy YourStackName to deploy in isolation.

Integrate your stack with the full application by importing it to bin/main.ts and bin/cicd.ts if you have chosen to deploy it.

#open the file main.ts
open main.ts

// Import your stack at the top of the file
import {YourStackName} from './stacks/stack-title/your-stack'

// Now create a new stack to deploy within the application
const stackName = new YourStackName(app, "YourStackTitle", {
    // these are props that serve as an input to your stack
    // these are optional, but could include things like S3 bucket names or other outputs of other stacks.
    // For more on this see the stack output section above.
    yourStackProp1: prop1,
    yourStackProp2: prop2,
    env: appEnv // be sure to include this environment prop
})

Working with Stack Outputs

You can access the outputs of application stacks by adding them as props to your stack inputs. For example, you can access the myVpc output by adding networkStack.myVpc as props your own stack. It is best practice to add this as props at the application level, and then as an interface at the stack level. Finally, you can access it via props.myVpc (or whatever you call it) within your stack. Below is an example.

// Start by importing it when you instatiate your stack 👇
new MyFirstStack(app, 'MyFirstStack', {
    vpc: networkStack.myVpc
});

// Now export this as an interface within that stack 👇
export interface MySecondStackProps extends StackProps {
    vpc: Ec2.vpc
}

// Now access it as a prop where you need it within the stack 👇
this.myStackObject = new ec2.SecurityGroup(this, 'ec2SecurityGroup', {
            props.vpc,
            allowAllOutbound: true,
        });

The above is a theoretical example. We recommend reviewing the CDK documentation and the existing stacks to see more examples.

Integrating with existing AWS Services

The model below describes the required schema for input to the carbon data lake calculator microservice. This is Calculator Data Input Model.

📚 Reference & Resources

Helpful Commands for CDK

npm run build compile typescript to js
npm run watch watch for changes and compile
npm run test perform the jest unit tests\
cdk diff compare deployed stack with current state
cdk synth emits the synthesized CloudFormation template
cdk deploy --all deploy this stack to your default AWS account/region w/o the CICD pipeline
npm run deploy:cicd deploy this application CI/CD stack and then link your repo for automated pipeline

Data Model

Calculator Input Model

The model below describes the required schema for input to the carbon data lake calculator microservice. This is Calculator Data Input Model

{
    "activity_event_id": "customer-carbon-data-lake-12345",
    "asset_id": "vehicle-1234",
    "geo": {
        "lat": 45.5152,
        "long": 122.6784
    },
    "origin_measurement_timestamp":"2022-06-26 02:31:29",
    "scope": 1,
    "category": "mobile-combustion",
    "activity": "Diesel Fuel - Diesel Passenger Cars",
    "source": "company_fleet_management_database",
    "raw_data": 103.45,
    "units": "gal"
}

Calculator Output Model

The model below describes the standard output model from the carbon data lake emissions calculator microservice.

Calculator Output Model

{
    "activity_event_id": "customer-CarbonLake-12345",
    "asset_id": "vehicle-1234",
    "activity": "Diesel Fuel - Diesel Passenger Cars",
    "category": "mobile-combustion",
    "scope": 1,
    "emissions_output": {
        "calculated_emissions": {
            "co2": {
                "amount": 0.024,
                "unit": "tonnes"
            },
            "ch4": {
                "amount": 0.00001,
                "unit": "tonnes"
            },
            "n2o": {
                "amount": 0.00201,
                "unit": "tonnes"
            },
            "co2e": {
                "ar4": {
                    "amount": 0.2333,
                    "unit": "tonnes"
                },
                "ar5": {
                    "amount": 0.2334,
                    "unit": "tonnes"
                }
            }
        },
        "emissions_factor": {
            "ar4": {
                "amount": 8.812,
                "unit": "kgCO2e/unit"
            },
            "ar5": {
                "amount": 8.813,
                "unit": "kgCO2e/unit"
            }
        }
    },
    "geo": {
        "lat": 45.5152,
        "long": 122.6784
    },
    "origin_measurement_timestamp": "2022-06-26 02:31:29",
    "raw_data": 103.45,
    "source": "company_fleet_management_database",
    "units": "gal"
}

Sample and bring your own emissions factor models

The json document below is a sample emissions factor model for testing and development purposes only. To use this solution or develop your own related solution please customize and update your own emissions factor models to represent your reporting requirements.

Sample Emissions Factor Model. This is the lookup table used for coefficient inputs to the calculator microservice.

Calculation methodologies reflected in this solution are aligned with the sample model, and this calculator stack may require modification if a new model is applied. To review calculation methodology and lookup tables please review the carbon data lake Emissions Calculator Stack.

👀 See also

AWS Energy & Utilities

🔐 Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

Appendix

Troubleshooting

For users with an Apple M1 chip, you may run into the following error when executing npm commands: "no matching version found for node-darwin-amd64@16.4.0" or similar terminal error output depending on the version of node you are running. If this happens, execute the following commands from your terminal in order (this fix assumes you have node version manager (nvm) installed). In this example, we will use node version 16.4.0. Replace the node version in these commands with the version you are running:

nvm uninstall 16.4.0

arch -x86_64 zsh

nvm install 16.4.0

nvm alias default 16.4.0

Name		Name	Last commit message	Last commit date
Latest commit History 992 Commits
bin		bin
lib		lib
resources		resources
sample-data		sample-data
test		test
tools/reference-databases		tools/reference-databases
.bandit		.bandit
.eslintignore		.eslintignore
.eslintrc		.eslintrc
.gitignore		.gitignore
.npmignore		.npmignore
.prettierrc		.prettierrc
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
cdk.context.template.json		cdk.context.template.json
cdk.json		cdk.json
jest.config.js		jest.config.js
package-lock.json		package-lock.json
package.json		package.json
test-deploy.sh		test-deploy.sh
test-e2e.sh		test-e2e.sh
test-regional.sh		test-regional.sh
test-security.sh		test-security.sh
test-synth.sh		test-synth.sh
tsconfig.json		tsconfig.json

License

radixeng/guidance-for-carbon-data-lake-on-aws-1

Folders and files

Latest commit

History

Repository files navigation

Welcome to the carbon data lake AWS Cloud Development Kit (CDK) application

What it does

Capabilities

🛠 What you will build

Application Stacks

Shared Resource Stack

Data Pipeline

Emissions Factor Reference Database Sample

AWS AppSync GraphQL API

Optional: AWS Amplify Sample Web Application

Optional: Amazon Quicksight Module with prebuilt visualizations and Analysis

Optional: Sagemaker Notebook Instance with pre-built Machine Learning notebook

Sample Data Collection for Testing

💲 Cost and Licenses

🚀 How to Deploy

🎒 Pre-requisites

🔐 Security Note

🚀 Setup

0/ Use git to clone this repository to your local environment

1/ Set up your AWS environment

2/ Prepare your CDK environment

--Set up your emissions factor document--

--Context Parameters--

3/ Install dependencies, build, and synthesize the CDK app

4/ Deploy the application

4/ Optional: Set up the Amplify Web Application

Optional A/ Manually enable & set up Amazon Quicksight Stack

Optional B/ Manually enable & set up Forecast stack

🗑 How to Destroy

Work with outputs

Shared Resources Stack Outputs

API Stack Outputs

Data Pipeline Stack Outputs

Web Stack Outputs

Quicksight Stack Outputs

Sagemaker Notebook Stack Outputs

Test Stack Outputs

🛠 Usage

1/ Make sure all the infrastructure deployed properly

2/ Drop some synthetic test data into the carbon data lake landing zone S3 Bucket

3/ Take a look at the step functions workflow

3/ Review your calculated outputs

4/ Query your GraphQL API endpoint

5/ Take a look at the Amplify Sample Web Application

6/ Try dropping some other sample data into the landing zone

7/ Start connecting your own data to the carbon data lake landing zone

🧪 Tests

Pipeline Tests

Static Tests

Security Tests

Deployment Tests

Manual Tests

Extending carbon data lake

Ingesting data into carbon data lake

Integrating carbon data lake data outputs

General Guide to adding features

Working with Stack Outputs

Integrating with existing AWS Services

📚 Reference & Resources

Helpful Commands for CDK

Data Model

Calculator Input Model

Calculator Output Model

Sample and bring your own emissions factor models

👀 See also

🔐 Security

License

Appendix

Troubleshooting

About

Resources

License

Stars

Watchers

Packages