Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Amazon Security Lake integration - Architecture and requirements #113

Closed
Tracked by #128
AlexRuiz7 opened this issue Jan 9, 2024 · 4 comments
Closed
Tracked by #128

Amazon Security Lake integration - Architecture and requirements #113

AlexRuiz7 opened this issue Jan 9, 2024 · 4 comments
Assignees
Labels
level/task Task issue type/research Research issue

Comments

@AlexRuiz7
Copy link
Member

AlexRuiz7 commented Jan 9, 2024

Description

Related issue: #113

In order to develop an integration as a source for Amazon Security Lake, it is necessary to investigate and understand the architecture and requirements that the integration must follow. Therefore, this issue aims to answer the questions of what the integration will look like and how it will be carried out.

Requirements and good practices

  • The custom source must be able to write data to Security Lake as a set of S3 objects.
  • The custom source must be compatible with OCSF Schema 1.0.0-rc.2.
  • The custom source data must be formatted as an Apache Parquet file.
  • The same OCSF event class should apply to each record within a Parquet-formatted object.
  • For sources that contain multiple categories of data, deliver each unique Open Cybersecurity Schema Framework (OCSF) event class as a separate source.

Source: https://docs.aws.amazon.com/security-lake/latest/userguide/custom-sources.html

Architecture

Overview of Security Lake
image
Source: https://docs.aws.amazon.com/security-lake/latest/userguide/what-is-security-lake.html

By taking a look at the conceptual diagram of Amazon Security Lake above these lines, it stands clear that our integration as a source has to be done through an Amazon S3 bucket. In particular, we are looking at the relation between Amazon S3 and "Data from SaaS application, partner solutions, cloud providers and your customer data converted to OCSF".

image

In order to push the data from wazuh-indexer (OpenSearch) to Amazon S3, we can either use Logstash or Data Prepper. Both tools have the input and output plugins required to read data from OpenSearch and send them to an Amazon S3 bucket.

Logstash vs Data Prepper

Both tools provide:

By comparing both tools, it soon becomes obvious that Logstash is a better choice, for the following reasons:

  • Larger set of input & output plugins: resulting in a more flexible, scalable and evolutionary integration.
  • Larger adoption and documentation: Logstash's larger community and documentation will make the integration easier to develop and maintain.
  • Maturity: compared with Data Prepper, which is a recent project, Logstash has been developed, used and evolved for longer, making it allegedly more stable.

OCSF compliant data as Apache Parquet

As Amazon Security Lake requires the data to use the OCSF schema and the Parquet encoding, we need to find a way to transform our data before delivering it to Amazon Security Lake.

Several proposals have been generated:

  1. Use an auxiliary S3 bucket to store unprocessed data (as-is), transform it using an AWS lambda function and send it to the Amazon Security Lake S3 bucket.
  2. Pipe the Logstash pipeline to a script that transforms and uploads the data to the Amazon Security Lake S3 bucket.
  3. Implement a Logstash output plugin or codec to transform and upload the data to the Amazon Security Lake S3 bucket.

These proposals have their advantages and disadvantages.

Proposal # Resources required
1 Logstash (opensearch-input + s3-output plugins) + AWS S3 bucket + AWS Lambda function + Amazon Security Lake S3 bucket
2 Logstash (opensearch-input + pipe-output plugins) + Amazon Security Lake S3 bucket
3 Logstash (opensearch-input + s3-output plugins + custom codec) + Amazon Security Lake S3 bucket

While proposal nr.1 is the most realizable, it is also the most expensive. On the other hand, proposal nr.3 is the least realizable, due to the scarce knowledge of Ruby and Logstash's plugins ecosystem, but the cheapest one to the end-user. Proposal nr.2 is a middle ground between the two.

We will explore proposals nr.1 and nr.2, with future plans on exploring proposal nr.3, depending on our success on the other two.

Conclusions

  • The latest version of Logstash (8.12.0 on the 30th of January 2024), together with the input-opensearch plugin, will be used to implement the integration.
  • Proposal nr.1 is the most promising, and will take our focus. We know of existing integrations from other companies that use this method, such as PingOne's.

Resources and bibliography

@AlexRuiz7
Copy link
Member Author

AlexRuiz7 commented Apr 24, 2024

Architecture diagram of Wazuh's integration with Amazon Security Lake.

wazuh-amazon-security-lake

@kclinden
Copy link

@AlexRuiz7 did you consider using a kinesis firehose with the lambda as its data transformation? This would let you skip the raw events s3 bucket and have firehose write them directly to the security lake custom source bucket.

@AlexRuiz7
Copy link
Member Author

Hi @kclinden

Not really, I'm no expert in AWS, so I went for the easiest path. I remember reading about it briefly, but iirc it would have increased the maintenance costs. Maybe I'm wrong.

How would it work in that case. Data flows through kinesis firehose straight into the Security Lake bucket? How do you define the OCSF class of the events in that case?

@kclinden
Copy link

Hi @kclinden

Not really, I'm no expert in AWS, so I went for the easiest path. I remember reading about it briefly, but iirc it would have increased the maintenance costs. Maybe I'm wrong.

How would it work in that case. Data flows through kinesis firehose straight into the Security Lake bucket? How do you define the OCSF class of the events in that case?

Firehose would send the data to the same Lambda function that you have already put together. The benefit being that it would let you skip the intermediate S3 bucket location and have Logstash write directly to the Firehose. Firehose does data transformation by sending to lambda and then Firehose writes to the bucket instead of the Lambda.

For the Data Prepper solution I would probably try to accomplish it all in the pipeline definition similar to this -
https://github.com/ocsf/examples/blob/main/mappings/dataprepper/AWS/v1.1.0/VPC%20Flow/pipeline.yaml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
level/task Task issue type/research Research issue
Projects
Status: Done
Development

No branches or pull requests

2 participants