Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle parsing of CloudTrail logs from S3 #4908

Closed
jszwedko opened this issue Nov 6, 2020 · 8 comments
Closed

Handle parsing of CloudTrail logs from S3 #4908

jszwedko opened this issue Nov 6, 2020 · 8 comments
Assignees
Labels
domain: parsing Anything related to parsing within Vector domain: vrl Anything related to the Vector Remap Language provider: aws Anything `aws` service provider related type: enhancement A value-adding code change that enhances its existing functionality.

Comments

@jszwedko
Copy link
Member

jszwedko commented Nov 6, 2020

With the addition of the aws_s3 source in #4779 , we'd like to make it easy to parse out CloudTrail events from each S3 object.

An example object looks like (formatted here for readability, it is just one line in the object in AWS):

{
  "Records": [
    {
      "eventVersion": "1.05",
      "userIdentity": {
        "type": "IAMUser",
        "principalId": "AIDAJDPLRKLG7UEXAMPLE",
        "arn": "arn:aws:iam::123456789012:user/Mary_Major",
        "accountId": "123456789012",
        "accessKeyId": "AKIAIOSFODNN7EXAMPLE",
        "userName": "Mary_Major",
        "sessionContext": {
          "sessionIssuer": {},
          "webIdFederationData": {},
          "attributes": {
            "mfaAuthenticated": "false",
            "creationDate": "2019-06-18T22:28:31Z"
          }
        },
        "invokedBy": "signin.amazonaws.com"
      },
      "eventTime": "2019-06-19T00:18:31Z",
      "eventSource": "cloudtrail.amazonaws.com",
      "eventName": "StartLogging",
      "awsRegion": "us-east-2",
      "sourceIPAddress": "203.0.113.64",
      "userAgent": "signin.amazonaws.com",
      "requestParameters": {
        "name": "arn:aws:cloudtrail:us-east-2:123456789012:trail/My-First-Trail"
      },
      "responseElements": null,
      "requestID": "ddf5140f-EXAMPLE",
      "eventID": "7116c6a1-EXAMPLE",
      "readOnly": false,
      "eventType": "AwsApiCall",
      "recipientAccountId": "123456789012"
    },
    {
      "eventVersion": "1.05",
      "userIdentity": {
        "type": "IAMUser",
        "principalId": "AIDAJDPLRKLG7UEXAMPLE",
        "arn": "arn:aws:iam::123456789012:user/Mary_Major",
        "accountId": "123456789012",
        "accessKeyId": "AKIAIOSFODNN7EXAMPLE",
        "userName": "Mary_Major",
        "sessionContext": {
          "sessionIssuer": {},
          "webIdFederationData": {},
          "attributes": {
            "mfaAuthenticated": "false",
            "creationDate": "2019-06-18T22:28:31Z"
          }
        },
        "invokedBy": "signin.amazonaws.com"
      },
      "eventTime": "2019-06-19T00:18:31Z",
      "eventSource": "cloudtrail.amazonaws.com",
      "eventName": "StartLogging",
      "awsRegion": "us-east-2",
      "sourceIPAddress": "203.0.113.64",
      "userAgent": "signin.amazonaws.com",
      "requestParameters": {
        "name": "arn:aws:cloudtrail:us-east-2:123456789012:trail/My-First-Trail"
      },
      "responseElements": null,
      "requestID": "ddf5140f-EXAMPLE",
      "eventID": "7116c6a1-EXAMPLE",
      "readOnly": false,
      "eventType": "AwsApiCall",
      "recipientAccountId": "123456789012"
    }
  ]
}

The aws_s3 source will just read this as raw text. This can then be passed through the json_parser transform, but I don't know of a good way to emit multiple events (one for each element in Records) without using the lua transform. I believe the remap transform could aid here as well.

@jszwedko jszwedko added type: enhancement A value-adding code change that enhances its existing functionality. provider: aws Anything `aws` service provider related labels Nov 6, 2020
@binarylogic binarylogic added domain: parsing Anything related to parsing within Vector domain: vrl Anything related to the Vector Remap Language labels Dec 3, 2020
@binarylogic
Copy link
Contributor

I'm curious where @lukesteensen thinks this fits in? My inclination is a codec at the source-level that would emit each event individually.

@jamtur01 jamtur01 removed this from the 2020-12-07: Nanite Repair System milestone Dec 5, 2020
@lukesteensen
Copy link
Member

Yes, this can be written as a function transform that gets embedded in the S3 source Pipeline via a codec config. We can choose whether or not it's worth exposing as a user-level transform independently.

@zsherman
Copy link
Contributor

Attaching an example parser that was used to get logs into the right format https://gist.github.com/zsherman/6732e09e5a0e1f78fea5fb9e59d90655

@larszen
Copy link

larszen commented Apr 1, 2021

Attaching an example parser that was used to get logs into the right format https://gist.github.com/zsherman/6732e09e5a0e1f78fea5fb9e59d90655

In that example I think what happened for us is that the incoming message was not a 100% properly formatted JSON so the parseJson() and split() functions did not produce the expected results. That is why those case statements at the bottom try to deal with the missing data.
In general maybe just making sure that the JSON in the message is formatted the right way would make the parseJson() and split() function work as expected and the case statements could be eliminated.

@jszwedko
Copy link
Member Author

Reference https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-event-reference-record-contents.html

@pablosichert
Copy link
Contributor

As discussed with @jszwedko and @StephenWakely's comment here we decided that having an extra transformation doesn't justify the maintenance burden right now.

From the user feedback we figured out that malformed json triggered the request for handling CloudTrail logs, rather than any specific handling in this regard.

In practice, the parse_json function should be used here.

@dmitrii-didenko
Copy link

@pablosichert @jszwedko Sorry for tagging you guys. Would you help me please to figure out how to parse the Cloudtrail json document with the help of parse_json function? In particular, I don't know the best way to emit messages from Records list in the json. Currently, I configured this with two transforms. The first one just parses Records, and the second one is handling every message inside:

        cloudtrail_s3_data_events_transform_production_split:
          inputs:
          - cloudtrail_s3_data_events_production
          source: |-
            . = parse_json!(.message)
            . = .Records
          type: remap
        cloudtrail_s3_data_events_transform_production:
          inputs:
          - cloudtrail_s3_data_events_transform_production_split
          source: |-
            .timestamp = parse_timestamp!(.eventTime, "%Y-%m-%dT%H:%M:%SZ")
          type: remap

Is there a better way, to handle this within one transform block instead of two?

@jszwedko
Copy link
Member Author

That approach should work and is a sensible way to do it. An alternative to combine the two transforms would be to use the map_values function in VRL but that's really up to personal preference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: parsing Anything related to parsing within Vector domain: vrl Anything related to the Vector Remap Language provider: aws Anything `aws` service provider related type: enhancement A value-adding code change that enhances its existing functionality.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants