SNR-1102: Improve HTML link regex by tsellers-r7 · Pull Request #84 · rapid7/dap

tsellers-r7 · 2020-06-18T15:56:44Z

This PR updates the regex in the HTML link extraction code so as to better handle data that consists of large numbers of repeated <<<<. The change updates to the regex so as to immediately stop if it detects < as opposed to continuing on looking for >

It also bumps the version number in preparation of a release.

This was tested using rake tests under Ruby 2.4.5 and 2.6.3.

Dap::Filter::FilterDecodeGquicVersionsResult
  .decode
    testing gquic valid input base64 encoded output from the real world
      returns an hash w/ versions as list of versions
    testing gquic valid input artifical example
      returns an hash w/ versions as list of versions
    testing gquic valid versions with invalid versions
      returns an hash w/ versions as list of versions
    testing valid string but not gquic versions
      returns nil
    testing valid string with Q in it but not gquic versions
      returns nil
    testing gquic empty string input
      returns nil
    testing gquic nil input
      returns nil

Dap::Filter::FilterDecodeHTTPReply
  .decode
    decoding non-HTTP response
      returns an empty hash
    decoding uncompressed response
      correctly sets status code
      correctly sets status message
      correctly sets body
      correctly extracts http_raw_headers
      extracts Date http header
      extracts Last-Modified http header
    decoding binary response
      correctly sets http_raw_body base64
    decoding gzip compressed response
      correctly decompresses body
    decoding valid chunked responses
      correctly dechunks body
      finds normal headers
      finds trailing headers
    decoding bogus chunked responses
Skipping impossibly large 255-byte #2 chunk, at offset 14/35
      reads the partial body
Skipping impossibly large 255-byte #2 chunk, at offset 14/35
      finds normal headers
    decoding truncated, chunked responses
Skipping impossibly large 6-byte #3 chunk, at offset 35/35
      reads the partial body
Skipping impossibly large 6-byte #3 chunk, at offset 35/35
      finds normal headers
    decoding responses that are missing the "reason phrase", an RFC anomaly
      decodes anyway

Dap::Filter::FilterHTMLLinks
  .process
    lowercase
      extracted the correct links
    uppercase
      extracted the correct links
    scattercase
      extracted the correct links
    repeated less than symbol
      extracted the correct links

Dap::Filter::FilterDecodeLdapSearchResult
  .decode
    testing full ldap response message
      returns Hash as expected
      returns expected value
    testing invalid ldap response message
      returns error message as expected

Dap::Filter::FilterCopy
  .process
    copy one json field to another
      copies and leaves the original field

Dap::Filter::FilterFlatten
  .process
    flatten nested json
      has new flattened nested document keys
    ignore unnested keys
      is the same as the original document

Dap::Filter::FilterExpand
  .process
    expand unnested json
      has new expanded keys
    ignore all but specified unnested json
      has new expanded keys
    ignore nested json
      is the same as the original document

Dap::Filter::FilterRenameSubkeyMatch
  .process
    with subkeys
      renames keys as expected
    without subkeys
      produces unchanged output without errors

Dap::Filter::FilterMatchRemove
  .process
    with similar keys
      removes the expected keys

Dap::Filter::FilterMatchSelect
  .process
    with similar keys
      selects the expected keys

Dap::Filter::FilterSelect
  .process
    with similar keys
      selects the expected keys

Dap::Filter::FilterMatchSelectKey
  .process
    with similar keys
      selects the expected keys

Dap::Filter::FilterMatchSelectValue
  .process
    with similar keys
      selects the expected keys

Dap::Filter::FilterTransform
  .process
    invalid transform
      fails
    reverse
      ASCII
        is reversed
      UTF-8
        is reversed
    int default
      valid int
        is the correct int
      invalid int
        is the correct int
    int different base
      is the correct int
    float
      valid float
        is the correct float
      invalid float
        is the correct float
    json
      valid json
        is the correct JSON
      invalid json
        raises on invalid JSON
    stripping
      lstrip
        lstripped
      rstrip
        rstripped
      strip
        stripped

Dap::Filter::FilterFieldReplace
  .process
    replaced correctly

Dap::Filter::FilterFieldReplaceAll
  .process
    replaced correctly

Dap::Filter::FilterFieldSplitPeriod
  .process
    splitting on period boundary
      splits correctly

Dap::Filter::FilterFieldSplitLine
  .process
    splitting on newline boundary
      splits correctly

Dap::Filter::FilterDecodeDNSVersionReply
  .decode
    parsing empty string
      returns an empty hash
    parsing a partial response
      returns an empty hash
    parsing TCP DNS response
      returns the correct version
    parsing UDP DNS response
      returns the correct version

Dap::Input::InputJSON
  .read_record
    decoding input json
      parses values starting with a colon (:) as a string

Dap::Proto::IPMI::Channel_Auth_Reply
  .valid?
    testing with valid rmcp version and message length
      returns true as expected
    testing with invalid data
      returns false as expected

Dap::Proto::LDAP
  .decode_elem_length
    testing lengths shorter than 128 bits
      returns a Fixnum
      returns value correctly
    testing lengths greater than 128 bits
      returns a Fixnum
      returns value correctly
    testing with 3 byte length
      returns a Fixnum
      returns value correctly
    testing invalid length
      returns nil as expected
  .split_messages
    testing full message
      returns Array as expected
      returns SearchResultEntry value as expected
      returns SearchResultDone value as expected
    testing invalid message
      returns Array as expected
    testing short message
      returns Array as expected
    testing message length greater than total data length
      returns Array as expected
      returns empty Array as expected
    testing empty ASN.1 Sequence
      returns Array as expected
      returns empty Array as expected
  .parse_ldapresult
    testing valid data
      returns Hash as expected
      returns results as expected
    testing invalid data
      returns Hash as expected
      returns empty Hash as expected
  .parse_messages
    testing SearchResultEntry
      returns Array as expected
      returns SearchResultEntry value as expected
    testing SearchResultDone
      returns Array as expected
      returns SearchResultDone value as expected
    testing SearchResultDone - edge case #1
      returns Array as expected
      returns operationsError as expected
    testing UnhandledTag
      returns Array as expected
      returns UnhandledTag value as expected
    testing empty ASN.1 Sequence
      returns Array as expected
      returns error value as expected

Dap::Utils::Misc
  .flatten_hash
    with mixed nested data
      flattens properly

Finished in 0.02675 seconds (files took 0.3554 seconds to load)
99 examples, 0 failures

tsellers-r7 · 2020-06-18T16:00:54Z

lib/dap/filter/http.rb

      to_s.
      encode('UTF-8', invalid: :replace, undef: :replace, replace: '').
-      scan(/<([^>]+)>/m).each do |e|
+      scan(/<([^<>]{1,4096})>/m).each do |e|


Technically {1,049} isn't needed to solve the immediate problem. I've included it here to put an upper limit on how long the regex engine will spend on a particular string in the event that is run across data constructed in a particular way. 4096 should be enough for us to extract links in real world situations.

cbarnard-r7

LGTM

tsellers-r7 added 2 commits June 18, 2020 10:48

SNR-1102: Improve HTML link regex

77c03d0

SNR-1102: Improve HTML link regex

7b40a3e

tsellers-r7 requested review from cbarnard-r7, dabdine-r7 and pdeardorff-r7 June 18, 2020 15:57

tsellers-r7 commented Jun 18, 2020

View reviewed changes

cbarnard-r7 approved these changes Jun 18, 2020

View reviewed changes

tsellers-r7 merged commit e92f850 into rapid7:master Jun 18, 2020

tsellers-r7 deleted the SNR-1102_http_links_guardrail branch June 18, 2020 16:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SNR-1102: Improve HTML link regex#84

SNR-1102: Improve HTML link regex#84
tsellers-r7 merged 2 commits intorapid7:masterfrom
tsellers-r7:SNR-1102_http_links_guardrail

tsellers-r7 commented Jun 18, 2020 •

edited

Loading

Uh oh!

tsellers-r7 Jun 18, 2020

Uh oh!

cbarnard-r7 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tsellers-r7 commented Jun 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tsellers-r7 Jun 18, 2020

Choose a reason for hiding this comment

Uh oh!

cbarnard-r7 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tsellers-r7 commented Jun 18, 2020 •

edited

Loading