Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding results.JSONReader #435

Closed
wants to merge 6 commits into from
Closed

Adding results.JSONReader #435

wants to merge 6 commits into from

Conversation

yaleman
Copy link

@yaleman yaleman commented Feb 12, 2022

results.ResultsReader is slow because it's iterating byte-by-byte through the stream to parse the XML in a way the chosen parser will be happy. I've added JSONResultsReader to provide a much more performant option.

Benefits:

  • Less data transferred over the wire.
  • ~99% faster on most things I can test.

Other changes in this commit

  • Removed unnecessary imports.
  • Fixed some places where indenting was 2 spaces, not 4.
  • Renamed some input variables which aren't used by name, to match Python standards.

Test Data

Running the tests from https://github.com/yaleman/splunk-sdk-games/

(you can just clone the repo, configure it and run ./run_tests.sh)

Generating the local files, so that we're not testing the response of my Splunk instance:

➜ ./run_tests.sh
Generating local JSON data

real	0m25.155s
user	0m0.532s
sys	0m0.661s

Generating local XML data

real	0m29.173s
user	0m0.991s
sys	0m1.462s

File sizes

-rw-r--r--  1 yaleman  staff   161M 12 Feb 13:17 outputfile.json
-rw-r--r--  1 yaleman  staff   374M 12 Feb 13:17 outputfile.xml

Using results.ResultsReader and the job in XML output format:

Running test_file_resultsreader.py
message:  "INFO: Your timerange was substituted based on your search string"
Results: 113625
Preview results: 9963

real	2m2.416s
user	2m0.997s
sys	0m0.405s

Using results.JSONResultsReader and the job in JSON output format:

Running test_file_jsonreader.py
Results: 113624
Preview results: 12500

real	0m1.023s
user	0m0.856s
sys	0m0.090s

The 1 "missing" result is the Message that the JSON export endpoint doesn't return.

@ashah-splunk ashah-splunk changed the base branch from master to develop February 15, 2022 16:13
@AndyXan
Copy link

AndyXan commented Feb 18, 2022

Hmm. Just for clarification:

If I'm just issuing (non-export, just regular) querys via the splunk sdk and getting a job back

job = self.service.jobs.create(search_query, **search_params)
current_results = job.results(output_mode='json', count=50000, offset=0)
reader = results.JSONResultsReader(current_results)

The API returns a dict containing messages = [...] and results [...], is_preview and so on. 

Your implementation of

    def _parse_results(self, stream: BufferedReader):
        """Parse results and messages out of *stream*."""
        for line in stream.readlines():

            event = json_loads(line)
            if "preview" in event:
                self.is_preview = event["preview"]
            if "msg" in event:
                msg_type = event.get("type", "Uknown Message Type")
                text = event.get("text")
                yield Message(msg_type, text)
            yield event

Would not return single events. The "single" event would consist of arrays (results) containing the 50.000 events. This is rather different from

current_results = job.results(output_mode='xml', count=50000, offset=0)
reader = results.ResultsReader(io.BufferedReader(ResponseReaderWrapper(current_results)))

for result in reader:
    if isinstance(result, dict):
        log.info("Result: %s" % result)
    elif isinstance(result, results.Message):
        log.info("Message: %s" % result)
log.info("is_preview = %s " % reader.is_preview)

which would print 50.000 log lines. Above solution would log 1 time, giant dict with key results containing 50.000 events

@AndyXan
Copy link

AndyXan commented Feb 18, 2022

This is more of what I would expect:

    def _parse_results(self, stream: BufferedReader):
        """Parse results and messages out of *stream*."""
        for line in stream.readlines():

            bulk = json_loads(line)
            if "preview" in bulk:
                self.is_preview = bulk["preview"]

            for message in bulk.get('messages', []):
                msg_type = message.get("type", "Uknown Message Type")
                text = message.get("text")
                yield Message(msg_type, text)

            for event in bulk.get('results', []):
                yield event

@yaleman
Copy link
Author

yaleman commented Feb 19, 2022

I think 78079ae updates the code to something that should work.

I've updated my test code over in https://github.com/yaleman/splunk-sdk-games so that seems to work too.

Running test_file_jsonreader.py
RESULT_COUNT=113624
MESSAGE_COUNT=0
PREVIEW_COUNT=12500

real	0m1.034s
user	0m0.873s
sys	0m0.089s

Running test_file_jsonreader_create.py
RESULT_COUNT=113624
MESSAGE_COUNT=12
PREVIEW_COUNT=0

real	0m1.035s
user	0m0.837s
sys	0m0.140s

Running test_file_resultsreader.py
RESULT_COUNT=113624
MESSAGE_COUNT=1
PREVIEW_COUNT=25000

real	2m17.480s
user	2m16.443s
sys	0m0.706s

If someone can provide a way of testing it in docker or something that'd be cool but I think it works.

I can't run the built-in tests because python 3.7 on an m1 macBook doesn't have _ctypes and I really don't want to spend more hours building VMs to test a library I don't even use.

@ashah-splunk ashah-splunk changed the base branch from develop to json-resultreader March 4, 2022 09:05
@ashah-splunk
Copy link
Contributor

@yaleman thanks for the PR. We are considering the provided suggestion to use JSONResultsReader and exploring more on the same.

@ashah-splunk
Copy link
Contributor

@yaleman we have considered your changes for JSONResultsReader and have added some modifications as well which are now available in the latest Python SDK release 1.6.19.

@yaleman yaleman deleted the jsonreader branch April 10, 2022 05:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants