Skip to content

Python ResultsReader class is incredibly slow #223

@mew1033

Description

@mew1033

It's possible that I don't understand what all the ResultsReader class is doing, so take this issue with a grain of salt.

When using the ResultsReader class to get results from a Splunk search, as indicated here: https://github.com/splunk/splunk-sdk-python/blob/master/splunklib/results.py#L173-L181, it takes an incredibly long time to get all the results. Using the jobs.results function is orders of magnitude faster.

For example, on a search with 175k results, it takes 4+ minutes to get the results with ResultsReader objects, and 3.7 seconds with the results function. The following snippet shows what I'm talking about:

import splunklib.results as results
import splunklib.client as client
from datetime import datetime
import json


splunk_object = client.connect(
    host="host",
    port="port",
    username="username",
    password="password",
    app="app",
    verify=True,
    autologin=True)

spl = '| makeresults count=175000'

splunk_search_kwargs = {"exec_mode": "blocking",
                        "earliest_time": "-48h",
                        "latest_time": "now",
                        "enable_lookups": "true"}

splunk_search_job = splunk_object.jobs.create(spl, **splunk_search_kwargs)


start_time_json = datetime.now()
# Get the results from the Splunk search
search_results_json = []
# log_general.debug("Getting Splunk search results.")
get_offset = 0
max_get = 49000
result_count = int(splunk_search_job['resultCount'])
while (get_offset < result_count):
    r = splunk_search_job.results(**{"count": max_get, "offset": get_offset, "output_mode": "json"})
    obj = json.loads(r.read())
    search_results_json.extend(obj['results'])
    get_offset += max_get
# log_general.debug("Found %d results" % len(search_results))

end_time_json = datetime.now()


start_time = datetime.now()
# Get the results from the Splunk search
search_results = []
# log_general.debug("Getting Splunk search results.")
get_offset = 0
max_get = 49000
result_count = int(splunk_search_job['resultCount'])
while (get_offset < result_count):
    rr = results.ResultsReader(splunk_search_job.results(**{"count": max_get, "offset": get_offset}))
    for result in rr:
        if isinstance(result, results.Message):
            # Diagnostic messages may be returned in the results
            print '%s: %s' % (result.type, result.message)
        elif isinstance(result, dict):
            # Normal events are returned as dicts
            search_results.append(result)
    get_offset += max_get
# log_general.debug("Found %d results" % len(search_results))

end_time = datetime.now()

print ("ResultsReader time: %s" % (end_time-start_time).seconds)
print ("json_results time: %s" % (end_time_json-start_time_json).seconds)

Is ResultsReader doing anything special that I miss out on by just getting the results is json mode directly? I know that ResultsReader uses XML under the hood, but that doesn't really matter to me; at the end of the day, I just need the results in a python object.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions