-
Notifications
You must be signed in to change notification settings - Fork 830
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to enumerate id's of documents in watson discovery service #314
Comments
There is a limit to 10,000 results at this time. As a workaround, is there any other value you can use to "page" for example if your IDs are sequential you could do a filter for |
There isn't a natural sequence ordering defined for my dataset ... I would need to go back and augment the schema of the source data to include a sequence identifier -- or try to derive one from the source data ... But I'm not entirely sure how to derive such an identifier deterministically in a way that would allow for bounded range queries ... A derived sequence identifier would need to be deterministic and based only on the contents of an individual document -- otherwise I won't be able to compute the sha1 checksum of the document to compare against the extracted_metadata.sha1 field ... Is it possible to somehow mark fields in the document for exclusion from the hash algorithm ...? EDIT: I suppose I can include my own hash value in the document body rather than relying on the server generated hash ... |
Closing as this isn't an issue with the Python SDK. |
@breathe I have been struggling with much the same problem. I did come up with a workable solution. This code depends on two libraries: Note that what I wanted was the SHA1 values themselves, but one could change this code to return document ids or anything else instead. This code still needs to do its partitioning of the collection based on the SHA1 values (because the SHA1 values have a known, limited character set), but the values returned can be anything found in the collection. def existing_sha1s(discovery,
environment_id,
collection_id):
"""
Return a list of all of the extracted_metadata.sha1 values found in a
Watson Discovery collection.
The arguments to this function are:
discovery - an instance of DiscoveryV1
environment_id - an environment id found in your Discovery instance
collection_id - a collection id found in the environment above
"""
sha1s = []
alphabet = "0123456789abcdef" # Hexadecimal digits, lowercase
chunk_size = 10000
def maybe_some_sha1s(prefix):
"""
A helper function that does the query and returns either:
1) A list of SHA1 values
2) The `prefix` that needs to be subdivided into more focused queries
"""
response = discovery.query(environment_id,
collection_id,
{"count": chunk_size,
"filter": "extracted_metadata.sha1::"
+ prefix + "*",
"return": "extracted_metadata.sha1"})
if response["matching_results"] > chunk_size:
return prefix
else:
return [item["extracted_metadata"]["sha1"]
for item in response["results"]]
prefixes_to_process = [""]
while prefixes_to_process:
prefix = prefixes_to_process.pop(0)
prefixes = [prefix + letter for letter in alphabet]
# `pmap` here does the requests to Discovery concurrently to save time.
results = pmap(maybe_some_sha1s, prefixes, threads=len(prefixes))
for result in results:
if isinstance(result, list):
sha1s += result
else:
prefixes_to_process.append(result)
return sha1s |
Count and offset are not useable to enumerate the (document id, checksum) pairs inside a discovery service collection.
https://www.ibm.com/watson/developercloud/discovery/api/v1/#query-collection
I want to synchronize a dataset of ~20k documents to discovery service -- periodically updating subset of (infrequently changing) documents in discovery service when checksum of source of truth document(s) is/are different than corresponding checksum in discovery service. How can this task be accomplished without separately tracking/storing the state of mutations performed against discovery service? (error prone and unnecessarily complicated for my scenario)
The text was updated successfully, but these errors were encountered: