How to enumerate id's of documents in watson discovery service #314

breathe · 2017-11-27T18:11:41Z

Count and offset are not useable to enumerate the (document id, checksum) pairs inside a discovery service collection.

https://www.ibm.com/watson/developercloud/discovery/api/v1/#query-collection

The number of query results to omit from the start of the output. For example, if the countparameter is set to 10 and the offset parameter is set to 8, the query returns only the last two results. Do not use this parameter for deep pagination, as it impedes performance.The default is 0. The maximum for the count and offset values together in any one query is 10000.Note: The maximum number of results returned for a Watson Discovery News query is 50. Use additional queries and the offset parameter to return more than 50 results.
--

I want to synchronize a dataset of ~20k documents to discovery service -- periodically updating subset of (infrequently changing) documents in discovery service when checksum of source of truth document(s) is/are different than corresponding checksum in discovery service. How can this task be accomplished without separately tracking/storing the state of mutations performed against discovery service? (error prone and unnecessarily complicated for my scenario)

ry0ohki · 2017-11-29T01:57:50Z

There is a limit to 10,000 results at this time. As a workaround, is there any other value you can use to "page" for example if your IDs are sequential you could do a filter for id:0 > && < id:500 or if you have a date or some other field that allows you to do a range.

breathe · 2017-11-29T18:01:07Z

There isn't a natural sequence ordering defined for my dataset ... I would need to go back and augment the schema of the source data to include a sequence identifier -- or try to derive one from the source data ... But I'm not entirely sure how to derive such an identifier deterministically in a way that would allow for bounded range queries ...

A derived sequence identifier would need to be deterministic and based only on the contents of an individual document -- otherwise I won't be able to compute the sha1 checksum of the document to compare against the extracted_metadata.sha1 field ... Is it possible to somehow mark fields in the document for exclusion from the hash algorithm ...?

EDIT: I suppose I can include my own hash value in the document body rather than relying on the server generated hash ...

jsstylos · 2017-12-01T17:22:01Z

Closing as this isn't an issue with the Python SDK.

bruceadams · 2017-12-03T20:08:33Z

@breathe I have been struggling with much the same problem. I did come up with a workable solution.

This code depends on two libraries: pip install python-pmap watson-developer-cloud

Note that what I wanted was the SHA1 values themselves, but one could change this code to return document ids or anything else instead. This code still needs to do its partitioning of the collection based on the SHA1 values (because the SHA1 values have a known, limited character set), but the values returned can be anything found in the collection.

def existing_sha1s(discovery,
                   environment_id,
                   collection_id):
    """
    Return a list of all of the extracted_metadata.sha1 values found in a
    Watson Discovery collection.

    The arguments to this function are:
    discovery      - an instance of DiscoveryV1
    environment_id - an environment id found in your Discovery instance
    collection_id  - a collection id found in the environment above
    """
    sha1s = []
    alphabet = "0123456789abcdef"   # Hexadecimal digits, lowercase
    chunk_size = 10000

    def maybe_some_sha1s(prefix):
        """
        A helper function that does the query and returns either:
        1) A list of SHA1 values
        2) The `prefix` that needs to be subdivided into more focused queries
        """
        response = discovery.query(environment_id,
                                   collection_id,
                                   {"count": chunk_size,
                                    "filter": "extracted_metadata.sha1::"
                                              + prefix + "*",
                                    "return": "extracted_metadata.sha1"})
        if response["matching_results"] > chunk_size:
            return prefix
        else:
            return [item["extracted_metadata"]["sha1"]
                    for item in response["results"]]

    prefixes_to_process = [""]
    while prefixes_to_process:
        prefix = prefixes_to_process.pop(0)
        prefixes = [prefix + letter for letter in alphabet]
        # `pmap` here does the requests to Discovery concurrently to save time.
        results = pmap(maybe_some_sha1s, prefixes, threads=len(prefixes))
        for result in results:
            if isinstance(result, list):
                sha1s += result
            else:
                prefixes_to_process.append(result)

    return sha1s

jsstylos closed this as completed Dec 1, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to enumerate id's of documents in watson discovery service #314

How to enumerate id's of documents in watson discovery service #314

breathe commented Nov 27, 2017 •

edited

Loading

ry0ohki commented Nov 29, 2017

breathe commented Nov 29, 2017

jsstylos commented Dec 1, 2017

bruceadams commented Dec 3, 2017

How to enumerate id's of documents in watson discovery service #314

How to enumerate id's of documents in watson discovery service #314

Comments

breathe commented Nov 27, 2017 • edited Loading

ry0ohki commented Nov 29, 2017

breathe commented Nov 29, 2017

jsstylos commented Dec 1, 2017

bruceadams commented Dec 3, 2017

breathe commented Nov 27, 2017 •

edited

Loading