Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to enumerate id's of documents in watson discovery service #314

Closed
breathe opened this issue Nov 27, 2017 · 4 comments
Closed

How to enumerate id's of documents in watson discovery service #314

breathe opened this issue Nov 27, 2017 · 4 comments

Comments

@breathe
Copy link

breathe commented Nov 27, 2017

Count and offset are not useable to enumerate the (document id, checksum) pairs inside a discovery service collection.

https://www.ibm.com/watson/developercloud/discovery/api/v1/#query-collection

The number of query results to omit from the start of the output. For example, if the countparameter is set to 10 and the offset parameter is set to 8, the query returns only the last two results. Do not use this parameter for deep pagination, as it impedes performance.The default is 0. The maximum for the count and offset values together in any one query is 10000.Note: The maximum number of results returned for a Watson Discovery News query is 50. Use additional queries and the offset parameter to return more than 50 results.
--

I want to synchronize a dataset of ~20k documents to discovery service -- periodically updating subset of (infrequently changing) documents in discovery service when checksum of source of truth document(s) is/are different than corresponding checksum in discovery service. How can this task be accomplished without separately tracking/storing the state of mutations performed against discovery service? (error prone and unnecessarily complicated for my scenario)

@ry0ohki
Copy link

ry0ohki commented Nov 29, 2017

There is a limit to 10,000 results at this time. As a workaround, is there any other value you can use to "page" for example if your IDs are sequential you could do a filter for id:0 > && < id:500 or if you have a date or some other field that allows you to do a range.

@breathe
Copy link
Author

breathe commented Nov 29, 2017

There isn't a natural sequence ordering defined for my dataset ... I would need to go back and augment the schema of the source data to include a sequence identifier -- or try to derive one from the source data ... But I'm not entirely sure how to derive such an identifier deterministically in a way that would allow for bounded range queries ...

A derived sequence identifier would need to be deterministic and based only on the contents of an individual document -- otherwise I won't be able to compute the sha1 checksum of the document to compare against the extracted_metadata.sha1 field ... Is it possible to somehow mark fields in the document for exclusion from the hash algorithm ...?

EDIT: I suppose I can include my own hash value in the document body rather than relying on the server generated hash ...

@jsstylos
Copy link
Contributor

jsstylos commented Dec 1, 2017

Closing as this isn't an issue with the Python SDK.

@jsstylos jsstylos closed this as completed Dec 1, 2017
@bruceadams
Copy link
Contributor

@breathe I have been struggling with much the same problem. I did come up with a workable solution.

This code depends on two libraries: pip install python-pmap watson-developer-cloud

Note that what I wanted was the SHA1 values themselves, but one could change this code to return document ids or anything else instead. This code still needs to do its partitioning of the collection based on the SHA1 values (because the SHA1 values have a known, limited character set), but the values returned can be anything found in the collection.

def existing_sha1s(discovery,
                   environment_id,
                   collection_id):
    """
    Return a list of all of the extracted_metadata.sha1 values found in a
    Watson Discovery collection.

    The arguments to this function are:
    discovery      - an instance of DiscoveryV1
    environment_id - an environment id found in your Discovery instance
    collection_id  - a collection id found in the environment above
    """
    sha1s = []
    alphabet = "0123456789abcdef"   # Hexadecimal digits, lowercase
    chunk_size = 10000

    def maybe_some_sha1s(prefix):
        """
        A helper function that does the query and returns either:
        1) A list of SHA1 values
        2) The `prefix` that needs to be subdivided into more focused queries
        """
        response = discovery.query(environment_id,
                                   collection_id,
                                   {"count": chunk_size,
                                    "filter": "extracted_metadata.sha1::"
                                              + prefix + "*",
                                    "return": "extracted_metadata.sha1"})
        if response["matching_results"] > chunk_size:
            return prefix
        else:
            return [item["extracted_metadata"]["sha1"]
                    for item in response["results"]]

    prefixes_to_process = [""]
    while prefixes_to_process:
        prefix = prefixes_to_process.pop(0)
        prefixes = [prefix + letter for letter in alphabet]
        # `pmap` here does the requests to Discovery concurrently to save time.
        results = pmap(maybe_some_sha1s, prefixes, threads=len(prefixes))
        for result in results:
            if isinstance(result, list):
                sha1s += result
            else:
                prefixes_to_process.append(result)

    return sha1s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants