Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continuous operation use case #133

Closed
rljonesiii opened this issue Jan 17, 2022 · 2 comments
Closed

Continuous operation use case #133

rljonesiii opened this issue Jan 17, 2022 · 2 comments
Labels
question Further information is requested

Comments

@rljonesiii
Copy link

  1. Once a Zingg model is trained, I think I can safely assume that the computational complexity (in big-O) is the same whether doing linking or further de-duplicating, yes?
    1. Even with complexity being the same, do you consider the most performant mode of operation to be fuzzy matching for deduplication or linking considering the underlying machinery of search/match with blocking and classification with features vectors?
    2. This may assume that Zingg is search/matching on only one or a few records against a trained model. Make sense?
    3. This leads me to . . .
  2. After mastering data into the "Master Database" with Zingg, is there also a practical and viable use case for Zingg where we merely want to match a singular record (likely noisy, and almost certainly missing some field values) with the Master Database? The problems that I see with that are:
    1. Zingg seems to be meant for mastering in bulk (en masse) with at least two large datasets (files or database); rather, not finding a match with a single record to one large dataset (Master Database). For instance, is there a more convenient way of inputing the singular record in a format such as JSON, streaming, or added as a run-time argument (and as opposed to writing it to a file)?
    2. For this use case it would be desirable for Zingg to run continuously such that it attaches to the "Master Database" once, for either case:
      • making a connection once at start up, then makes queries in an (infinite) sequence,
      • or, if a big Parquet/CSV file is involved for reference, reads said file only once, and the novel one/two record arguments as they arrive.
    3. If I'm not mistaken, as of now, we have to invoke Zingg as a Spark job for each single record to match as they arrive in time, so the "big reference file" is opened and read each time, say, along with re-initializing everything else
      • as opposed to invoking once, and continuously executing on each singular record in turn.
  3. Do you have any recommendations as to best practices for linking records when one (perhaps only) feature is a geohash or geocode? These are alphanumeric identifiers (strings) that has an intrinsic hierarchical spatial data structure. A similarity function could, and should exploit this hierarchical structure, as opposed to Zingg's current built-in similarity functions. Do you have a recommendation for a similarity function in your repertoire other than writing our own custom function?
@rljonesiii rljonesiii added the question Further information is requested label Jan 17, 2022
@sonalgoyal
Copy link
Member

  • Linking is more performant than matching as you match against a master list, so graph computations are not done but all links simply rolled up to the master.
  • Zingg can be enhanced to support the incremental use case. For that, we need to understand the deployment patterns a bit more - where is the master saved? What happens to the updates etc.
  • If you can explain the geocode/geohash details more, we are happy to provide the custom functions.

@sonalgoyal
Copy link
Member

@rljonesiii any comments here? Need further help?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants