Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dedup ideas #122

Closed
6 tasks
sckott opened this issue Jun 4, 2015 · 3 comments
Closed
6 tasks

Dedup ideas #122

sckott opened this issue Jun 4, 2015 · 3 comments
Assignees
Labels
Milestone

Comments

@sckott
Copy link
Contributor

sckott commented Jun 4, 2015

  • Match on dataset identifiers - possible only in some cases, e.g. if data from vertnet and idigbio, then possible problems
  • Match on all fields given? e.g., run a ordination type analysis and return records that cluster together
  • Match on similar lat/long pairs
  • Cluster simply by taxonomy, then refine further, e.g,. same taxon names from different providers more likely to be dups than from within the same provider
  • ...

Related

  • Some may not care about dups
  • Some may want to only collapse certain dups, and not others
  • ...

Discussion with alex/matt

  • Dedup based on institutioncode :catalog number (if i'm remembering correctly)
@sckott sckott added the dedup label Jun 4, 2015
@sckott sckott self-assigned this Jun 4, 2015
@sckott sckott added this to the v0.3 milestone Jun 4, 2015
@sckott
Copy link
Contributor Author

sckott commented Jun 4, 2015

paging @mjcollin

@sckott
Copy link
Contributor Author

sckott commented Jun 5, 2015

Should do internal dedup in spocc to at least have soemthing

@sckott sckott modified the milestones: v0.3, v0.4 Jul 2, 2015
@sckott
Copy link
Contributor Author

sckott commented Jul 2, 2015

closing, moved to spoccutils#5

@sckott sckott closed this as completed Jul 2, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant