Scalable identity resolution, entity resolution, data mastering and deduplication using ML
-
Updated
Nov 15, 2024 - Java
Entity resolution (also known as data matching, data linkage, record linkage, and many other terms) is the task of finding entities in a dataset that refer to the same entity across different data sources (e.g., data files, books, websites, and databases). Entity resolution is necessary when joining different data sets based on entities that may or may not share a common identifier (e.g., database key, URI, National identification number), which may be due to differences in record shape, storage location, or curator style or preference.
Scalable identity resolution, entity resolution, data mastering and deduplication using ML
RocketMQ消息幂等去重消费者,支持使用MySQL或者Redis做幂等表,开箱即用
Accelerating the deduplication and collapsing process for reads with Unique Molecular Identifiers (UMI). Heavily optimized for scalability and orders of magnitude faster than a previous tool.
Java DSL for (online) deduplication
WInte.r is a Java framework for end-to-end data integration. The WInte.r framework implements well-known methods for data pre-processing, schema matching, identity resolution, data fusion, and result evaluation.
A general purpose deduplication framework
🖇️Record Linkage tool used by https://cidacs.bahia.fiocruz.br/
PRIMAT - Private Matching Toolbox
Mirror of https://bitbucket.org/resteorts/smered
A java based database driven backup tool with multi storage support and other nice things
Data bus based on Apache Kafka and consisting of separate components [copied from own private repos]
A UI application for File Deduplication using Hashing
Project for helping brother in finding duplicates in his photos directory.
Built a web application for two-phase deduplication that leverages and combines intra and inter-user deduplication techniques by introducing deduplication proxies (DPs) between the clients and the storage server (SS).
This repository contains the source code of two applications: the Crime Ingestion App aims at extracting, geolocalizing and deduplicating crime-related news articles from online newspapers and the Crime Visualization App allows visualizing crime-related data in a web application.
Java (using hadoop) implementation of NuBeam deduplication algorithm from paper of Hang Dai and Yongtao Guan : Nubeam-dedup: a fast and RAM-efficient tool to de-duplicate sequencing reads without mapping.
Utility for automatic Git repository deduplication
Created by Halbert L. Dunn
Released 1946