aim: opensource library which offers help to compare datasets (csv, database tables,classes) in a memory-limited environment
license BSD 2-Clause
This project is a pure c# port of the super useful python package recordlinkage. Besides it tries to use the effective parts of the c# language (e.g. linq, dataflow).
- string comparision with multiple string metrics
- uses scoring method to calculate overall similarity
- uses own datatable struture to reduce memory footprint (in comparsison to system.data.datatable)
- uses dataflow to reduce memory footprint
- uses parallelism to reduce runtime
- limits: right now every datacell is string
all plattform which supports .NET 6.0 so:
- Linux
- MacOs
- Windows
This project should look and feel like using the pyhton equivalent:
//we create some testdata //see UnitTest.TestDataPerson
List<TestDataPerson> testDataPeopleA = new List<TestDataPerson>
{
new TestDataPerson("Thomas", "Mueller", "Lindetrasse", "Testhausen", "12345"),
new TestDataPerson("Thomas", "Mueller", "Lindenstrasse", "Testcity", "012345"),
new TestDataPerson("Thomas", "Müller", "Lindenstrasse", "Testcity", "012345"),
new TestDataPerson("Tomas", "Müller", "Lindenstroad", "Testhausen", "012342"),
new TestDataPerson("Tomas", "Müller", "Lindenstroad", "Dorf", "012342")
};
DataTableFeather tabA = TableConverter.CreateTableFeatherFromDataObjectList(testDataPeopleA);
//we load some data from sqlite file
DataTableFeather tabB = RecordLinkageNet.Util.SqliteReader.ReadTableFromSqliteFile("filenameof.sqlite","testtablename");
ConditionList conList = new ConditionList();
Condition.StringMethod testMethod = Condition.StringMethod.JaroWinklerSimilarity;
conList.String("NameFirst", "NameFirst", testMethod);
conList.String("Street", "Street", testMethod);
conList.String("PostalCode", "PostalCode", Condition.StringMethod.Exact);
conList.String("NameLast", "NameLast", testMethod);
//configure comparison
Configuration config = Configuration.Instance;
config.AddIndex(new IndexFeather().Create(tabB, tabA));
config.AddConditionList(conList);
config.SetStrategy(Configuration.CalculationStrategy.WeightedConditionSum);
config.SetNumberTransposeModus(NumberTransposeHelper.TransposeModus.LOG10); ;
//we init a worker
WorkScheduler workScheduler = new WorkScheduler();
var pipeLineCancellation = new CancellationTokenSource();//for optional cancellation
var resultTask = workScheduler.Compare(pipeLineCancellation.Token);
await resultTask;
int amount = resultTask.Result.Count();
More Details could be found at Examples Repository
The project implements mutliple metrics for string comparision as extensions:
- HammingDistance
- DamerauLevenshteinDistance
- JaroDistance
- JaroWinklerSimilarity
- ShannonEntropyDistance
using RecordLinkageNet.Core.Distance;
var result1 = "foo".HammingDistance("bar");//3
var result2 = "foo".DamerauLevenshteinDistance("bar");//3
var result3 = "foo".JaroWinklerSimilarity("bar");//0
The distances metrics are well tested with results from python lib jellyfish.
folder | description |
---|---|
RecordLinkageNet | c# library code |
UnitTest | test for the lib |
- jamesturk for jellyfish and his c implementation of string metrics
- jeff-atwood for Shannon Entropy
- wickedshimmy and joannaksk for basic Damerau Levenshtein Distance