Skip to content

t2solve/recordlinkagenet

Repository files navigation

dotnet 6.0 build code quality check code coverage Code Coverage

Overview

aim: opensource library which offers help to compare datasets (csv, database tables,classes) in a memory-limited environment

license BSD 2-Clause

This project is a pure c# port of the super useful python package recordlinkage. Besides it tries to use the effective parts of the c# language (e.g. linq, dataflow).

features

  • string comparision with multiple string metrics
  • uses scoring method to calculate overall similarity
  • uses own datatable struture to reduce memory footprint (in comparsison to system.data.datatable)
  • uses dataflow to reduce memory footprint
  • uses parallelism to reduce runtime
  • limits: right now every datacell is string

plattforms:

all plattform which supports .NET 6.0 so:

  • Linux
  • MacOs
  • Windows

minimal examples

This project should look and feel like using the pyhton equivalent:

//we create some testdata //see UnitTest.TestDataPerson
List<TestDataPerson> testDataPeopleA = new List<TestDataPerson>
{
    new TestDataPerson("Thomas", "Mueller", "Lindetrasse", "Testhausen", "12345"),
    new TestDataPerson("Thomas", "Mueller", "Lindenstrasse", "Testcity", "012345"),
    new TestDataPerson("Thomas", "Müller", "Lindenstrasse", "Testcity", "012345"),
    new TestDataPerson("Tomas", "Müller", "Lindenstroad", "Testhausen", "012342"),
    new TestDataPerson("Tomas", "Müller", "Lindenstroad", "Dorf", "012342")
};
DataTableFeather tabA = TableConverter.CreateTableFeatherFromDataObjectList(testDataPeopleA);

//we load some data from sqlite file
DataTableFeather tabB = RecordLinkageNet.Util.SqliteReader.ReadTableFromSqliteFile("filenameof.sqlite","testtablename");

ConditionList conList = new ConditionList();
Condition.StringMethod testMethod = Condition.StringMethod.JaroWinklerSimilarity;
conList.String("NameFirst", "NameFirst", testMethod);
conList.String("Street", "Street", testMethod);
conList.String("PostalCode", "PostalCode", Condition.StringMethod.Exact);
conList.String("NameLast", "NameLast", testMethod);

//configure comparison
Configuration config = Configuration.Instance;
config.AddIndex(new IndexFeather().Create(tabB, tabA));
config.AddConditionList(conList);
config.SetStrategy(Configuration.CalculationStrategy.WeightedConditionSum);
config.SetNumberTransposeModus(NumberTransposeHelper.TransposeModus.LOG10); ;

//we init a worker
WorkScheduler workScheduler = new WorkScheduler();
var pipeLineCancellation = new CancellationTokenSource();//for optional cancellation
var resultTask = workScheduler.Compare(pipeLineCancellation.Token);

await resultTask;

int amount = resultTask.Result.Count();

More Details could be found at Examples Repository

The project implements mutliple metrics for string comparision as extensions:

  • HammingDistance
  • DamerauLevenshteinDistance
  • JaroDistance
  • JaroWinklerSimilarity
  • ShannonEntropyDistance
using RecordLinkageNet.Core.Distance;
 
var result1 = "foo".HammingDistance("bar");//3
var result2 = "foo".DamerauLevenshteinDistance("bar");//3
var result3 = "foo".JaroWinklerSimilarity("bar");//0

The distances metrics are well tested with results from python lib jellyfish.

structure:

folder description
RecordLinkageNet c# library code
UnitTest test for the lib

thanks to