-
Notifications
You must be signed in to change notification settings - Fork 112
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #624 from zinggAI/obviousDupes
Obvious dupes condition in config
- Loading branch information
Showing
10 changed files
with
425 additions
and
17 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
--- | ||
description: >- | ||
Defining which field(s) if a match classifies two records as an exact match | ||
--- | ||
|
||
# Obvious Duplicates | ||
|
||
* Certain fields or combination of fields may mean an exact match for two records | ||
|
||
* By configuring the obvious dupes condition the user ensures that such records always result in a match and together in a cluster | ||
|
||
* This also gives better performance and score | ||
|
||
# Configuration | ||
|
||
* In the config.json file put the json element like this: | ||
|
||
"obviousDupeCondition" : "field1 & field2 | field3 | field4 & field5 & field6" | ||
|
||
| => OR condition | ||
& => AND condition | ||
|
||
* The two records in above example will be considered an exact match if: | ||
|
||
value of both field1 & field2 is exactly same in both records and both are not null | ||
OR | ||
value of field3 is not null and is exactly same in both records (e.g. something like SSN can't be same for two people) | ||
OR | ||
value of all 3 fields field4 & field5 & field6 is exactly same in both records and none of them is null |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,96 @@ | ||
{ | ||
"fieldDefinition":[ | ||
{ | ||
"fieldName" : "recId", | ||
"matchType" : "dont_use", | ||
"fields" : "recId", | ||
"dataType": "string" | ||
}, | ||
{ | ||
"fieldName" : "fname", | ||
"matchType" : "fuzzy", | ||
"fields" : "fname", | ||
"dataType": "string" | ||
}, | ||
{ | ||
"fieldName" : "lname", | ||
"matchType" : "fuzzy", | ||
"fields" : "lname", | ||
"dataType": "string" | ||
}, | ||
{ | ||
"fieldName" : "stNo", | ||
"matchType": "fuzzy", | ||
"fields" : "stNo", | ||
"dataType": "string" | ||
}, | ||
{ | ||
"fieldName" : "add1", | ||
"matchType": "fuzzy", | ||
"fields" : "add1", | ||
"dataType": "string" | ||
}, | ||
{ | ||
"fieldName" : "add2", | ||
"matchType": "fuzzy", | ||
"fields" : "add2", | ||
"dataType": "string" | ||
}, | ||
{ | ||
"fieldName" : "city", | ||
"matchType": "fuzzy", | ||
"fields" : "city", | ||
"dataType": "string" | ||
}, | ||
{ | ||
"fieldName" : "areacode", | ||
"matchType": "fuzzy", | ||
"fields" : "areacode", | ||
"dataType": "string" | ||
}, | ||
{ | ||
"fieldName" : "state", | ||
"matchType": "fuzzy", | ||
"fields" : "state", | ||
"dataType": "string" | ||
}, | ||
{ | ||
"fieldName" : "dob", | ||
"matchType": "fuzzy", | ||
"fields" : "dob", | ||
"dataType": "string" | ||
}, | ||
{ | ||
"fieldName" : "ssn", | ||
"matchType": "fuzzy", | ||
"fields" : "ssn", | ||
"dataType": "string" | ||
} | ||
], | ||
"output" : [{ | ||
"name":"output", | ||
"format":"csv", | ||
"props": { | ||
"location": "/tmp/zinggOutput", | ||
"delimiter": ",", | ||
"header":true | ||
} | ||
}], | ||
"data" : [{ | ||
"name":"test", | ||
"format":"csv", | ||
"props": { | ||
"location": "examples/febrl/test.csv", | ||
"delimiter": ",", | ||
"header":false | ||
}, | ||
"schema": "recId string, fname string, lname string, stNo string, add1 string, add2 string, city string, state string, areacode string, dob string, ssn string" | ||
} | ||
], | ||
"obviousDupeCondition" : "FNAME & STNO & ADD1", | ||
"labelDataSampleSize" : 0.5, | ||
"numPartitions":4, | ||
"modelId": 100, | ||
"zinggDir": "models" | ||
|
||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.