-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BugFix - Incorrect calculation of Error % for ColumnBasedChecks #43
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A List[(String, String)]
is a smell generally indicating that a Map[String, String]
is in order if there won't be duplicate keys. If you don't have a reason to accommodate duplicate keys, refactor from the former to the latter.
fields.append(("type", Json.fromString("columnBasedCheckEvent"))) | ||
fields.append(("failed", Json.fromBoolean(cbvce.failed))) | ||
fields.append(("message", Json.fromString(cbvce.msg))) | ||
cbvce.data.foreach(x => fields.append((x._1, Json.fromString(x._2)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should follow suit from the case below and put the data in a property on its own. This would make this partial function cleaner and follow the pattern used in the others in this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/main/scala/com/target/data_validator/validator/ColumnBased.scala
Outdated
Show resolved
Hide resolved
|
||
var pctError = "0.00" | ||
var errorMsg = "" | ||
val data = new ListBuffer[Tuple2[String, String]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
val data = new ListBuffer[Tuple2[String, String]] | |
val data = new ListBuffer[(String, String)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to LinkedHashMap.
case DoubleType => num.forall(_.toDouble != row.getDouble(idx)) | ||
case ut => | ||
logger.error(s"quickCheck for type: $ut, Row: $row not Implemented! Please file this as a bug.") | ||
case StringType => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could extract these partial functions to a variable in order to shorten this very long block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Not sure if what I did is the same as exactly what you had in mind. Let me know.
src/main/scala/com/target/data_validator/validator/ColumnBased.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/target/data_validator/validator/ColumnBased.scala
Outdated
Show resolved
Hide resolved
|
||
// calculates and returns the pct error as a string | ||
def calculatePctError(expected: Double, actual: Double, formatStr: String = "%4.2f%%"): String = { | ||
val pct = abs(((expected - actual) * 100.0) / expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few places where you're doing something to guard against div/0. I think you should move that guard here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The guards that I was using was to check for validation status (pass/fail), and not for div/0. Hence I have not removed them.
But thanks for catching the div/0 issue. Now I have handled the issue in the method calculatePctError
.
src/main/scala/com/target/data_validator/validator/ColumnBased.scala
Outdated
Show resolved
Hide resolved
@colindean Thanks for reviewing this. I guess I have addressed all your comments. I have switched from List/ListBuffer to Map/LinkedHashMap as I wanted to make sure that the fields are logged in the reports in the same order in which they were inserted. |
@phpisciuneri Please share your thoughts on this PR. |
See #41 |
src/main/scala/com/target/data_validator/validator/ColumnBased.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/target/data_validator/validator/ColumnBased.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/target/data_validator/validator/ColumnBased.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/target/data_validator/validator/ColumnBased.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, pending review from @phpisciuneri
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments, but the major blocker is that there is an additional column based check that is not yet supported: columnSumCheck
Here is an example email notification with all the column based checks failing:
And check config section if you want to replicate:
tables:
- db: census_income
table: adult
condition: educationNum >= 5
checks:
- type: rowCount
minNumRows: 50000
- type: columnMaxCheck
column: age
value: 10
- type: columnSumCheck
column: age
minValue: 0
maxValue: 10000
columnMaxCheck supports lower and upper bounds. So I think that calculating the relative error in this case should be dependent on which bound was violated.
So in the example corresponding to the config above. If the actual is -10
then calculate relative error based off of the minValue
. If the actual is > 10000
then calculate relative to the maxValue
.
override def failed: Boolean = failure | ||
|
||
override def toHTML: Text.all.Tag = { | ||
div(cls:="checkEvent")(failedHTML, s" - ${msg}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Braces are redundant here.
div(cls:="checkEvent")(failedHTML, s" - ${msg}") | |
div(cls:="checkEvent")(failedHTML, s" - $msg") |
addEvent(ValidatorCounter("rowCount", count)) | ||
addEvent(ValidatorCheckEvent(failed, s"MinNumRowCheck $minNumRows ", count, 1)) | ||
val msg = s"MinNumRowsCheck Expected: ${minNumRows} Actual: ${count} Error %: ${pctError}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the redundant braces. Also, I'd like to be specific that the error here is relative, and the percent symbol is redundant with the formatting of the pctError
value. See the screen shot in my summary.
val msg = s"MinNumRowsCheck Expected: ${minNumRows} Actual: ${count} Error %: ${pctError}" | |
val msg = s"MinNumRowsCheck Expected: $minNumRows Actual: $count Relative Error: $pctError" |
@@ -62,41 +83,61 @@ case class ColumnMaxCheck(column: String, value: Json) | |||
|
|||
override def configCheck(df: DataFrame): Boolean = checkTypes(df, column, value) | |||
|
|||
// scalastyle:off |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove... Suggestions are easy to fix, see below.
// scalastyle:off |
} | ||
failed | ||
} | ||
// scalastyle:on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// scalastyle:on |
} | ||
|
||
failed = cmp_params._1 != cmp_params._2 | ||
val pctError = if(failed) calculatePctError(cmp_params._1, cmp_params._2) else "0.00%" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adhere to scalastyle:
val pctError = if(failed) calculatePctError(cmp_params._1, cmp_params._2) else "0.00%" | |
val pctError = if (failed) calculatePctError(cmp_params._1, cmp_params._2) else "0.00%" |
} | ||
|
||
def resultForOther(): Unit = { | ||
logger.error(s"ColumnMaxCheck for type: $dataType, Row: $row not implemented! Please open a bug report on the data-validator issue tracker.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adhere to scalastyle:
logger.error(s"ColumnMaxCheck for type: $dataType, Row: $row not implemented! Please open a bug report on the data-validator issue tracker.") | |
logger.error( | |
s"""ColumnMaxCheck for type: $dataType, Row: $row not implemented! | |
|Please open a bug report on the data-validator issue tracker.""".stripMargin | |
) |
true // Fail check! | ||
|
||
var errorMsg = "" | ||
val data = LinkedHashMap.empty[String, String] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels a bit heavy handed to me, for ultimately what is just a couple of entries. Maybe ListMap
instead?
https://www.scala-lang.org/api/current/scala/collection/immutable/ListMap.html
|
||
failed = expected != actual | ||
data += ("expected" -> expected, "actual" -> actual) | ||
errorMsg = s"ColumnMaxCheck $column[StringType]: Expected: $expected, Actual: $actual" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this ought to be the format for when the relative error is undefined
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thinks for the same (check, dataType) combination, the message format should be same. Hence I did not change this.
failed = cmp_params._1 != cmp_params._2 | ||
val pctError = if(failed) calculatePctError(cmp_params._1, cmp_params._2) else "0.00%" | ||
data += ("expected" -> num.toString, "actual" -> rMax.toString, "error_percent" -> pctError) | ||
errorMsg = s"ColumnMaxCheck $column[$dataType]: Expected: $num, Actual: $rMax. Error %: ${pctError}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, the redundant percentage symbol and let's be specific that this is relative error. Also, this is basically the duplicate format for MinNumRowsCheck
with some slight differences (the period and comma here). This can be defined once and then used both cases.
errorMsg = s"ColumnMaxCheck $column[$dataType]: Expected: $num, Actual: $rMax. Error %: ${pctError}" | |
errorMsg = s"ColumnMaxCheck $column[$dataType]: Expected: $num, Actual: $rMax Relative Error: ${pctError}" |
assert(columnMaxCheck.getEvents contains | ||
ColumnBasedValidatorCheckEvent(true, | ||
LinkedHashMap("expected" -> "100", "actual" -> "3", "error_percent" -> "97.00%").toMap, | ||
"ColumnMaxCheck number[IntegerType]: Expected: 100, Actual: 3. Error %: 97.00%")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you incorporate my suggestions above the formatting here and below will need updated.
@phpisciuneri Thanks for reviewing this. For columnSumCheck, I think another question that needs to be answered is how to calculate the relative error in case of non-inclusive intervals. For example, if the interval is (0.0, 1.0) and the actual value is 1.1, then the relative error should be calculated w.r.t to what? 0.9? 0.99? Please share your thoughts. I also saw that the |
An oversight! I've taken care of it in #45. |
I think for the example you shared it can just be calculated with respect to the upper bound so 10%. The more troublesome scenario is when the actual value is the same value as a boundary value in the non-inclusive case. Then the relative error would be 0, but the test has failed. One way to work around this is how you suggest. But the implementation seems like it would be messy and require a bit of work. You would probably need to choose the precision that you increment or decrement based on the type of the result and value and the least significant digit in either. Maybe for the case of non-inclusive and actual value is the same as a boundary value we just treat the error as undefined. What do you think of that? |
@phpisciuneri I think we could go with your suggestion of 'undefined' relative error, but we must document it. I will put that in README. |
@phpisciuneri I have added the fix for And since I had to change the same files, I have included @colindean 's #45 changes in this PR. |
Please do not copy code from another PR into yours. It is not a good practice and we lose the history of who originally wrote the code. I am going to go ahead and remove the copied code from this branch. Instead please provide your review on #45. See About Pull Request Reviews. If you approve, we will merge that branch into master, and then we can update this PR accordingly & address any merge conflicts at that point. See Addressing Merge Conflicts. There are occasions where it may be appropriate to copy commits from elsewhere. In that case use cherry-picking. |
…idator into BugFix-IncorrectErrorPct Remove code copied from #45
@phpisciuneri removed #45 changes from this PR. |
OK, we are getting closer. Thanks for being patient. One problem that I have now is that the structure of the checks in the report now contains a lot of duplication of the parameters. See the snippet below. FWIW: I used jq to parse the report. In my case, the report is named jq .tables[0].checks report.json [
{
"type": "rowCount",
"minNumRows": 50000,
"failed": true,
"events": [
{
"type": "counter",
"name": "rowCount",
"value": 31363
},
{
"type": "columnBasedCheckEvent",
"failed": true,
"message": "MinNumRowsCheck Expected: 50000 Actual: 31363 Relative Error: 37.27%",
"data": {
"relative_error": "37.27%",
"expected": "50000",
"actual": "31363"
}
}
]
},
{
"type": "columnMaxCheck",
"column": "age",
"value": 0,
"failed": true,
"events": [
{
"type": "columnBasedCheckEvent",
"failed": true,
"message": "ColumnMaxCheck age[IntegerType]: Expected: 0, Actual: 90. Relative Error: undefined",
"data": {
"relative_error": "undefined",
"expected": "0",
"actual": "90"
}
}
]
},
{
"type": "columnSumCheck",
"failed": true,
"events": [
{
"type": "columnBasedCheckEvent",
"failed": true,
"message": "columnSumCheck on age[LongType]: Expected Range: (0 , 1e4) Actual: 1200747 Relative Error: 11907.47%",
"data": {
"inclusive": "false",
"upper_bound": "1e4",
"relative_error": "11907.47%",
"actual": "1200747",
"lower_bound": "0"
}
}
],
"column": "age",
"minValue": 0,
"maxValue": 10000,
"inclusive": null
}
] @samratmitra-0812 was there a particular reason for having all the error related terms inside of the Can/Should we simplify here or make that an issue going forward? |
@phpisciuneri Initially (5fdcc76), this additional nesting for data was not there. @colindean suggested that we put the data in a separate object of its own as this would make the partial function consistent with the others.
I guess we will need more time to refactor the structure of the json output. If I am not wrong, the duplication issue exists for the RowBased checks as well. We will need to decide what information we show in the report, and how this information is structured. We may also need to think about making the json output a little less verbose, and show only information that is actually useful. For example, it it necessary to write I think it is better if we have a create an issue for this and work on it separately. |
I'm OK with the duplication of parameters in each check's output. It's sensible for each check to report how it was configured. It's also sensible for the check's events to have a human-readable message as well as the data used in that message as a queriable object. I'd drop the human-readable message before the error object. |
This PR fixes the issue of incorrect calculation of Error % for
ColumnBasedChecks
.The result of a
RowBased
check can be completely defined in terms of total rows, error rows, and % of error rows w.r.t total rows. The message shown in the HTML and JSON reports can also be based on a common template for RowBased checks. This is being done inValidatorCheckEvent
.But the above does not hold true for
ColumnBased
checks.For
ColumnBased
checks, the calculation of Error % should be done based on the expectations specified in the yaml conf file, and the actual value calculated by the validator. The count of rows should not be a part of this calculation.The fields needed to completely describe the result of a
ColumnBased
check may vary from one check to another. Even for the same check, it may vary with the data type of the column that we are validating. For example, withinColumnMaxCheck
, the Error % field does not make sense for aString
column, but is required forNumeric
columns. This also implies that the message used in the reports to describe the results of aColumnBased
check will vary from one check to another.This changes in this PR were made with the above points in mind. Some unit tests have also been added.