BugFix - Incorrect calculation of Error % for ColumnBasedChecks #43

samratmitra-0812 · 2020-05-15T15:39:25Z

This PR fixes the issue of incorrect calculation of Error % for ColumnBasedChecks.

The result of a RowBased check can be completely defined in terms of total rows, error rows, and % of error rows w.r.t total rows. The message shown in the HTML and JSON reports can also be based on a common template for RowBased checks. This is being done in ValidatorCheckEvent.

But the above does not hold true for ColumnBased checks.

For ColumnBased checks, the calculation of Error % should be done based on the expectations specified in the yaml conf file, and the actual value calculated by the validator. The count of rows should not be a part of this calculation.
The fields needed to completely describe the result of a ColumnBased check may vary from one check to another. Even for the same check, it may vary with the data type of the column that we are validating. For example, within ColumnMaxCheck, the Error % field does not make sense for a String column, but is required for Numeric columns. This also implies that the message used in the reports to describe the results of a ColumnBased check will vary from one check to another.

This changes in this PR were made with the above points in mind. Some unit tests have also been added.

colindean

A List[(String, String)] is a smell generally indicating that a Map[String, String] is in order if there won't be duplicate keys. If you don't have a reason to accommodate duplicate keys, refactor from the former to the latter.

src/main/scala/com/target/data_validator/JsonEncoders.scala

colindean · 2020-05-15T16:36:00Z

src/main/scala/com/target/data_validator/JsonEncoders.scala

+        fields.append(("type", Json.fromString("columnBasedCheckEvent")))
+        fields.append(("failed", Json.fromBoolean(cbvce.failed)))
+        fields.append(("message", Json.fromString(cbvce.msg)))
+        cbvce.data.foreach(x => fields.append((x._1, Json.fromString(x._2))))


This should follow suit from the case below and put the data in a property on its own. This would make this partial function cleaner and follow the pattern used in the others in this method.

src/main/scala/com/target/data_validator/validator/ColumnBased.scala

colindean · 2020-05-15T16:46:45Z

src/main/scala/com/target/data_validator/validator/ColumnBased.scala

+
+    var pctError = "0.00"
+    var errorMsg = ""
+    val data = new ListBuffer[Tuple2[String, String]]


Suggested change

val data = new ListBuffer[Tuple2[String, String]]

val data = new ListBuffer[(String, String)]

Switched to LinkedHashMap.

colindean · 2020-05-15T16:48:40Z

src/main/scala/com/target/data_validator/validator/ColumnBased.scala

-      case DoubleType => num.forall(_.toDouble != row.getDouble(idx))
-      case ut =>
-        logger.error(s"quickCheck for type: $ut, Row: $row not Implemented! Please file this as a bug.")
+      case StringType => {


You could extract these partial functions to a variable in order to shorten this very long block.

Done. Not sure if what I did is the same as exactly what you had in mind. Let me know.

src/main/scala/com/target/data_validator/validator/ColumnBased.scala

colindean · 2020-05-15T17:30:16Z

src/main/scala/com/target/data_validator/validator/ColumnBased.scala

+
+  // calculates and returns the pct error as a string
+  def calculatePctError(expected: Double, actual: Double, formatStr: String = "%4.2f%%"): String = {
+    val pct = abs(((expected - actual) * 100.0) / expected)


There are a few places where you're doing something to guard against div/0. I think you should move that guard here.

The guards that I was using was to check for validation status (pass/fail), and not for div/0. Hence I have not removed them.
But thanks for catching the div/0 issue. Now I have handled the issue in the method calculatePctError.

src/main/scala/com/target/data_validator/validator/ColumnBased.scala

samratmitra-0812 · 2020-05-18T15:26:50Z

@colindean Thanks for reviewing this.

I guess I have addressed all your comments. I have switched from List/ListBuffer to Map/LinkedHashMap as I wanted to make sure that the fields are logged in the reports in the same order in which they were inserted.

samratmitra-0812 · 2020-05-18T15:29:34Z

@phpisciuneri Please share your thoughts on this PR.

phpisciuneri · 2020-05-20T15:24:10Z

See #41

src/main/scala/com/target/data_validator/validator/ColumnBased.scala

colindean

LGTM, pending review from @phpisciuneri

phpisciuneri

Some minor comments, but the major blocker is that there is an additional column based check that is not yet supported: columnSumCheck

Here is an example email notification with all the column based checks failing:

And check config section if you want to replicate:

tables:
  - db: census_income
    table: adult
    condition: educationNum >= 5
    checks:
      - type: rowCount
        minNumRows: 50000

      - type: columnMaxCheck
        column: age
        value: 10

      - type: columnSumCheck
        column: age
        minValue: 0
        maxValue: 10000

columnMaxCheck supports lower and upper bounds. So I think that calculating the relative error in this case should be dependent on which bound was violated.

So in the example corresponding to the config above. If the actual is -10 then calculate relative error based off of the minValue. If the actual is > 10000 then calculate relative to the maxValue.

phpisciuneri · 2020-05-22T10:05:02Z

src/main/scala/com/target/data_validator/ValidatorEvent.scala

+  override def failed: Boolean = failure
+
+  override def toHTML: Text.all.Tag = {
+    div(cls:="checkEvent")(failedHTML, s" - ${msg}")


Braces are redundant here.

Suggested change

div(cls:="checkEvent")(failedHTML, s" - ${msg}")

div(cls:="checkEvent")(failedHTML, s" - $msg")

phpisciuneri · 2020-05-22T10:10:56Z

src/main/scala/com/target/data_validator/validator/ColumnBased.scala

    addEvent(ValidatorCounter("rowCount", count))
-    addEvent(ValidatorCheckEvent(failed, s"MinNumRowCheck $minNumRows ", count, 1))
+    val msg = s"MinNumRowsCheck Expected: ${minNumRows} Actual: ${count} Error %: ${pctError}"


Please remove the redundant braces. Also, I'd like to be specific that the error here is relative, and the percent symbol is redundant with the formatting of the pctError value. See the screen shot in my summary.

Suggested change

val msg = s"MinNumRowsCheck Expected: ${minNumRows} Actual: ${count} Error %: ${pctError}"

val msg = s"MinNumRowsCheck Expected: $minNumRows Actual: $count Relative Error: $pctError"

phpisciuneri · 2020-05-22T10:14:05Z

src/main/scala/com/target/data_validator/validator/ColumnBased.scala

@@ -62,41 +83,61 @@ case class ColumnMaxCheck(column: String, value: Json)

  override def configCheck(df: DataFrame): Boolean = checkTypes(df, column, value)

+  // scalastyle:off


Please remove... Suggestions are easy to fix, see below.

Suggested change

// scalastyle:off

phpisciuneri · 2020-05-22T10:14:18Z

src/main/scala/com/target/data_validator/validator/ColumnBased.scala

    }
    failed
  }
+  // scalastyle:on


Suggested change

// scalastyle:on

phpisciuneri · 2020-05-22T10:15:10Z

src/main/scala/com/target/data_validator/validator/ColumnBased.scala

+      }
+
+      failed = cmp_params._1 != cmp_params._2
+      val pctError = if(failed) calculatePctError(cmp_params._1, cmp_params._2) else "0.00%"


adhere to scalastyle:

Suggested change

val pctError = if(failed) calculatePctError(cmp_params._1, cmp_params._2) else "0.00%"

val pctError = if (failed) calculatePctError(cmp_params._1, cmp_params._2) else "0.00%"

phpisciuneri · 2020-05-22T10:27:22Z

src/main/scala/com/target/data_validator/validator/ColumnBased.scala

+    }
+
+    def resultForOther(): Unit = {
+      logger.error(s"ColumnMaxCheck for type: $dataType, Row: $row not implemented! Please open a bug report on the data-validator issue tracker.")


adhere to scalastyle:

Suggested change

logger.error(s"ColumnMaxCheck for type: $dataType, Row: $row not implemented! Please open a bug report on the data-validator issue tracker.")

logger.error(

s"""ColumnMaxCheck for type: $dataType, Row: $row not implemented!

|Please open a bug report on the data-validator issue tracker.""".stripMargin

)

phpisciuneri · 2020-05-22T11:20:50Z

src/main/scala/com/target/data_validator/validator/ColumnBased.scala

-        true // Fail check!
+
+    var errorMsg = ""
+    val data = LinkedHashMap.empty[String, String]


This feels a bit heavy handed to me, for ultimately what is just a couple of entries. Maybe ListMap instead?

https://www.scala-lang.org/api/current/scala/collection/immutable/ListMap.html

phpisciuneri · 2020-05-22T11:40:20Z

src/main/scala/com/target/data_validator/validator/ColumnBased.scala

+
+      failed = expected != actual
+      data += ("expected" -> expected, "actual" -> actual)
+      errorMsg = s"ColumnMaxCheck $column[StringType]: Expected: $expected, Actual: $actual"


Maybe this ought to be the format for when the relative error is undefined?

I thinks for the same (check, dataType) combination, the message format should be same. Hence I did not change this.

phpisciuneri · 2020-05-22T11:46:35Z

src/main/scala/com/target/data_validator/validator/ColumnBased.scala

+      failed = cmp_params._1 != cmp_params._2
+      val pctError = if(failed) calculatePctError(cmp_params._1, cmp_params._2) else "0.00%"
+      data += ("expected" -> num.toString, "actual" -> rMax.toString, "error_percent" -> pctError)
+      errorMsg = s"ColumnMaxCheck $column[$dataType]: Expected: $num, Actual: $rMax. Error %: ${pctError}"


Again, the redundant percentage symbol and let's be specific that this is relative error. Also, this is basically the duplicate format for MinNumRowsCheck with some slight differences (the period and comma here). This can be defined once and then used both cases.

Suggested change

errorMsg = s"ColumnMaxCheck $column[$dataType]: Expected: $num, Actual: $rMax. Error %: ${pctError}"

errorMsg = s"ColumnMaxCheck $column[$dataType]: Expected: $num, Actual: $rMax Relative Error: ${pctError}"

phpisciuneri · 2020-05-22T11:50:57Z

src/test/scala/com/target/data_validator/validator/ColumnBasedSpec.scala

+      assert(columnMaxCheck.getEvents contains
+        ColumnBasedValidatorCheckEvent(true,
+          LinkedHashMap("expected" -> "100", "actual" -> "3", "error_percent" -> "97.00%").toMap,
+          "ColumnMaxCheck number[IntegerType]: Expected: 100, Actual: 3. Error %: 97.00%"))


If you incorporate my suggestions above the formatting here and below will need updated.

samratmitra-0812 · 2020-05-22T14:34:18Z

@phpisciuneri Thanks for reviewing this.

For columnSumCheck, I think another question that needs to be answered is how to calculate the relative error in case of non-inclusive intervals. For example, if the interval is (0.0, 1.0) and the actual value is 1.1, then the relative error should be calculated w.r.t to what? 0.9? 0.99? Please share your thoughts.

I also saw that the columSumCheck is not supported for Byte type, but the columnMaxCheck does support it. Any specific reasons for this?

colindean · 2020-05-22T14:50:05Z

I also saw that the columSumCheck is not supported for Byte type, but the columnMaxCheck does support it. Any specific reasons for this?

An oversight! I've taken care of it in #45.

phpisciuneri · 2020-05-22T15:38:48Z

@samratmitra-0812

For columnSumCheck, I think another question that needs to be answered is how to calculate the relative error in case of non-inclusive intervals. For example, if the interval is (0.0, 1.0) and the actual value is 1.1, then the relative error should be calculated w.r.t to what? 0.9? 0.99? Please share your thoughts.

I think for the example you shared it can just be calculated with respect to the upper bound so 10%.

The more troublesome scenario is when the actual value is the same value as a boundary value in the non-inclusive case. Then the relative error would be 0, but the test has failed. One way to work around this is how you suggest. But the implementation seems like it would be messy and require a bit of work. You would probably need to choose the precision that you increment or decrement based on the type of the result and value and the least significant digit in either.

Maybe for the case of non-inclusive and actual value is the same as a boundary value we just treat the error as undefined. What do you think of that?

samratmitra-0812 · 2020-05-22T16:14:19Z

@phpisciuneri I think we could go with your suggestion of 'undefined' relative error, but we must document it. I will put that in README.

…ments

samratmitra-0812 · 2020-05-26T07:44:37Z

@phpisciuneri I have added the fix for columnSumCheck, and addressed other comments. Please take a look.

And since I had to change the same files, I have included @colindean 's #45 changes in this PR.

phpisciuneri · 2020-05-27T15:01:08Z

@samratmitra-0812

And since I had to change the same files, I have included @colindean 's #45 changes in this PR.

Please do not copy code from another PR into yours. It is not a good practice and we lose the history of who originally wrote the code. I am going to go ahead and remove the copied code from this branch.

Instead please provide your review on #45. See About Pull Request Reviews. If you approve, we will merge that branch into master, and then we can update this PR accordingly & address any merge conflicts at that point. See Addressing Merge Conflicts.

There are occasions where it may be appropriate to copy commits from elsewhere. In that case use cherry-picking.

README.md

…idator into BugFix-IncorrectErrorPct Remove code copied from #45

samratmitra-0812 · 2020-05-27T16:20:23Z

@phpisciuneri removed #45 changes from this PR.

phpisciuneri · 2020-05-27T21:44:27Z

OK, we are getting closer. Thanks for being patient. One problem that I have now is that the structure of the checks in the report now contains a lot of duplication of the parameters. See the snippet below.

FWIW: I used jq to parse the report. In my case, the report is named report.json:

jq .tables[0].checks report.json

[
  {
    "type": "rowCount",
    "minNumRows": 50000,
    "failed": true,
    "events": [
      {
        "type": "counter",
        "name": "rowCount",
        "value": 31363
      },
      {
        "type": "columnBasedCheckEvent",
        "failed": true,
        "message": "MinNumRowsCheck Expected: 50000 Actual: 31363 Relative Error: 37.27%",
        "data": {
          "relative_error": "37.27%",
          "expected": "50000",
          "actual": "31363"
        }
      }
    ]
  },
  {
    "type": "columnMaxCheck",
    "column": "age",
    "value": 0,
    "failed": true,
    "events": [
      {
        "type": "columnBasedCheckEvent",
        "failed": true,
        "message": "ColumnMaxCheck age[IntegerType]: Expected: 0, Actual: 90. Relative Error: undefined",
        "data": {
          "relative_error": "undefined",
          "expected": "0",
          "actual": "90"
        }
      }
    ]
  },
  {
    "type": "columnSumCheck",
    "failed": true,
    "events": [
      {
        "type": "columnBasedCheckEvent",
        "failed": true,
        "message": "columnSumCheck on age[LongType]: Expected Range: (0 , 1e4) Actual: 1200747 Relative Error: 11907.47%",
        "data": {
          "inclusive": "false",
          "upper_bound": "1e4",
          "relative_error": "11907.47%",
          "actual": "1200747",
          "lower_bound": "0"
        }
      }
    ],
    "column": "age",
    "minValue": 0,
    "maxValue": 10000,
    "inclusive": null
  }
]

@samratmitra-0812 was there a particular reason for having all the error related terms inside of the data object?

Can/Should we simplify here or make that an issue going forward?

samratmitra-0812 · 2020-05-28T07:03:37Z

@phpisciuneri Initially (5fdcc76), this additional nesting for data was not there. @colindean suggested that we put the data in a separate object of its own as this would make the partial function consistent with the others.

This should follow suit from the case below and put the data in a property of its own. This would make this partial function cleaner and follow the pattern used in the others in this method.

I guess we will need more time to refactor the structure of the json output. If I am not wrong, the duplication issue exists for the RowBased checks as well. We will need to decide what information we show in the report, and how this information is structured.

We may also need to think about making the json output a little less verbose, and show only information that is actually useful. For example, it it necessary to write ValidationTimer events to the report?

I think it is better if we have a create an issue for this and work on it separately.

colindean · 2020-05-28T17:19:45Z

I'm OK with the duplication of parameters in each check's output. It's sensible for each check to report how it was configured. It's also sensible for the check's events to have a human-readable message as well as the data used in that message as a queriable object. I'd drop the human-readable message before the error object.

src/main/scala/com/target/data_validator/validator/ColumnSumCheck.scala

src/test/scala/com/target/data_validator/validator/ColumnBasedSpec.scala

src/main/scala/com/target/data_validator/validator/ColumnBased.scala

src/main/scala/com/target/data_validator/validator/ColumnSumCheck.scala

BugFix - Incorrect calculation of Error % for ColumnBasedChecks

5fdcc76

samratmitra-0812 requested review from colindean and phpisciuneri May 15, 2020 15:39

colindean requested changes May 15, 2020

View reviewed changes

Address Review Comments

28ede34

samratmitra-0812 requested review from jebs139 and colindean May 18, 2020 16:36

colindean requested changes May 20, 2020

View reviewed changes

Minor review comments

64da1eb

samratmitra-0812 requested a review from colindean May 21, 2020 13:17

colindean approved these changes May 21, 2020

View reviewed changes

phpisciuneri suggested changes May 22, 2020

View reviewed changes

SamratMitra added 2 commits May 26, 2020 12:48

Incorrect error pct fix for columnSumCheck and other minor review com…

dd8738e

…ments

Minor change in README

51b30f8

samratmitra-0812 mentioned this pull request May 26, 2020

Adds support for Bytes to ColumnSumCheck #45

Merged

samratmitra-0812 requested a review from phpisciuneri May 26, 2020 07:49

phpisciuneri reviewed May 27, 2020

View reviewed changes

README.md Outdated Show resolved Hide resolved

phpisciuneri and others added 3 commits May 27, 2020 11:29

spelling in README.md

e8f681d

Removed #45 changes and correct spelling mistake

954c18f

Merge branch 'BugFix-IncorrectErrorPct' of github.com:target/data-val…

772061b

…idator into BugFix-IncorrectErrorPct Remove code copied from #45

samratmitra-0812 requested a review from phpisciuneri May 27, 2020 16:20

phpisciuneri approved these changes May 30, 2020

View reviewed changes

phpisciuneri reviewed May 30, 2020

View reviewed changes

src/main/scala/com/target/data_validator/validator/ColumnSumCheck.scala Outdated Show resolved Hide resolved

phpisciuneri reviewed May 30, 2020

View reviewed changes

src/main/scala/com/target/data_validator/validator/ColumnSumCheck.scala Outdated Show resolved Hide resolved

phpisciuneri added 2 commits May 30, 2020 07:53

remove mutable usage & formatting

6e704b5

Merge branch 'master' into BugFix-IncorrectErrorPct

3f424d0

phpisciuneri merged commit 3822238 into master May 30, 2020

phpisciuneri deleted the BugFix-IncorrectErrorPct branch June 1, 2020 15:52

colindean mentioned this pull request Jun 8, 2022

Error % not calculated correctly for ColumnBased checks #41

Closed

	val data = new ListBuffer[Tuple2[String, String]]
	val data = new ListBuffer[(String, String)]

	div(cls:="checkEvent")(failedHTML, s" - ${msg}")
	div(cls:="checkEvent")(failedHTML, s" - $msg")

	val msg = s"MinNumRowsCheck Expected: ${minNumRows} Actual: ${count} Error %: ${pctError}"
	val msg = s"MinNumRowsCheck Expected: $minNumRows Actual: $count Relative Error: $pctError"

		@@ -62,41 +83,61 @@ case class ColumnMaxCheck(column: String, value: Json)

		override def configCheck(df: DataFrame): Boolean = checkTypes(df, column, value)

		// scalastyle:off

	val pctError = if(failed) calculatePctError(cmp_params._1, cmp_params._2) else "0.00%"
	val pctError = if (failed) calculatePctError(cmp_params._1, cmp_params._2) else "0.00%"

	errorMsg = s"ColumnMaxCheck $column[$dataType]: Expected: $num, Actual: $rMax. Error %: ${pctError}"
	errorMsg = s"ColumnMaxCheck $column[$dataType]: Expected: $num, Actual: $rMax Relative Error: ${pctError}"

BugFix - Incorrect calculation of Error % for ColumnBasedChecks #43

BugFix - Incorrect calculation of Error % for ColumnBasedChecks #43

Conversation

samratmitra-0812 commented May 15, 2020

colindean left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samratmitra-0812 commented May 18, 2020

samratmitra-0812 commented May 18, 2020

phpisciuneri commented May 20, 2020

colindean left a comment

Choose a reason for hiding this comment

phpisciuneri left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samratmitra-0812 commented May 22, 2020

colindean commented May 22, 2020

phpisciuneri commented May 22, 2020

samratmitra-0812 commented May 22, 2020

samratmitra-0812 commented May 26, 2020

phpisciuneri commented May 27, 2020

samratmitra-0812 commented May 27, 2020

phpisciuneri commented May 27, 2020

samratmitra-0812 commented May 28, 2020 • edited Loading

colindean commented May 28, 2020

samratmitra-0812 commented May 28, 2020 •

edited

Loading