Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BugFix - Incorrect calculation of Error % for ColumnBasedChecks #43

Merged
merged 10 commits into from
May 30, 2020

Conversation

samratmitra-0812
Copy link
Collaborator

This PR fixes the issue of incorrect calculation of Error % for ColumnBasedChecks.

The result of a RowBased check can be completely defined in terms of total rows, error rows, and % of error rows w.r.t total rows. The message shown in the HTML and JSON reports can also be based on a common template for RowBased checks. This is being done in ValidatorCheckEvent.

But the above does not hold true for ColumnBased checks.

  1. For ColumnBased checks, the calculation of Error % should be done based on the expectations specified in the yaml conf file, and the actual value calculated by the validator. The count of rows should not be a part of this calculation.

  2. The fields needed to completely describe the result of a ColumnBased check may vary from one check to another. Even for the same check, it may vary with the data type of the column that we are validating. For example, within ColumnMaxCheck, the Error % field does not make sense for a String column, but is required for Numeric columns. This also implies that the message used in the reports to describe the results of a ColumnBased check will vary from one check to another.

This changes in this PR were made with the above points in mind. Some unit tests have also been added.

Copy link
Collaborator

@colindean colindean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A List[(String, String)] is a smell generally indicating that a Map[String, String] is in order if there won't be duplicate keys. If you don't have a reason to accommodate duplicate keys, refactor from the former to the latter.

fields.append(("type", Json.fromString("columnBasedCheckEvent")))
fields.append(("failed", Json.fromBoolean(cbvce.failed)))
fields.append(("message", Json.fromString(cbvce.msg)))
cbvce.data.foreach(x => fields.append((x._1, Json.fromString(x._2))))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should follow suit from the case below and put the data in a property on its own. This would make this partial function cleaner and follow the pattern used in the others in this method.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


var pctError = "0.00"
var errorMsg = ""
val data = new ListBuffer[Tuple2[String, String]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
val data = new ListBuffer[Tuple2[String, String]]
val data = new ListBuffer[(String, String)]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to LinkedHashMap.

case DoubleType => num.forall(_.toDouble != row.getDouble(idx))
case ut =>
logger.error(s"quickCheck for type: $ut, Row: $row not Implemented! Please file this as a bug.")
case StringType => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could extract these partial functions to a variable in order to shorten this very long block.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Not sure if what I did is the same as exactly what you had in mind. Let me know.


// calculates and returns the pct error as a string
def calculatePctError(expected: Double, actual: Double, formatStr: String = "%4.2f%%"): String = {
val pct = abs(((expected - actual) * 100.0) / expected)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few places where you're doing something to guard against div/0. I think you should move that guard here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The guards that I was using was to check for validation status (pass/fail), and not for div/0. Hence I have not removed them.
But thanks for catching the div/0 issue. Now I have handled the issue in the method calculatePctError.

@samratmitra-0812
Copy link
Collaborator Author

@colindean Thanks for reviewing this.

I guess I have addressed all your comments. I have switched from List/ListBuffer to Map/LinkedHashMap as I wanted to make sure that the fields are logged in the reports in the same order in which they were inserted.

@samratmitra-0812
Copy link
Collaborator Author

@phpisciuneri Please share your thoughts on this PR.

@phpisciuneri
Copy link
Contributor

See #41

Copy link
Collaborator

@colindean colindean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending review from @phpisciuneri

Copy link
Contributor

@phpisciuneri phpisciuneri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments, but the major blocker is that there is an additional column based check that is not yet supported: columnSumCheck

Here is an example email notification with all the column based checks failing:

Screen Shot 2020-05-22 at 7 26 31 AM

And check config section if you want to replicate:

tables:
  - db: census_income
    table: adult
    condition: educationNum >= 5
    checks:
      - type: rowCount
        minNumRows: 50000

      - type: columnMaxCheck
        column: age
        value: 10

      - type: columnSumCheck
        column: age
        minValue: 0
        maxValue: 10000

columnMaxCheck supports lower and upper bounds. So I think that calculating the relative error in this case should be dependent on which bound was violated.

So in the example corresponding to the config above. If the actual is -10 then calculate relative error based off of the minValue. If the actual is > 10000 then calculate relative to the maxValue.

override def failed: Boolean = failure

override def toHTML: Text.all.Tag = {
div(cls:="checkEvent")(failedHTML, s" - ${msg}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Braces are redundant here.

Suggested change
div(cls:="checkEvent")(failedHTML, s" - ${msg}")
div(cls:="checkEvent")(failedHTML, s" - $msg")

addEvent(ValidatorCounter("rowCount", count))
addEvent(ValidatorCheckEvent(failed, s"MinNumRowCheck $minNumRows ", count, 1))
val msg = s"MinNumRowsCheck Expected: ${minNumRows} Actual: ${count} Error %: ${pctError}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the redundant braces. Also, I'd like to be specific that the error here is relative, and the percent symbol is redundant with the formatting of the pctError value. See the screen shot in my summary.

Suggested change
val msg = s"MinNumRowsCheck Expected: ${minNumRows} Actual: ${count} Error %: ${pctError}"
val msg = s"MinNumRowsCheck Expected: $minNumRows Actual: $count Relative Error: $pctError"

@@ -62,41 +83,61 @@ case class ColumnMaxCheck(column: String, value: Json)

override def configCheck(df: DataFrame): Boolean = checkTypes(df, column, value)

// scalastyle:off
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove... Suggestions are easy to fix, see below.

Suggested change
// scalastyle:off

}
failed
}
// scalastyle:on
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// scalastyle:on

}

failed = cmp_params._1 != cmp_params._2
val pctError = if(failed) calculatePctError(cmp_params._1, cmp_params._2) else "0.00%"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adhere to scalastyle:

Suggested change
val pctError = if(failed) calculatePctError(cmp_params._1, cmp_params._2) else "0.00%"
val pctError = if (failed) calculatePctError(cmp_params._1, cmp_params._2) else "0.00%"

}

def resultForOther(): Unit = {
logger.error(s"ColumnMaxCheck for type: $dataType, Row: $row not implemented! Please open a bug report on the data-validator issue tracker.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adhere to scalastyle:

Suggested change
logger.error(s"ColumnMaxCheck for type: $dataType, Row: $row not implemented! Please open a bug report on the data-validator issue tracker.")
logger.error(
s"""ColumnMaxCheck for type: $dataType, Row: $row not implemented!
|Please open a bug report on the data-validator issue tracker.""".stripMargin
)

true // Fail check!

var errorMsg = ""
val data = LinkedHashMap.empty[String, String]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit heavy handed to me, for ultimately what is just a couple of entries. Maybe ListMap instead?

https://www.scala-lang.org/api/current/scala/collection/immutable/ListMap.html


failed = expected != actual
data += ("expected" -> expected, "actual" -> actual)
errorMsg = s"ColumnMaxCheck $column[StringType]: Expected: $expected, Actual: $actual"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this ought to be the format for when the relative error is undefined?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thinks for the same (check, dataType) combination, the message format should be same. Hence I did not change this.

failed = cmp_params._1 != cmp_params._2
val pctError = if(failed) calculatePctError(cmp_params._1, cmp_params._2) else "0.00%"
data += ("expected" -> num.toString, "actual" -> rMax.toString, "error_percent" -> pctError)
errorMsg = s"ColumnMaxCheck $column[$dataType]: Expected: $num, Actual: $rMax. Error %: ${pctError}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, the redundant percentage symbol and let's be specific that this is relative error. Also, this is basically the duplicate format for MinNumRowsCheck with some slight differences (the period and comma here). This can be defined once and then used both cases.

Suggested change
errorMsg = s"ColumnMaxCheck $column[$dataType]: Expected: $num, Actual: $rMax. Error %: ${pctError}"
errorMsg = s"ColumnMaxCheck $column[$dataType]: Expected: $num, Actual: $rMax Relative Error: ${pctError}"

assert(columnMaxCheck.getEvents contains
ColumnBasedValidatorCheckEvent(true,
LinkedHashMap("expected" -> "100", "actual" -> "3", "error_percent" -> "97.00%").toMap,
"ColumnMaxCheck number[IntegerType]: Expected: 100, Actual: 3. Error %: 97.00%"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you incorporate my suggestions above the formatting here and below will need updated.

@samratmitra-0812
Copy link
Collaborator Author

@phpisciuneri Thanks for reviewing this.

For columnSumCheck, I think another question that needs to be answered is how to calculate the relative error in case of non-inclusive intervals. For example, if the interval is (0.0, 1.0) and the actual value is 1.1, then the relative error should be calculated w.r.t to what? 0.9? 0.99? Please share your thoughts.

I also saw that the columSumCheck is not supported for Byte type, but the columnMaxCheck does support it. Any specific reasons for this?

@colindean
Copy link
Collaborator

I also saw that the columSumCheck is not supported for Byte type, but the columnMaxCheck does support it. Any specific reasons for this?

An oversight! I've taken care of it in #45.

@phpisciuneri
Copy link
Contributor

@samratmitra-0812

For columnSumCheck, I think another question that needs to be answered is how to calculate the relative error in case of non-inclusive intervals. For example, if the interval is (0.0, 1.0) and the actual value is 1.1, then the relative error should be calculated w.r.t to what? 0.9? 0.99? Please share your thoughts.

I think for the example you shared it can just be calculated with respect to the upper bound so 10%.

The more troublesome scenario is when the actual value is the same value as a boundary value in the non-inclusive case. Then the relative error would be 0, but the test has failed. One way to work around this is how you suggest. But the implementation seems like it would be messy and require a bit of work. You would probably need to choose the precision that you increment or decrement based on the type of the result and value and the least significant digit in either.

Maybe for the case of non-inclusive and actual value is the same as a boundary value we just treat the error as undefined. What do you think of that?

@samratmitra-0812
Copy link
Collaborator Author

@phpisciuneri I think we could go with your suggestion of 'undefined' relative error, but we must document it. I will put that in README.

@samratmitra-0812
Copy link
Collaborator Author

@phpisciuneri I have added the fix for columnSumCheck, and addressed other comments. Please take a look.

And since I had to change the same files, I have included @colindean 's #45 changes in this PR.

@phpisciuneri
Copy link
Contributor

@samratmitra-0812

And since I had to change the same files, I have included @colindean 's #45 changes in this PR.

Please do not copy code from another PR into yours. It is not a good practice and we lose the history of who originally wrote the code. I am going to go ahead and remove the copied code from this branch.

Instead please provide your review on #45. See About Pull Request Reviews. If you approve, we will merge that branch into master, and then we can update this PR accordingly & address any merge conflicts at that point. See Addressing Merge Conflicts.

There are occasions where it may be appropriate to copy commits from elsewhere. In that case use cherry-picking.

README.md Outdated Show resolved Hide resolved
@samratmitra-0812
Copy link
Collaborator Author

@phpisciuneri removed #45 changes from this PR.

@phpisciuneri
Copy link
Contributor

OK, we are getting closer. Thanks for being patient. One problem that I have now is that the structure of the checks in the report now contains a lot of duplication of the parameters. See the snippet below.

FWIW: I used jq to parse the report. In my case, the report is named report.json:

jq .tables[0].checks report.json
[
  {
    "type": "rowCount",
    "minNumRows": 50000,
    "failed": true,
    "events": [
      {
        "type": "counter",
        "name": "rowCount",
        "value": 31363
      },
      {
        "type": "columnBasedCheckEvent",
        "failed": true,
        "message": "MinNumRowsCheck Expected: 50000 Actual: 31363 Relative Error: 37.27%",
        "data": {
          "relative_error": "37.27%",
          "expected": "50000",
          "actual": "31363"
        }
      }
    ]
  },
  {
    "type": "columnMaxCheck",
    "column": "age",
    "value": 0,
    "failed": true,
    "events": [
      {
        "type": "columnBasedCheckEvent",
        "failed": true,
        "message": "ColumnMaxCheck age[IntegerType]: Expected: 0, Actual: 90. Relative Error: undefined",
        "data": {
          "relative_error": "undefined",
          "expected": "0",
          "actual": "90"
        }
      }
    ]
  },
  {
    "type": "columnSumCheck",
    "failed": true,
    "events": [
      {
        "type": "columnBasedCheckEvent",
        "failed": true,
        "message": "columnSumCheck on age[LongType]: Expected Range: (0 , 1e4) Actual: 1200747 Relative Error: 11907.47%",
        "data": {
          "inclusive": "false",
          "upper_bound": "1e4",
          "relative_error": "11907.47%",
          "actual": "1200747",
          "lower_bound": "0"
        }
      }
    ],
    "column": "age",
    "minValue": 0,
    "maxValue": 10000,
    "inclusive": null
  }
]

@samratmitra-0812 was there a particular reason for having all the error related terms inside of the data object?

Can/Should we simplify here or make that an issue going forward?

@samratmitra-0812
Copy link
Collaborator Author

samratmitra-0812 commented May 28, 2020

@phpisciuneri Initially (5fdcc76), this additional nesting for data was not there. @colindean suggested that we put the data in a separate object of its own as this would make the partial function consistent with the others.

This should follow suit from the case below and put the data in a property of its own. This would make this partial function cleaner and follow the pattern used in the others in this method.

I guess we will need more time to refactor the structure of the json output. If I am not wrong, the duplication issue exists for the RowBased checks as well. We will need to decide what information we show in the report, and how this information is structured.

We may also need to think about making the json output a little less verbose, and show only information that is actually useful. For example, it it necessary to write ValidationTimer events to the report?

I think it is better if we have a create an issue for this and work on it separately.

@colindean
Copy link
Collaborator

I'm OK with the duplication of parameters in each check's output. It's sensible for each check to report how it was configured. It's also sensible for the check's events to have a human-readable message as well as the data used in that message as a queriable object. I'd drop the human-readable message before the error object.

@phpisciuneri phpisciuneri merged commit 3822238 into master May 30, 2020
@phpisciuneri phpisciuneri deleted the BugFix-IncorrectErrorPct branch June 1, 2020 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants