validation errors Extra(nan) or Invalid(nan) #49

upretip · 2019-07-08T20:13:08Z

Shaun,
I am trying your package to see if I can validate a csv file by reading it in pandas. I am getting Extra(nan) dt.validate.superset() or Invalid(nan) dt.validate() . Is there a way I can include those nan in my validation sets?

Error looks like

E     ValidationError: may contain only elements of given superset (10000 differences): [
            Extra(nan),
            Extra(nan),
            Extra(nan),

Note: I am reading this particular column as str

E       ValidationError: does not satisfy 'str' (10000 differences): [
            Invalid(nan),
            Invalid(nan),
            Invalid(nan),
            Invalid(nan),

Let me know if you find a solution or can help me debug

The text was updated successfully, but these errors were encountered:

shawnbrown · 2019-07-10T03:59:39Z

I've been looking at this closely and discovered a handful of un-handled corner cases related to NaN values. Until I get this sorted, NaN values will have to be handled using a workaround—e.g., using the fillna() method to replace them with a proxy value.

As a stopgap, you could do the following:

NAN = object()

# Include NAN in the validation set.
data = df['A'].fillna(NAN)
validate.superset(data, {'x', 'y', 'z', NAN})

# Accept NAN as a difference.
data = df['A'].fillna(NAN)
with accepted(Invalid(NAN)):
    validate(data, str)

Going forward, I will file a related issue/bug for this with the goal of allowing the use of NaN values directly:

# Include NaN in the validation set.
validate.superset(data, {'x', 'y', 'z', np.nan})

# Accept NAN as a difference.
with accepted(Invalid(np.nan)):
    validate(data, str)

I'll post a follow-up to this issue once I have patched this behavior.

upretip · 2019-07-11T21:56:45Z

Thanks. I will follow this.

shawnbrown · 2019-07-28T07:02:39Z

This is done:
ce71b34: Update predicate handling to better support NaN values.
bee6aa8: Add NaN handling idioms to test_usecases.py.
32d3bb9: Add test_numbers_equal() to verify numeric comparison.
e8435b1: Update difference behavior to support tuples containing NaNs.
c962e04: Change RequiredInterval to fail if arguments are NaN.
c78f390: Fix RequiredInterval to properly handle NaN differences.
4995510: Update NaN use cases to highlight recommended pattern.
fa2646e: Add how-to documentation for working with NaN values.

shawnbrown · 2019-07-28T07:02:54Z

@upretip, I've just pushed some new "how to" docs that give detail regarding NaN validation and behavior. You can view it in the latest docs here:

How to Deal With NaN Values
https://datatest.readthedocs.io/en/latest/how-to/nan-values.html

upretip · 2019-08-05T16:45:02Z

Thanks for the help. Closing this issue now!

upretip closed this as completed Aug 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

validation errors Extra(nan) or Invalid(nan) #49

validation errors Extra(nan) or Invalid(nan) #49

upretip commented Jul 8, 2019

shawnbrown commented Jul 10, 2019

upretip commented Jul 11, 2019

shawnbrown commented Jul 28, 2019

shawnbrown commented Jul 28, 2019

upretip commented Aug 5, 2019

validation errors Extra(nan) or Invalid(nan) #49

validation errors Extra(nan) or Invalid(nan) #49

Comments

upretip commented Jul 8, 2019

shawnbrown commented Jul 10, 2019

upretip commented Jul 11, 2019

shawnbrown commented Jul 28, 2019

shawnbrown commented Jul 28, 2019

upretip commented Aug 5, 2019