Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

validation errors Extra(nan) or Invalid(nan) #49

Closed
upretip opened this issue Jul 8, 2019 · 5 comments
Closed

validation errors Extra(nan) or Invalid(nan) #49

upretip opened this issue Jul 8, 2019 · 5 comments

Comments

@upretip
Copy link

upretip commented Jul 8, 2019

Shaun,
I am trying your package to see if I can validate a csv file by reading it in pandas. I am getting Extra(nan) dt.validate.superset() or Invalid(nan) dt.validate() . Is there a way I can include those nan in my validation sets?

Error looks like

E     ValidationError: may contain only elements of given superset (10000 differences): [
            Extra(nan),
            Extra(nan),
            Extra(nan),

Note: I am reading this particular column as str

E       ValidationError: does not satisfy 'str' (10000 differences): [
            Invalid(nan),
            Invalid(nan),
            Invalid(nan),
            Invalid(nan),

Let me know if you find a solution or can help me debug

@shawnbrown
Copy link
Owner

I've been looking at this closely and discovered a handful of un-handled corner cases related to NaN values. Until I get this sorted, NaN values will have to be handled using a workaround—e.g., using the fillna() method to replace them with a proxy value.

As a stopgap, you could do the following:

NAN = object()

# Include NAN in the validation set.
data = df['A'].fillna(NAN)
validate.superset(data, {'x', 'y', 'z', NAN})

# Accept NAN as a difference.
data = df['A'].fillna(NAN)
with accepted(Invalid(NAN)):
    validate(data, str)

Going forward, I will file a related issue/bug for this with the goal of allowing the use of NaN values directly:

# Include NaN in the validation set.
validate.superset(data, {'x', 'y', 'z', np.nan})

# Accept NAN as a difference.
with accepted(Invalid(np.nan)):
    validate(data, str)

I'll post a follow-up to this issue once I have patched this behavior.

@upretip
Copy link
Author

upretip commented Jul 11, 2019

Thanks. I will follow this.

@shawnbrown
Copy link
Owner

This is done:
ce71b34: Update predicate handling to better support NaN values.
bee6aa8: Add NaN handling idioms to test_usecases.py.
32d3bb9: Add test_numbers_equal() to verify numeric comparison.
e8435b1: Update difference behavior to support tuples containing NaNs.
c962e04: Change RequiredInterval to fail if arguments are NaN.
c78f390: Fix RequiredInterval to properly handle NaN differences.
4995510: Update NaN use cases to highlight recommended pattern.
fa2646e: Add how-to documentation for working with NaN values.

@shawnbrown
Copy link
Owner

@upretip, I've just pushed some new "how to" docs that give detail regarding NaN validation and behavior. You can view it in the latest docs here:

How to Deal With NaN Values
https://datatest.readthedocs.io/en/latest/how-to/nan-values.html

@upretip
Copy link
Author

upretip commented Aug 5, 2019

Thanks for the help. Closing this issue now!

@upretip upretip closed this as completed Aug 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants