Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve datetime handling in ScalarInequality and ScalarRange #853

Conversation

pvk-developer
Copy link
Member

Resolves #819

@pvk-developer pvk-developer force-pushed the issue-819-improve-datetime-handling-in-scalarinequality-and-scalarrange branch from c5206b5 to 2a7b48d Compare June 23, 2022 13:49
@codecov-commenter
Copy link

codecov-commenter commented Jun 23, 2022

Codecov Report

Merging #853 (be3e27c) into master (6939278) will increase coverage by 0.49%.
The diff coverage is 100.00%.

❗ Current head be3e27c differs from pull request most recent head 8594098. Consider uploading reports for the commit 8594098 to get more accurate results

@@            Coverage Diff             @@
##           master     #853      +/-   ##
==========================================
+ Coverage   67.65%   68.15%   +0.49%     
==========================================
  Files          38       38              
  Lines        2823     2867      +44     
==========================================
+ Hits         1910     1954      +44     
  Misses        913      913              
Impacted Files Coverage Δ
sdv/constraints/tabular.py 99.74% <100.00%> (+0.01%) ⬆️
sdv/constraints/utils.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6939278...8594098. Read the comment docs.

@pvk-developer pvk-developer marked this pull request as ready for review June 23, 2022 14:40
@pvk-developer pvk-developer requested a review from a team as a code owner June 23, 2022 14:40
@pvk-developer pvk-developer requested review from fealho and amontanez24 and removed request for a team June 23, 2022 14:40
Copy link
Contributor

@amontanez24 amontanez24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Mostly minor comments about restructuring

@@ -424,10 +425,17 @@ def _validate_inputs(column_name, value, relation):
if not (isinstance(value, (int, float)) or is_datetime_type(value)):
raise ValueError('`value` must be a number or datetime.')

if is_datetime_type(value):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is called above can we capture the result in a variable?

Comment on lines 432 to 434
return cast_to_datetime64(value)

return value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the _validate_inputs method should return a value. If we want to group the logic we can make a parse_value method that calls _validate_inputs and then parses to datetime if necessary

sdv/constraints/tabular.py Show resolved Hide resolved
@@ -531,6 +541,11 @@ def reverse_transform(self, table_data):
original_column = self._value - diff_column

table_data[self._column_name] = pd.Series(original_column).astype(self._dtype)
if self._is_datetime and self._datetime_format:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there ever a case where self._datetime_format is True, but self._is_datetime isn't? There shouldn't be right because the former is only set if the latter is True

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If self._is_datetime is True but we are not able to detect the format of the date time, then we can't format with None.

Copy link
Contributor

@amontanez24 amontanez24 Jun 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the other way around. Couldn't the if statement just be if self._datetime_format? Why do we need to check the self._is_datetime as well?

sdv/constraints/tabular.py Show resolved Hide resolved
@amontanez24
Copy link
Contributor

Has the test coverage been updated? If not it seems like some cases are missing

Copy link
Member

@fealho fealho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, I just had a few questions + some typos.

@@ -531,6 +541,11 @@ def reverse_transform(self, table_data):
original_column = self._value - diff_column

table_data[self._column_name] = pd.Series(original_column).astype(self._dtype)
if self._is_datetime and self._datetime_format:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the self._datetime_format check really necessary? It's only False when the data is empty or filled with nans, but in those cases self._is_datetime should be False, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes self._datetime_format may not be learnt (it can return None) since the learning of the format does not support ms:

In [32]: detect_datetime_format('2021-02-02T00:00:00.000000')
In [33]: detect_datetime_format('2021-02-02T00:00:00')
Out[33]: '%Y-%m-%dT%H:%M:%S'


if self._is_datetime:
table_data[self.column_name] = pd.to_datetime(data)
table_data[self._column_name] = pd.to_datetime(data)
if self._datetime_format:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, do we really need this check?

``numpy.datetime64`` value or values.
"""
if isinstance(value, str):
value = pd.to_datetime(value).to_datetime64()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to do to_datetime and to_datetime64?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used it to ensure it ends up being numpy.datetime64[ns] that way all the datetimes are in the same magnitude.



def detect_datetime_format(value):
"""Detect if possible the datetime format.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring missing a word?

def test__validate_inputs_with_numerical_value(self):
"""Test the ``_validate_inputs`` method.

Ensure the method crashes when the column name is not a string.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect, method doesn't crash.

def test__validate_inputs_with_datetime_value(self):
"""Test the ``_validate_inputs`` method.

Ensure the method crashes when the column name is not a string.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect, method doesn't crash.

def test_is_datetime_type_with_datetime_str():
"""Test the ``is_datetime_type`` function when an valid datetime string is passed.

Expect to return False when an int variable is passed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be expected to return True.

Input:
- string
Output:
- True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

False

def test_is_datetime_type_with_invalid_str():
"""Test the ``is_datetime_type`` function when an invalid string is passed.

Expect to return False when an int variable is passed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Str, not int


# Assert
pd.testing.assert_frame_equal(table_data, out)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you delete this check?

Copy link
Contributor

@amontanez24 amontanez24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple small comments and address Felipe's comment on the test file and then this is good to go!

Comment on lines 429 to 430
if value_is_datetime:
if not isinstance(value, str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can just be an and on one level right? ie

if value_is_datetime and not isinstance(value, str):

@pvk-developer pvk-developer force-pushed the issue-819-improve-datetime-handling-in-scalarinequality-and-scalarrange branch 2 times, most recently from 942bd90 to be3e27c Compare June 28, 2022 19:14
raise ValueError('`value` must be a number or datetime.')

if value_is_datetime not isinstance(value, str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should have an and in between

@pvk-developer pvk-developer force-pushed the issue-819-improve-datetime-handling-in-scalarinequality-and-scalarrange branch from 2413f01 to 8594098 Compare June 28, 2022 19:29
@fealho fealho self-requested a review June 28, 2022 20:34
@pvk-developer pvk-developer force-pushed the issue-819-improve-datetime-handling-in-scalarinequality-and-scalarrange branch from 8594098 to aa33b3e Compare June 29, 2022 08:57
@pvk-developer pvk-developer merged commit c2c2c0d into master Jun 29, 2022
@pvk-developer pvk-developer deleted the issue-819-improve-datetime-handling-in-scalarinequality-and-scalarrange branch June 29, 2022 14:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve datetime handling in ScalarInequality and ScalarRange constraints
4 participants