Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Options for how r.parsedate() should handle invalid dates #416

Closed
mattkiefer opened this issue Mar 17, 2022 · 11 comments
Closed

Options for how r.parsedate() should handle invalid dates #416

mattkiefer opened this issue Mar 17, 2022 · 11 comments
Labels
enhancement New feature or request

Comments

@mattkiefer
Copy link

Exceptions are normal expected behavior when typecasting an invalid format. However, r.parsedate() is really just re-formatting strings and keeping the type as text. So it may be better to print-and-pass on exception so the user can see a complete list of invalid values -- while also allowing for the parser to reformat the remaining valid values.

sqlite-utils convert idfpr.db license "Expiration Date" "r.parsedate(value)"
  [#######-----------------------------]   21%  00:01:57Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/sqlite_utils/db.py", line 2336, in convert_value
    return fn(v)
  File "<string>", line 2, in fn
  File "/usr/local/lib/python3.9/dist-packages/sqlite_utils/recipes.py", line 8, in parsedate
    parser.parse(value, dayfirst=dayfirst, yearfirst=yearfirst).date().isoformat()
  File "/usr/lib/python3/dist-packages/dateutil/parser/_parser.py", line 1374, in parse
    return DEFAULTPARSER.parse(timestr, **kwargs)
  File "/usr/lib/python3/dist-packages/dateutil/parser/_parser.py", line 652, in parse
    raise ParserError("String does not contain a date: %s", timestr)
dateutil.parser._parser.ParserError: String does not contain a date:   /  /    

In this case, I had just one variation of an invalid date: ' / / '. But theoretically there could be many values that would have to be fixed one at a time with the current exception handling.

@simonw simonw added the enhancement New feature or request label Mar 18, 2022
@simonw
Copy link
Owner

simonw commented Mar 18, 2022

Good call-out: right now the parsedate() and parsedatetime() functions both terminate with an exception if they hit something invalid: https://sqlite-utils.datasette.io/en/stable/cli.html#sqlite-utils-convert-recipes

It would be better if this was configurable by the user (and properly documented) - options could include "set null if date is invalid" and "leave the value as it is if invalid" in addition to throwing an error.

@simonw simonw changed the title r.parsedate() chokes on invalid dates Options for how r.parsedate() should handle invalid dates Mar 18, 2022
@simonw simonw changed the title Options for how r.parsedate() should handle invalid dates Options for how r.parsedate() should handle invalid dates Mar 18, 2022
@simonw
Copy link
Owner

simonw commented Mar 18, 2022

Python's str.encode() method has a errors= parameter that does something along these lines: https://docs.python.org/3/library/stdtypes.html#str.encode

errors may be given to set a different error handling scheme. The default for errors is 'strict', meaning that encoding errors raise a UnicodeError. Other possible values are 'ignore', 'replace', 'xmlcharrefreplace', 'backslashreplace' and any other name registered via codecs.register_error(),

Imitating this might be the way to go.

@simonw
Copy link
Owner

simonw commented Mar 21, 2022

Generating a test database using a pattern from https://www.geekytidbits.com/date-range-table-sqlite/

sqlite-utils create-database test-dates.db
sqlite-utils create-table test-dates.db dates id integer date text --pk id
sqlite-utils test-dates.db "WITH RECURSIVE
  cnt(x) AS (
     SELECT 0
     UNION ALL
     SELECT x+1 FROM cnt
      LIMIT (SELECT ((julianday('2016-04-01') - julianday('2016-03-15'))) + 1)
  )
insert into dates (date) select date(julianday('2016-03-15'), '+' || x || ' days') as date FROM cnt;"

After running that:

% sqlite-utils rows test-dates.db dates
[{"id": 1, "date": "2016-03-15"},
 {"id": 2, "date": "2016-03-16"},
 {"id": 3, "date": "2016-03-17"},
 {"id": 4, "date": "2016-03-18"},
 {"id": 5, "date": "2016-03-19"},
 {"id": 6, "date": "2016-03-20"},
 {"id": 7, "date": "2016-03-21"},
 {"id": 8, "date": "2016-03-22"},
 {"id": 9, "date": "2016-03-23"},
 {"id": 10, "date": "2016-03-24"},
 {"id": 11, "date": "2016-03-25"},
 {"id": 12, "date": "2016-03-26"},
 {"id": 13, "date": "2016-03-27"},
 {"id": 14, "date": "2016-03-28"},
 {"id": 15, "date": "2016-03-29"},
 {"id": 16, "date": "2016-03-30"},
 {"id": 17, "date": "2016-03-31"},
 {"id": 18, "date": "2016-04-01"}]

Then to make one of them invalid:

sqlite-utils test-dates.db "update dates set date = '//' where id = 10"

@simonw
Copy link
Owner

simonw commented Mar 21, 2022

Then I ran this to convert 2016-03-27 etc to 2016/03/27 so I could see which ones were later converted:

sqlite-utils convert test-dates.db dates date 'value.replace("-", "/")'

@simonw
Copy link
Owner

simonw commented Mar 21, 2022

I confirmed that if it fails for any value ALL values are left alone, since it runs in a transaction.

Here's the code that does that:

with self.db.conn:
self.db.execute(sql, where_args or [])
if drop:
self.transform(drop=columns)

@simonw
Copy link
Owner

simonw commented Mar 21, 2022

I think the options here should be:

  • On error, raise an exception and revert the transaction (the current default)
  • On error, leave the value as-is
  • On error, set the value to None

These need to be indicated by parameters to the r.parsedate() function.

Some design options:

  • ignore=True to ignore errors - but how does it know if it should leave the value or set it to None? This is similar to other ignore=True parameters elsewhere in the Python API.
  • errors="ignore", errors="set-null" - I don't like magic string values very much, but this is similar to Python's str.encode(errors=) mechanism
  • errors=r.IGNORE - using constants, which at least avoids magic strings. The other one could be errors=r.SET_NULL
  • error=lambda v: None or error=lambda v: v - this is a bit confusing though, introducing another callback that gets to have a go at converting the error if the first callback failed? And what happens if that lambda itself raises an error?

@simonw
Copy link
Owner

simonw commented Mar 21, 2022

I'm going to try the errors=r.IGNORE option and see what that looks like once implemented.

@simonw
Copy link
Owner

simonw commented Mar 21, 2022

This is quite nice:

% sqlite-utils convert test-dates.db dates date "r.parsedate(value, errors=r.IGNORE)"
  [####################################]  100%
% sqlite-utils rows test-dates.db dates                                           
[{"id": 1, "date": "2016-03-15"},
 {"id": 2, "date": "2016-03-16"},
 {"id": 3, "date": "2016-03-17"},
 {"id": 4, "date": "2016-03-18"},
 {"id": 5, "date": "2016-03-19"},
 {"id": 6, "date": "2016-03-20"},
 {"id": 7, "date": "2016-03-21"},
 {"id": 8, "date": "2016-03-22"},
 {"id": 9, "date": "2016-03-23"},
 {"id": 10, "date": "//"},
 {"id": 11, "date": "2016-03-25"},
 {"id": 12, "date": "2016-03-26"},
 {"id": 13, "date": "2016-03-27"},
 {"id": 14, "date": "2016-03-28"},
 {"id": 15, "date": "2016-03-29"},
 {"id": 16, "date": "2016-03-30"},
 {"id": 17, "date": "2016-03-31"},
 {"id": 18, "date": "2016-04-01"}]
% sqlite-utils convert test-dates.db dates date "r.parsedate(value, errors=r.SET_NULL)"
  [####################################]  100%
% sqlite-utils rows test-dates.db dates                                                
[{"id": 1, "date": "2016-03-15"},
 {"id": 2, "date": "2016-03-16"},
 {"id": 3, "date": "2016-03-17"},
 {"id": 4, "date": "2016-03-18"},
 {"id": 5, "date": "2016-03-19"},
 {"id": 6, "date": "2016-03-20"},
 {"id": 7, "date": "2016-03-21"},
 {"id": 8, "date": "2016-03-22"},
 {"id": 9, "date": "2016-03-23"},
 {"id": 10, "date": null},
 {"id": 11, "date": "2016-03-25"},
 {"id": 12, "date": "2016-03-26"},
 {"id": 13, "date": "2016-03-27"},
 {"id": 14, "date": "2016-03-28"},
 {"id": 15, "date": "2016-03-29"},
 {"id": 16, "date": "2016-03-30"},
 {"id": 17, "date": "2016-03-31"},
 {"id": 18, "date": "2016-04-01"}]

@simonw
Copy link
Owner

simonw commented Mar 21, 2022

Prototype:

diff --git a/sqlite_utils/cli.py b/sqlite_utils/cli.py
index 8255b56..0a3693e 100644
--- a/sqlite_utils/cli.py
+++ b/sqlite_utils/cli.py
@@ -2583,7 +2583,11 @@ def _generate_convert_help():
     """
     ).strip()
     recipe_names = [
-        n for n in dir(recipes) if not n.startswith("_") and n not in ("json", "parser")
+        n
+        for n in dir(recipes)
+        if not n.startswith("_")
+        and n not in ("json", "parser")
+        and callable(getattr(recipes, n))
     ]
     for name in recipe_names:
         fn = getattr(recipes, name)
diff --git a/sqlite_utils/recipes.py b/sqlite_utils/recipes.py
index 6918661..569c30d 100644
--- a/sqlite_utils/recipes.py
+++ b/sqlite_utils/recipes.py
@@ -1,17 +1,38 @@
 from dateutil import parser
 import json
 
+IGNORE = object()
+SET_NULL = object()
 
-def parsedate(value, dayfirst=False, yearfirst=False):
+
+def parsedate(value, dayfirst=False, yearfirst=False, errors=None):
     "Parse a date and convert it to ISO date format: yyyy-mm-dd"
-    return (
-        parser.parse(value, dayfirst=dayfirst, yearfirst=yearfirst).date().isoformat()
-    )
+    try:
+        return (
+            parser.parse(value, dayfirst=dayfirst, yearfirst=yearfirst)
+            .date()
+            .isoformat()
+        )
+    except parser.ParserError:
+        if errors is IGNORE:
+            return value
+        elif errors is SET_NULL:
+            return None
+        else:
+            raise
 
 
-def parsedatetime(value, dayfirst=False, yearfirst=False):
+def parsedatetime(value, dayfirst=False, yearfirst=False, errors=None):
     "Parse a datetime and convert it to ISO datetime format: yyyy-mm-ddTHH:MM:SS"
-    return parser.parse(value, dayfirst=dayfirst, yearfirst=yearfirst).isoformat()
+    try:
+        return parser.parse(value, dayfirst=dayfirst, yearfirst=yearfirst).isoformat()
+    except parser.ParserError:
+        if errors is IGNORE:
+            return value
+        elif errors is SET_NULL:
+            return None
+        else:
+            raise
 
 
 def jsonsplit(value, delimiter=",", type=str):

@simonw
Copy link
Owner

simonw commented Mar 21, 2022

Needs tests and documentation.

@simonw simonw closed this as completed in 878d5f5 Mar 21, 2022
simonw added a commit that referenced this issue Apr 13, 2022
@mattkiefer
Copy link
Author

Thanks for addressing this @simonw! However, I just reinstalled sqlite-utils 3.26.1 and get an ParserError: Unknown string format: None:

sqlite-utils --version
sqlite-utils, version 3.26.1
sqlite-utils convert idfpr.db license "Original Issue Date" "r.parsedate(value)"
Traceback (most recent call last):
  File "/home/matt/.local/lib/python3.9/site-packages/sqlite_utils/db.py", line 2514, in convert_value
    return fn(v)
  File "<string>", line 2, in fn
  File "/home/matt/.local/lib/python3.9/site-packages/sqlite_utils/recipes.py", line 19, in parsedate
    parser.parse(value, dayfirst=dayfirst, yearfirst=yearfirst)
  File "/usr/lib/python3/dist-packages/dateutil/parser/_parser.py", line 1374, in parse
    return DEFAULTPARSER.parse(timestr, **kwargs)
  File "/usr/lib/python3/dist-packages/dateutil/parser/_parser.py", line 649, in parse
    raise ParserError("Unknown string format: %s", timestr)
dateutil.parser._parser.ParserError: Unknown string format: None
Traceback (most recent call last):
  File "/home/matt/.local/bin/sqlite-utils", line 8, in <module>
    sys.exit(cli())
  File "/usr/lib/python3/dist-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3/dist-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3/dist-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/matt/.local/lib/python3.9/site-packages/sqlite_utils/cli.py", line 2707, in convert
    db[table].convert(
  File "/home/matt/.local/lib/python3.9/site-packages/sqlite_utils/db.py", line 2530, in convert
    self.db.execute(sql, where_args or [])
  File "/home/matt/.local/lib/python3.9/site-packages/sqlite_utils/db.py", line 463, in execute
    return self.conn.execute(sql, parameters)
sqlite3.OperationalError: user-defined function raised exception

I definitely have some invalid data in the db. Happy to send a copy if it's helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants