Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad escape characters trigger an exception #1045

Closed
Etirf opened this issue Mar 16, 2022 · 17 comments · Fixed by #1095
Closed

Bad escape characters trigger an exception #1045

Etirf opened this issue Mar 16, 2022 · 17 comments · Fixed by #1095

Comments

@Etirf
Copy link

Etirf commented Mar 16, 2022

Note: As a workaround for this issue, we have pinned regex. Which makes Python 3.11 support either impossible or uncomfortable. The goal now is to remove that version pin on regex without making this issue resurface.

Hello everyone,

Tried parsing under python 3.7.5 and 3.9

dateparser.parse('12/12/12')

It also gives the same output for any "valid" input shown in the doc:

dateparser.parse('Fri, 12 Dec 2014 10:55:50')
dateparser.parse('22 Décembre 2010', date_formats=['%d %B %Y'])
...

Here's the error:


---------------------------------------------------------------------------
error                                     Traceback (most recent call last)
Input In [46], in <cell line: 1>()
----> 1 dateparser.parse("12/12/12")

File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\conf.py:92, in apply_settings.<locals>.wrapper(*args, **kwargs)
     89 if not isinstance(kwargs['settings'], Settings):
     90     raise TypeError("settings can only be either dict or instance of Settings class")
---> 92 return f(*args, **kwargs)

File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\__init__.py:61, in parse(date_string, date_formats, languages, locales, region, settings, detect_languages_function)
     57 if languages or locales or region or detect_languages_function or not settings._default:
     58     parser = DateDataParser(languages=languages, locales=locales,
     59                             region=region, settings=settings, detect_languages_function=detect_languages_function)
---> 61 data = parser.get_date_data(date_string, date_formats)
     63 if data:
     64     return data['date_obj']

File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\date.py:428, in DateDataParser.get_date_data(self, date_string, date_formats)
    425 date_string = sanitize_date(date_string)
    427 for locale in self._get_applicable_locales(date_string):
--> 428     parsed_date = _DateLocaleParser.parse(
    429         locale, date_string, date_formats, settings=self._settings)
    430     if parsed_date:
    431         parsed_date['locale'] = locale.shortname

File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\date.py:178, in _DateLocaleParser.parse(cls, locale, date_string, date_formats, settings)
    175 @classmethod
    176 def parse(cls, locale, date_string, date_formats=None, settings=None):
    177     instance = cls(locale, date_string, date_formats, settings)
--> 178     return instance._parse()

File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\date.py:182, in _DateLocaleParser._parse(self)
    180 def _parse(self):
    181     for parser_name in self._settings.PARSERS:
--> 182         date_data = self._parsers[parser_name]()
    183         if self._is_valid_date_data(date_data):
    184             return date_data

File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\date.py:196, in _DateLocaleParser._try_freshness_parser(self)
    194 def _try_freshness_parser(self):
    195     try:
--> 196         return freshness_date_parser.get_date_data(self._get_translated_date(), self._settings)
    197     except (OverflowError, ValueError):
    198         return None

File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\date.py:234, in _DateLocaleParser._get_translated_date(self)
    232 def _get_translated_date(self):
    233     if self._translated_date is None:
--> 234         self._translated_date = self.locale.translate(
    235             self.date_string, keep_formatting=False, settings=self._settings)
    236     return self._translated_date

File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\languages\locale.py:131, in Locale.translate(self, date_string, keep_formatting, settings)
    128 dictionary = self._get_dictionary(settings)
    129 date_string_tokens = dictionary.split(date_string, keep_formatting)
--> 131 relative_translations = self._get_relative_translations(settings=settings)
    133 for i, word in enumerate(date_string_tokens):
    134     word = word.lower()

File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\languages\locale.py:158, in Locale._get_relative_translations(self, settings)
    155 if settings.NORMALIZE:
    156     if self._normalized_relative_translations is None:
    157         self._normalized_relative_translations = (
--> 158             self._generate_relative_translations(normalize=True))
    159     return self._normalized_relative_translations
    160 else:

File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\languages\locale.py:172, in Locale._generate_relative_translations(self, normalize)
    170     value = list(map(normalize_unicode, value))
    171 pattern = '|'.join(sorted(value, key=len, reverse=True))
--> 172 pattern = DIGIT_GROUP_PATTERN.sub(r'?P<n>\d+', pattern)
    173 pattern = re.compile(r'^(?:{})$'.format(pattern), re.UNICODE | re.IGNORECASE)
    174 relative_dictionary[pattern] = key

File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\regex\regex.py:700, in _compile_replacement_helper(pattern, template)
    695     break
    696 if ch == "\\":
    697     # '_compile_replacement' will return either an int group reference
    698     # or a string literal. It returns items (plural) in order to handle
    699     # a 2-character literal (an invalid escape sequence).
--> 700     is_group, items = _compile_replacement(source, pattern, is_unicode)
    701     if is_group:
    702         # It's a group, so first flush the literal.
    703         if literal:

File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\regex\_regex_core.py:1736, in _compile_replacement(source, pattern, is_unicode)
   1733         if value is not None:
   1734             return False, [value]
-> 1736     raise error("bad escape \\%s" % ch, source.string, source.pos)
   1738 if isinstance(source.sep, bytes):
   1739     octal_mask = 0xFF

error: bad escape \d at position 7

How to reproduce:
Env: windows 10

  • Fresh install of python 3.7.5 or 3.9
  • Make a simple python file including these 2 lines:
import dateparser
dateparser.parse("12/12/12")
@souch3
Copy link

souch3 commented Mar 16, 2022

I am seeing the exact same behavior with code that worked just 2 hours ago. This is on macOS. I tested with python 3.8.2, 3.8.5, and 3.10.2

@daisieh
Copy link

daisieh commented Mar 16, 2022

Same here. Python 3.7.12, macOS.

@tomasvotava
Copy link

Same here, Python 3.9-slim and 3.10-slim docker images, sample code:

from dateparser import parse
parse("7 days ago")

Output:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.10/site-packages/dateparser/conf.py", line 92, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/dateparser/__init__.py", line 61, in parse
    data = parser.get_date_data(date_string, date_formats)
  File "/usr/local/lib/python3.10/site-packages/dateparser/date.py", line 428, in get_date_data
    parsed_date = _DateLocaleParser.parse(
  File "/usr/local/lib/python3.10/site-packages/dateparser/date.py", line 178, in parse
    return instance._parse()
  File "/usr/local/lib/python3.10/site-packages/dateparser/date.py", line 182, in _parse
    date_data = self._parsers[parser_name]()
  File "/usr/local/lib/python3.10/site-packages/dateparser/date.py", line 196, in _try_freshness_parser
    return freshness_date_parser.get_date_data(self._get_translated_date(), self._settings)
  File "/usr/local/lib/python3.10/site-packages/dateparser/date.py", line 234, in _get_translated_date
    self._translated_date = self.locale.translate(
  File "/usr/local/lib/python3.10/site-packages/dateparser/languages/locale.py", line 131, in translate
    relative_translations = self._get_relative_translations(settings=settings)
  File "/usr/local/lib/python3.10/site-packages/dateparser/languages/locale.py", line 158, in _get_relative_translations
    self._generate_relative_translations(normalize=True))
  File "/usr/local/lib/python3.10/site-packages/dateparser/languages/locale.py", line 172, in _generate_relative_translations
    pattern = DIGIT_GROUP_PATTERN.sub(r'?P<n>\d+', pattern)
  File "/usr/local/lib/python3.10/site-packages/regex/regex.py", line 700, in _compile_replacement_helper
    is_group, items = _compile_replacement(source, pattern, is_unicode)
  File "/usr/local/lib/python3.10/site-packages/regex/_regex_core.py", line 1736, in _compile_replacement
    raise error("bad escape \\%s" % ch, source.string, source.pos)
regex._regex_core.error: bad escape \d at position 7

We were using dateparser==1.0.0, upgrading to dateparser==1.1.0 didn't solve the issue.

@xiaopc
Copy link

xiaopc commented Mar 16, 2022

dependency regex==2022.3.15 made this probably
rolling back to regex==2022.1.18 may help

update: this commit
mrabarnett/mrab-regex@138970b

@dmoklaf
Copy link

dmoklaf commented Mar 16, 2022

I can confirm that deploying regex==2022.1.18 instead (through conda in my case) makes the bug disappear.

@EpicWink
Copy link

EpicWink commented Mar 16, 2022

Caused by behaviour change introduced in mrabarnett/mrab-regex@138970b (released as regex v2022.3.15), installing any version before this (eg v2022.3.2) should fix

Change was to now raise on invalid ASCII escape characters in pattern compiling and substitution. Not sure if it's a bug with dateparser or regex

This will be a problem on all supported platforms and environments (Linux, MacOS, Windows; Python 3.6 to 3.10)

adbar added a commit to adbar/htmldate that referenced this issue Mar 16, 2022
bdura added a commit to aphp/edsnlp that referenced this issue Mar 16, 2022
bdura added a commit to aphp/edsnlp that referenced this issue Mar 16, 2022
@mycaule
Copy link

mycaule commented Mar 16, 2022

Making CI/CD break when installing latest version. Please update the PyPI package too, thanks a lot.

@tducret
Copy link

tducret commented Mar 16, 2022

Hi. I was also faced with the same problem (and thought it was a Mac M1 problem with the regex lib).
It turns out to be related to the drop of Python 3.6 support in regex :

Since Python 3.6, the re module has been rejecting unknown escape sequences such as \q in patterns and escape sequences including \d in replacement templates.

As the regex module no longer supports versions of Python <3.6, I've brought the regex module into line with re.

You code should now read:

pattern = DIGIT_GROUP_PATTERN.sub(r'?P<n>\\d+', pattern)

More info in mrabarnett/mrab-regex/issues/459

Here is a problematic pattern but there may be more?

xiaopc added a commit to xiaopc/qdii-value that referenced this issue Mar 17, 2022
1. fix gfinance exceptions
2. temporary fix for scrapinghub/dateparser#1045
3. readme update
@miguelbalmeida
Copy link

I can confirm that this issue is NOT specific to MacOS - our CI/CD uses Linux machines and was affected by this. My local machine, running Ubuntu, was also affected.

Explicitly pinning regex==2022.1.18 as suggested by @xiaopc fixed it for us.

@saemideluxe
Copy link

saemideluxe commented Mar 17, 2022

Thanks for the fix and for writing the library in the first place. This seems to me to be one of the best date parsing libraries, we use it for a lot of data imports. Hoping for a soon pip release as well. Keep up the good work 👍

asadurski added a commit that referenced this issue Mar 17, 2022
@asadurski
Copy link
Member

Many thanks for thorough investigation!
For now I'll make a quick fix by pinning regex version, but in the long run we should follow @tducret's suggestion (#1045 (comment)) and reform the regexes.

If anyone's up for a PR with the fix, please go ahead!

@rerowep
Copy link

rerowep commented Mar 17, 2022

Is it possible to push the version 1.1.1 to pypi please?

@asadurski
Copy link
Member

Thank you for raising that, @rerowep. It seems like the PyPI publish action got stuck. It's published now 👍

rerowep added a commit to rerowep/rero-ils that referenced this issue Mar 17, 2022
* Fixes problem with regex in datetime:
scrapinghub/dateparser#1045
* Fixes flask_celerxext mappings for scheduler.

Co-Authored-by:  Peter Weber <Peter.Weber@rero.ch>
asaggi-supportlogic added a commit to asaggi-supportlogic/dateparser that referenced this issue Jun 30, 2022
goodspark added a commit to goodspark/dateparser that referenced this issue Aug 11, 2022
mweinelt added a commit to NixOS/nixpkgs that referenced this issue Sep 15, 2022
Patch by thmo (THomas Moschny) taken from
scrapinghub/dateparser#1045 (comment).

Also relaxes the regex version constraint.
mweinelt added a commit to NixOS/nixpkgs that referenced this issue Sep 16, 2022
Patch by thmo (THomas Moschny) taken from
scrapinghub/dateparser#1045 (comment).

Also relaxes the regex version constraint.
mweinelt added a commit to NixOS/nixpkgs that referenced this issue Sep 16, 2022
Patch by thmo (THomas Moschny) taken from
scrapinghub/dateparser#1045 (comment).

Also relaxes the regex version constraint.
SuperSandro2000 pushed a commit to NixOS/nixpkgs that referenced this issue Sep 17, 2022
Patch by thmo (THomas Moschny) taken from
scrapinghub/dateparser#1045 (comment).

Also relaxes the regex version constraint.
mweinelt added a commit to NixOS/nixpkgs that referenced this issue Sep 18, 2022
Patch by thmo (THomas Moschny) taken from
scrapinghub/dateparser#1045 (comment).

Also relaxes the regex version constraint.
mweinelt added a commit to NixOS/nixpkgs that referenced this issue Sep 19, 2022
Patch by thmo (THomas Moschny) taken from
scrapinghub/dateparser#1045 (comment).

Also relaxes the regex version constraint.
@Gallaecio Gallaecio reopened this Oct 15, 2022
@Gallaecio
Copy link
Member

Reopening until we fix it properly.

@Gallaecio Gallaecio changed the title Can't parse anything Bad escape characters trigger an exception Oct 21, 2022
duncanmmacleod added a commit to duncanmmacleod/gwpy that referenced this issue Nov 7, 2022
@DJRHails
Copy link
Contributor

DJRHails commented Nov 8, 2022

Independently arrived on the same solution as the PR, explanation for the bug here

@thebaptiste
Copy link

Fine. Expecting now a new publish on pypi !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet