-
Notifications
You must be signed in to change notification settings - Fork 466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature iso date format with non english language #86
Feature iso date format with non english language #86
Conversation
Fixed github issue#45
Fixes issue#45
Fixed wrong test
@@ -24,40 +26,51 @@ def __init__(self, shortname, language_info): | |||
if isinstance(value, int): | |||
simplification[key] = str(value) | |||
|
|||
self._cached = self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we aliasing self
as self._cached
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did that because we generate a new instance of Language whenever a new skip token is added and identified like here. Is there a better practice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But now original methods are never called, they all proxying to self._cached
. Is there a point?
How do you feel about instead of implementing this on a Language
class level go level above and try to implement this on LanguageDataLoader
level, or BaseLanguageDetector
level or even DateDataParser
level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure if direct assignments to self
like self = SomeInstance()
a good practice or not. Hence, used _cached
attrib. I tried implementing it on almost all levels. Higher levels slowed down the speed and had other issues but due to speed imperative I didn't look into fixing those issues. Tests took about 16 seconds to finish when I tried implementing it on DateDataParser level in comparison with ~ 6 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What have caused those speed issues? I mean we are generating updated languages once and then just reusing them, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we would customize settings before importing parse
, then it would work alright. But if we'd customize settings after importing parse
, it won't work hence requiring generation of languages again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can address this isue when we'd have more configuration changes to work with.
What about using LanguageDataLoader
for now then?
Reverted changes to a test
return ( | ||
tz_obj[0], | ||
{ | ||
'regex': re.compile(re.sub(repl, replw, regex % tz_obj[0]), re.IGNORECASE), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't it be clearer to pass regex
with other function parameters?
@Allactaga please take a look again. |
global default_language_loader | ||
if settings.SKIP_TOKENS != self._skip_tokens: | ||
default_language_loader = LanguageDataLoader() | ||
self = DateDataParser(*self._default_args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think is happening in this block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If DateDataParser
instance's _skip_tokens
differ from settings.SKIP_TOKENS
it creates a new instance of LanguageDataLoader
and assigns it to default_language_loader
which is then used in DataDataParser.__init__
while creating a new instance for DateDataParser
with originally provided arguments and updating reference to self. This ensures new skip tokens are accounted for while detecting language and subsequently parsing date string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You understand that this is happening every time you call get_date_data
because you updating local variable self
and not the instance itself, right?
Also, as for now we don't expose our settings interface to user, let's assume it's not changing and move this check to __init__
method, getting rid of global
in the same time. Instead, create a class method get_language_loader
and use class attribute to lazy instantiate default language loader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made changes accordingly, please take a look again.
('zzz', 0) | ||
{ | ||
'regex_patterns': | ||
[r'(^\w|\w$|\s|\d)\(?%s\)?$'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
\b
wasn't suitable because it would also match parenthesis and hence not omit them failing case likes #45
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. Can you please give me a code example. I can't think of the valid case. It is working like this though:
>>> re.sub(r'(\b|\d)\(?UTC\)?$', r'\1', "19:00(UTC)")
'19:00'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here:
In [14]: re.sub(r'(\b|\d)\(?UTC\)?$', r'\1', '19:00 (UTC)')
Out[14]: '19:00 ('
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, i see. So what are ^\w
and \w$
there for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also thinking in the direction of (\W|\d|_)
Refactored code to remove skip tokens check
Removed SKIP_TOKENS tests
]) | ||
language_loader._data = ordered_languages | ||
self.add_patch(patch('dateparser.date.default_language_loader', new=language_loader)) | ||
self.parser = date.DateDataParser(languages=restrict_to_languages, **params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is not restricting languages in loader to a limited subset anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
…non-english-language Feature iso date format with non english language
Fixes iso datestamp format parsing issue.
parse('2015-05-02T10:20:19+0000', languages=['fr'])
is not parsed, while
parse('2015-05-02T10:20:19+0000', languages=['en'])
has good result.Fixes #45