Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature iso date format with non english language #86

Merged
merged 32 commits into from
Jul 13, 2015

Conversation

waqasshabbir
Copy link
Contributor

Fixes iso datestamp format parsing issue.

parse('2015-05-02T10:20:19+0000', languages=['fr'])
is not parsed, while
parse('2015-05-02T10:20:19+0000', languages=['en']) has good result.

Fixes #45

@@ -24,40 +26,51 @@ def __init__(self, shortname, language_info):
if isinstance(value, int):
simplification[key] = str(value)

self._cached = self
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we aliasing self as self._cached ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did that because we generate a new instance of Language whenever a new skip token is added and identified like here. Is there a better practice?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But now original methods are never called, they all proxying to self._cached. Is there a point?
How do you feel about instead of implementing this on a Language class level go level above and try to implement this on LanguageDataLoader level, or BaseLanguageDetector level or even DateDataParser level?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure if direct assignments to self like self = SomeInstance() a good practice or not. Hence, used _cached attrib. I tried implementing it on almost all levels. Higher levels slowed down the speed and had other issues but due to speed imperative I didn't look into fixing those issues. Tests took about 16 seconds to finish when I tried implementing it on DateDataParser level in comparison with ~ 6 seconds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What have caused those speed issues? I mean we are generating updated languages once and then just reusing them, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we would customize settings before importing parse, then it would work alright. But if we'd customize settings after importing parse, it won't work hence requiring generation of languages again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can address this isue when we'd have more configuration changes to work with.
What about using LanguageDataLoader for now then?

Reverted changes to a test
return (
tz_obj[0],
{
'regex': re.compile(re.sub(repl, replw, regex % tz_obj[0]), re.IGNORECASE),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't it be clearer to pass regex with other function parameters?

@waqasshabbir
Copy link
Contributor Author

@Allactaga please take a look again.

global default_language_loader
if settings.SKIP_TOKENS != self._skip_tokens:
default_language_loader = LanguageDataLoader()
self = DateDataParser(*self._default_args)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think is happening in this block?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If DateDataParser instance's _skip_tokens differ from settings.SKIP_TOKENS it creates a new instance of LanguageDataLoader and assigns it to default_language_loader which is then used in DataDataParser.__init__ while creating a new instance for DateDataParser with originally provided arguments and updating reference to self. This ensures new skip tokens are accounted for while detecting language and subsequently parsing date string.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You understand that this is happening every time you call get_date_data because you updating local variable self and not the instance itself, right?
Also, as for now we don't expose our settings interface to user, let's assume it's not changing and move this check to __init__ method, getting rid of global in the same time. Instead, create a class method get_language_loader and use class attribute to lazy instantiate default language loader.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made changes accordingly, please take a look again.

('zzz', 0)
{
'regex_patterns':
[r'(^\w|\w$|\s|\d)\(?%s\)?$'],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\b wasn't suitable because it would also match parenthesis and hence not omit them failing case likes #45

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. Can you please give me a code example. I can't think of the valid case. It is working like this though:

>>> re.sub(r'(\b|\d)\(?UTC\)?$', r'\1', "19:00(UTC)")
'19:00'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here:

In [14]: re.sub(r'(\b|\d)\(?UTC\)?$', r'\1', '19:00 (UTC)')
Out[14]: '19:00 ('

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, i see. So what are ^\w and \w$ there for?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also thinking in the direction of (\W|\d|_)

Removed SKIP_TOKENS tests
])
language_loader._data = ordered_languages
self.add_patch(patch('dateparser.date.default_language_loader', new=language_loader))
self.parser = date.DateDataParser(languages=restrict_to_languages, **params)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is not restricting languages in loader to a limited subset anymore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

waqasshabbir added a commit that referenced this pull request Jul 13, 2015
…non-english-language

Feature iso date format with non english language
@waqasshabbir waqasshabbir merged commit 1718d73 into master Jul 13, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dates "08/17/14 17:00 PM (PDT)" not parsed.
2 participants