Feature iso date format with non english language #86

waqasshabbir · 2015-05-20T11:27:06Z

Fixes iso datestamp format parsing issue.

parse('2015-05-02T10:20:19+0000', languages=['fr'])
is not parsed, while
parse('2015-05-02T10:20:19+0000', languages=['en']) has good result.

Fixes #45

Fixed github issue#45

Fixes issue#45

Fixed wrong test

…anguage

Allactaga · 2015-06-25T15:42:01Z

dateparser/languages/language.py

@@ -24,40 +26,51 @@ def __init__(self, shortname, language_info):
            if isinstance(value, int):
                simplification[key] = str(value)

+        self._cached = self


Why do we aliasing self as self._cached ?

I did that because we generate a new instance of Language whenever a new skip token is added and identified like here. Is there a better practice?

But now original methods are never called, they all proxying to self._cached. Is there a point?
How do you feel about instead of implementing this on a Language class level go level above and try to implement this on LanguageDataLoader level, or BaseLanguageDetector level or even DateDataParser level?

I wasn't sure if direct assignments to self like self = SomeInstance() a good practice or not. Hence, used _cached attrib. I tried implementing it on almost all levels. Higher levels slowed down the speed and had other issues but due to speed imperative I didn't look into fixing those issues. Tests took about 16 seconds to finish when I tried implementing it on DateDataParser level in comparison with ~ 6 seconds.

What have caused those speed issues? I mean we are generating updated languages once and then just reusing them, right?

If we would customize settings before importing parse, then it would work alright. But if we'd customize settings after importing parse, it won't work hence requiring generation of languages again.

We can address this isue when we'd have more configuration changes to work with.
What about using LanguageDataLoader for now then?

Reverted changes to a test

Allactaga · 2015-06-25T16:37:52Z

dateparser/timezone_parser.py

+        return (
+            tz_obj[0],
+            {
+                'regex': re.compile(re.sub(repl, replw, regex % tz_obj[0]), re.IGNORECASE),


wouldn't it be clearer to pass regex with other function parameters?

This reverts commit e1984a6.

This reverts commit 82903f2.

This reverts commit 68d46d5. Conflicts: dateparser/conf.py dateparser/languages/loader.py

…-non-english-language

waqasshabbir · 2015-06-25T20:21:02Z

@Allactaga please take a look again.

Allactaga · 2015-06-28T14:12:54Z

dateparser/date.py

+        global default_language_loader
+        if settings.SKIP_TOKENS != self._skip_tokens:
+            default_language_loader = LanguageDataLoader()
+            self = DateDataParser(*self._default_args)


What do you think is happening in this block?

If DateDataParser instance's _skip_tokens differ from settings.SKIP_TOKENS it creates a new instance of LanguageDataLoader and assigns it to default_language_loader which is then used in DataDataParser.__init__ while creating a new instance for DateDataParser with originally provided arguments and updating reference to self. This ensures new skip tokens are accounted for while detecting language and subsequently parsing date string.

You understand that this is happening every time you call get_date_data because you updating local variable self and not the instance itself, right?
Also, as for now we don't expose our settings interface to user, let's assume it's not changing and move this check to __init__ method, getting rid of global in the same time. Instead, create a class method get_language_loader and use class attribute to lazy instantiate default language loader.

Made changes accordingly, please take a look again.

waqasshabbir · 2015-06-30T15:09:31Z

dateparser/timezones.py

-    ('zzz', 0)
+    {
+        'regex_patterns':
+            [r'(^\w|\w$|\s|\d)\(?%s\)?$'],


\b wasn't suitable because it would also match parenthesis and hence not omit them failing case likes #45

Hm. Can you please give me a code example. I can't think of the valid case. It is working like this though:

>>> re.sub(r'(\b|\d)$?UTC$?$', r'\1', "19:00(UTC)") '19:00'

Here:

In [14]: re.sub(r'(\b|\d)$?UTC$?$', r'\1', '19:00 (UTC)') Out[14]: '19:00 ('

Okay, i see. So what are ^\w and \w$ there for?

I am also thinking in the direction of (\W|\d|_)

Refactored code to remove skip tokens check

Removed SKIP_TOKENS tests

Allactaga · 2015-07-09T10:17:20Z

tests/test_date.py

-            ])
-            language_loader._data = ordered_languages
-            self.add_patch(patch('dateparser.date.default_language_loader', new=language_loader))
+        self.parser = date.DateDataParser(languages=restrict_to_languages, **params)


This change is not restricting languages in loader to a limited subset anymore

…non-english-language Feature iso date format with non english language

waqasshabbir added 13 commits May 18, 2015 11:54

Fixes iso format issue with non-english languages

4604134

Fixed utc offset (+/-HHMM) parsing

e64e97d

Added timezone parsing support for utcoffset format

cd97f12

Fixed github issue#45

Added tests for dates with utc offset

368e385

Fixes issue#45

Added source for the list of utc offsets

b8d5885

Fixed tests failing due to timezone differences

c41df69

Fixed indentation.

ddc9019

Fixed wrong test

Merge branch 'master' into feature-iso-date-format-with-non-english-l…

8831e09

…anguage

Check on translation. Has issues with language detection

68d46d5

Moved logic to add new skip tokens to is_applicable method

82903f2

Added tests for SKIP_TOKENS configuration

da3bad3

Updated docs accordingly

26a9448

Fixed a minor mistake

e1984a6

Allactaga reviewed Jun 25, 2015
View reviewed changes

Fixed typo. Removed irrelevant comment.

6118177

Reverted changes to a test

Allactaga reviewed Jun 25, 2015
View reviewed changes

waqasshabbir added 7 commits June 25, 2015 22:48

Fixed inside-method signatures for better readability. Fixed a regex

96ad833

Dynamic skip tokens additions, LanguageDataLoader approach

fe67b6a

Revert "Fixed a minor mistake"

ef73451

This reverts commit e1984a6.

Revert "Moved logic to add new skip tokens to is_applicable method"

3e5ce0e

This reverts commit 82903f2.

Revert "Check on translation. Has issues with language detection"

2f90553

This reverts commit 68d46d5. Conflicts: dateparser/conf.py dateparser/languages/loader.py

Merge branch 'language-data-loader' into feature-iso-date-format-with…

bf95a94

…-non-english-language

Moved tests to proper file

ad73410

waqasshabbir added 2 commits June 26, 2015 13:52

Removed 't' from base skip

4a24537

Added default 't' to skip tokens tests

f149f11

Allactaga reviewed Jun 28, 2015
View reviewed changes

waqasshabbir added 2 commits June 29, 2015 11:05

Accessing _raw_data attrib as a static variable

5ac821c

Changes according to notes by Eugene

4b7c860

waqasshabbir reviewed Jun 30, 2015
View reviewed changes

waqasshabbir added 4 commits July 1, 2015 18:07

Fixed regex for stripping timezone

07c36bc

Commented failing test

8329812

Uncommented failing test.

47080cf

Refactored code to remove skip tokens check

Fixed failing tests

e2d8967

Allactaga mentioned this pull request Jul 2, 2015

Stop parsing invalid dates #66

Closed

Made lazy loader a static method.

b29a3d9

Removed SKIP_TOKENS tests

Allactaga reviewed Jul 9, 2015
View reviewed changes

waqasshabbir added 2 commits July 9, 2015 18:31

Fixed given_parser test method

7736534

Removed unused method

e4d4ff8

waqasshabbir added a commit that referenced this pull request Jul 13, 2015

Merge pull request #86 from scrapinghub/feature-iso-date-format-with-…

1718d73

…non-english-language Feature iso date format with non english language

waqasshabbir merged commit 1718d73 into master Jul 13, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature iso date format with non english language #86

Feature iso date format with non english language #86

waqasshabbir commented May 20, 2015

Allactaga Jun 25, 2015

waqasshabbir Jun 25, 2015

Allactaga Jun 25, 2015

waqasshabbir Jun 25, 2015

Allactaga Jun 25, 2015

waqasshabbir Jun 25, 2015

Allactaga Jun 25, 2015

Allactaga Jun 25, 2015

waqasshabbir commented Jun 25, 2015

Allactaga Jun 28, 2015

waqasshabbir Jun 29, 2015

Allactaga Jun 29, 2015

waqasshabbir Jun 30, 2015

waqasshabbir Jun 30, 2015

Allactaga Jun 30, 2015

waqasshabbir Jul 1, 2015

Allactaga Jul 1, 2015

Allactaga Jul 1, 2015

Allactaga Jul 9, 2015

waqasshabbir Jul 9, 2015

Feature iso date format with non english language #86

Feature iso date format with non english language #86

Conversation

waqasshabbir commented May 20, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

waqasshabbir commented Jun 25, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment