[WiP] new number-parser library - incorporation #711

arnavkapoor · 2020-06-17T12:38:11Z

Hi everyone, for the past couple of weeks I have been working on a number-parser library. (the goal is to parse numbers written in natural language). A basic English only version is available here. I will constantly improve the library in the coming months. A detailed plan for the same is available here.

The current goals for the library include:-

Handling ordinal numbers (third, seventh, five thousand and sixth)
Support for multiple languages (similar to how date-parser allows user to contribute language specific nuances)
Supporting year specific lingo ( saying nineteen eighty as opposed to one thousand nine hundred and eighty).

The current PR removes away the need to have the hard-coded values for numbers in the data translation files. It also allows dates like 'thirty May two thousand and nine' to parse correctly.
Currently, the number-parser is called inside the 'get_date_data' function of 'DateDataParser' as that seemed to be called prior to parsing by all features ( except search ? ).

I would appreciate any inputs with regards to incorporation and bugs/features/ideas that you can think of for the number-parser library. Please feel free to raise issues in the library.

To test this update you will need to clone it from the GitHub link and install using 'pip install -e' , as it hasn't been updated in the dependency.

Updating forked repo to current version of date_parser

Gallaecio · 2020-06-17T15:02:01Z

I guess related issues would be: #651, #239, and #46.

noviluni · 2020-06-17T18:08:25Z

Hi @arnavkapoor !

I think we can move your addition to another place. Let me explain it.

Inside the get_date_data() method we find:

checking for the correct type
(your addition)
trying with provided date formats (parse_with_formats)
sanitization (sanitize_date)
for loop trying applicable locales

I would move it at least after the sanitization part. However, as number-parser will support different locales, it could be a good idea to add it inside the for loop, where we can pass directly the locale to number-parser.

This approach has some advantages and some disadvantages. Summarizing:

Pros:

If we don't implement any lang detection to number-parser, this will allow us to directly pass the desired locale.
In case we implement a sort of lang detection in number-parser, passing the correct language will save time.
We will be translating to the expected locale, so no mixing logic from different locales.

Cons:

This approach will call multiple times to the number-parser parser, even if there isn't anything to "parse", so it could more expensive.

As the disadvantage is not quite important, I think that I would proceed with the second approach.

We could add:

...
for locale in self._get_applicable_locales(date_string):
    date_string = parser.parse(date_string, locales=[locale.shortname])
    ...

If you like it, go ahead and add the locales parameter to the parse method in number-parser. (Don't worry if it doesn't work, just add it even if it's not used). I added a list to allow us to pass multiple possible locales, but you can also implement a locale parameter, up to you.

Let me know what you think about this.

On the other hand, we should add some tests to dateparser to ensure that this integration works, so you could also add some tests to this PR. Those tests should be as easy as testing different sentences with "natural written numbers" in different languages.

In any case, good job on this @arnavkapoor ! The progress is quite amazing! 🚀 😄

noviluni · 2020-06-17T18:47:14Z

Ok, I've been looking at the code, and it seems that dateparser lang detection doesn't work when passing for example "one of january of two thousand eight", as "one" is not recognised as English (I know it should be "first", but as number-parser doesn't work with "ordinals" I used "one").

We have here two options:

Improve the lang detection in dateparser
Make number-parser to autodetect the language and put it before the for loop.

Even the second option could be good, in dateparserI think that the second option would be better as in this way we could avoid mixing languages logic (imagine we "translate" a sentence coming from Spanish, for example, and then when we parse it in dateparser we use a wrong locale).

To implement the second option we could implement a function like is_a_number() in number-parser and use it inside dateparser (should be applied in the are_tokens_valid "any" check). This function could add an optional locale argument to ensure that we are using the expected locale. This function would check the passed string against the numbers (ordinal and cardinal) for that language.

Let me know if you have any other ideas.

noviluni · 2020-06-18T10:38:07Z

By the way, number-parser is not supporting Python 2 (and I don't think it should be supporting it), so I think that this integration could come at the same time as a major release of dateparser without Python 2 support 🎉 .

@Gallaecio what do you think?

Gallaecio · 2020-06-18T10:43:15Z

It’s about time! 😛

Updating code-base

…to number_parser_incorporation

noviluni · 2020-08-19T09:52:06Z

Hi @arnavkapoor!

Do you need help with the tests/pipeline?

The Python 3.5 pipeline is expected to fail because of the f-strings. Don't worry about it because we will remove the Python 3.5 support in the next version, at the same time we release this change.

The other python versions are failing because of an issue with Arabian if I'm not wrong. Could you check it? Let me know if you need help or feedback.

arnavkapoor · 2020-08-19T10:49:16Z

Hi @noviluni , Yes I was investigating a bit about the failing test cases and it does seem that the failure wasn't due to number-parser. Though I am surprised that all cases with second are passing. Will look into this.
Also I was removing other hard-coded values for es , which basically was un and una. Doing this did cause a test-case to fail in test_languages - the string hace un horas. However
dateparser.parse("hace un horas") seems to be working correctly and returns a time one hour ago so I guess it's some intermediary transaltion which is not being performed.

Apart from this I tried out other strings like May twenty seventh , two thousand and seventeen and more similar cases and they seem to be working fine. So will also create a new tests/test_number_parser file (better name ? ) to add these cases. Apart from these I don't see any other major changes needed for incorporation.

noviluni · 2020-08-19T10:56:58Z

@arnavkapoor

Great! Let me know when you are ready with the tests. I think a better name for the file would be test_natural_language_numbers or something similar, as we are not going to test the number_parser package but the result of using it.

Try to use real date examples in different languages, but you can also try with some polemic/weird cases and discuss them here if they are failing or you have any doubt.

If you check other tests in this package, you will see that they are written with the nose package (which is compatible with pytest), but I would prefer if you don't use those examples and you use pytest instead, as we migrated the test suite to it and it's better to write the new tests using pytest.

When having a moment I will check the failing tests, it is possible that we have to change them 👍 .

arnavkapoor · 2020-08-20T16:28:06Z

Hi @noviluni , I have added the test files. I did try using pytest , however there seemed to be some error. (will try fixing it) For the time being I have created it with a similar structure as the other test files.
With regards to the actual test-cases :-

I did try looking out for actual instances of such dates online but couldn't find a large source. Mostly these cases arise for spoken language. I have added some cases based on what I could find. Would appreciate if others with the specific language knowledge can also contribute.
For Spanish there seems to be an el preceding the dates , however with this el dates aren't accurately detected , hence I have removed them for the time-being.
There is one error with number-parser whitespace handling with Hindi that I detected , for the other languages it works fine and returns the correct output '1999 11:08'. I will create a new issue in number-parser and resolve it.

>>> number_parser.parse("1999 11:08", "hi")
'199911:08'

Apart from this currently there are no tests for how this affects search functionality (and other features ?). So I will add them too.

Would appreciate all opinions and advice.

noviluni

I checked the failing tests.

param('es', "hace un horas", "1 hour ago"),
This is failing because it tries to check that the old "un" --> 1 and the "hace (\\d+) horas" --> "\\1 hour ago" simplifications are working, but we changed they way it works (the "translation" is performed before) and the result when doing dateparser.parse("hace un horas")work as expected, so I think it doesn't make sense to leave it. (And in fact, "hace un horas" is gramatically incorrect, it should be "hace una hora", andhace una horais tested when testing thefreshness parser`). @Gallaecio another opinion regarding this would be nice :)
param('तेरह जनवरी 1997 11:08', datetime(1997, 1, 13, 11, 8))
You mentioned that it is because there is an issue with the whitespaces and I think it is because it breaks the tokens by using split instead of regex (if I'm not remembering wrong). We should probably fix it and release a new minor version (number-parser == 0.2.1)
param('fa', 'نگ جهانی دوم جنگ جدی بین سپتامبر 1939 و 2 سپتامبر 1945 بود.', ...
To be totally honest I'm not sure why is this failing... It seems that it's taking the دوم ("second") from the "second world war" as a date, but I don't fully understand how could this be related to the new incorporation... the search_dates() function has a lot of open issues and it's something that I would like to check after releasing the new version. I will try to check this more and give you more insights.
@Gallaecio any idea about this?

noviluni · 2020-08-23T11:41:18Z

setup.py

@@ -31,6 +31,7 @@
    install_requires=[
        'python-dateutil',
        'pytz',
+        'number_parser',


I prefer if you put number-parser (with hyphen) as it is the "official" name (https://pypi.org/project/number-parser/)

noviluni · 2020-08-23T15:18:55Z

dateparser/date.py

@@ -11,6 +11,7 @@
 from dateparser.timezone_parser import pop_tz_offset_from_string
 from dateparser.utils import apply_timezone_from_settings, \
    set_correct_day_from_settings
+from number_parser import parser


when using parser.parse(...) below, it's confusing what we are doing ("transforming" numbers). I'm not sure the best way to do it, but what do you think if we use from number_parser import parse as transform_numbers and then transform_numbers(...) or something like that to make it less confusing?

I'm not totally sure if this is a good practice, bet leaving "parser.parse()" seem really confusing for me.

@Gallaecio any feedback on this?

noviluni · 2020-08-23T15:37:16Z

tests/test_natural_language_numbers.py

+from dateparser.date import DateDataParser, date_parser
+from dateparser.date_parser import DateParser
+from dateparser.timezone_parser import StaticTzInfo
+from dateparser.utils import normalize_unicode


I think that

from dateparser.timezone_parser import StaticTzInfo from dateparser.utils import normalize_unicode

as well as

import pytest

Are not used.

Hi @noviluni I have updated with some of the changes. Will make the others too based on the feedback. Meanwhile have started looking into the failing hindi test-case.

noviluni · 2020-09-11T10:22:29Z

As this is not a breaking change but an improvement, I changed the milestone from 1.0.0 to 1.1.0.

We need to improve number-parser by adding support for more languages as well as improving the coverage before integrating it with dateparser.

arnavkapoor added 3 commits May 21, 2020 14:09

Merge pull request #1 from scrapinghub/master

8b598b4

Updating forked repo to current version of date_parser

using number-parser

eaa6652

removing hard-coded numerals for english

4034cea

noviluni added this to the v1.0.0 milestone Jun 22, 2020

This was referenced Jul 3, 2020

added NumberParser(InternationalSystemEnglish) #647

Closed

Years spelt out e.g. "Two Thousand Eight" #534

Open

arnavkapoor added 3 commits August 18, 2020 22:35

Merge pull request #2 from scrapinghub/master

f884c35

Updating code-base

Merge branch 'master' of https://github.com/arnavkapoor/dateparser in…

19104c6

…to number_parser_incorporation

incorporating number_parser

ac94ef8

removing hard-coded data for spanish

5e1ac94

adding tests for natural language parsing

11f3b82

noviluni reviewed Aug 23, 2020

View reviewed changes

arnavkapoor added 3 commits August 24, 2020 17:43

minor changes , fixes

eca11bd

test_case_passing_comment_updated

3c5f8c2

minimum v0.2.1 of number-parser needed

5e415c3

noviluni mentioned this pull request Sep 11, 2020

remove numeral translation_data #782

Merged

noviluni modified the milestones: v1.0.0, 1.1.0 Sep 11, 2020

noviluni mentioned this pull request Oct 26, 2020

Extend the data of the Russian language #815

Merged

Gallaecio mentioned this pull request Mar 17, 2021

Integrate number-parser into dateparser and price-parser scrapinghub/number-parser#61

Open

noviluni mentioned this pull request Aug 19, 2021

French date parsed wrongly #960

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WiP] new number-parser library - incorporation #711

[WiP] new number-parser library - incorporation #711

arnavkapoor commented Jun 17, 2020

Gallaecio commented Jun 17, 2020

noviluni commented Jun 17, 2020 •

edited

Loading

noviluni commented Jun 17, 2020 •

edited

Loading

noviluni commented Jun 18, 2020

Gallaecio commented Jun 18, 2020

noviluni commented Aug 19, 2020

arnavkapoor commented Aug 19, 2020

noviluni commented Aug 19, 2020

arnavkapoor commented Aug 20, 2020

noviluni left a comment

noviluni Aug 23, 2020

arnavkapoor Aug 24, 2020

noviluni Aug 23, 2020

noviluni Aug 23, 2020

arnavkapoor Aug 24, 2020

noviluni commented Sep 11, 2020

[WiP] new number-parser library - incorporation #711

Are you sure you want to change the base?

[WiP] new number-parser library - incorporation #711

Conversation

arnavkapoor commented Jun 17, 2020

Gallaecio commented Jun 17, 2020

noviluni commented Jun 17, 2020 • edited Loading

noviluni commented Jun 17, 2020 • edited Loading

noviluni commented Jun 18, 2020

Gallaecio commented Jun 18, 2020

noviluni commented Aug 19, 2020

arnavkapoor commented Aug 19, 2020

noviluni commented Aug 19, 2020

arnavkapoor commented Aug 20, 2020

noviluni left a comment

Choose a reason for hiding this comment

noviluni Aug 23, 2020

Choose a reason for hiding this comment

arnavkapoor Aug 24, 2020

Choose a reason for hiding this comment

noviluni Aug 23, 2020

Choose a reason for hiding this comment

noviluni Aug 23, 2020

Choose a reason for hiding this comment

arnavkapoor Aug 24, 2020

Choose a reason for hiding this comment

noviluni commented Sep 11, 2020

noviluni commented Jun 17, 2020 •

edited

Loading

noviluni commented Jun 17, 2020 •

edited

Loading