Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WiP] new number-parser library - incorporation #711

Draft
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

arnavkapoor
Copy link
Contributor

Hi everyone, for the past couple of weeks I have been working on a number-parser library. (the goal is to parse numbers written in natural language). A basic English only version is available here. I will constantly improve the library in the coming months. A detailed plan for the same is available here.

The current goals for the library include:-

  1. Handling ordinal numbers (third, seventh, five thousand and sixth)
  2. Support for multiple languages (similar to how date-parser allows user to contribute language specific nuances)
  3. Supporting year specific lingo ( saying nineteen eighty as opposed to one thousand nine hundred and eighty).

The current PR removes away the need to have the hard-coded values for numbers in the data translation files. It also allows dates like 'thirty May two thousand and nine' to parse correctly.
Currently, the number-parser is called inside the 'get_date_data' function of 'DateDataParser' as that seemed to be called prior to parsing by all features ( except search ? ).

I would appreciate any inputs with regards to incorporation and bugs/features/ideas that you can think of for the number-parser library. Please feel free to raise issues in the library.

To test this update you will need to clone it from the GitHub link and install using 'pip install -e' , as it hasn't been updated in the dependency.

@Gallaecio
Copy link
Member

I guess related issues would be: #651, #239, and #46.

@noviluni
Copy link
Collaborator

noviluni commented Jun 17, 2020

Hi @arnavkapoor !

I think we can move your addition to another place. Let me explain it.

Inside the get_date_data() method we find:

  1. checking for the correct type
  2. (your addition)
  3. trying with provided date formats (parse_with_formats)
  4. sanitization (sanitize_date)
  5. for loop trying applicable locales

I would move it at least after the sanitization part. However, as number-parser will support different locales, it could be a good idea to add it inside the for loop, where we can pass directly the locale to number-parser.

This approach has some advantages and some disadvantages. Summarizing:

Pros:

  • If we don't implement any lang detection to number-parser, this will allow us to directly pass the desired locale.
  • In case we implement a sort of lang detection in number-parser, passing the correct language will save time.
  • We will be translating to the expected locale, so no mixing logic from different locales.

Cons:

  • This approach will call multiple times to the number-parser parser, even if there isn't anything to "parse", so it could more expensive.

As the disadvantage is not quite important, I think that I would proceed with the second approach.

We could add:

...
for locale in self._get_applicable_locales(date_string):
    date_string = parser.parse(date_string, locales=[locale.shortname])
    ...

If you like it, go ahead and add the locales parameter to the parse method in number-parser. (Don't worry if it doesn't work, just add it even if it's not used). I added a list to allow us to pass multiple possible locales, but you can also implement a locale parameter, up to you.

Let me know what you think about this.

On the other hand, we should add some tests to dateparser to ensure that this integration works, so you could also add some tests to this PR. Those tests should be as easy as testing different sentences with "natural written numbers" in different languages.

In any case, good job on this @arnavkapoor ! The progress is quite amazing! 🚀 😄

@noviluni
Copy link
Collaborator

noviluni commented Jun 17, 2020

Ok, I've been looking at the code, and it seems that dateparser lang detection doesn't work when passing for example "one of january of two thousand eight", as "one" is not recognised as English (I know it should be "first", but as number-parser doesn't work with "ordinals" I used "one").

We have here two options:

  1. Improve the lang detection in dateparser
  2. Make number-parser to autodetect the language and put it before the for loop.

Even the second option could be good, in dateparserI think that the second option would be better as in this way we could avoid mixing languages logic (imagine we "translate" a sentence coming from Spanish, for example, and then when we parse it in dateparser we use a wrong locale).

To implement the second option we could implement a function like is_a_number() in number-parser and use it inside dateparser (should be applied in the are_tokens_valid "any" check). This function could add an optional locale argument to ensure that we are using the expected locale. This function would check the passed string against the numbers (ordinal and cardinal) for that language.

Let me know if you have any other ideas.

@noviluni
Copy link
Collaborator

By the way, number-parser is not supporting Python 2 (and I don't think it should be supporting it), so I think that this integration could come at the same time as a major release of dateparser without Python 2 support 🎉 .

@Gallaecio what do you think?

@Gallaecio
Copy link
Member

It’s about time! 😛

@noviluni
Copy link
Collaborator

Hi @arnavkapoor!

Do you need help with the tests/pipeline?

The Python 3.5 pipeline is expected to fail because of the f-strings. Don't worry about it because we will remove the Python 3.5 support in the next version, at the same time we release this change.

The other python versions are failing because of an issue with Arabian if I'm not wrong. Could you check it? Let me know if you need help or feedback.

@arnavkapoor
Copy link
Contributor Author

Hi @noviluni , Yes I was investigating a bit about the failing test cases and it does seem that the failure wasn't due to number-parser. Though I am surprised that all cases with second are passing. Will look into this.
Also I was removing other hard-coded values for es , which basically was un and una. Doing this did cause a test-case to fail in test_languages - the string hace un horas. However
dateparser.parse("hace un horas") seems to be working correctly and returns a time one hour ago so I guess it's some intermediary transaltion which is not being performed.

Apart from this I tried out other strings like May twenty seventh , two thousand and seventeen and more similar cases and they seem to be working fine. So will also create a new tests/test_number_parser file (better name ? ) to add these cases. Apart from these I don't see any other major changes needed for incorporation.

@noviluni
Copy link
Collaborator

@arnavkapoor

Great! Let me know when you are ready with the tests. I think a better name for the file would be test_natural_language_numbers or something similar, as we are not going to test the number_parser package but the result of using it.

Try to use real date examples in different languages, but you can also try with some polemic/weird cases and discuss them here if they are failing or you have any doubt.

If you check other tests in this package, you will see that they are written with the nose package (which is compatible with pytest), but I would prefer if you don't use those examples and you use pytest instead, as we migrated the test suite to it and it's better to write the new tests using pytest.

When having a moment I will check the failing tests, it is possible that we have to change them 👍 .

@arnavkapoor
Copy link
Contributor Author

Hi @noviluni , I have added the test files. I did try using pytest , however there seemed to be some error. (will try fixing it) For the time being I have created it with a similar structure as the other test files.
With regards to the actual test-cases :-

  • I did try looking out for actual instances of such dates online but couldn't find a large source. Mostly these cases arise for spoken language. I have added some cases based on what I could find. Would appreciate if others with the specific language knowledge can also contribute.
  • For Spanish there seems to be an el preceding the dates , however with this el dates aren't accurately detected , hence I have removed them for the time-being.
  • There is one error with number-parser whitespace handling with Hindi that I detected , for the other languages it works fine and returns the correct output '1999 11:08'. I will create a new issue in number-parser and resolve it.
>>> number_parser.parse("1999 11:08", "hi")
'199911:08'
  • Apart from this currently there are no tests for how this affects search functionality (and other features ?). So I will add them too.

Would appreciate all opinions and advice.

Copy link
Collaborator

@noviluni noviluni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the failing tests.

  1. param('es', "hace un horas", "1 hour ago"),
    This is failing because it tries to check that the old "un" --> 1 and the "hace (\\d+) horas" --> "\\1 hour ago" simplifications are working, but we changed they way it works (the "translation" is performed before) and the result when doing dateparser.parse("hace un horas")work as expected, so I think it doesn't make sense to leave it. (And in fact, "hace un horas" is gramatically incorrect, it should be "hace una hora", andhace una horais tested when testing thefreshness parser`). @Gallaecio another opinion regarding this would be nice :)

  2. param('तेरह जनवरी 1997 11:08', datetime(1997, 1, 13, 11, 8))
    You mentioned that it is because there is an issue with the whitespaces and I think it is because it breaks the tokens by using split instead of regex (if I'm not remembering wrong). We should probably fix it and release a new minor version (number-parser == 0.2.1)

  3. param('fa', 'نگ جهانی دوم جنگ جدی بین سپتامبر 1939 و 2 سپتامبر 1945 بود.', ...
    To be totally honest I'm not sure why is this failing... It seems that it's taking the دوم ("second") from the "second world war" as a date, but I don't fully understand how could this be related to the new incorporation... the search_dates() function has a lot of open issues and it's something that I would like to check after releasing the new version. I will try to check this more and give you more insights.
    @Gallaecio any idea about this?

setup.py Outdated
@@ -31,6 +31,7 @@
install_requires=[
'python-dateutil',
'pytz',
'number_parser',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer if you put number-parser (with hyphen) as it is the "official" name (https://pypi.org/project/number-parser/)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@@ -11,6 +11,7 @@
from dateparser.timezone_parser import pop_tz_offset_from_string
from dateparser.utils import apply_timezone_from_settings, \
set_correct_day_from_settings
from number_parser import parser
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when using parser.parse(...) below, it's confusing what we are doing ("transforming" numbers). I'm not sure the best way to do it, but what do you think if we use from number_parser import parse as transform_numbers and then transform_numbers(...) or something like that to make it less confusing?

I'm not totally sure if this is a good practice, bet leaving "parser.parse()" seem really confusing for me.

@Gallaecio any feedback on this?

from dateparser.date import DateDataParser, date_parser
from dateparser.date_parser import DateParser
from dateparser.timezone_parser import StaticTzInfo
from dateparser.utils import normalize_unicode
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that

from dateparser.timezone_parser import StaticTzInfo
from dateparser.utils import normalize_unicode

as well as

import pytest

Are not used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @noviluni I have updated with some of the changes. Will make the others too based on the feedback. Meanwhile have started looking into the failing hindi test-case.

@noviluni noviluni modified the milestones: v1.0.0, 1.1.0 Sep 11, 2020
@noviluni
Copy link
Collaborator

As this is not a breaking change but an improvement, I changed the milestone from 1.0.0 to 1.1.0.

We need to improve number-parser by adding support for more languages as well as improving the coverage before integrating it with dateparser.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants