-
Notifications
You must be signed in to change notification settings - Fork 468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WiP] new number-parser library - incorporation #711
base: master
Are you sure you want to change the base?
[WiP] new number-parser library - incorporation #711
Conversation
Updating forked repo to current version of date_parser
Hi @arnavkapoor ! I think we can move your addition to another place. Let me explain it. Inside the
I would move it at least after the sanitization part. However, as This approach has some advantages and some disadvantages. Summarizing: Pros:
Cons:
As the disadvantage is not quite important, I think that I would proceed with the second approach. We could add: ...
for locale in self._get_applicable_locales(date_string):
date_string = parser.parse(date_string, locales=[locale.shortname])
... If you like it, go ahead and add the Let me know what you think about this. On the other hand, we should add some tests to In any case, good job on this @arnavkapoor ! The progress is quite amazing! 🚀 😄 |
Ok, I've been looking at the code, and it seems that We have here two options:
Even the second option could be good, in To implement the second option we could implement a function like Let me know if you have any other ideas. |
By the way, @Gallaecio what do you think? |
It’s about time! 😛 |
Updating code-base
…to number_parser_incorporation
Hi @arnavkapoor! Do you need help with the tests/pipeline? The Python 3.5 pipeline is expected to fail because of the f-strings. Don't worry about it because we will remove the Python 3.5 support in the next version, at the same time we release this change. The other python versions are failing because of an issue with Arabian if I'm not wrong. Could you check it? Let me know if you need help or feedback. |
Hi @noviluni , Yes I was investigating a bit about the failing test cases and it does seem that the failure wasn't due to number-parser. Though I am surprised that all cases with Apart from this I tried out other strings like |
Great! Let me know when you are ready with the tests. I think a better name for the file would be Try to use real date examples in different languages, but you can also try with some polemic/weird cases and discuss them here if they are failing or you have any doubt. If you check other tests in this package, you will see that they are written with the When having a moment I will check the failing tests, it is possible that we have to change them 👍 . |
Hi @noviluni , I have added the test files. I did try using pytest , however there seemed to be some error. (will try fixing it) For the time being I have created it with a similar structure as the other test files.
Would appreciate all opinions and advice. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the failing tests.
-
param('es', "hace un horas", "1 hour ago"),
This is failing because it tries to check that the old"un" --> 1
and the"hace (\\d+) horas" --> "\\1 hour ago" simplifications are working, but we changed they way it works (the "translation" is performed before) and the result when doing
dateparser.parse("hace un horas")work as expected, so I think it doesn't make sense to leave it. (And in fact, "hace un horas" is gramatically incorrect, it should be "hace una hora", and
hace una horais tested when testing the
freshness parser`). @Gallaecio another opinion regarding this would be nice :) -
param('तेरह जनवरी 1997 11:08', datetime(1997, 1, 13, 11, 8))
You mentioned that it is because there is an issue with the whitespaces and I think it is because it breaks the tokens by using split instead of regex (if I'm not remembering wrong). We should probably fix it and release a new minor version (number-parser == 0.2.1) -
param('fa', 'نگ جهانی دوم جنگ جدی بین سپتامبر 1939 و 2 سپتامبر 1945 بود.', ...
To be totally honest I'm not sure why is this failing... It seems that it's taking theدوم
("second") from the "second world war" as a date, but I don't fully understand how could this be related to the new incorporation... thesearch_dates()
function has a lot of open issues and it's something that I would like to check after releasing the new version. I will try to check this more and give you more insights.
@Gallaecio any idea about this?
setup.py
Outdated
@@ -31,6 +31,7 @@ | |||
install_requires=[ | |||
'python-dateutil', | |||
'pytz', | |||
'number_parser', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer if you put number-parser
(with hyphen) as it is the "official" name (https://pypi.org/project/number-parser/)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
@@ -11,6 +11,7 @@ | |||
from dateparser.timezone_parser import pop_tz_offset_from_string | |||
from dateparser.utils import apply_timezone_from_settings, \ | |||
set_correct_day_from_settings | |||
from number_parser import parser |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when using parser.parse(...)
below, it's confusing what we are doing ("transforming" numbers). I'm not sure the best way to do it, but what do you think if we use from number_parser import parse as transform_numbers
and then transform_numbers(...)
or something like that to make it less confusing?
I'm not totally sure if this is a good practice, bet leaving "parser.parse()
" seem really confusing for me.
@Gallaecio any feedback on this?
from dateparser.date import DateDataParser, date_parser | ||
from dateparser.date_parser import DateParser | ||
from dateparser.timezone_parser import StaticTzInfo | ||
from dateparser.utils import normalize_unicode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that
from dateparser.timezone_parser import StaticTzInfo
from dateparser.utils import normalize_unicode
as well as
import pytest
Are not used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @noviluni I have updated with some of the changes. Will make the others too based on the feedback. Meanwhile have started looking into the failing hindi test-case.
As this is not a breaking change but an improvement, I changed the milestone from 1.0.0 to 1.1.0. We need to improve |
Hi everyone, for the past couple of weeks I have been working on a number-parser library. (the goal is to parse numbers written in natural language). A basic English only version is available here. I will constantly improve the library in the coming months. A detailed plan for the same is available here.
The current goals for the library include:-
The current PR removes away the need to have the hard-coded values for numbers in the data translation files. It also allows dates like 'thirty May two thousand and nine' to parse correctly.
Currently, the number-parser is called inside the 'get_date_data' function of 'DateDataParser' as that seemed to be called prior to parsing by all features ( except search ? ).
I would appreciate any inputs with regards to incorporation and bugs/features/ideas that you can think of for the number-parser library. Please feel free to raise issues in the library.
To test this update you will need to clone it from the GitHub link and install using 'pip install -e' , as it hasn't been updated in the dependency.