Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Edge cases that are not working #4

Open
noviluni opened this issue Jun 17, 2020 · 4 comments
Open

Edge cases that are not working #4

noviluni opened this issue Jun 17, 2020 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@noviluni
Copy link
Contributor

Some edge cases that should be fixed:

On the other side, when having "5 hundred" I think it should return "500". What do you think?

Example:

parse("one million")                                                              
>>> '1000000'
parse("1 million")                                                              
>>> '1 1000000'

Please, add a test with each of these strings when fixing it :)

On the other hand, I think there are some currently working cases that should be added to the tests:

  • word that includes a number but it's not a number and shouldn't be translated. Examples: "weight" (includes "eight") or "network" (includes "two").
  • "twelve hundred" notation (returns "1200" which is correct).

Let me know if you have any doubt. 🙂

@arnavkapoor arnavkapoor added the bug Something isn't working label Jun 17, 2020
@arnavkapoor arnavkapoor self-assigned this Jun 17, 2020
@arnavkapoor
Copy link
Collaborator

I was thinking a bit about these edge cases

  1. twentyone - This seems to be a specific alternate form of the number thus we should probably a dictionary of hard-coded exceptions that would be language specific to handle such cases.

  2. 5 hundred - I do agree that 500 is the more natural and expected parsing, This might involve a lot of major code changes so I might push it for a bit later, with first working on incorporating languages because as discussed it shouldn't be too customized to English.

I feel that we should only handle base_number (in digits) multiplier (in words) in this way.

5 hundred = 500
267 thousands = 267000
3 million = 3000000

(These treated normally)
hundred 73 = 100 73
sixty 5 = 60 5
5 nine thousand = 5 9000 

Opinions ?

  1. The usage of 'a' , I am not very sure about the correct usage of 'a' before a number.
He has a hundred percent chance of succeeding

Do we remove the 'a' or not ?. Also is it the same behavior with other numbers and the position of 'a' in the beginning of the sentence or middle impacts parsing?

a one in six chance 

@noviluni
Copy link
Contributor Author

  1. Agree
  2. Agree, I like what I see in your examples. If anyone has another idea just raise it.
  3. I think that we should "remove" the "a" in the same cases we omit the "one". That is: "a hundred", "a million", "a billion", etc. Removing the "a" from "a one" doesn't have any sense for me.

However, I think that this ticket shouldn't be prioritized as we should first accept other languages (#10) to avoid adapting this library too much to the English language.

@Gallaecio
Copy link
Member

Regarding

He has a hundred percent chance of succeeding

I think the right wording would be:

He has a one hundred percent chance of succeeding

Where removing the ‘a’ makes sense.

@dhananjaypai08
Copy link
Contributor

dhananjaypai08 commented Nov 27, 2021

Hey @noviluni @Gallaecio

The issue mentioned about parsing '1 million' to '1000000'

Previously
parse('1 million')
>>> '1 1000000'

so to change this to '1000000' we can make few changes to the algorithm.

My Approach for this issue

1.)According to me,
In the number_parser/parser.py while checking the validity(function name: _check_validity) of the current token with the previous one , we can check if the previous token is a numeric one(using isnumeric() ) and the previous token's integer value is less than the current token's (mapped value in the dictionary) mapped integer value. Then we can continue to build the current token with the previous number. And then the actual value resulting till now would be multiplying both integer values (previous and current one).

Intuition

a.) '1 million' should be parsed as '1000000'.
b.) 'million 1' or '1000000 one' should and will parse as '1000000 1'.
c.) Also keeping in mind that 'million one' (both require lookups to its dictionary values) will be parsed as '1000001' (already handled previously)

2.)
The reason we check if the previous token's integer value is less than the current one's is because if the both are string values (ex:- "one"), the previous token's integer value is bigger than current one's and both are in string ordinals without any numeric values in it, they both will add up .for ex: 'million one' will parse '1000001'. So we will only be moving forward for the one where the previous one is numeric and current is string and previous is smaller. Rest all case handlings are all managed previously.

I know this needs a more efficient and deeper approach.
I am a beginner, so please give your valuable views. Also am thinking to take part in GSoC next year through this sub-org project . Please let me know what you think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants