Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"." incorrectly interpreted as a thousands separator #46

Closed
AntonGsv opened this issue Feb 3, 2021 · 15 comments · Fixed by #59
Closed

"." incorrectly interpreted as a thousands separator #46

AntonGsv opened this issue Feb 3, 2021 · 15 comments · Fixed by #59

Comments

@AntonGsv
Copy link

AntonGsv commented Feb 3, 2021

from price_parser import Price
print(Price.fromstring('3.350').amount_float)
print(Price.fromstring('3.3500').amount_float)
print(Price.fromstring('3.35').amount_float)
print(Price.fromstring('3.355').amount_float)

Results:

3350.0
3.35
3.35
3355.0
  • python 3.8.6
  • price-parser 0.3.4
@lopuhin
Copy link
Member

lopuhin commented Feb 3, 2021

There are web-sites which use "." as thousands separator, so if the number of digits after "." is 3, then it's interpreted as thousands. If you know the decimal separator (e.g. from locale / language), you can pass it, see https://github.com/scrapinghub/price-parser#decimal-separator

@AntonGsv AntonGsv closed this as completed Feb 4, 2021
@lopuhin
Copy link
Member

lopuhin commented Feb 4, 2021

@AntonGsv btw if you have some examples from prices in the wild which have "." as decimal separator with 3 digits after it, that would be useful for our future improvements.

@lopuhin
Copy link
Member

lopuhin commented Feb 8, 2021

Thanks @AntonGsv , great examples. I'll re-open the ticket, we'd like to be able to handle such cases better. Screenshots attached below.

Screenshot 2021-02-08 at 11 11 49

Screenshot 2021-02-08 at 11 12 10

@lopuhin lopuhin reopened this Feb 8, 2021
@lopuhin lopuhin changed the title False price extraction "." incorrectly interpreted as a thousands separator Feb 8, 2021
@Manish-210
Copy link

Manish-210 commented Mar 16, 2021

I think we can make a modify amount_float function something like amount_float(thousand_with_dot=false) (name should be something different ) which can resolve this
Example:

Price(amount=Decimal('22.900'), currency='€')
>>> price.amount_float( thousand_with_dot = false)    # price amount as float with thousand detection off
'22.900'

@lopuhin
Copy link
Member

lopuhin commented Mar 16, 2021

sorry @Manish-210 I'm not sure I understand your proposal - currently amount_float is a property which would return a float amount. But the problem here is that we don't exactly know how to interpret "22.900", and ideally we'd like to be able to make this decision without requiring extra input from the user.

@Manish-210
Copy link

Sorry for the unclear explanation, here is what I mean,

The current amount_float property changes the "." like a thousand separators when there are 3 digits after it, but what I mean is adding an optional input from the user say thousand_with_dot which by default is true, but if a user turns thousand_with_dot = false then "." will not act as a thousand separator when there are 3 digits after it.

Example:


>>>from price_parser import Price
>>>print(Price.fromstring('3.350').amount_float)
3350.0         # "." will act as thousand separator because bydefault thousand_with_dot is true
>>>print(Price.fromstring('3.350').amount_float(thousand_with_dot=false))
3.350           # "." will not act as thousand separator because thousand_with_dot is false
>>>print(Price.fromstring('3.35').amount_float)
3.35             # bydefault thousand_with_dot is true will have no effect on numbers which does not have 3 digits after "."
>>>print(Price.fromstring('3.35').amount_float(thousand_with_dot=false))
3.35             # thousand_with_dot is false will have no effect on numbers which does not have 3 digits after "."

ideally we'd like to be able to make this decision without requiring extra input from the user.

It is true that this will add a little work for the user but guessing it from the string provided by the user raises the possibility of error. Let me know your thoughts.

@lopuhin
Copy link
Member

lopuhin commented Mar 17, 2021

Thanks for feedback @Manish-210 , I see what you mean. Yes, if someone knows that dots are unlikely to be a thousands separator, then it makes sense to be able to communicate it to the library via some option 👍

@Manish-210
Copy link

Should I give a try to implement it?

@Gallaecio
Copy link
Member

I would first try to agree on an interface. I think this should be an argument to fromstring, not to amount_float, because I think this makes sense as part of the input, and it could help improve the parsing in some scenarios, while it would not make sense I think for someone to use amount_float with False and True for the same price object.

@lopuhin
Copy link
Member

lopuhin commented Mar 17, 2021

+1 to make it an argument to fromstring`. Also I think we should allow both None, TrueandFalse, with None`` performing a default.

Another approach here could be to approach this from locales standpoint - e.g. different locales might have preferences from different formats. But we'd first need to check if this makes sense - does the way dot is interpreted also depend on locale.

@Gallaecio
Copy link
Member

We could change the default behavior based on a locale, but I think allowing the flexibility to override the behavior would be best.

@PyExplorer
Copy link

Additional examples for the issue with wrong recognition thousands.

  1. https://www.xcite.com/samsung-galaxy-z-fold-3-5g-256gb-phone-black-1/p Price. The price 423.900 KD is recognized as 423900.
  2. https://enzahome-oman.com/product/burgundy-sofa-three-seater-black-leg-oman/. The price 577.500 OMR is recognized as 577500.

There is an idea to check how many digits after decimal separator this currency has based on this info https://en.wikipedia.org/wiki/ISO_4217. It can help to avoid some issues with familiar currencies with exactly 3 digits (as only this number price_parser considers as separator https://github.com/scrapinghub/price-parser/blob/master/price_parser/parser.py#L230C34-L230C80 )
@Gallaecio @kmike what do you think?

@kmike
Copy link
Member

kmike commented Jul 21, 2023

@PyExplorer doing it based on currency makes sense, if different currencies have different rules. In addition to that,
it also could make sense to provide an explicit parameter, to override any heuristics.

@emarondan
Copy link
Contributor

emarondan commented Sep 14, 2023

Hey guys. I just created a PR for this issue. If someone can give me a feedback I will appreciate it.
I used the idea proposed by @Manish-210 and also added a logic to check for specific rules to a currency. #59

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants