New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved price extraction #103
Conversation
scrapely/extractors.py
Outdated
if decimalpart[0] == "," and len(decimalpart) <= 3: | ||
decimalpart = decimalpart.replace(",", ".") | ||
value = "".join(parts + [decimalpart]).replace(",", "") | ||
decimalSeparator = 'point' # defaults to point |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole section is not very pythonic. You should use x in y
instead of y.__contains__(x)
. You're using camelcase instead of snake case.
This implements the same logic but it should be faster (avoids unnecessary copying caused by slicing and does int comparisons instead of string comparisons where needed):
POINT, COMMA = 0, 1
decimal_separator = POINT
last_point_idx = value.rfind('.')
last_comma_idx = value.rfind(',')
if last_point_idx > 0 and last_comma_idx > 0:
# If a number has both separators take the last one
if last_comma_idx > last_point_idx:
decimal_separator = COMMA
elif last_comma_idx > 0:
# If a number has only commas check the last one
first_comma_idx = value.find(',')
if (first_comma_idx == last_comma_idx and
len(value) - last_comma_idx != 4):
decimal_separator = COMMA
if decimal_separator == POINT:
return value.replace(',', '')
return value.replace('.', '').replace(',', '.')
Please be sure to add tests for this PR too
@ruairif Completed the changes, please review. |
scrapely/extractors.py
Outdated
@@ -16,8 +16,7 @@ | |||
_NUMERIC_ENTITIES = re.compile("&#([0-9]+)(?:;|\s)", re.U) | |||
_PRICE_NUMBER_RE = re.compile('(?:^|[^a-zA-Z0-9])(\d+(?:\.\d+)?)(?:$|[^a-zA-Z0-9])') | |||
_NUMBER_RE = re.compile('(-?\d+(?:\.\d+)?)') | |||
_DECIMAL_RE = re.compile(r'(\d[\d\,]*(?:(?:\.\d+)|(?:)))', re.U | re.M) | |||
_VALPARTS_RE = re.compile("([\.,]?\d+)") | |||
_DECIMAL_RE = re.compile(r'(-?\d[\d\,\.]*(?:(?:\.\d+)|(?:)))', re.U | re.M) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this regex will work if it's just:
_DECIMAL_RE = re.compile(r'(-?\d[\d\,\.]*)', re.U | re.M)
If if needs to stay as the current regex can you add a test case for where it's needed
@ruairif Yes, the extra part in regex was indeed not required. Fixed. Please check. |
@ruairif I have made the required changes, please check. |
scrapely/extractors.py
Outdated
|
||
if decimal_separator == POINT: | ||
value = value.replace(',', '') | ||
if decimal_separator == COMMA: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be an else here. The decimal_separator only has 2 states so if it's not point it's going to be comma
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, please review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry, I forgot to force push
; now it's done. Please check.
The prices are now handled using regexp, strings and loops instead of only regex which was inaccurate in some cases. Added test cases. Removed unnecessary regexp part Fixes scrapinghub/portia#212
The prices are now handled using regexp, strings and loops instead of only regex
which was inaccurate in some cases.
Fixes scrapinghub/portia#212