Improved price extraction #103

hackrush01 · 2017-03-06T21:42:21Z

The prices are now handled using regexp, strings and loops instead of only regex
which was inaccurate in some cases.

ruairif · 2017-03-07T10:12:59Z

scrapely/extractors.py

-        if decimalpart[0] == "," and len(decimalpart) <= 3:
-            decimalpart = decimalpart.replace(",", ".")
-        value = "".join(parts + [decimalpart]).replace(",", "")
+        decimalSeparator = 'point'  # defaults to point


This whole section is not very pythonic. You should use x in y instead of y.__contains__(x). You're using camelcase instead of snake case.

This implements the same logic but it should be faster (avoids unnecessary copying caused by slicing and does int comparisons instead of string comparisons where needed):

POINT, COMMA = 0, 1 decimal_separator = POINT last_point_idx = value.rfind('.') last_comma_idx = value.rfind(',') if last_point_idx > 0 and last_comma_idx > 0: # If a number has both separators take the last one if last_comma_idx > last_point_idx: decimal_separator = COMMA elif last_comma_idx > 0: # If a number has only commas check the last one first_comma_idx = value.find(',') if (first_comma_idx == last_comma_idx and len(value) - last_comma_idx != 4): decimal_separator = COMMA if decimal_separator == POINT: return value.replace(',', '') return value.replace('.', '').replace(',', '.')

Please be sure to add tests for this PR too

hackrush01 · 2017-03-07T18:21:51Z

@ruairif Completed the changes, please review.
https://github.com/scrapy/scrapely/pull/103/files

ruairif · 2017-03-08T12:08:22Z

scrapely/extractors.py

@@ -16,8 +16,7 @@
 _NUMERIC_ENTITIES = re.compile("&#([0-9]+)(?:;|\s)", re.U)
 _PRICE_NUMBER_RE = re.compile('(?:^|[^a-zA-Z0-9])(\d+(?:\.\d+)?)(?:$|[^a-zA-Z0-9])')
 _NUMBER_RE = re.compile('(-?\d+(?:\.\d+)?)')
-_DECIMAL_RE = re.compile(r'(\d[\d\,]*(?:(?:\.\d+)|(?:)))', re.U | re.M)
-_VALPARTS_RE = re.compile("([\.,]?\d+)")
+_DECIMAL_RE = re.compile(r'(-?\d[\d\,\.]*(?:(?:\.\d+)|(?:)))', re.U | re.M)


It looks like this regex will work if it's just:

_DECIMAL_RE = re.compile(r'(-?\d[\d\,\.]*)', re.U | re.M)

If if needs to stay as the current regex can you add a test case for where it's needed

hackrush01 · 2017-03-08T18:45:42Z

@ruairif Yes, the extra part in regex was indeed not required. Fixed. Please check.

hackrush01 · 2017-03-09T10:27:02Z

@ruairif I have made the required changes, please check.
https://github.com/scrapy/scrapely/pull/103/files

ruairif · 2017-03-09T10:32:12Z

scrapely/extractors.py

+
+        if decimal_separator == POINT:
+            value = value.replace(',', '')
+        if decimal_separator == COMMA:


There should be an else here. The decimal_separator only has 2 states so if it's not point it's going to be comma

Done, please review.

I am sorry, I forgot to force push; now it's done. Please check.

The prices are now handled using regexp, strings and loops instead of only regex which was inaccurate in some cases. Added test cases. Removed unnecessary regexp part Fixes scrapinghub/portia#212

ruairif requested changes Mar 7, 2017

View reviewed changes

hackrush01 force-pushed the master branch from 817e6dc to 8788a7f Compare March 7, 2017 17:44

ruairif reviewed Mar 8, 2017

View reviewed changes

hackrush01 force-pushed the master branch from 8788a7f to 65bef75 Compare March 8, 2017 18:34

ruairif reviewed Mar 9, 2017

View reviewed changes

Improved price extraction

0f300c4

The prices are now handled using regexp, strings and loops instead of only regex which was inaccurate in some cases. Added test cases. Removed unnecessary regexp part Fixes scrapinghub/portia#212

hackrush01 force-pushed the master branch from 65bef75 to 0f300c4 Compare March 9, 2017 10:52

ruairif merged commit f0b4777 into scrapy:master Mar 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved price extraction #103

Improved price extraction #103

hackrush01 commented Mar 6, 2017

ruairif Mar 7, 2017

hackrush01 commented Mar 7, 2017

ruairif Mar 8, 2017

hackrush01 commented Mar 8, 2017

hackrush01 commented Mar 9, 2017

ruairif Mar 9, 2017

hackrush01 Mar 9, 2017

hackrush01 Mar 9, 2017 •

edited

Improved price extraction #103

Improved price extraction #103

Conversation

hackrush01 commented Mar 6, 2017

ruairif Mar 7, 2017

Choose a reason for hiding this comment

hackrush01 commented Mar 7, 2017

ruairif Mar 8, 2017

Choose a reason for hiding this comment

hackrush01 commented Mar 8, 2017

hackrush01 commented Mar 9, 2017

ruairif Mar 9, 2017

Choose a reason for hiding this comment

hackrush01 Mar 9, 2017

Choose a reason for hiding this comment

hackrush01 Mar 9, 2017 • edited

Choose a reason for hiding this comment

hackrush01 Mar 9, 2017 •

edited