Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved price extraction #103

Merged
merged 1 commit into from Mar 9, 2017
Merged

Improved price extraction #103

merged 1 commit into from Mar 9, 2017

Conversation

hackrush01
Copy link
Contributor

The prices are now handled using regexp, strings and loops instead of only regex
which was inaccurate in some cases.

Fixes scrapinghub/portia#212

if decimalpart[0] == "," and len(decimalpart) <= 3:
decimalpart = decimalpart.replace(",", ".")
value = "".join(parts + [decimalpart]).replace(",", "")
decimalSeparator = 'point' # defaults to point
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole section is not very pythonic. You should use x in y instead of y.__contains__(x). You're using camelcase instead of snake case.

This implements the same logic but it should be faster (avoids unnecessary copying caused by slicing and does int comparisons instead of string comparisons where needed):

    POINT, COMMA = 0, 1
    decimal_separator = POINT
    last_point_idx = value.rfind('.')
    last_comma_idx = value.rfind(',')
    if last_point_idx > 0 and last_comma_idx > 0:
        # If a number has both separators take the last one
        if last_comma_idx > last_point_idx:
            decimal_separator = COMMA
    elif last_comma_idx > 0:
        # If a number has only commas check the last one 
        first_comma_idx = value.find(',')
        if (first_comma_idx == last_comma_idx and
                len(value) - last_comma_idx != 4):
            decimal_separator = COMMA

    if decimal_separator == POINT:
        return value.replace(',', '')
    return value.replace('.', '').replace(',', '.')

Please be sure to add tests for this PR too

@hackrush01
Copy link
Contributor Author

@ruairif Completed the changes, please review.
https://github.com/scrapy/scrapely/pull/103/files

@@ -16,8 +16,7 @@
_NUMERIC_ENTITIES = re.compile("&#([0-9]+)(?:;|\s)", re.U)
_PRICE_NUMBER_RE = re.compile('(?:^|[^a-zA-Z0-9])(\d+(?:\.\d+)?)(?:$|[^a-zA-Z0-9])')
_NUMBER_RE = re.compile('(-?\d+(?:\.\d+)?)')
_DECIMAL_RE = re.compile(r'(\d[\d\,]*(?:(?:\.\d+)|(?:)))', re.U | re.M)
_VALPARTS_RE = re.compile("([\.,]?\d+)")
_DECIMAL_RE = re.compile(r'(-?\d[\d\,\.]*(?:(?:\.\d+)|(?:)))', re.U | re.M)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this regex will work if it's just:

_DECIMAL_RE = re.compile(r'(-?\d[\d\,\.]*)', re.U | re.M)

If if needs to stay as the current regex can you add a test case for where it's needed

@hackrush01
Copy link
Contributor Author

@ruairif Yes, the extra part in regex was indeed not required. Fixed. Please check.

@hackrush01
Copy link
Contributor Author

@ruairif I have made the required changes, please check.
https://github.com/scrapy/scrapely/pull/103/files


if decimal_separator == POINT:
value = value.replace(',', '')
if decimal_separator == COMMA:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be an else here. The decimal_separator only has 2 states so if it's not point it's going to be comma

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, please review.

Copy link
Contributor Author

@hackrush01 hackrush01 Mar 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry, I forgot to force push; now it's done. Please check.

The prices are now handled using regexp, strings and loops instead of only regex
which was inaccurate in some cases.

Added test cases.

Removed unnecessary regexp part

Fixes scrapinghub/portia#212
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants