Skip to content

Conversation

@ruairif
Copy link
Collaborator

@ruairif ruairif commented Jun 21, 2016

No description provided.

@kmike kmike merged commit 91da733 into master Jun 24, 2016
(self.state == 6 and c == u'i') or
(self.state == 7 and c == u'p') or
(self.state == 8 and c == u't') or
(self.state == 3 and c == u's' or c == u'S') or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If change to c in [u's', u'S'] the developer will not to have to care about remember the priority of the and and the or

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the generated Cython code will be slower in this case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After doing some speed tests the difference is negligible.

I would be open to changing it to:

        if ((self.state == 1 and c == u'<') or
            (self.state == 2 and c == u'/') or
            (self.state == 3 and c in u'sS') or
            (self.state == 4 and c in u'cC') or
            (self.state == 5 and c in u'rR') or
            (self.state == 6 and c in u'iI') or
            (self.state == 7 and c in u'pP') or
            (self.state == 8 and c in u'tT') or
            (self.state == 9 and c == u'>')):
            self.state += 1

Here's the generated C code:

// Current
  __pyx_t_2 = ((__pyx_v_self->state == 3) != 0);
  if (!__pyx_t_2) {
    goto __pyx_L11_next_or;
  } else {
  }
  __pyx_t_2 = ((__pyx_v_c == 0x73) != 0);
  if (!__pyx_t_2) {
  } else {
    __pyx_t_1 = __pyx_t_2;
    goto __pyx_L5_bool_binop_done;
  }
  __pyx_L11_next_or:;
  __pyx_t_2 = ((__pyx_v_c == 83) != 0);
  if (!__pyx_t_2) {
  } else {
    __pyx_t_1 = __pyx_t_2;
    goto __pyx_L5_bool_binop_done;
  }


// Proposed (Same code generated when using u'sS' and [u's', u'S']
  __pyx_t_2 = ((__pyx_v_self->state == 3) != 0);
  if (!__pyx_t_2) {
    goto __pyx_L10_next_or;
  } else {
  }
  switch (__pyx_v_c) {
    case 0x73:
    case 83:
    __pyx_t_2 = 1;
    break;
    default:
    __pyx_t_2 = 0;
    break;
  }
  __pyx_t_3 = (__pyx_t_2 != 0);
  if (!__pyx_t_3) {
  } else {
    __pyx_t_1 = __pyx_t_3;
    goto __pyx_L5_bool_binop_done;
  }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ruairif as you're checking it, is stdlib regex module really a bottleneck? why does scrapely need a custom state machine - regexes should do the same internally?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New scrapely:

test_extraction_speed.TestExtractionSpeed.test_parsel_parse_and_extract: 4.1588s
test_extraction_speed.TestExtractionSpeed.test_parsel_extract: 3.6258s
test_extraction_speed.TestExtractionSpeed.test_slybot_parse_and_extract: 3.1787s
test_extraction_speed.TestExtractionSpeed.test_slybot_extract: 2.3742s

Scrapely 0.12.0

test_extraction_speed.TestExtractionSpeed.test_slybot_parse_and_extract: 5.3518s
test_extraction_speed.TestExtractionSpeed.test_parsel_parse_and_extract: 4.1480s
test_extraction_speed.TestExtractionSpeed.test_parsel_extract: 3.5887s
test_extraction_speed.TestExtractionSpeed.test_slybot_extract: 2.8221s

https://github.com/scrapinghub/portia/blob/master/slybot/slybot/tests/test_extraction_speed.py

@ruairif ruairif deleted the handle_uppercase_script_tags branch December 21, 2016 09:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants