Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 6 additions & 7 deletions scrapely/_htmlpage.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -139,15 +139,14 @@ cdef class ScriptParser:
cdef int parse(self, Py_UCS4 c, int i):
if self.state == 10:
self.state = 1

if ((self.state == 1 and c == u'<') or
(self.state == 2 and c == u'/') or
(self.state == 3 and c == u's') or
(self.state == 4 and c == u'c') or
(self.state == 5 and c == u'r') or
(self.state == 6 and c == u'i') or
(self.state == 7 and c == u'p') or
(self.state == 8 and c == u't') or
(self.state == 3 and c == u's' or c == u'S') or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If change to c in [u's', u'S'] the developer will not to have to care about remember the priority of the and and the or

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the generated Cython code will be slower in this case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After doing some speed tests the difference is negligible.

I would be open to changing it to:

        if ((self.state == 1 and c == u'<') or
            (self.state == 2 and c == u'/') or
            (self.state == 3 and c in u'sS') or
            (self.state == 4 and c in u'cC') or
            (self.state == 5 and c in u'rR') or
            (self.state == 6 and c in u'iI') or
            (self.state == 7 and c in u'pP') or
            (self.state == 8 and c in u'tT') or
            (self.state == 9 and c == u'>')):
            self.state += 1

Here's the generated C code:

// Current
  __pyx_t_2 = ((__pyx_v_self->state == 3) != 0);
  if (!__pyx_t_2) {
    goto __pyx_L11_next_or;
  } else {
  }
  __pyx_t_2 = ((__pyx_v_c == 0x73) != 0);
  if (!__pyx_t_2) {
  } else {
    __pyx_t_1 = __pyx_t_2;
    goto __pyx_L5_bool_binop_done;
  }
  __pyx_L11_next_or:;
  __pyx_t_2 = ((__pyx_v_c == 83) != 0);
  if (!__pyx_t_2) {
  } else {
    __pyx_t_1 = __pyx_t_2;
    goto __pyx_L5_bool_binop_done;
  }


// Proposed (Same code generated when using u'sS' and [u's', u'S']
  __pyx_t_2 = ((__pyx_v_self->state == 3) != 0);
  if (!__pyx_t_2) {
    goto __pyx_L10_next_or;
  } else {
  }
  switch (__pyx_v_c) {
    case 0x73:
    case 83:
    __pyx_t_2 = 1;
    break;
    default:
    __pyx_t_2 = 0;
    break;
  }
  __pyx_t_3 = (__pyx_t_2 != 0);
  if (!__pyx_t_3) {
  } else {
    __pyx_t_1 = __pyx_t_3;
    goto __pyx_L5_bool_binop_done;
  }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ruairif as you're checking it, is stdlib regex module really a bottleneck? why does scrapely need a custom state machine - regexes should do the same internally?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New scrapely:

test_extraction_speed.TestExtractionSpeed.test_parsel_parse_and_extract: 4.1588s
test_extraction_speed.TestExtractionSpeed.test_parsel_extract: 3.6258s
test_extraction_speed.TestExtractionSpeed.test_slybot_parse_and_extract: 3.1787s
test_extraction_speed.TestExtractionSpeed.test_slybot_extract: 2.3742s

Scrapely 0.12.0

test_extraction_speed.TestExtractionSpeed.test_slybot_parse_and_extract: 5.3518s
test_extraction_speed.TestExtractionSpeed.test_parsel_parse_and_extract: 4.1480s
test_extraction_speed.TestExtractionSpeed.test_parsel_extract: 3.5887s
test_extraction_speed.TestExtractionSpeed.test_slybot_extract: 2.8221s

https://github.com/scrapinghub/portia/blob/master/slybot/slybot/tests/test_extraction_speed.py

(self.state == 4 and c == u'c' or c == u'C') or
(self.state == 5 and c == u'r' or c == u'R') or
(self.state == 6 and c == u'i' or c == u'I') or
(self.state == 7 and c == u'p' or c == u'P') or
(self.state == 8 and c == u't' or c == u'T') or
(self.state == 9 and c == u'>')):
self.state += 1
else:
Expand Down