Skip to content

Conversation

@kmike
Copy link
Member

@kmike kmike commented Jun 4, 2019

This is an unfinished fix for #113, please don't merge. I've opened the PR just to share initial code. It has these problems:

  • cleaning can be too aggressive, e.g. <style> elements are removed. This may cause missing extractions, if microdata is set on these elements.
  • tests are failing (I haven't looked at them - probably the fix is not correct at all :)

The problem with not cleaning HTML tree once is that it could make algorithm O(N^2) - but maybe that's fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants