Permalink
Browse files

Remove control characters before cleanhtml step

- control characters were breaking the soupparser html parsing step
  • Loading branch information...
1 parent b763a2d commit 611226f0eabf4ae844c5c4c2809e890c172749c0 David Powell committed Aug 16, 2011
Showing with 5 additions and 1 deletion.
  1. +5 −1 r2/r2/lib/filters.py
View
6 r2/r2/lib/filters.py
@@ -195,8 +195,12 @@ def killhtml(html=''):
cleaned_html = ' '.join([fragment.strip() for fragment in text])
return cleaned_html
+control_chars = re.compile('[\x00-\x08\x0b\0xc\x0e-\x1f]') # Control characters *except* \t \r \n
+def remove_control_chars(text):
+ return control_chars.sub('',text)
+
def cleanhtml(html='', cleaner=None):
- html_doc = soupparser.fromstring(html)
+ html_doc = soupparser.fromstring(remove_control_chars(html))
if not cleaner:
cleaner = sanitizer
cleaned_html = cleaner.clean_html(html_doc)

2 comments on commit 611226f

@ArisKatsaris

The regular expression in this commit is wrong -- where it says "\0xc" it should have said "\x0c". As a result it removes the letters 'x' and 'c'.

@drpowell

Ugh! Thanks, that'll teach me to push before the tests pass

Please sign in to comment.