Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Simplify HTML space compression (this fixes issue 304) #16

Closed
wants to merge 2 commits into from

2 participants

John Simon Wesley Moore
John Simon

Fix for issue 304.

I replaced what looked to be over-engineered code with a naive but conservative regex. I did a quick unscientific test on one page, and even after sending it through gzip it's still a 10% improvement over no space removal, so it's probably worth the CPU.

Wesley Moore wezm commented on the diff
r2/r2/lib/filters.py
((16 lines not shown))
61   - _between_tags1 = re.compile('> +')
62   - _between_tags2 = re.compile(' +<')
63   - _spaces = re.compile('[\s]+')
64   - _ignore = re.compile('(' + MD_START + '.*?' + MD_END + ')', re.S | re.I)
65   - def spaceCompress(content):
66   - res = ''
67   - for p in _ignore.split(content):
68   - if not p.startswith(MD_START) and not p.endswith(MD_END):
69   - p = _spaces.sub(' ', p)
70   - p = _between_tags1.sub('>', p)
71   - p = _between_tags2.sub('<', p)
72   - res += p
73   - return res
  49 +# There is an out-of-date, currently unused C implementation of this in Cfilters.
  50 +c_websafe = python_websafe
  51 +c_websafe_json = python_websafe_json
5
Wesley Moore Owner
wezm added a note

The old code attempted to import the C implementations on line 50. If that failed then it defined the functions with there Python implementation. Your changes prevent any of the C functions from being imported. Particularly c_websafe and uwebsafe_json which were used later on the file. I think that the previous try/catch construct needs to remain but without the import of uspace_compress and definition of spaceCompress inside the try block.

John Simon
johnsoft added a note

There are three symbols which are defined in both halves of the try block - c_websafe, c_websafe_json, and spaceCompress. uspace_compress, which is defined only in the top half, isn't referenced anywhere else in the file. The handful of c_* functions are assigned to python_* equivalents, so in the end all the same names have a valid definition.

Would you rather I use the C implementations of everything but spaceCompress? I thought it would make more sense to deprecate filters.c all-at-once instead of function-by-function. I could also replace the 'try...else...' with 'if 0...else...' while keeping the note about the C code being outdated, so the code keeps the same structure and can be easily reverted if the C is updated.

(By the way, I'm operating under the assumption that you're not desperately hurting for CPU time, as I imagine Reddit was when they wrote this in C, and it's okay to move this code back up to Python - let me know if that's off-base)

Wesley Moore Owner
wezm added a note

I'd very much like to keep the C implementations. The LW server(s) do run into CPU limitations. This is mitigated by autoscaling in AWS and memcached but each server that gets started costs money. There are currently three servers running, each one costs over $100 a month.

I would suggest you leave the try/catch block, remove the import of uspace_wrapped and include you note about it. Remove both definitions of spaceCompress, replacing it with your one.

Wesley Moore Owner
wezm added a note

One other thing. Be sure to test your implementation on text that isn't valid utf-8. We encounter this surprising often. This is why the original spaceCompress has the try/catch block:

    def spaceCompress(text):
        try:
            text = unicode(text, 'utf-8')
        except TypeError:
            text = unicode(text)
John Simon
johnsoft added a note

Ah, that changes things. I'll rework this patch, then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
John Simon

See other pull request

John Simon johnsoft closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
This page is out of date. Refresh to see the latest.

Showing 2 changed files with 10 additions and 25 deletions. Show diff stats Hide diff stats

  1. +4 0 r2/r2/lib/c/filters.c
  2. +6 25 r2/r2/lib/filters.py
4 r2/r2/lib/c/filters.c
@@ -24,6 +24,10 @@
24 24 #include <stdio.h>
25 25 #include <string.h>
26 26
  27 +/* This code (specifically, filters_u?space_compress) is out-of-date and not currently
  28 + * used. To fix the issue in http://code.google.com/p/lesswrong/issues/detail?id=304 ,
  29 + * I updated the pure Python code in r2/r2/lib/filters.py, and left the C code behind.
  30 + */
27 31
28 32 PyObject *unicode_arg(PyObject *args) {
29 33 PyObject * com;
31 r2/r2/lib/filters.py
@@ -46,31 +46,12 @@ def python_websafe(text):
46 46 def python_websafe_json(text):
47 47 return text.replace('&', "&amp;").replace("<", "&lt;").replace(">", "&gt;")
48 48
49   -try:
50   - from Cfilters import uwebsafe as c_websafe, uspace_compress, \
51   - uwebsafe_json as c_websafe_json
52   - def spaceCompress(text):
53   - try:
54   - text = unicode(text, 'utf-8')
55   - except TypeError:
56   - text = unicode(text)
57   - return uspace_compress(text)
58   -except ImportError:
59   - c_websafe = python_websafe
60   - c_websafe_json = python_websafe_json
61   - _between_tags1 = re.compile('> +')
62   - _between_tags2 = re.compile(' +<')
63   - _spaces = re.compile('[\s]+')
64   - _ignore = re.compile('(' + MD_START + '.*?' + MD_END + ')', re.S | re.I)
65   - def spaceCompress(content):
66   - res = ''
67   - for p in _ignore.split(content):
68   - if not p.startswith(MD_START) and not p.endswith(MD_END):
69   - p = _spaces.sub(' ', p)
70   - p = _between_tags1.sub('>', p)
71   - p = _between_tags2.sub('<', p)
72   - res += p
73   - return res
  49 +# There is an out-of-date, currently unused C implementation of this in Cfilters.
  50 +c_websafe = python_websafe
  51 +c_websafe_json = python_websafe_json
  52 +_spaces = re.compile(r'(\s)\s+', re.VERBOSE)
  53 +def spaceCompress(content):
  54 + return _spaces.sub(r'\1', content.strip())
74 55
75 56 class _Unsafe(unicode): pass
76 57

Tip: You can add notes to lines in a file. Hover to the left of a line to make a note

Something went wrong with that request. Please try again.