Pipe #33

Closed
wants to merge 24 commits into
from

Conversation

Projects
None yet
7 participants
Collaborator

tnm commented Jul 8, 2012

This adds a full implementation of the Pipe Based Method of talking to Pygments. It maintains full API compatibility with the existing pygments.rb library.

The basic data flow:

  1. Rubyland opens a pipe to the mentos.py process if one does not already exist
  2. mentos listens for a fixed 32-length string, representing an integer, which represents header length
  3. Rubyland sends over a JSON header of the 'RPC'-ish method name, args, kwargs, and length of text to come (if any)
  4. Rubyland sends over the text, if any
  5. mentos returns the requested data

This pull also includes handling for dealing with dead child processes, catching SIGCHLD and then issuing the subsequent wait. We are also able to re-spawn child processes if they die. We also catch relevant errors involved in the syscalls.

This also adds a benchmark tool allowing for arbitrary iterations as well as increasing the length of the input data itself.

All existing tests are green, and I've also added several more unit tests. There are now also some test data files (from Redis and Gunicorn) in test/

@tmm1 tmm1 commented on an outdated diff Jul 8, 2012

lib/pygments/mentos.py
+ Highlight the relevant code, and return a result string.
+ The default formatter is html, but alternate formatters can be passed in via
+ the formatter_name argument. Additional paramters can be passed as args
+ or kwargs.
+ """
+ # Default to html if we don't have the formatter name.
+ if formatter_name:
+ _format_name = str(formatter_name)
+ else:
+ _format_name = "html"
+
+ # Return a lexer object
+ lexer = self.return_lexer(args, kwargs)
+
+ # Make sure we sucessfuly got a lexer
+ if lexer:
@tmm1

tmm1 Jul 8, 2012

Owner

What happens in the else case here? Do we need to send back an error response of some sort?

@tmm1 tmm1 and 1 other commented on an outdated diff Jul 8, 2012

lib/pygments/mentos.py
+
+ The header is of form:
+ { "method": "highlight", "args": [], "kwargs": {"arg1": "v"}, "bytes": 128, "fd": "8"}
+ """
+
+ while True:
+ res = None
+
+ # The loop begins by reading off a simple 32-arity string representing
+ # an integer of 32 bits. This is the length of our JSON header. Using
+ # this method allows to avoid worrying about newlines.
+ size = sys.stdin.read(32)
+
+ # Read from stdin the amount of bytes we were told to expect.
+ header_bytes = int(size, 2)
+ line = sys.stdin.read(header_bytes)
@tmm1

tmm1 Jul 8, 2012

Owner

We've had issues in the past sending utf-8 over stdin. I think we did something like

sys.stdin  = codecs.getreader('UTF-8')(sys.stdin)
@brynary

brynary Jul 23, 2012

@tmm1 -- What issues did you see with utf-8 over stdin in the past?

@tmm1 tmm1 commented on an outdated diff Jul 8, 2012

lib/pygments/popen.rb
+ # The #start method also includes logic for dealing with signals from the
+ # child.
+ #
+ def start(pygments_path = File.expand_path('../../../vendor/pygments-main/', __FILE__))
+ ENV['PYGMENTS_PATH'] = pygments_path
+
+ # Make sure we kill off the child when we're done
+ at_exit { stop }
+
+ # A pipe to the mentos python process. #POSIX::Spawn#popen4 gives us
+ # the pid and three IO objects to write and read..
+ @pid, @in, @out, @err = popen4(File.expand_path('../mentos.py', __FILE__))
+
+ # Deal with dying child processes.
+ Signal.trap('CHLD') do
+
@tmm1

tmm1 Jul 8, 2012

Owner

I think you can do something like this to make sure other existing CHLD handlers are not overwritten:

old = trap('CHLD') do
  old.call
end

I'm not sure we want to in this case though... since the other handler is probably also going to call waitpid and would block.

@tmm1 tmm1 and 1 other commented on an outdated diff Jul 8, 2012

lib/pygments/popen.rb
+
+ # A pipe to the mentos python process. #POSIX::Spawn#popen4 gives us
+ # the pid and three IO objects to write and read..
+ @pid, @in, @out, @err = popen4(File.expand_path('../mentos.py', __FILE__))
+
+ # Deal with dying child processes.
+ Signal.trap('CHLD') do
+
+ # Once waitpid() returns the pid (i.e., the child process exited),
+ # we can safely set our pid variable to nil. Next time a Pygments.rb
+ # method gets called, the child will be spawned again, so we don't
+ # need to spawn a new child in this block right now. For extra safety,
+ # if an ECHILD (no children) is set by waitpid(), don't die horribly;
+ # still set the @pid to nil.
+ begin
+ @pid = nil if Process.waitpid == @pid
@tmm1

tmm1 Jul 8, 2012

Owner

Does it makes sense to use Process::WNOHANG here as well? To avoid any blocking in case the process has already been reaped.

Also might be worth using Process.waitpid(@pid) instead, but I'm not convinced.

@tnm

tnm Jul 8, 2012

Collaborator

Yeah totally, I can add in the WNOHANG. I do think plain Process.waitpid is fine here.

Also I think once we get the start/stop process exactly how we want, I'll be finnin go extract it out since it seems generic and re-usable.

@tmm1 tmm1 and 3 others commented on an outdated diff Jul 8, 2012

lib/pygments/popen.rb
+ Process.waitpid(@pid)
+ rescue Errno::ESRCH, Errno::ECHILD
+ end
+ end
+
+ @pid = nil
+ end
+
+ # Check for a @pid variable, and then hit `kill -0` with the pid to
+ # check if the pid is still in the process table. If this function
+ # gives us an ENOENT or ESRCH, we can also safely return false (no process
+ # to worry about).
+ #
+ # Returns true if the child is alive.
+ def alive?
+ return true if @pid && Process.kill(0, @pid)
@tmm1

tmm1 Jul 8, 2012

Owner

I used this kill(0, pid) trick but I'm wondering if there's a better solution. I guess I'm concerned it might be expensive, but it doesn't really look that way.

/cc @rtomayko @scottjg

@scottjg

scottjg Jul 9, 2012

i'm not sure offhand how expensive it is, but i'm wondering if it's really reliable/necessary? if the process is dead, won't the pipe be broken - and that gets caught below anyway?

@tmm1

tmm1 Jul 9, 2012

Owner

Good point, if we catch the EPIPE when writing the json payload and respawn/retry that'll probably be good enough.

@rtomayko

rtomayko Jul 9, 2012

My understanding is that kill(0, pid) is fast and reliable and the best way to do a pure process exists check. Use it if you need it. Agreed EPIPE looks like it makes it unnecessary though.

@tnm

tnm Jul 9, 2012

Collaborator

Yep, the EPIPE should take care of it, although I suppose it's harmless enough to leave this check in. I'll try to rig up some tests anyway to get some data on the perf implications, if any.

@tmm1 tmm1 commented on an outdated diff Jul 8, 2012

test/test_pygments.rb
-class PygmentsLexerTest < Test::Unit::TestCase
- include Pygments
+P = Pygments
@tmm1

tmm1 Jul 8, 2012

Owner

This whole P. thing is kind of stupid, but the include Pygments into all the test classes and subclasses was causing weird inheritance behaviors and making the tests unreliable. Maybe there's a better solution, but I couldn't come up with anything..

Owner

tmm1 commented Jul 8, 2012

🔥 This looks great. If the simplejson issue is resolved and CI is green, we should branch deploy this to an fe or two and see what happens.

@tnm tnm commented on an outdated diff Jul 8, 2012

lib/pygments/popen.rb
@@ -0,0 +1,261 @@
+# coding: utf-8
+
+require 'posix/spawn'
+require 'yajl'
+
+# Error class
+class MentosError < IOError
@tnm

tnm Jul 8, 2012

Collaborator

I'll add in the code so we make full use of this error class (getting a Python trace, etc)

Has this been tested on Windows?

@postmodern postmodern and 1 other commented on an outdated diff Jul 9, 2012

lib/pygments/popen.rb
@@ -0,0 +1,261 @@
+# coding: utf-8
+
+require 'posix/spawn'
+require 'yajl'
@postmodern

postmodern Jul 9, 2012

Why yajl instead of json or multi_json?

@tmm1

tmm1 Jul 9, 2012

Owner

I like yajl, and it provides a fast symbolize_keys option.

@postmodern

postmodern Jul 9, 2012

The fewer dependencies and C-extensions the better.

Owner

tmm1 commented Jul 9, 2012

Has this been tested on Windows?

No. Likely will not work, although posix-spawn kinda works on 1.9 in windows.

Either way, it requires python which is not bundled on windows. win32 is not a priority right now. You can continue to use pygments 0.2.x on windows if that works, since it's API compatible.

rtomayko commented Jul 9, 2012

Man this looks awesome. Good shit.

brynary commented Jul 17, 2012

Hey @tmm1, @tnm...

I tried this branch out yesterday, and I think I ran into an issue where Pygments returned the wrong result for a request. Haven't been able to reproduce yet. Is there any chance that a bug in this code might cause Pygments to incorrectly return the result of a previous highlight? (Or perhaps if it's used incorrectly -- maybe somehow something is sharing a socket which shouldn't be, etc.)

Thoughts?

-Bryan

Collaborator

tnm commented Jul 17, 2012

Hey @brynary,

I'm still finishing this up, so I wouldn't use it quite yet. Once the release is done, a bug like that won't be possible :)

-t

brynary commented Jul 18, 2012

@tnm -- Thanks for the heads up. Does the behavior I describe sound like a possible or known bug?

tnm added some commits Aug 4, 2012

@tnm tnm Update core mentos operations, adding pipe checks.
This also updates tests, improves the shell lexer recognition, and allows for
logging to be set via the ENV.
0b2049c
@tnm tnm Update with latest error handling, timeouts, and local lexers 02c05ff

tnm closed this Sep 25, 2012

cespare commented Oct 1, 2012

Hey, I'm thinking of writing something like pygments.rb for Go, and I'm curious why you guys switched approaches from the embedded python to a child process. I looked through this ticket and others and couldn't see any mention of the reasoning. Does this approach afford some kind of speedup, maybe? Perhaps using rubypython to embed a python runtime has higher memory usage than a separate process, or can cause instabilities? I'm just guessing here.

Collaborator

tnm commented Oct 1, 2012

Hey @cespare — There were reliability issues with rubypython and the embedded approach; too-frequent segfaults, mostly, as well as difficulty in debugging them. #13 was an example that would occur sometimes in tests.

cespare commented Oct 1, 2012

@tnm Thanks, appreciate it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment