Make it easier to work with encoded (non-ascii) strings #1

rwstauner opened this Issue Nov 4, 2011 · 3 comments


None yet

1 participant


RJBS recently made this change:

Here is his report, excerpted from IRC:
should look like

Things to report:

  1. T::VC expects "string" to be a byte string, not text. I had to encode it first.
  2. T::VC returns a byte string, not a text string. I had to decode it afterward.
  3. I had to communicate the character encoding of the bytestream to Vim, which meant passing '+set fenc=utf-8', but I could not do this without copying and pasting the default options as well.

Thoughts on addressing this:

  1. Either document that it expects a byte string or fix it to expect unicode, which it will then encode, and document that it expects text, not bytes. I suggest the latter. People passing a string should be passing unencoded text. It also eliminates the question of passing the encoding, because if you pass a byte string, you will also need to allow for the encoding to be passed.
  2. I strongly suggest returning a character string, but otherwise document that an encoded byte string is returned. In either case, you probably need to pass an argument to ensure that it is always returned in one encoding, so it can be reliably decoded.
  3. provide access to @VIM_OPTIONS (or whatever it was called) as ->default_vim_options so it can be included in "those options plus more" as (options => [ T::VC->default_vim_options, ... ])

Looking at usage by reverse deps I found this:

    # any encoding will do if vim automatically detects it
    my $vim_encoding = 'utf-8';
    my $BOM = "\x{feff}";
    my $syn = Text::VimColor->new(
            filetype    => $lang,
            string      => encode($vim_encoding, $BOM . $str),
    $str = decode($vim_encoding, $syn->html);
    $str =~ s/^$BOM//;
    return $str;

For reference, rjbs's similar code is here:

sub build_html {
  my ($self, $str, $param) = @_;

  my $octets = Encode::encode('utf-8', $str, Encode::FB_CROAK);

  my $vim = Text::VimColor->new(
    string   => $octets,
    filetype => $param->{filetype},

    vim_options => [
      qw( -RXZ -i NONE -u NONE -N -n ), "+set nomodeline", '+set fenc=utf-8',

  my $html_bytes = $vim->html;
  my $html = Encode::decode('utf-8', $html_bytes);

  return $html;
@rwstauner rwstauner added a commit that referenced this issue Jan 18, 2012
@rwstauner Test methods of specifying encoding used by other modules
Ensure that we don't break backward compatibility
with other modules that have already implemented workarounds.

This is the beginning of addressing gh-1.

I have added extra_vim_options => [] for an easy way to append options to the list after the defaults.

@rwstauner rwstauner added a commit that closed this issue Feb 2, 2013
@rwstauner Accept (and return) character strings
closes gh-1.

Thanks, RJBS!
@rwstauner rwstauner closed this in e227c46 Feb 2, 2013
@rwstauner rwstauner added a commit that referenced this issue Feb 2, 2013
@rwstauner v0.23
  - Attempt to do the right thing with character strings:
    Encode them in UTF-8, tell vim the file encoding (UTF-8),
    and return a (decoded) character string.
    Thanks to Ricardo Signes for the very helpful report (gh-1).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment