Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise on character encoding errors #73

Open
burlesona opened this issue May 29, 2018 · 2 comments
Open

Raise on character encoding errors #73

burlesona opened this issue May 29, 2018 · 2 comments

Comments

@burlesona
Copy link

burlesona commented May 29, 2018

I've been using Reverse Markdown and it works great most of the time. I've run into one issue that I thought I'd get your opinion on.

Sometimes the HTML documents I'm converting have character encoding problems, leading to th dreaded Argument Error: invalid byte sequence in UTF-8.

In other places I'm fixing this by coercing the lines of a file to UTF8 as I read them. I've discovered that when you parse a line you can generally just force_encoding on it, and that will convert typographic marks and whatnot pretty well, but occasionally you'll run into issues where it's not enough and you have to be more aggressive, ie. the following:

def clean_line(line)
  # encoding must be utf8, if non-utf8 characters are encountered we remove them.
  # Weirdly though, this can fail, but then doesn't blow up until you call something else on the string...
  line.force_encoding("UTF-8").strip # strip will make this raise if it didn't work
rescue
  # ... in that case we want to selectively remove the offending characters.
   line.encode('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')
end

I end up using this same code to scrub HTML before I enter it into ReverseMarkdown, but it would probably be more efficient to handle it inside the gem - and would save other people from this same headache.

Are you interested in handling encoding errors inside the gem? If yes, you can use that code, or I can try to circle back with a PR. If not, no worries, just thought it might be worth considering.

Thanks for a great gem!

@xijo
Copy link
Owner

xijo commented Apr 5, 2019

Hey @burlesona,

Sorry for the late response!

It sure does sound like an interesting issue and might be worth solving within the gem, maybe with a flag to trigger it. Can you provide an example document that triggers the problem?

Thanks,
Jo

@xijo
Copy link
Owner

xijo commented Oct 2, 2019

Hello @burlesona

please have a look at the PR and let me know what you think!

xijo added a commit that referenced this issue Oct 2, 2019
Handle force_encoding issue, according to #73
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants