The markdown output files that the script creates from my wordpress.xml all wrap normal text paragraphs are 80 characters (or less if a word will go across the boundary). Unfortunately, when I import this into my Jekyll blog (or any other markdown interpreter), these new lines are preserved. So, rather than getting paragraphs naturally following screen width when converted from markdown to HTML, I get formatting that is wrapped at the same point.
The only except seems to be image urls, which are not wrapped.
Can the script be updated to not wrap lines at 80 characters?
It should really not do this.. Strange.
Do you see it?
It made me wind up using a more convoluted (and less useful) method to get data because I didn't have a means of differentiating "wraps at 80" with wanted linebreaks.
I'm having this same problem.
I was able to fix this by switching to html2text_file instead of html2text.
@jamesward could you be more specific about what you've changed. Simply changing the method call doesn't work for me.
Ok I've solved it. First off the latest html2text doesn't have html2text_file and also causes a whole bunch of additional problems (all the links are inline instead of reference style for example). So you have to get a pre 3.x version. Then you also have to use html2text_file since that solves the problem with 80 character wrapping even with the reference links. I'll submit a pull request of course ;-).
You can see my fork here: https://github.com/jamesward/exitwp
@jamesward thanks! Your shift to html2text_file and exception handling was exactly what I needed.
Pandoc is one of the best parsers i've tried.. thinking of bringing it back instead of html2text.