Mix Microsoft namespace and paragraph elements #31

Merged
merged 1 commit into from Jun 9, 2016

Projects

None yet

2 participants

@thetaylor82
Contributor

Microsoft exchange emails often include HTML which is undocumented and non-standard. The <o:p> tags are handled as paragraphs by DOMDocument which results in huge amounts of line returns all mopped up by $output = preg_replace("/\n\n\n*/im", "\n\n", $output); that results in double line returns everywhere. To fix this the o:p elements should be removed and anything with a classname of MsoNormal (the standard classname in any Microsoft export or outlook for a paragraph that behaves like a line return) being changed to a line with a break afterwards. This makes emails display correctly while not impacting emails that do not have this encoding.

@thetaylor82 thetaylor82 Mix Microsoft namespace and paragraph elements
Microsoft exchange emails often include HTML which is undocumented and non-standard. The <o:p> tags are handled as paragraphs by DOMDocument which results in huge amounts of line returns all mopped up by $output = preg_replace("/\n\n\n*/im", "\n\n", $output); that results in double line returns everywhere. To fix this the o:p elements should be removed and anything with a classname of MsoNormal (the standard classname in any Microsoft export or outlook for a paragraph that behaves like a line return) being changed to a line with a break afterwards. This makes emails display correctly while not impacting emails that do not have this encoding.
6ba6ce5
@collizo4sky collizo4sky added a commit to collizo4sky/html2text that referenced this pull request May 24, 2016
@collizo4sky collizo4sky Mix Microsoft namespace and paragraph elements #31 69b5812
@soundasleep
Owner

I love this change idea! Thank you so much for the PR. Is there any chance you can include a test file from MS Word so we can verify it works into the future?

@thetaylor82
Contributor

This is the source from an email I sent to test functionality:

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=us-ascii"><meta name=Generator content="Microsoft Word 15 (filtered medium)"><style><!-- /* Font Definitions */ @font-face {font-family:"Cambria Math"; panose-1:2 4 5 3 5 4 6 3 2 4;} @font-face {font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {margin:0cm; margin-bottom:.0001pt; font-size:11.0pt; font-family:"Calibri",sans-serif; mso-fareast-language:EN-US;} a:link, span.MsoHyperlink {mso-style-priority:99; color:#0563C1; text-decoration:underline;} a:visited, span.MsoHyperlinkFollowed {mso-style-priority:99; color:#954F72; text-decoration:underline;} span.EmailStyle17 {mso-style-type:personal-compose; font-family:"Calibri",sans-serif; color:windowtext;} .MsoChpDefault {mso-style-type:export-only; font-family:"Calibri",sans-serif; mso-fareast-language:EN-US;} @page WordSection1 {size:612.0pt 792.0pt; margin:72.0pt 72.0pt 72.0pt 72.0pt;} div.WordSection1 {page:WordSection1;} --></style><!--[if gte mso 9]><xml> <o:shapedefaults v:ext="edit" spidmax="1026" /> </xml><![endif]--><!--[if gte mso 9]><xml> <o:shapelayout v:ext="edit"> <o:idmap v:ext="edit" data="1" /> </o:shapelayout></xml><![endif]--></head><body lang=EN-GB link="#0563C1" vlink="#954F72"><div class=WordSection1><p class=MsoNormal>Dear html2text,<o:p></o:p></p><p class=MsoNormal><o:p>&nbsp;</o:p></p><p class=MsoNormal>This is an example email that can be used to test html2text conversion of outlook / exchange emails.<o:p></o:p></p><p class=MsoNormal><o:p>&nbsp;</o:p></p><p class=MsoNormal>The addition of &lt;o:p&gt; tags is very annoying!<o:p></o:p></p><p class=MsoNormal>This is a single line return<o:p></o:p></p><p class=MsoNormal><o:p>&nbsp;</o:p></p><p class=MsoNormal><b>This is bold<o:p></o:p></b></p><p class=MsoNormal><i>This is italic<o:p></o:p></i></p><p class=MsoNormal><u>This is underline<o:p></o:p></u></p><p class=MsoNormal><o:p>&nbsp;</o:p></p><p class=MsoNormal>Andrew<o:p></o:p></p></div></body></html>

@soundasleep soundasleep merged commit 6ba6ce5 into soundasleep:master Jun 9, 2016

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@soundasleep
Owner

Thank you! I've put that into a new test file, added some more checks and code comments and this is now part of 0.3.4 - does this change work for you?

@thetaylor82
Contributor

Tested and confirmed working. Thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment