outlook #326

uhuntu · 2023-04-09T09:41:49Z

When I was trying to translate some Chinese letters, seems with encoding, it will come out wrong characters.

ojwb · 2023-04-09T19:25:34Z

Removing escaping here is not the right thing to do - it will result in text emails containing characters such as <, > and & getting mangled.

What problem are you actually trying to solve with this change?

uhuntu · 2023-04-10T00:34:26Z

When I was trying to translate some Chinese letters, seems with encoding, it will come out wrong characters.

ojwb · 2023-04-11T03:02:04Z

I tried with some sample .msg files and I think I can reproduce:

./outlookmsg2html tst_unicode.msg
Use of uninitialized value in string eq at /usr/share/perl5/Email/Outlook/Message/AddressInfo.pm line 76.
<head>
<title>Testcase</title>
<meta name="author" content="">
<meta name="created" content="2019-10-21T11:29:02">
</head>
<pre>-/-
Char-&Atilde;&yen;-Char
-/-
Char-&Atilde;&#133;-Char
-/-
Char-&Atilde;&cedil;-Char
-/-
Char-&Atilde;&#152;-Char
-/-
Char-&Atilde;&brvbar;-Char
-/-
Char-&Atilde;&#134;-Char
 
</pre>

Looks like the escaping is treating the string as ISO-8859-1 rather than UTF-8. I think the appropriate fix is something along the lines of this, which solves the problem for my example at least:

diff --git a/xapian-applications/omega/outlookmsg2html.in b/xapian-applications/omega/outlookmsg2html.in
index 86c1a430b648..d96f318b192c 100644
--- a/xapian-applications/omega/outlookmsg2html.in
+++ b/xapian-applications/omega/outlookmsg2html.in
@@ -24,6 +24,7 @@
 # IN THE SOFTWARE.
 
 use strict;
+use Encode 'decode';
 eval {
     require Email::Outlook::Message;
     require HTML::Entities;
@@ -62,7 +63,7 @@ handle_mimepart($mime);
 
 sub do_header {
     my ($msg, $header) = @_;
-    my $s = $msg->header_str($header);
+    my $s = decode('utf-8', $msg->header_str($header));
     return HTML::Entities::encode_entities($s);
 }
 
@@ -85,7 +86,8 @@ sub handle_mimepart {
 	}
     } elsif ($type eq 'text') {
 	if ($sub eq 'plain') {
-	    print "<pre>", HTML::Entities::encode_entities($e->body), "</pre>\n";
+	    my $body = decode('utf-8', $e->body);
+	    print "<pre>", HTML::Entities::encode_entities($body), "</pre>\n";
 	    return 1;
 	} elsif ($sub eq 'html') {
 	    my $m = $e->body;

Does that solve it for you?

uhuntu · 2023-04-11T03:45:08Z

Unfortunately no.

ojwb · 2023-04-11T20:17:42Z

@uhuntu Can you provide a testcase?

ojwb · 2023-04-12T02:59:51Z

I found some more examples I already had including one with Chinese characters, and this patch seems to solve Unicode-related issues for everything I have to hand:

diff --git a/xapian-applications/omega/outlookmsg2html.in b/xapian-applications/omega/outlookmsg2html.in
index 86c1a430b648..85224ec2b633 100644
--- a/xapian-applications/omega/outlookmsg2html.in
+++ b/xapian-applications/omega/outlookmsg2html.in
@@ -85,10 +85,10 @@ sub handle_mimepart {
 	}
     } elsif ($type eq 'text') {
 	if ($sub eq 'plain') {
-	    print "<pre>", HTML::Entities::encode_entities($e->body), "</pre>\n";
+	    print "<pre>", HTML::Entities::encode_entities($e->body_str), "</pre>\n";
 	    return 1;
 	} elsif ($sub eq 'html') {
-	    my $m = $e->body;
+	    my $m = $e->body_str;
 	    $m =~ s!</?body[^>]*>!\n!gi;
 	    print $m, "\n";
 	    return 1;

I'm not sure about the second change as none of the example .msg files I have seem to have Unicode in HTML parts, but it seems logical that we want to use body_str there too.

If that doesn't help for your situation then I really am going to need a reproducer.

uhuntu · 2023-04-12T09:14:02Z

@ojwb Hi Olly, I leave you a message by your email, there is a test case I can give.

ojwb · 2023-04-12T19:20:37Z

Thanks - changing to body_str seems to fix your testcase.

There are 4 warning messages still:

Value for 'Subject' header with wide characters at /usr/share/perl5/Email/Outlook/Message.pm line 320.
Value for 'From' header with wide characters at /usr/share/perl5/Email/Outlook/Message.pm line 320.
Value for 'To' header with wide characters at /usr/share/perl5/Email/Outlook/Message.pm line 320.
Value for 'Cc' header with wide characters at /usr/share/perl5/Email/Outlook/Message.pm line 320.

That looks like a bug in Email::Outlook - I think it should call header_str_set rather than header_set there, or maybe check whether the string to set it Unicode or raw bytes and call different methods depending. I'll see if I can submit a patch for that upstream, but it doesn't seem to cause problems with the output of our script at least.

uhuntu · 2023-04-13T01:59:24Z

@ojwb Hello Olly, sorry I can not agree to close this issue, I pulled the latest master with your commit, but seems the issue did not gone. In my case, I just use /usr/local/lib/xapian-omega/bin/outlookmsg2html that living in my machine to determine the issue.

For example, that email I sent you outlookmsg2html 'Redmin - RFQ Catalog少 B8 這個欄位 .msg', I just check the message output, if just apply your patch, seems issue still exist.

ojwb · 2023-04-13T05:02:16Z

With the commit before the fix (0029277) I get:

<head>
<title>&lt;Redmin - RFQ / Catalog&#x5C11; B8 &#x9019;&#x500B;&#x6B04;&#x4F4D;&gt; </title>
<meta name="author" content="&quot;Judy.Lin &#x6797;&#x6DD1;&#x8C9E;&quot; &lt;Judy.Lin@tes-tec.com&gt;">
<meta name="created" content="2023-04-10T02:40:52">
</head>
<pre>Dear Hunt:

 

Redmin -  RFQ / Catalog : &aring;&deg;&#145; B8 &eacute;&#128;&#153;&aring;&#128;&#139;&aelig;&not;&#132;&auml;&frac12;&#141;, &egrave;&laquo;&#139;&aelig;&#130;&uml;&aring;&sup1;&laquo;&aring;&iquest;&#153;&aring;&cent;&#158;&aring;&#138;&nbsp;, &egrave;&not;&#157;&egrave;&not;&#157;.

 



 



 

 

 

Best regards!

Judy.Lin 

02-77228751 (&aring;&deg;&#136;&ccedil;&middot;&#154;)

</pre>

With the fix (6dc668a) I get:

<head>
<title>&lt;Redmin - RFQ / Catalog&#x5C11; B8 &#x9019;&#x500B;&#x6B04;&#x4F4D;&gt; </title>
<meta name="author" content="&quot;Judy.Lin &#x6797;&#x6DD1;&#x8C9E;&quot; &lt;Judy.Lin@tes-tec.com&gt;">
<meta name="created" content="2023-04-10T02:40:52">
</head>
<pre>Dear Hunt:

 

Redmin -  RFQ / Catalog : &#x5C11; B8 &#x9019;&#x500B;&#x6B04;&#x4F4D;, &#x8ACB;&#x60A8;&#x5E6B;&#x5FD9;&#x589E;&#x52A0;, &#x8B1D;&#x8B1D;.

 



 



 

 

 

Best regards!

Judy.Lin 

02-77228751 (&#x5C08;&#x7DDA;)

</pre>

I'm not fluent in Chinese, but decoding the HTML from the second one and pasting it into google translate strongly suggested it has decoded correctly.

If the second output is not correct, you'll need to pin-point exactly which part is still wrong.

If you're getting different output with the fixed outlookmsg2html to me then I'd suggest trying with the latest Email::Outlook Perl module - e.g. 0.920 and 0.921 both include Unicode-related fixes. I'm using 0.921 (as packaged for Debian unstable).

uhuntu · 2023-04-13T09:41:23Z

Thanks for explain, it works as expect.

Codepoints <= U+00FF seemed to be OK, but anything higher resulted in individual bytes of the UTF-8 encoding being treated as separate characters. Fixes #326, reported by uhuntu. (cherry picked from commit 6dc668a)

documentation: * Improve documentation for OmegaScript numerical and logical operators. Patch from Vaibhav Kansagara. * Improve documentation for DATEVALUE, xFILTERS and $filters. indexers: * omindex: + Handle XPS files with multiple FixedDocument parts better. Previously we only extracted text from the first FixedDocument part. + Prefer latter subparts of multipart/alternative which is what RFC2046 (and earlier RFCs which that obsoletes) say, but previously we used the first subpart that we could get text from. + Prefer latter subparts of multipart/alternative when indexing Outlook .msg files too. + Fix obscure bug in --mimetype option. We keep track of the length of the longest extension we have a mapping for, but this was being updated using the length of the MIME type rather than the length of the extension. Theoretically this could have led to us effectively ignoring a --mimetype option, but in the real world the MIME type will probably always be longer so this just results in us testing long extensions unnecessarily. omega: * Ignore DATEVALUE CGI parameter if START.n, etc is specified on the same slot. We explicitly document not to do this, but if that advice is ignored it's more helpful to at least preserve the property that we only have one date range per value slot. * Add flag_ngrams as a preferred new alias for flag_cjk_ngram. In the next release series this feature has been expanded to cover many more languages so the "cjk" in the name has become inaccurate as it stands for "Chinese, Japanese and Korean"). * Fix handling of Outlook .msg containing Unicode. Codepoints <= U+00FF appear to have been handled correctly, but anything higher resulted in individual bytes of the UTF-8 encoding being treated as separate characters. Fixes xapian/xapian#326, reported by uhuntu. portability: * Fix compatibility code for old libmagic versions. The code we were using seems like it would never have worked. Nobody's reported this (it was spotted while looking at the code) so we could just require libmagic >= 4.22, but it's trivial to actually handle so we've fixed the fallback code. * Remove lingering traces of IRIX support as it's been dead for many years.

outlook

2dc7871

ojwb closed this in 6dc668a Apr 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

outlook #326

outlook #326

uhuntu commented Apr 9, 2023 •

edited

ojwb commented Apr 9, 2023

uhuntu commented Apr 10, 2023

ojwb commented Apr 11, 2023

uhuntu commented Apr 11, 2023

ojwb commented Apr 11, 2023

ojwb commented Apr 12, 2023

uhuntu commented Apr 12, 2023

ojwb commented Apr 12, 2023

uhuntu commented Apr 13, 2023

ojwb commented Apr 13, 2023

uhuntu commented Apr 13, 2023

outlook #326

outlook #326

Conversation

uhuntu commented Apr 9, 2023 • edited

ojwb commented Apr 9, 2023

uhuntu commented Apr 10, 2023

ojwb commented Apr 11, 2023

uhuntu commented Apr 11, 2023

ojwb commented Apr 11, 2023

ojwb commented Apr 12, 2023

uhuntu commented Apr 12, 2023

ojwb commented Apr 12, 2023

uhuntu commented Apr 13, 2023

ojwb commented Apr 13, 2023

uhuntu commented Apr 13, 2023

uhuntu commented Apr 9, 2023 •

edited