Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

outlook #326

Closed
wants to merge 1 commit into from
Closed

outlook #326

wants to merge 1 commit into from

Conversation

uhuntu
Copy link

@uhuntu uhuntu commented Apr 9, 2023

When I was trying to translate some Chinese letters, seems with encoding, it will come out wrong characters.

@ojwb
Copy link
Contributor

ojwb commented Apr 9, 2023

Removing escaping here is not the right thing to do - it will result in text emails containing characters such as <, > and & getting mangled.

What problem are you actually trying to solve with this change?

@uhuntu
Copy link
Author

uhuntu commented Apr 10, 2023

When I was trying to translate some Chinese letters, seems with encoding, it will come out wrong characters.

@ojwb
Copy link
Contributor

ojwb commented Apr 11, 2023

I tried with some sample .msg files and I think I can reproduce:

./outlookmsg2html tst_unicode.msg
Use of uninitialized value in string eq at /usr/share/perl5/Email/Outlook/Message/AddressInfo.pm line 76.
<head>
<title>Testcase</title>
<meta name="author" content="">
<meta name="created" content="2019-10-21T11:29:02">
</head>
<pre>-/-
Char-&Atilde;&yen;-Char
-/-
Char-&Atilde;&#133;-Char
-/-
Char-&Atilde;&cedil;-Char
-/-
Char-&Atilde;&#152;-Char
-/-
Char-&Atilde;&brvbar;-Char
-/-
Char-&Atilde;&#134;-Char
 
</pre>

Looks like the escaping is treating the string as ISO-8859-1 rather than UTF-8. I think the appropriate fix is something along the lines of this, which solves the problem for my example at least:

diff --git a/xapian-applications/omega/outlookmsg2html.in b/xapian-applications/omega/outlookmsg2html.in
index 86c1a430b648..d96f318b192c 100644
--- a/xapian-applications/omega/outlookmsg2html.in
+++ b/xapian-applications/omega/outlookmsg2html.in
@@ -24,6 +24,7 @@
 # IN THE SOFTWARE.
 
 use strict;
+use Encode 'decode';
 eval {
     require Email::Outlook::Message;
     require HTML::Entities;
@@ -62,7 +63,7 @@ handle_mimepart($mime);
 
 sub do_header {
     my ($msg, $header) = @_;
-    my $s = $msg->header_str($header);
+    my $s = decode('utf-8', $msg->header_str($header));
     return HTML::Entities::encode_entities($s);
 }
 
@@ -85,7 +86,8 @@ sub handle_mimepart {
 	}
     } elsif ($type eq 'text') {
 	if ($sub eq 'plain') {
-	    print "<pre>", HTML::Entities::encode_entities($e->body), "</pre>\n";
+	    my $body = decode('utf-8', $e->body);
+	    print "<pre>", HTML::Entities::encode_entities($body), "</pre>\n";
 	    return 1;
 	} elsif ($sub eq 'html') {
 	    my $m = $e->body;

Does that solve it for you?

@uhuntu
Copy link
Author

uhuntu commented Apr 11, 2023

Unfortunately no.

@ojwb
Copy link
Contributor

ojwb commented Apr 11, 2023

@uhuntu Can you provide a testcase?

@ojwb
Copy link
Contributor

ojwb commented Apr 12, 2023

I found some more examples I already had including one with Chinese characters, and this patch seems to solve Unicode-related issues for everything I have to hand:

diff --git a/xapian-applications/omega/outlookmsg2html.in b/xapian-applications/omega/outlookmsg2html.in
index 86c1a430b648..85224ec2b633 100644
--- a/xapian-applications/omega/outlookmsg2html.in
+++ b/xapian-applications/omega/outlookmsg2html.in
@@ -85,10 +85,10 @@ sub handle_mimepart {
 	}
     } elsif ($type eq 'text') {
 	if ($sub eq 'plain') {
-	    print "<pre>", HTML::Entities::encode_entities($e->body), "</pre>\n";
+	    print "<pre>", HTML::Entities::encode_entities($e->body_str), "</pre>\n";
 	    return 1;
 	} elsif ($sub eq 'html') {
-	    my $m = $e->body;
+	    my $m = $e->body_str;
 	    $m =~ s!</?body[^>]*>!\n!gi;
 	    print $m, "\n";
 	    return 1;

I'm not sure about the second change as none of the example .msg files I have seem to have Unicode in HTML parts, but it seems logical that we want to use body_str there too.

If that doesn't help for your situation then I really am going to need a reproducer.

@uhuntu
Copy link
Author

uhuntu commented Apr 12, 2023

@ojwb Hi Olly, I leave you a message by your email, there is a test case I can give.

@ojwb
Copy link
Contributor

ojwb commented Apr 12, 2023

Thanks - changing to body_str seems to fix your testcase.

There are 4 warning messages still:

Value for 'Subject' header with wide characters at /usr/share/perl5/Email/Outlook/Message.pm line 320.
Value for 'From' header with wide characters at /usr/share/perl5/Email/Outlook/Message.pm line 320.
Value for 'To' header with wide characters at /usr/share/perl5/Email/Outlook/Message.pm line 320.
Value for 'Cc' header with wide characters at /usr/share/perl5/Email/Outlook/Message.pm line 320.

That looks like a bug in Email::Outlook - I think it should call header_str_set rather than header_set there, or maybe check whether the string to set it Unicode or raw bytes and call different methods depending. I'll see if I can submit a patch for that upstream, but it doesn't seem to cause problems with the output of our script at least.

@ojwb ojwb closed this in 6dc668a Apr 12, 2023
@uhuntu
Copy link
Author

uhuntu commented Apr 13, 2023

@ojwb Hello Olly, sorry I can not agree to close this issue, I pulled the latest master with your commit, but seems the issue did not gone. In my case, I just use /usr/local/lib/xapian-omega/bin/outlookmsg2html that living in my machine to determine the issue.

For example, that email I sent you outlookmsg2html 'Redmin - RFQ Catalog少 B8 這個欄位 .msg', I just check the message output, if just apply your patch, seems issue still exist.

@ojwb
Copy link
Contributor

ojwb commented Apr 13, 2023

With the commit before the fix (0029277) I get:

<head>
<title>&lt;Redmin - RFQ / Catalog&#x5C11; B8 &#x9019;&#x500B;&#x6B04;&#x4F4D;&gt; </title>
<meta name="author" content="&quot;Judy.Lin &#x6797;&#x6DD1;&#x8C9E;&quot; &lt;Judy.Lin@tes-tec.com&gt;">
<meta name="created" content="2023-04-10T02:40:52">
</head>
<pre>Dear Hunt:

 

Redmin -  RFQ / Catalog : &aring;&deg;&#145; B8 &eacute;&#128;&#153;&aring;&#128;&#139;&aelig;&not;&#132;&auml;&frac12;&#141;, &egrave;&laquo;&#139;&aelig;&#130;&uml;&aring;&sup1;&laquo;&aring;&iquest;&#153;&aring;&cent;&#158;&aring;&#138;&nbsp;, &egrave;&not;&#157;&egrave;&not;&#157;.

 



 



 

 

 

Best regards!

Judy.Lin 

02-77228751 (&aring;&deg;&#136;&ccedil;&middot;&#154;)

</pre>

With the fix (6dc668a) I get:

<head>
<title>&lt;Redmin - RFQ / Catalog&#x5C11; B8 &#x9019;&#x500B;&#x6B04;&#x4F4D;&gt; </title>
<meta name="author" content="&quot;Judy.Lin &#x6797;&#x6DD1;&#x8C9E;&quot; &lt;Judy.Lin@tes-tec.com&gt;">
<meta name="created" content="2023-04-10T02:40:52">
</head>
<pre>Dear Hunt:

 

Redmin -  RFQ / Catalog : &#x5C11; B8 &#x9019;&#x500B;&#x6B04;&#x4F4D;, &#x8ACB;&#x60A8;&#x5E6B;&#x5FD9;&#x589E;&#x52A0;, &#x8B1D;&#x8B1D;.

 



 



 

 

 

Best regards!

Judy.Lin 

02-77228751 (&#x5C08;&#x7DDA;)

</pre>

I'm not fluent in Chinese, but decoding the HTML from the second one and pasting it into google translate strongly suggested it has decoded correctly.

If the second output is not correct, you'll need to pin-point exactly which part is still wrong.

If you're getting different output with the fixed outlookmsg2html to me then I'd suggest trying with the latest Email::Outlook Perl module - e.g. 0.920 and 0.921 both include Unicode-related fixes. I'm using 0.921 (as packaged for Debian unstable).

@uhuntu
Copy link
Author

uhuntu commented Apr 13, 2023

Thanks for explain, it works as expect.

ojwb added a commit that referenced this pull request Apr 13, 2023
Codepoints <= U+00FF seemed to be OK, but anything higher resulted
in individual bytes of the UTF-8 encoding being treated as separate
characters.

Fixes #326, reported by uhuntu.

(cherry picked from commit 6dc668a)
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this pull request Jul 10, 2023
documentation:

* Improve documentation for OmegaScript numerical and logical operators.  Patch
  from Vaibhav Kansagara.

* Improve documentation for DATEVALUE, xFILTERS and $filters.

indexers:

* omindex:

  + Handle XPS files with multiple FixedDocument parts better.  Previously we
    only extracted text from the first FixedDocument part.

  + Prefer latter subparts of multipart/alternative which is what RFC2046 (and
    earlier RFCs which that obsoletes) say, but previously we used the first
    subpart that we could get text from.

  + Prefer latter subparts of multipart/alternative when indexing Outlook
    .msg files too.

  + Fix obscure bug in --mimetype option.  We keep track of the length of the
    longest extension we have a mapping for, but this was being updated using
    the length of the MIME type rather than the length of the extension.
    Theoretically this could have led to us effectively ignoring a --mimetype
    option, but in the real world the MIME type will probably always be longer
    so this just results in us testing long extensions unnecessarily.

omega:

* Ignore DATEVALUE CGI parameter if START.n, etc is specified on the same
  slot.  We explicitly document not to do this, but if that advice is ignored
  it's more helpful to at least preserve the property that we only have
  one date range per value slot.

* Add flag_ngrams as a preferred new alias for flag_cjk_ngram.  In the next
  release series this feature has been expanded to cover many more languages
  so the "cjk" in the name has become inaccurate as it stands for
  "Chinese, Japanese and Korean").

* Fix handling of Outlook .msg containing Unicode.  Codepoints <= U+00FF appear
  to have been handled correctly, but anything higher resulted in individual
  bytes of the UTF-8 encoding being treated as separate characters.

  Fixes xapian/xapian#326, reported by uhuntu.

portability:

* Fix compatibility code for old libmagic versions.  The code we were using
  seems like it would never have worked.  Nobody's reported this (it was
  spotted while looking at the code) so we could just require libmagic >= 4.22,
  but it's trivial to actually handle so we've fixed the fallback code.

* Remove lingering traces of IRIX support as it's been dead for many years.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants