-
Notifications
You must be signed in to change notification settings - Fork 278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
outlook #326
outlook #326
Conversation
Removing escaping here is not the right thing to do - it will result in text emails containing characters such as What problem are you actually trying to solve with this change? |
When I was trying to translate some Chinese letters, seems with encoding, it will come out wrong characters. |
I tried with some sample .msg files and I think I can reproduce:
Looks like the escaping is treating the string as ISO-8859-1 rather than UTF-8. I think the appropriate fix is something along the lines of this, which solves the problem for my example at least:
Does that solve it for you? |
Unfortunately no. |
@uhuntu Can you provide a testcase? |
I found some more examples I already had including one with Chinese characters, and this patch seems to solve Unicode-related issues for everything I have to hand:
I'm not sure about the second change as none of the example If that doesn't help for your situation then I really am going to need a reproducer. |
@ojwb Hi Olly, I leave you a message by your email, there is a test case I can give. |
Thanks - changing to There are 4 warning messages still:
That looks like a bug in Email::Outlook - I think it should call |
@ojwb Hello Olly, sorry I can not agree to close this issue, I pulled the latest master with your commit, but seems the issue did not gone. In my case, I just use For example, that email I sent you |
With the commit before the fix (0029277) I get:
With the fix (6dc668a) I get:
I'm not fluent in Chinese, but decoding the HTML from the second one and pasting it into google translate strongly suggested it has decoded correctly. If the second output is not correct, you'll need to pin-point exactly which part is still wrong. If you're getting different output with the fixed |
Thanks for explain, it works as expect. |
documentation: * Improve documentation for OmegaScript numerical and logical operators. Patch from Vaibhav Kansagara. * Improve documentation for DATEVALUE, xFILTERS and $filters. indexers: * omindex: + Handle XPS files with multiple FixedDocument parts better. Previously we only extracted text from the first FixedDocument part. + Prefer latter subparts of multipart/alternative which is what RFC2046 (and earlier RFCs which that obsoletes) say, but previously we used the first subpart that we could get text from. + Prefer latter subparts of multipart/alternative when indexing Outlook .msg files too. + Fix obscure bug in --mimetype option. We keep track of the length of the longest extension we have a mapping for, but this was being updated using the length of the MIME type rather than the length of the extension. Theoretically this could have led to us effectively ignoring a --mimetype option, but in the real world the MIME type will probably always be longer so this just results in us testing long extensions unnecessarily. omega: * Ignore DATEVALUE CGI parameter if START.n, etc is specified on the same slot. We explicitly document not to do this, but if that advice is ignored it's more helpful to at least preserve the property that we only have one date range per value slot. * Add flag_ngrams as a preferred new alias for flag_cjk_ngram. In the next release series this feature has been expanded to cover many more languages so the "cjk" in the name has become inaccurate as it stands for "Chinese, Japanese and Korean"). * Fix handling of Outlook .msg containing Unicode. Codepoints <= U+00FF appear to have been handled correctly, but anything higher resulted in individual bytes of the UTF-8 encoding being treated as separate characters. Fixes xapian/xapian#326, reported by uhuntu. portability: * Fix compatibility code for old libmagic versions. The code we were using seems like it would never have worked. Nobody's reported this (it was spotted while looking at the code) so we could just require libmagic >= 4.22, but it's trivial to actually handle so we've fixed the fallback code. * Remove lingering traces of IRIX support as it's been dead for many years.
When I was trying to translate some Chinese letters, seems with encoding, it will come out wrong characters.