Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue reading content encoded as Windows-1258 #141

Closed
Lepelley opened this issue Oct 1, 2020 · 12 comments
Closed

Issue reading content encoded as Windows-1258 #141

Lepelley opened this issue Oct 1, 2020 · 12 comments
Labels

Comments

@Lepelley
Copy link

Lepelley commented Oct 1, 2020

I retrieve mails from Gmail API using your library and for some cases (like less that 3%), it returns some characters, but not all of them, on PHP 5.4.16, but returns everything on 7.4.6.

<?php $decodedMail = "mime string"; $mime = Message::from($decodedMail); echo $mime->getHtmlContent();

Do you have some ideas that can cause that difference ? We are bound to upgrade PHP version, but i'm not sure to force my boss to do that yet.

@zbateson
Copy link
Owner

zbateson commented Oct 2, 2020

Hi @Lepelley

Could you confirm which version of mail-mime-parser you're using? Some old versions had an issue with base64 decoding that was doing that using php's built-in decoding, so I had switched to my own based on psr7 streams with guzzlehttp... I don't think that issue was specific to 5.4, and unfortunately can't think of anything else that may be causing that.

Otherwise -- if it's not a version issue... it would be very helpful if you could narrow it down to an email and see if a test could be written based on it so we can fix it.

@Lepelley
Copy link
Author

Lepelley commented Oct 2, 2020

I was using the 1.2.0 version, but i also tried with the 1.2.3. The email i got the error with (anonymised some data) :

Deleted

@zbateson
Copy link
Owner

zbateson commented Oct 2, 2020

Can you confirm it's the base64 encoded image part that the issue happens on?

@Lepelley
Copy link
Author

Lepelley commented Oct 2, 2020

My problem is that the content of the mail is truncated, not sure if it's cause of the base64 image.

@zbateson
Copy link
Owner

zbateson commented Oct 2, 2020

Hmm, so it could be an issue with quoted-printable... the content as in specifically the text part or the html part or both?

@Lepelley
Copy link
Author

Lepelley commented Oct 2, 2020

Both

@zbateson
Copy link
Owner

Hi @Lepelley

Sorry for the delay looking at this. This is actually happening to me on php 7.4.3 as well actually, but what I've noticed is that it specifies a weird charset for the content: "windows-1258", which according to Wikipedia is "a code page used in Microsoft Windows to represent Vietnamese texts.".

Using your attached example, if I manually update the charsets to iso-8859-1, I'm able to see the entire content for both the text/plain and text/html parts. I'm not sure if this is an issue on my end (or with zbateson/mb-wrapper), with php, or with the incorrect charset specified... any ideas?

@Lepelley
Copy link
Author

Hello @zbateson,
Well... I have no Idea, but that's strange that PHP 7.4.6 returns good result too, but not even 7.4.3.
I'm getting the mail directly from Gmail API, the wrong encoding must have been when they sent the mail.

@zbateson
Copy link
Owner

I've narrowed it down to an iconv function, so this could be system-specific, down to the version of iconv being used potentially (or existing in php's implementation of the function calling iconv, lol).

In zbateson/mb-wrapper, I end up calling:

iconv_substr($decodedText, 0, 2037, 'CP1258');

$decodedText containing the html or text part after being quoted-printable decoded. Unfortunately iconv_substr is only returning 11 characters, and I'm not sure why. It seems to successfully convert from CP1258 to UTF-8, and calling iconv_strlen on $decodedText also returns '2037' in this case.

I noticed converting to UTF-8, then calling mb_substr seems to work (mb_substr doesn't support these Windows charsets and some others, hence why it's using iconv). Unfortunately that's additional work getting the correct results, but I've had to do that elsewhere too anyway.

@zbateson
Copy link
Owner

Oh! I went in to create a test and it seems I was kind of aware of this:

https://github.com/zbateson/mb-wrapper/blob/718f357861735d463afd9ebf38c002b08d06dcea/tests/MbWrapper/MbWrapperTest.php#L156

I have a comment that reads "// seems to fail only on CP1258, returns incorrect number of characters with iconv_substr". Aah well, I guess time to work that out ;)

@zbateson zbateson changed the title Differents results for getHtmlContent() between PHP 5.4.16 and 7.4.6 Issue reading content encoded as Windows-1258 Oct 21, 2020
@zbateson
Copy link
Owner

This is fixed in zbateson/mb-wrapper 1.0.1. I released a new mail-mime-parser version 1.3.0 which requires that version, but just updating your dependencies in 1.x will also work.

If you get a chance, please have a look and make sure all is well for you now :)

@zbateson zbateson added the bug label Oct 21, 2020
@Lepelley
Copy link
Author

Works perfectly, thank you !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants