Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix empty msgid #2063

Closed
wants to merge 12 commits into from
Closed

Fix empty msgid #2063

wants to merge 12 commits into from

Conversation

dsukhin
Copy link

@dsukhin dsukhin commented Jun 8, 2018

When messages come in with an empty or missing Message-Id header, its trivial to generate a unique one from the hash of the RFC822 message body. This is important to prevent duplicate imports via the imap driver if a message is ever looked at twice.

Happy to discuss and answer questions.

@dsukhin
Copy link
Author

dsukhin commented Jun 16, 2018

Additionally, this also fixes an issue where the IMAP gem and the Mail gem are inconsistent in how they parse fields like the message-id. The IMAP gem sometimes leaves whitespace around the message-id leading to different MD5 results when checking for duplicate imports. Adding strip in all the right places fixes this easily.

Happy to answer questions as always!

@martini
Copy link
Collaborator

martini commented Jul 4, 2018

@dsukhin thanks for you PR. Can you explain me, what's happen if there is no message id in the incoming email?

By the way, normally every MTA adds one if there is none.

Thanks for feedback!

@dsukhin
Copy link
Author

dsukhin commented Jul 7, 2018

Hi @martini. Sorry for the slow reply.
Yes, in general I would also expect most messages will have a message id added by the sender or an MTA. However, when I was importing messages, I found plenty of messages without a message-id. By rfc 2822, message id is still optional despite being highly recommended. Since Zammad relies heavily on the message-id (instead of imap' s uidvalidity+uid) to uniquely identify messages, we have to generate a message id to make sure Zammad has an indentifier for the message.

I do that by taking the rfc 822 string of the message (which has a fixed set of headers in a deterministic order) and taking the md5 of this to generate a message hash. I piece that together with a "gen-" to identify which ids are generated by Zammad and the fqdn of the sender to add a little more uniqueness. The code to generate this id is present both in the imap driver (which checks for message duplicates) and the email parser which creates the id for when a message is being added to the database. This prevents duplicate imports even in the case of messages without a message id.

The second fix in this PR is actually for a bug. The imap driver and mail parser strip whitespace differently when parsing headers. Normally this is fine, but in the case of the message id which is run through MD5, an extra space in the front of the id changes the hash completely. Therefore it's important to strip any leading/lagging whitespace from the message id before computing the md5 in both places. Otherwise duplicates may be imported if you run the imap import twice on a message. I can provide an email test case in a bit to show this behavior more clearly. Please let me know of any questions. Happy to answer.

@dsukhin
Copy link
Author

dsukhin commented Jul 14, 2018

This is a test message that gets imported multiple times with the whitespace differences of message id parsing in the Mail gem vs the IMAP gem. The line break on the message id line is fully compliant with the IMAP spec for long lines.

Return-Path: <user.name@zammad.com>
Delivered-To: user@zammad.com
Received: from host.zammad.com
	by host.zammad.com with LMTP id kFSRC5u/I1tJHBAAz53O+w
	for <user@zammad.com>; Fri, 15 Jun 2018 09:31:07 -0400
Return-path: <user.name@zammad.com>
Envelope-to: user@zammad.com
Delivery-date: Fri, 15 Jun 2018 09:31:07 -0400
Received: from mail-do1ram03ok0780.emails.users.zammad.com ([0.0.0.0]:15241 helo=do1ram03ok0780.emails.users.zammad.com)
	by host.zammad.com with esmtps (TLSv1.2:ECDHE-RSA-AES256-SHA384:256)
	(Exim 4.91)
	(envelope-from <user.name@zammad.com>)
	id 1dFool-1004vt-2z
	for user@zammad.com; Fri, 15 Jun 2018 09:31:07 -0400
From: "Name R. John" <user.name@zammad.com>
To: To This Person <user@zammad.com>
Subject: Subject of message
Thread-Topic: Subject of message
Thread-Index: AdQDSn2ZjObO2qLmRLWd8iIbgua/6gBYoT0g
Sender: "Name R. John" <user.name@zammad.com>
Date: Fri, 15 Jun 2018 13:30:23 +0000
Message-ID:
 <DM6PR02MF1425372C1E5187B91210554G127D5@do1ram03ok0780.emails.users.zammad.com>
Accept-Language: en-US
Content-Language: en-US

Message text

@rlue
Copy link
Contributor

rlue commented Aug 16, 2018

@dsukhin, thanks for your contribution and sorry to delay on getting it merged. We're currently in the process of incorporating / refactoring your changes, and I wanted to confirm something with you:

The second fix in this PR is actually for a bug. The imap driver and mail parser strip whitespace differently when parsing headers... it's important to strip any leading/lagging whitespace from the message id before computing the md5 in both places.

In your refactoring, the part that gets MD5-ed is the raw text of the email. In the IMAP driver, that looks like this:

261 msg = @imap.fetch(message_id, 'RFC822')[0].attr['RFC822']
    # ...
270 local_message_id = '<gen-' + Digest::MD5.hexdigest(msg) + '@' + fqdn.strip + '>'

and in the EmailParser, it looks like this:

478 '<gen-' + Digest::MD5.hexdigest(msg) + '@' + fqdn.strip + '>'

In neither of these cases do you strip whitespace from the raw msg before computing the MD5 digest.

Have I misunderstood you, or is there something wrong with this patch?


EDIT

Whoops, it looks like you were talking about fqdn.strip. Is that right?

If so, false alarm. Sorry for the confusion!

@dsukhin
Copy link
Author

dsukhin commented Aug 17, 2018

Hey @rlue,

Yup, that's right - I strip the fqdn (the full domain of the from address) in both places (imap driver and email parser) but the really important place to do this is here in the IMAP driver: local_message_id.strip.

This is for the case (for which I provided the test email above) where the message id is present in the message but has leading whitespace. The NET::IMAP module doesn't strip whitespace on its own the way the Mail gem parser does (see the example message above) so I made sure to add a strip in the imap driver. The Mail gem does strip on its own so it's not strictly necessary.

This is the original commit where I added the fix: 8604bb2 (it's prior to the refactor tho). You are right that I left the strip out in the email_parser during the refactor but only because it's only really needed in the IMAP driver. Might be a good idea to put it back in the email parser file as well just to future proof and make sure everything is consistent.

It may be worth it to see for yourself how the imap driver parses the message-id with whitespace (but you will have to load the test message above into a test imap server). Ultimately, I feel this is a bug with NET::IMAP not Zammad but the .strip is an easy fix on our end.

Let me know if that helps!
David

@dsukhin
Copy link
Author

dsukhin commented Aug 18, 2018

Hey @rlue -- was auditing my server today and found yet another interesting case that the IMAP gem and Mail gem deal with differently...

Message-ID: <5B30C269022FE1D0@zammad.com> (added by postmaster@zammad.com)

the message-id part is only the part in the angle brackets... comments like this are technically allowed by the spec and the Mail gem parser properly returns only <5B30C269022FE1D0@zammad.com> as the message-id while the IMAP gem returns the whole line after the colon including the portion in parenthesis.

Thinking about the design a little more - it seems that the only way to do clean message-id checks would be to use the same parser in both cases. In other words, use the Mail gem parser inside the imap module to extract the message-id. It's either this or using the uidvalidity+uid for unique identification of messages which would require a much larger refactor.

Wanted to share this interesting new finding as we work to integrate this and think about the design so that its properly robust. Let me know if you have any questions! Happy to help.

@martini martini closed this Aug 7, 2019
@martini
Copy link
Collaborator

martini commented Aug 7, 2019

closed by accident - re-open

@martini martini reopened this Aug 7, 2019
@dsukhin dsukhin mentioned this pull request Jan 29, 2020
@thorsteneckel
Copy link
Contributor

Hey @dsukhin - any chance to have some tests for this? Let me know if we can help!

@dsukhin
Copy link
Author

dsukhin commented Jun 23, 2020

Hi @thorsteneckel, sorry for the late reply - missed this message.
The test message above in #2063 (comment) is a test message which demonstrates the whitespace issues.

The only fix needed for that is local_message_id = local_message_id.strip in the IMAP driver.

To test messages without a message id, simply remove the message id line and try to import and make sure it generates the "<gen-#hash#@fqdn|zammad-generated>" id in both the email parser and the IMAP driver.

@thorsteneckel
Copy link
Contributor

Hi @dsukhin - thanks for your feedback. I was thinking about tests in terms of code coverage/test files as part of this contribution. Do you mind adding them? Otherwise please let us know. It just might take longer though.

@mantas mantas assigned mantas and unassigned rlue Sep 7, 2020
@mantas
Copy link
Collaborator

mantas commented Dec 29, 2020

Hi @dsukhin

This part of code has changed quite a bit since the time this PR started. A part of the issue (message-ID on new line) was already fixed. However, parsing emails without message ID at all was still broken. This is now fixed with the commit above.

I did use your core idea to use fingerprint of the body, but refactored it a bit . So commit shows up under my name, but I added credit to commit message. Fingers crossed it works for you.

@martini martini added this to the 3.7.0 milestone Dec 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants