Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MailHeader::get_value is adding spaces where it should not #19

Closed
gagath opened this issue Apr 2, 2018 · 6 comments
Closed

MailHeader::get_value is adding spaces where it should not #19

gagath opened this issue Apr 2, 2018 · 6 comments

Comments

@gagath
Copy link
Contributor

gagath commented Apr 2, 2018

Hello,

The comment over the get_value function says that the parser should get rid of the extra whitespace introduced by MIME for multiline headers. However I think there is a problem in the implementation. Parsing around 2000 mails with this nice library made me find this problem when using multiline UTF-8 subject:

extern crate mailparse;

use mailparse::parse_header;


fn main() {
    let raw = b"Subject: =?utf-8?q?this_is_?=\n\t=?utf-8?q?a_test?=";
    let (parsed, _) = parse_header(raw).unwrap();

    let key = parsed.get_key().unwrap();
    let value = parsed.get_value().unwrap();

    println!("{}: {}", key, value);

    assert_eq!(key, "Subject");
    assert_eq!(value, "this is a test");
}

And the output is:

Subject: this is  a test
thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `"this is  a test"`,
 right: `"this is a test"`', src/main.rs:16:5
note: Run with `RUST_BACKTRACE=1` for a backtrace.

An extra whitespace is added, where it should not.

I can propose patch if you indicate me what to change. I am not very familiar with the RFC but I think this is a bug because a lot of mails I parse have this extra-whitespace problem in subjects.

@staktrace
Copy link
Owner

Hm, it looks like in your example, there is a space before the newline, and a tab after the newline. So we keep the space from the newline, and collapse the newline/tab into another space. This results in two spaces. So the behaviour matches what the get_value documentation says (it only talks about whitespace following newlines).

That being said, it's possible that this behaviour is wrong and that it should be collapsing all the whitespace surrounding the newline instead of just the whitespace following the newline. I'll take another look at the RFC.

@staktrace
Copy link
Owner

https://tools.ietf.org/html/rfc822#section-3.1.1 says:

Unfolding is accomplished by regarding CRLF immediately followed by a LWSP-char as equivalent to the LWSP-char.

@gagath
Copy link
Contributor Author

gagath commented Apr 4, 2018

OK so I think we should add a state to the state machine in order to not add a space if the character followed by the (CR) LF is the start of an LSPW-char. This would fix the issue and change well be small (+ add test for it).

I can try a PR.

@staktrace
Copy link
Owner

@MicroJoe Any progress with the PR?

@gagath
Copy link
Contributor Author

gagath commented Apr 17, 2018

Hey, I was waiting to know if the proposed fix would be fine. I will reserve some time in order to make a PR draft then.

@staktrace
Copy link
Owner

Fixed by #21.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants