Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve detection of reply quotations #3

Open
Soreine opened this issue Dec 31, 2019 · 3 comments
Open

Improve detection of reply quotations #3

Soreine opened this issue Dec 31, 2019 · 3 comments
Labels
enhancement New feature or request

Comments

@Soreine
Copy link
Member

Soreine commented Dec 31, 2019

We should improve the existing logic to detect the replied messages. We can use blockquotes as indicators, or common strings like "On Friday, 27 November 2015, Your Tempo <contact@yourtempo.co> wrote".

Here are some useful regexes for such messages in several languages

@Soreine
Copy link
Member Author

Soreine commented Jan 6, 2020

Why is parsing email bodies hard?

  • Signature identification
  • Various formats for headers
  • On Fri, Nov 19th…
  • On 10/9/2018
  • Headers that wrap across lines
  • From:, To:, Date: style headers
  • Reply chains indicated by > or multiple >>>
  • Some lines look like signatures but aren’t
  • Corrupted email headers
  • Common for plain text emails to split reply headers
  • Multi-language support if required
  • Header formats change over time

Due to this, we suggest not coding your own signature parsing algorithm. It is non-trivial.

Biased source: SigParser, a paid service for email parsing

Existing libraries

I have found https://github.com/mailgun/talon (in Python) which is interesting for its quotation detection for Text and HTML, and its basic text signature detection (forget about the signature detection with machine learning). They also have a lot of real-world fixtures, which is invaluable.

There is a JS port of it, made by people from Front, which I believe are great engineers. https://github.com/quentez/talonjs/ The repo is not documented, but it is recent and maintained.

There is also another port https://github.com/lever/planer which is older and seems less complete.

Both planer and talonjs requires a DOM implementation to work (xmldom or jsdom for example). talonjs also uses cheerio to cleanup the input document a bit.

@Soreine
Copy link
Member Author

Soreine commented Jan 6, 2020

For information, below is the algorithm used by Talon for HTML messages

# Extract actual message from provided html message body
# using tags and plain text algorithm.
#
# Cut out the 'blockquote', 'gmail_quote' tags.
# Cut out Microsoft (Outlook, Windows mail) quotations.
#
# Then use plain text algorithm to cut out splitter or
# leftover quotation.
# This works by adding checkpoint text to all html tags,
# then converting html to text,
# then extracting quotations from text,
# then checking deleted checkpoints,
# then deleting necessary tags.

@Soreine Soreine mentioned this issue Jan 6, 2020
@Soreine Soreine added the enhancement New feature or request label Jan 9, 2020
@Soreine Soreine changed the title Improve detection of replied messages to hide them Improve detection of reply quotations Jan 12, 2020
@Soreine
Copy link
Member Author

Soreine commented Jan 12, 2020

Things we could take from Mailspring:

Things we could take from TalonJS

  • International detection of On date, somebody wrote: lines

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant