Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Email reply text is not always extracted successfully #1096
We use https://github.com/mailgun/talon library to separate the content from the rest when you reply by email to a post/comment.
It has two modes and says:
We currently just use the quick method.
It seems it fails more when the language is not English, so we could implement the machine learning part of it to improve on that. It's a lot more involved though, but could be fun! (and maybe get to even learn something about machine learning on the way).
I think the machine learning code is only available for signature extraction - that's the conclusion I came to when looking at the talon source code a week ago.
Ah, the regular expressions seem to support some Swedish already: https://github.com/mailgun/talon/blob/a7404afbcb67e66aa13ff8917df4dcbbf3534624/talon/quotations.py#L44
But still, the extraction in the Solikyl group is not reliable...
This demo can be used to try out if a quotation would be extracted correctly: http://talon.mailgun.net/
There's also the option to try out other libraries, like this python one: https://github.com/zapier/email-reply-parser
added a commit
Oct 22, 2018
As a first step, I added a table to the database where we store incoming emails. Previously they have been discarded after the conversation message has been created. After we gathered some emails, we can compare different parsing libraries or extend existing ones.
I looked a bit more into https://github.com/zapier/email-reply-parser, but they seem to support English only: https://github.com/zapier/email-reply-parser/blob/master/email_reply_parser/__init__.py#L42