New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Email reply text is not always extracted successfully #1096

Open
nicksellen opened this Issue Sep 10, 2018 · 6 comments

Comments

Projects
None yet
2 participants
@nicksellen
Copy link
Member

nicksellen commented Sep 10, 2018

We use https://github.com/mailgun/talon library to separate the content from the rest when you reply by email to a post/comment.

It has two modes and says:

Quick and works like a charm 90% of the time. For other 10% you can use the power of machine learning algorithms

We currently just use the quick method.

It seems it fails more when the language is not English, so we could implement the machine learning part of it to improve on that. It's a lot more involved though, but could be fun! (and maybe get to even learn something about machine learning on the way).

@nicksellen nicksellen added the backend label Sep 10, 2018

@tiltec

This comment has been minimized.

Copy link
Member

tiltec commented Sep 12, 2018

I think the machine learning code is only available for signature extraction - that's the conclusion I came to when looking at the talon source code a week ago.
The quotation extraction is always done via regular expressions, with special cases for some languages. I think we could simply add some more of theses special cases for languages that we have in karrot, i.e. Swedish.

@tiltec

This comment has been minimized.

Copy link
Member

tiltec commented Sep 13, 2018

Ah, the regular expressions seem to support some Swedish already: https://github.com/mailgun/talon/blob/a7404afbcb67e66aa13ff8917df4dcbbf3534624/talon/quotations.py#L44

But still, the extraction in the Solikyl group is not reliable...

This demo can be used to try out if a quotation would be extracted correctly: http://talon.mailgun.net/

@tiltec

This comment has been minimized.

Copy link
Member

tiltec commented Sep 13, 2018

tiltec added a commit to yunity/karrot-backend that referenced this issue Oct 22, 2018

keep incoming emails in database
should aid debugging and testing of the reply parser

Related to yunity/karrot-frontend#1096
@tiltec

This comment has been minimized.

Copy link
Member

tiltec commented Oct 22, 2018

Discourse seems to have comprehensive support for handling incoming emails. This is their receiver code and they maintain their own reply parser that handles multiples languages.

As a first step, I added a table to the database where we store incoming emails. Previously they have been discarded after the conversation message has been created. After we gathered some emails, we can compare different parsing libraries or extend existing ones.

I looked a bit more into https://github.com/zapier/email-reply-parser, but they seem to support English only: https://github.com/zapier/email-reply-parser/blob/master/email_reply_parser/__init__.py#L42

@nicksellen

This comment has been minimized.

Copy link
Member

nicksellen commented Oct 22, 2018

That line you linked looked like it was just referring to headers in the meta data which would be English from the protocol.

This stuff seems to be at the boundaries of human knowledge!

@tiltec

This comment has been minimized.

Copy link
Member

tiltec commented Oct 23, 2018

Ah right! But I didn't find any other languages in their project and the test cases are all in English too. I'm happy to try it out for recorded emails though :)

Talon seems the best solution for python. Or we set up a ruby microservice for email trimming :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment