Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use IMAP function 📧 #15

Closed
krillin666 opened this issue Oct 18, 2021 · 15 comments
Closed

How to use IMAP function 📧 #15

krillin666 opened this issue Oct 18, 2021 · 15 comments

Comments

@krillin666
Copy link

Hello,

Thanks for this amazing extension of HPI that I just discovered. I was trying to setup this with Promnesia for my emails but I'm getting zero indexing:

[INFO    2021-10-18 23:17:26 promnesia extract.py:49] extracting via promnesia_sean.sources.imap:index () {} ... ...
[INFO    2021-10-18 23:17:26 promnesia extract.py:82] extracting via promnesia_sean.sources.imap:index () {} ...: got 0 visits

I am using this Thunderbird Addon and I've tried to export with using:

  • Export whole folder
  • Export whole folder with structure
  • Export all emails in EML format
  • Export all email in TXT format

All these export are in my .local/share/mail path, I even wrote the path literally in the init_ file instead of using path,join but nothing works.

Thank you so much !

@seanbreckenridge
Copy link
Owner

seanbreckenridge commented Oct 19, 2021

Hmm -- Like most modules in HPI, this supports any type of path (absolute, a Path object, a string, something like ~/.local/share/mail); Mine looks like this. That shouldn't really matter that much, since any folder you give the imap module, it will drill down and search everything, unless your mail are in hidden folders. Just to compare though, my folder looks like this:

 pwd
/home/sean/.local/share/mail
$ ls -1
seanbrecke@gmail.com

The part which determines which files are used is a recursive glob, so it should just search every folder listed in your configuration and try every file. You could also just try the following, to confirm its not matching anything...

$ python3
Python 3.9.7 (default, Aug 31 2021, 13:28:12)
[GCC 11.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import my.imap
>>> list(my.imap.mailboxes())
>>> list(my.imap.files())

Those should tell you what its computed as the target files

I'm not sure what format the Thunderbird export tool uses, but it seems to be EML -- which I don't think is a raw email file, as far as I understand. Am not totally sure, formats in email have always been confusing to me. I definitely do know you can sync with IMAP with thunderbird, but not sure if it stores all your mail locally -- I can install it later today to see if I can figure that out.

As a visual comparison, here is what one of my locally synced IMAP files looks like

If you have something similar to that -- pointing it at the top folder which has all of those should work

@seanbreckenridge
Copy link
Owner

I tested out the addonn, and I think I've got it to work. Using the 'Plain Text Format', export that to a folder somewhere.

It takes a while to do so:

image

I just put that in ~/Downloads/mailexport for this demo.

In my config, I put:

# locally synced IMAP mailboxes using mbsync
class imap:
    # path[s]/glob to the the mailboxes/IMAP files
    mailboxes = "~/Downloads/mailexport/"

And then:

>>> import my.imap
>>> next(my.imap.mail())._serialize()
{'filepath': PosixPath('/home/sean/Downloads/mailexport/Inbox_20211019-0046/messages/20211011-Re_[ActivityWatch_aw-watcher-window] Update macOS window title logic (#49)-13430.txt'), 'bcc': [], ...

If the imap.mail function works, promnesia should work fine, since its a thin wrapper around that function which extracts info

This does mean you'd have to periodically do an export, but there isn't a great way around that with thunderbird. For context, I use mutt-wizard, which uses mbsync under the hood, so my mail gets synced with a local folder once every 5 minutes.

@seanbreckenridge
Copy link
Owner

Ah - may also be some issue with the different date format that the thunderbird addonn uses, looking into that

@seanbreckenridge
Copy link
Owner

seanbreckenridge commented Oct 19, 2021

Alright yeah -- the dates in the emails that the thunderbird addonn created weren't RFC 2822 compliant, so I created a wrapper to parse them manually if it wasn't able to do so: c4d87b7

Also updated the promnesia module, so you may have to git pull/reinstall that, in addition to this repo

Added dateparser to the deps, so pip install dateparser

Using the addon export, I now get visits from promnesia:

$ hpi query promnesia_sean.sources.imap | jq length
105

If mail was parsing for you before, it may have actually been this line causing the issues -- since the mail objects didn't have any datetimes, promnesia would ignore them. Hopefully thats fixed

@seanbreckenridge
Copy link
Owner

seanbreckenridge commented Oct 19, 2021

Ah, I also just remembered, since it takes about 30 minutes to run on my machine, I cache this once per month, so it picks up new URLs periodically. If you want me to make that configurable, let me know

So if its still seems not to be working for you, you may have to delete the sqlite cachew file between testing if promnesia is working. That'd be in ~/.cache/my or ~/.cache/cachew. For me, thats:

rm ~/.cache/cachew/promnesia_sean.sources.imap:index

To figure out where that is, run:

python3 -c 'from my.core.cachew import cache_dir; print(cache_dir())'

Let me know if you have any other issues, hopefully this isn't all too confusing

@krillin666
Copy link
Author

krillin666 commented Oct 19, 2021

Wow that was fast 😅. First of all, let me thank you for fixing this and showing the appropriate steps.

I've now been able to index on of my Inboxs to test the display in Promnesia.
However, it was not what I was expecting and maybe you can improve them (I can try too but I'm not a coder).

Before using your IMAP source I was using the plaintext source from Promnesia and it was at least displaying surrounding text of the email body next to the URL. With your source I just get the file name.

Nevertheless, yours has the advantage of having the email date! What I was trying to accomplish before (and thought that yours implemented) was to display the surrounding text but also the Date, From, Subject, To fields in the promnesia plugin.

I've two more comments:

  1. When I installed your Promnesia package the sources folder is not installed and I had svn checkout https://github.com/seanbreckenridge/promnesia/trunk/promnesia_sean/sources inside the promnesia_sean folder.
  2. I've came up with this question pertaining to security and maybe I should direct it to karli but I think you are in the position to answer it too: Using sources that index sensitive information (like email) is it possible to add a prompt option on the promnesia index for a password to encrypted folders which could be given to some sources (like the IMAP, or the plaintex/auto) ? We could put it in the config.py file but that defeats the purpose because in that case would be scrapable to an attacker

Thank you so much again for your help and work !

@seanbreckenridge
Copy link
Owner

seanbreckenridge commented Oct 19, 2021

displaying surrounding text of the email body next to the URL

Theres an option in my promnesia module to display the text as the body, but with 8000 emails, the sqlite database tends to grow pretty fast (was something like 30GB on my end, since it copies the text for every URL it finds)

See https://github.com/seanbreckenridge/promnesia/blob/master/promnesia_sean/sources/imap.py#L30

To enable that, in your config you can do something like:

Source(imap, body_as_context=True) instead of just the imap like in my config

Perhaps extracting a few lines around the message is preferable? Would increase the complexity/runtime a bit though. Will think about this

svn checkout

Hmm, am not sure if this is related to it being a namespace package, but I sorta doubt it. Don't have any experience with svn

question pertaining to security

Yeah, I've thought about this a bit as well. The best solution that perhaps I've come across for something like this is to use something like pass, which PGP encrypts your password, and then use that to store a decryption key for some local zip? PGP can typically be stored in a keychain while the computer is active, so it sort of acts as an initial barrier so everything isn't just plaintext. But then again, this could also just be solved by encrypting your main drive. It adds a bunch of complexity, but core.structure already exists, which abstracts away some of the unzipping/extraction.

Since everything is local-first, I sort of don't see a huge issue, but if you want to bring it up, best place would probably be here

@krillin666
Copy link
Author

Thank for the guidance on the security part !
As for the svn checkout I just used it to pull the folder sources from your git repo. The important thing here, is that when following your install procedure the promnesia_sean folder installed in my PC does not contain the sources folder 😅

Thanks for the tip on the IMAP. I understand now how your database grew so large, it is displaying not only the text of the whole email but all text from threads (when they exist). Is it not possible to just index the surronding text as the promnesia.auto promnesia.markdown, promnesia.plaintext (etc) do ? This way it would prevent the database from growing so huge and also just provide the relevant text in the Promnesia plugin to not clutter the side bar.

In relation to extracting and displaying the From, To, Subject do you have any idea how to implement this ? I think maybe a simple open() and then iterate through the lines (line.strip()) with conditionals to store which field would suffice ?

@krillin666
Copy link
Author

krillin666 commented Oct 19, 2021

Since all emails in plaintext begin like so:

   1   │ Subject: Promnesia
   2   │ From: John Doe
   3   │ Date: 06/10/2021, 09:28
   4   │ To: Someone

It would suffice (???) to use something simple like this for each email file :

From = ""
To = ""
Subject = ""
Date = ""
with open(email_file) as in_file:
    
    for line in in_file:
        
        if "From" in line.strip():
            From = line.strip()
            # etc etc

        #If we want to obtain only the text after the From, To:
        From.split(":",1)[0]
        # etc etc
        # Maybe we can just keep it ?

I'm sure you'll know the best way though !

@seanbreckenridge
Copy link
Owner

seanbreckenridge commented Oct 19, 2021

relation to extracting and displaying the From, To, Subject

I think its already doing this?

2021-10-19--10_44_33

Relevant code is here

It displays that as the Locator description, not the body -- don't think that should make a difference though, I think thats always shown

I'll take a look at the markdown/plaintext modules from promnesia to see how they do it and see if I can figure out surrounding text; will leave this issue open for that purpose

@seanbreckenridge
Copy link
Owner

seanbreckenridge commented Oct 19, 2021

It would suffice (???) to use something simple like this for each email file :

As just an FYI, mail-parser (library which wraps the stdlib email lib) is what my.imap uses, and that already parses all that info out of it nicely, so no need to try to do it manually:

https://github.com/seanbreckenridge/HPI/blob/master/my/imap.py#L72-L95

@krillin666
Copy link
Author

relation to extracting and displaying the From, To, Subject

I think its already doing this?

2021-10-19--10_44_33

Relevant code is here 👁canonical: github.com/seanbreckenridge/promnesia/blob/master/promnesia_sean/sources/imap.py×?sources : firefox18/10/2021, 21:21:36

It displays that as the Locator description, not the body -- don't think that should make a difference though, I think thats always shown

I'll take a look at the markdown/plaintext modules from promnesia to see how they do it and see if I can figure out surrounding text; will leave this issue open for that purpose

Ok I see. But what I was trying to achieve was to have something like:


From: Sean
To: Krillin
Subject: IMAP issue

"Relevant text around url from email body here"

I see that the date is nicely displayed already on the Promnesia addon in the same location as for browser history, so no need to extract that line 👍

Thank you again and looking out how this goes 😀

@seanbreckenridge
Copy link
Owner

seanbreckenridge commented Oct 19, 2021

Since I think the config issues around IMAP/loading the text files has generally been solved here, gonna move it to an issue on the promnesia repo

@krillin666
Copy link
Author

I tested out the addonn, and I think I've got it to work. Using the 'Plain Text Format', export that to a folder somewhere.

It takes a while to do so:

image

I just put that in ~/Downloads/mailexport for this demo.

This does mean you'd have to periodically do an export, but there isn't a great way around that with thunderbird. For context, I use mutt-wizard, which uses mbsync under the hood, so my mail gets synced with a local folder once every 5 minutes.

Sorry for commenting on a closed issue. I've found a way to automate this process for Thunderbird users, however it relies on using MBOX type files.
The addon mentioned for Thunderbird (importexporttols) now supports periodic Backups but only in MBOX format.
I've seen this solution to parse MBOX files: https://github.com/chronicle-app/chronicle-email/blob/master/lib/chronicle/email/mbox_extractor.rb

Maybe this could be ported to promnesia, don't know how laborious that would be.
Thank you !

@seanbreckenridge
Copy link
Owner

seanbreckenridge commented Mar 15, 2022

Fine for you to comment here, all good

https://github.com/seanbreckenridge/HPI/blob/master/CHANGELOG.md

Ah I see, I updated the modules name/structure here so now the imap file is now my.mail.imap

I think it would make sense to add something like my.mail.mbox and parse those files there? Its probably in a totally different format. Or maybe some of it could be reused

I can create an issue to track it 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants