Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Feature request) Option to search specified headers #13

Open
camwebb opened this issue Mar 5, 2015 · 8 comments
Open

(Feature request) Option to search specified headers #13

camwebb opened this issue Mar 5, 2015 · 8 comments

Comments

@camwebb
Copy link

camwebb commented Mar 5, 2015

It would be very helpful to have the option to specify (in the config file) particular headers, other than To, From, etc., in which content would also be included in the building of the index. I.e., in .mairixrc:

header=User-Agent:X-Foo:X-Bar

Thanks for this great tool! I use it every day (with gnus).

@ericpruitt
Copy link

I'm going to give implementing this a shot. Per my comment on Kim Vandry's repository, I'm going to create a create a new table that will include all tokens from any of the headers that don't have dedicated tables in the database. The entries in the new table will be in the form of $HEADER_NAME:$TOKEN which will make it possible to re-use the existing search query syntax. Here's an example of what searching for User-Agent headers containing "mutt" will look like: mairix o:^User-Agent:,mutt . The "o" is short for "other (read: miscellaneous) headers." The down side to using $HEADER_NAME:$TOKEN is that it will result in a lot of redundant data being stored. Once I'm more familiar with the mairix codebase, I'll reconsider how I store the tokens for arbitrary headers.

Even if I don't end up changing how the records are stored, I don't think its impact on the size of the mairix database will be all that bad provided I didn't make a mistake below:

  • My mairix database has ~57,000 messages indexed.
  • The database weighs in at ~50 MB.
  • Based on the output of mairix -d, I think the tokens -- of which there are ~450,000 -- from the body of my emails accounts for 40 MB of that space.
  • The average length of all my email headers with the colon included is ~11 bytes (a copy of the ad hoc script is included at the end of this message).

For the sake of estimating, I'll pretend that the tokens account for all of the space being used and that the number of unique tokens in the tokens in the headers will be equal to the number of unique tokens in the body. The average length of the body tokens is ~90B (40MB / 450,000 Tokens), so my database grows from 50MB to 95MB (50MB + 40MB * 100B / 90B) in this scenario. Since I already have 57,000 emails taking up just under 2GB of space, an extra 45MB is a non-issue for me. Indexing arbitrary headers could be toggled with a flag if it turns out that my estimate is too optimistic. As long as search queries are still performant, I personally would gladly sacrifice a couple of hundred megs of space to support searching through arbitrary headers.

Questions and constructive criticism welcomed.

Script I used for computing the average header length:

messages$ (export LC_ALL=C;
          find -type f -name '*:*' -exec sh -c 'formail -X "" < "{}"' \; \
          | egrep -o '^[^: \t]{2,}:' \
          | awk '{ sum += length() }
                 END { print "N:", NR, "Avg:", sum / NR}')
N: 1511863 Avg: 10.543

CC: @vandry

@ericpruitt
Copy link

ericpruitt commented Dec 24, 2017

My estimate was off; the database with the re-indexed messages takes up ~220 MB (vs ~95 MB expected) with my changes. Aside from that, the preliminary code seems to work, and there's no appreciable difference in speed for my test queries. I still plan on looking into normalizing the data to reduce the size of the database.

TODO:

  • Consider heuristic for choosing how the miscellaneous headers indexed. In the current implementation, all of the miscellaneous headers are indexed using both the default tokenization and email address tokenization, but most of the tokens will not be email addresses. Ideas:
    • Could check to see if there's an "@" value or if the value matches /<[^ ]+@[^ ]+>/.
    • Analyze all of my emails to see which headers typically have email addresses or message IDs and create a hard-coded list of headers that are address-tokenized.
    • Allow the user to specify which headers are indexed as email addresses.
    • A more radical alternative would be to simply treat all minor headers as a single string. The down side to this is that it no longer becomes possible to search for text in a particular header, but it would be trivial to make this behavior configurable.
    • Email header names may only contain ASCII values 33 through 126, so the X most common header names could be abbreviated to a single byte using a lookup table.
  • Support printing which headers matched when using o:... with "--excerpt-output".
  • Update documentation.
  • I changed the value of UI_MSGID_BASE because adding entry 37 after it caused mairix to segfault. I'm guessing the message ID information is larger than 3 bytes to support threading, but I need to do more digging to be sure.
    When attachment names were added to the database, Curnow inserted them before the message IDs, so that's what I'm doing.
  • Figure out what types of mairix database changes do / do not require users to re-index their messages.
    Attachment names were the last thing added to the database schema (6d67f46), and Curnow just incremented HEADER_MAGIC3 by 1.
  • Change verbiage from "miscellaneous headers" to "minor headers" since "mairix --help" refers to the headers that are normally indexed as "major headers."
  • Update tests? On my machine, only a handful tests passed without my changes. Are these maintained anymore or did I do something wrong? I ran make && (cd /test && make -j1):
    ========================================================
    
      Total # of tests           : 36
      Total # of succeeded tests : 5
      Total # of failed tests    : 31
    
    ========================================================
    
    Makefile:74: recipe for target 'check' failed
    

@ericpruitt
Copy link

ericpruitt commented Dec 24, 2017

Normalizing the header terms to cut down on the size of the database is going to be a fairly involved task, so I'm not going to try to do that for the first implementation of this feature. I will be adding a new configuration option that will let users toggle this feature if the disk space is a concern.

@camwebb
Copy link
Author

camwebb commented Apr 19, 2018

Just caught up on your work @ericpruitt . Awesome! As per the original feature request, one solution to lighten the load is to only activate alt. header indexing when requested in .mairixrc, and only for the headers requested. Then the o: flag could be used without specifying a particular header, which might often be helpful if the data being searched for could occur in several headers.

@ericpruitt
Copy link

ericpruitt commented Apr 20, 2018

Then the o: flag could be used without specifying a particular header, which might often be helpful if the data being searched for could occur in several headers.

The code I wrote does just that; it supports either searching all headers or a specific set of headers as documented in mairix.c in my pull request:

+         "    h:word        : match word in the value of any minor header\n"
+         "    h:X:Y:word    : match word in the value of minor headers named \"X\" or \"Y\"\n"

I included some statistics comparing the size of the database before and after my change in vandry#12.

EDIT: Actually, my change doesn't work exactly as you described it because it wouldn't search the other headers like "From", "To", etc., but that could be changed easily enough, or an additional operator (maybe "H") could be added.

@camwebb
Copy link
Author

camwebb commented Apr 20, 2018

I tried out your fork. Works as expected. The database is large, but not unfeasibly large. Great! Thanks!

@xandro0777
Copy link

Seeing the size increase and wondering - wouldn't "naive full text indexing" of the headers result in smaller database size? Of course the queries would give less precise results but may be a good enough tradeoff in many cases. Also, it might be possible to do naive full text indexing but story the results for message bodies and headers in different tables to improve specificity of the results.

@ericpruitt
Copy link

Seeing the size increase and wondering - wouldn't "naive full text indexing" of the headers result in smaller database size?

Potentially, but I have no intention of spending any more time on this. You're welcome to build off my changes in vandry#12.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants