(Feature request) Option to search specified headers #13

camwebb · 2015-03-05T16:20:14Z

It would be very helpful to have the option to specify (in the config file) particular headers, other than To, From, etc., in which content would also be included in the building of the index. I.e., in .mairixrc:

header=User-Agent:X-Foo:X-Bar

Thanks for this great tool! I use it every day (with gnus).

The text was updated successfully, but these errors were encountered:

ericpruitt · 2017-12-24T03:33:22Z

I'm going to give implementing this a shot. Per my comment on Kim Vandry's repository, I'm going to create a create a new table that will include all tokens from any of the headers that don't have dedicated tables in the database. The entries in the new table will be in the form of $HEADER_NAME:$TOKEN which will make it possible to re-use the existing search query syntax. Here's an example of what searching for User-Agent headers containing "mutt" will look like: mairix o:^User-Agent:,mutt . The "o" is short for "other (read: miscellaneous) headers." The down side to using $HEADER_NAME:$TOKEN is that it will result in a lot of redundant data being stored. Once I'm more familiar with the mairix codebase, I'll reconsider how I store the tokens for arbitrary headers.

Even if I don't end up changing how the records are stored, I don't think its impact on the size of the mairix database will be all that bad provided I didn't make a mistake below:

My mairix database has ~57,000 messages indexed.
The database weighs in at ~50 MB.
Based on the output of mairix -d, I think the tokens -- of which there are ~450,000 -- from the body of my emails accounts for 40 MB of that space.
The average length of all my email headers with the colon included is ~11 bytes (a copy of the ad hoc script is included at the end of this message).

For the sake of estimating, I'll pretend that the tokens account for all of the space being used and that the number of unique tokens in the tokens in the headers will be equal to the number of unique tokens in the body. The average length of the body tokens is ~90B (40MB / 450,000 Tokens), so my database grows from 50MB to 95MB (50MB + 40MB * 100B / 90B) in this scenario. Since I already have 57,000 emails taking up just under 2GB of space, an extra 45MB is a non-issue for me. Indexing arbitrary headers could be toggled with a flag if it turns out that my estimate is too optimistic. As long as search queries are still performant, I personally would gladly sacrifice a couple of hundred megs of space to support searching through arbitrary headers.

Questions and constructive criticism welcomed.

Script I used for computing the average header length:

messages$ (export LC_ALL=C;
          find -type f -name '*:*' -exec sh -c 'formail -X "" < "{}"' \; \
          | egrep -o '^[^: \t]{2,}:' \
          | awk '{ sum += length() }
                 END { print "N:", NR, "Avg:", sum / NR}')
N: 1511863 Avg: 10.543

CC: @vandry

ericpruitt · 2017-12-24T08:07:33Z

My estimate was off; the database with the re-indexed messages takes up ~220 MB (vs ~95 MB expected) with my changes. Aside from that, the preliminary code seems to work, and there's no appreciable difference in speed for my test queries. I still plan on looking into normalizing the data to reduce the size of the database.

TODO:

Consider heuristic for choosing how the miscellaneous headers indexed. In the current implementation, all of the miscellaneous headers are indexed using both the default tokenization and email address tokenization, but most of the tokens will not be email addresses. Ideas:
- Could check to see if there's an "@" value ~~or if the value matches /<[^ ]+@[^ ]+>/~~.
- ~~Analyze all of my emails to see which headers typically have email addresses or message IDs and create a hard-coded list of headers that are address-tokenized.~~
- Allow the user to specify which headers are indexed ~~as email addresses~~.
- A more radical alternative would be to simply treat all minor headers as a single string. The down side to this is that it no longer becomes possible to search for text in a particular header, but it would be trivial to make this behavior configurable.
- Email header names may only contain ASCII values 33 through 126, so the X most common header names could be abbreviated to a single byte using a lookup table.
Support printing which headers matched when using o:... with "--excerpt-output".
Update documentation.
I changed the value of UI_MSGID_BASE because adding entry 37 after it caused mairix to segfault. I'm guessing the message ID information is larger than 3 bytes to support threading, but I need to do more digging to be sure.
When attachment names were added to the database, Curnow inserted them before the message IDs, so that's what I'm doing.
Figure out what types of mairix database changes do / do not require users to re-index their messages.
Attachment names were the last thing added to the database schema (6d67f46), and Curnow just incremented HEADER_MAGIC3 by 1.
Change verbiage from "miscellaneous headers" to "minor headers" since "mairix --help" refers to the headers that are normally indexed as "major headers."

Update tests? On my machine, only a handful tests passed without my changes. Are these maintained anymore or did I do something wrong? I ran make && (cd /test && make -j1):

========================================================

  Total # of tests           : 36
  Total # of succeeded tests : 5
  Total # of failed tests    : 31

========================================================

Makefile:74: recipe for target 'check' failed

ericpruitt · 2017-12-24T23:05:32Z

Normalizing the header terms to cut down on the size of the database is going to be a fairly involved task, so I'm not going to try to do that for the first implementation of this feature. I will be adding a new configuration option that will let users toggle this feature if the disk space is a concern.

camwebb · 2018-04-19T16:43:25Z

Just caught up on your work @ericpruitt . Awesome! As per the original feature request, one solution to lighten the load is to only activate alt. header indexing when requested in .mairixrc, and only for the headers requested. Then the o: flag could be used without specifying a particular header, which might often be helpful if the data being searched for could occur in several headers.

ericpruitt · 2018-04-20T00:52:22Z

Then the o: flag could be used without specifying a particular header, which might often be helpful if the data being searched for could occur in several headers.

The code I wrote does just that; it supports either searching all headers or a specific set of headers as documented in mairix.c in my pull request:

+         "    h:word        : match word in the value of any minor header\n"
+         "    h:X:Y:word    : match word in the value of minor headers named \"X\" or \"Y\"\n"

I included some statistics comparing the size of the database before and after my change in vandry#12.

EDIT: Actually, my change doesn't work exactly as you described it because it wouldn't search the other headers like "From", "To", etc., but that could be changed easily enough, or an additional operator (maybe "H") could be added.

camwebb · 2018-04-20T21:15:01Z

I tried out your fork. Works as expected. The database is large, but not unfeasibly large. Great! Thanks!

xandro0777 · 2019-11-23T20:35:23Z

Seeing the size increase and wondering - wouldn't "naive full text indexing" of the headers result in smaller database size? Of course the queries would give less precise results but may be a good enough tradeoff in many cases. Also, it might be possible to do naive full text indexing but story the results for message bodies and headers in different tables to improve specificity of the results.

ericpruitt · 2019-11-23T20:44:24Z

Seeing the size increase and wondering - wouldn't "naive full text indexing" of the headers result in smaller database size?

Potentially, but I have no intention of spending any more time on this. You're welcome to build off my changes in vandry#12.

vandry mentioned this issue Jul 30, 2017

Tagging a release vandry/mairix#1

Closed

This was referenced Dec 27, 2017

Generate a warning when a search query includes a colon for something other than an attachment vandry/mairix#11

Closed

Support searching arbitrary headers vandry/mairix#12

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Feature request) Option to search specified headers #13

(Feature request) Option to search specified headers #13

camwebb commented Mar 5, 2015

ericpruitt commented Dec 24, 2017

ericpruitt commented Dec 24, 2017 •

edited

Loading

ericpruitt commented Dec 24, 2017 •

edited

Loading

camwebb commented Apr 19, 2018

ericpruitt commented Apr 20, 2018 •

edited

Loading

camwebb commented Apr 20, 2018

xandro0777 commented Nov 23, 2019

ericpruitt commented Nov 23, 2019

(Feature request) Option to search specified headers #13

(Feature request) Option to search specified headers #13

Comments

camwebb commented Mar 5, 2015

ericpruitt commented Dec 24, 2017

ericpruitt commented Dec 24, 2017 • edited Loading

ericpruitt commented Dec 24, 2017 • edited Loading

camwebb commented Apr 19, 2018

ericpruitt commented Apr 20, 2018 • edited Loading

camwebb commented Apr 20, 2018

xandro0777 commented Nov 23, 2019

ericpruitt commented Nov 23, 2019

ericpruitt commented Dec 24, 2017 •

edited

Loading

ericpruitt commented Dec 24, 2017 •

edited

Loading

ericpruitt commented Apr 20, 2018 •

edited

Loading