[papercut] ignore rich text mark-up in title searches #81

Open
adam3smith opened this Issue Mar 13, 2012 · 8 comments

Projects

None yet

4 participants

@adam3smith
Contributor

"For example, the title stored as:

Nanosecond electron tunneling between the hemes in cytochrome [i]bo[/i][sub]3[/sub ]
[I replaced < by brackets to preserve syntax. as.]

is not found if searching simply for, say, cytochrome bo3. "

http://forums.zotero.org/discussion/3875/2/rich-text-in-titles/#Item_18

@rmzelle
Contributor
rmzelle commented Mar 13, 2012

And, related, it would be really nice if the center column (and the word processor dialog windows) would show the parsed and formatted title.

@aurimasv
Contributor
aurimasv commented Mar 7, 2013

If I'm understanding the search code correctly, the searches are performed directly using SQL statements, which is why these titles are not picked up.

If we want to stick to SQL based searching (which I think is quite efficient), then probably the most reasonable thing would be to store the formatted and unformatted (as well as de-accented) titles in the database. The formatted titles could be blank where there is no formatting in the title so they don't waste space in the database.

I was thinking that to further conserve space, we could just store the formatting that was stripped off and the position where it's supposed to be re-inserted. This makes things quite a bit more complicated though, and it probably does not save that much space overall. So if space is not super important, this could be avoided.

Is this an acceptable approach?

@dstillman
Member

We can start by just adding a UDF to do the stripping to db.js and searching via that. ("Premature optimization…") Most searches we do aren't left-bound anyway, since we need to find words in the middle of fields, so stripping the markup dynamically may not add much time.

Also, we might be able to get accent-insensitive searches just by using COLLATE locale, though I haven't tested that.

I was thinking that to further conserve space, we could just store the formatting that was stripped off and the position where it's supposed to be re-inserted.

Definitely not necessary. Disk space is cheap.

@aurimasv
Contributor
aurimasv commented Mar 7, 2013

We can start by just adding a UDF to do the stripping to db.js and searching via that.

I have to admit that my SQL is not that great, but is SQL up to the task of stripping off nested html tags that may have attributes? I suppose we should not have the same tags being nested within each other, so this would make it simpler to use regular expressions to strip them off. I'll have to look into SQL regex capabilities.

@dstillman
Member

The UDF is written in JS. Search for "udf" in db.js for examples.

@aurimasv
Contributor
aurimasv commented Mar 7, 2013

Alright, thanks. That actually looks promising. I'll see what I can cook up.

@dstillman
Member

I would hold off on this. JavaScript UDFs may not be usable with async DB queries. Still investigating.

@aurimasv
Contributor

@dstillman, did you find out if UDFs will be possible? As an alternative, we could build a temp table on Zotero startup with titles stripped of HTML. This would also involve updating this table on sync and item edits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment