Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

[papercut] ignore rich text mark-up in title searches #81

Open
adam3smith opened this Issue · 8 comments

4 participants

@adam3smith

"For example, the title stored as:

Nanosecond electron tunneling between the hemes in cytochrome [i]bo[/i][sub]3[/sub ]

[I replaced < by brackets to preserve syntax. as.]

is not found if searching simply for, say, cytochrome bo3. "

http://forums.zotero.org/discussion/3875/2/rich-text-in-titles/#Item_18

@rmzelle

And, related, it would be really nice if the center column (and the word processor dialog windows) would show the parsed and formatted title.

@aurimasv

If I'm understanding the search code correctly, the searches are performed directly using SQL statements, which is why these titles are not picked up.

If we want to stick to SQL based searching (which I think is quite efficient), then probably the most reasonable thing would be to store the formatted and unformatted (as well as de-accented) titles in the database. The formatted titles could be blank where there is no formatting in the title so they don't waste space in the database.

I was thinking that to further conserve space, we could just store the formatting that was stripped off and the position where it's supposed to be re-inserted. This makes things quite a bit more complicated though, and it probably does not save that much space overall. So if space is not super important, this could be avoided.

Is this an acceptable approach?

@dstillman
Owner

We can start by just adding a UDF to do the stripping to db.js and searching via that. ("Premature optimization…") Most searches we do aren't left-bound anyway, since we need to find words in the middle of fields, so stripping the markup dynamically may not add much time.

Also, we might be able to get accent-insensitive searches just by using COLLATE locale, though I haven't tested that.

I was thinking that to further conserve space, we could just store the formatting that was stripped off and the position where it's supposed to be re-inserted.

Definitely not necessary. Disk space is cheap.

@aurimasv

We can start by just adding a UDF to do the stripping to db.js and searching via that.

I have to admit that my SQL is not that great, but is SQL up to the task of stripping off nested html tags that may have attributes? I suppose we should not have the same tags being nested within each other, so this would make it simpler to use regular expressions to strip them off. I'll have to look into SQL regex capabilities.

@dstillman
Owner

The UDF is written in JS. Search for "udf" in db.js for examples.

@aurimasv

Alright, thanks. That actually looks promising. I'll see what I can cook up.

@dstillman
Owner

I would hold off on this. JavaScript UDFs may not be usable with async DB queries. Still investigating.

@aurimasv

@dstillman, did you find out if UDFs will be possible? As an alternative, we could build a temp table on Zotero startup with titles stripped of HTML. This would also involve updating this table on sync and item edits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.