Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CJK word boundary iterator to query parser and term generator #114

Closed
wants to merge 8 commits into from
Closed

Add CJK word boundary iterator to query parser and term generator #114

wants to merge 8 commits into from

Conversation

@rsto
Copy link
Contributor

@rsto rsto commented Jul 29, 2016

The CJK token iterator has been refactored to break CJK text either into ngrams or words. To enable the latter, a new flag FLAG_CJK_WORDS has been added to both QueryParser and TermGenerator.

Summary and caveats:

  • CJK ngrams remain default for backwards-compatibility (including the XAPIAN_CJK_NGRAM environment variable).
  • Word boundary analysis builds on the ICU - International Components for Unicode library. The autoconf builder has been adapted to require libicu, but this could be made optional.
  • Word boundaries might differ from other text analysis tools. Notably, for Japanese the ICU library results have been found to break some input text into more words than the Kuromoji morphological analyzer.

Unit tests pass for both ngrams and word boundaries:

Running tests with backend "none"...
./apitest backend none: All 128 tests passed.
@rsto
Copy link
Contributor Author

@rsto rsto commented Jan 9, 2017

Rebased to latest master following the snippet enhancement 4e56983

@rsto
Copy link
Contributor Author

@rsto rsto commented Jan 19, 2017

This most probably will require some code indentation cleaning before merging into upstream. Except for that it's what we currently are using in production. Would be happy to get this into Xapian's upstream version as well.

xapian-core/Makefile.am Outdated Show resolved Hide resolved
@ojwb
Copy link
Contributor

@ojwb ojwb commented Jan 20, 2017

I've not forgotten about this, but I still think it's something for master rather than 1.4.x, and merging patches and PRs which are suitable for 1.4.x seems more sensible to prioritise at this point in the release cycle.

The current ABI incompatibility could be addressed, but fundamentally it's a significant change on a hot code path so needs care to ensure it doesn't slow down indexing for everyone. There's also a longer-term strategy to decide - do we want to ditch our current Unicode support and use ICU everywhere? Or have a --with-icu/--without-icu configure option? ICU is rather large (it seems to be an order of magnitude larger than xapian-core), so not ideal for users in resource-constrained environments, but configure-time options are unhelpful for packagers - they have to make a single choice for all users of their package, or else maintain two variants of the package.

@rsto
Copy link
Contributor Author

@rsto rsto commented Jan 26, 2017

Thanks for review, I've updated the library flags and autoconf setup for ICU. I also reformatted the code to make it pass xapian-check-patch (thanks for that!).

In terms of strategy on how/if to add ICU to Xapian, I'd be tempted to make it an optional dependency (which admitedly this patch doesn't support, yet). Currently, it's only used for a niche feature along a specific code path. This might not warrant the efforts of reworking Xapian's own Unicode support, and users of CJK word break support could build Xapian from source.

Plus: most of the core string routines in ICU operate on UTF-16, which might add overhead to indexing.

@ojwb
Copy link
Contributor

@ojwb ojwb commented Jan 27, 2017

Thanks for the updates.

Not sure I'd agree that CJK is niche. Ideally we should support CJK well out of the box, and there's certainly wider interest in that - for example, there's the n-gram code we already have, and a patch to add support for the SCWS Chinese segmentation library (https://trac.xapian.org/ticket/594).

I still don't think this is suitable for 1.4.x though - even if use of ICU were disabled by default, the patch makes changes to the existing CJK support - when indexing CJK as n-grams two method calls per token are now virtual calls - unless the compiler can see enough information and is smart enough to devirtualise those calls, that'll prevent inlining and so be slower. Probably not by a lot per call, but we're talking about a lot of calls.

And if people have to build from source anyway, having to build from a different source tree is not a big deal.

Xapian's current Unicode support is pretty much iterating over UTF-8 strings, and character categorisation and case conversion - it looks like ICU has such functionality which works without having to round-trip the strings via UTF-16 (http://userguide.icu-project.org/strings/utf-8). If they're significantly slower, that's not so good (though they may be significantly faster, which would be great). But having to round-trip to use most of the functionality does make ICU less appealing as an answer to general Unicode needs.

@rsto
Copy link
Contributor Author

@rsto rsto commented Feb 21, 2017

Getting @elliefm in the loop on this PR. Looking forward to make CJK-word segmentation happen :)

@ojwb
Copy link
Contributor

@ojwb ojwb commented Feb 23, 2017

Xapian's current Unicode support is pretty much [...]

There's also converting to UTF-8 (append a given code-point to a char[] buffer or std::string.

If you want to help make this happen sooner, a useful thing to investigate would be speed. Ideally adding this support shouldn't make things slower for existing use cases (and I can't see a reason why it should inherently need to).

A simple test would be comparing the time taken for an unpatched version to index non-CJK text with a patched version indexing the same text. (The speed of indexing CJK text also matters, but the patched version is doing more so it's reasonable for it to take longer).

@rsto
Copy link
Contributor Author

@rsto rsto commented Mar 1, 2017

If you want to help make this happen sooner, a useful thing to investigate would be speed.

I've ran tests to compare performance for English and CJK text indexing.

Summary

The performance for non-CJK text is not impacted (the runtime performance differs by 0.5‰). For CJK text, word segmentation requires roughly 63% more CPU instructions than ngrams. Looking at the code it's clear why non-CJK text performance stays the same: the virtual functions only come into play, when CJK text is encountered!

But of course, this is just one test. I haven't looked for opportunities to optimise the word segmentation code.

How I measured

I'm measuring instruction counts as reported by Valgrind's callgrind module.

E.g. I ran callgrind as

$ valgrind --tool=callgrind --callgrind-out-file=callgrind-master-cjk.out --simulate-cache=yes simpleindex-1.5 db-master-cjk

and inspect the results as

$ callgrind_annotate --inclusive=yes callgrind-master-cjk.out
--------------------------------------------------------------------------------
Profile data file 'callgrind-master-cjk.out' (creator: callgrind-3.10.0)
--------------------------------------------------------------------------------
[...]
7,153,136,191 [...]  PROGRAM TOTALS

That is, I measure the cumulated Valgrind instruction counts for functions and descendants.

What I measured

I adapted Xapian's example/simpleindex.cc command line tool.

In contrast to the normal implementation, my instrumented version reads all its input into memory, then indexes it as a single document. A gist of the adapted version is here. For the master branch I chose FLAG_CJK_NGRAM, for the pull request FLAG_CJK_WORDS as flags.

For completeness, I asserted that both branches produce the same index:

$ xapian-delve-1.5 db-master-en > db-master-en.dump
$ xapian-delve-1.5 db-cjkwords-en > db-cjkwords-en.dump
$ diff db-master-en.dump db-cjkwords-en.dump
1c1
< UUID = 1556ea32-8de1-4c33-be00-c2de37f240e6
---
> UUID = 4765fe7f-649e-4719-8e54-665b0177e2dc

What I used as input

For non-CJK text I used Mark Twain's Huckleberry Fin:

613K Aug 17  2016 twain-76-0.txt

For CJK text I used a sample of Chinese texts, all concatenated into one stream read from stdin

 33K Nov 22 2007 buddha-23585-0.txt
 87K Jan 27 2014 confucius-23839-0.txt
 65K May  7 2008 daoshen-25366-0.txt
815K Apr 25 2008 shizhenwang-25162-0.txt
 42K Dec 15 2007 sunzi-23864-0.txt

All are UTF-8 encoded from Project Gutenberg

What I observed

I compared the cumulated instruction counts for the index_text method of the TermGenerator::Internal class.

The callgrind logs for the non-CJK text produced

$ callgrind_annotate --inclusive=yes callgrind-master-en.out | grep internal.cc:Xapian::TermGenerator::Internal::index_text
736,988,581 [...]  
$ callgrind_annotate --inclusive=yes callgrind-cjkwords-en.out | grep internal.cc:Xapian::TermGenerator::Internal::index_text
737,422,216 [...]

and for CJK text

$ callgrind_annotate --inclusive=yes callgrind-cjkwords-cjk.out | grep internal.cc:Xapian::TermGenerator::Internal::index_text
3,495,988,130 [...]
$ callgrind_annotate --inclusive=yes callgrind-master-cjk.out | grep internal.cc:Xapian::TermGenerator::Internal::index_text
2,143,674,128 [...]

I can share the callgrind log files, just let me know.

@rsto
Copy link
Contributor Author

@rsto rsto commented Mar 1, 2017

Addendum: of course this doesn't tell anything about performance if we replace the UTF8 implementation in Xapian with ICU. That's next on my list.

@ojwb
Copy link
Contributor

@ojwb ojwb commented May 19, 2017

That's generally promising.

I've used cachegrind for this sort of profiling before, as it gives you estimated cycles taking into account CPU caches - does callgrind with --simulate-cache=yes do the same thing (and so when you say "CPU instructions" above I can read that as "cycles")?

For CJK text, word segmentation requires roughly 63% more CPU instructions than ngrams.

That 63% is against a baseline of FLAG_CJK_NGRAM on master, if I follow the above.

It seems reasonable for segmentation to be a bit more work, as it's presumably inherently harder than generating every ngram.

Segmentation should mean fewer terms get generated which should balance the extra work at least somewhat. If I follow, you're only comparing the time to generate terms (and add them to a Document object since that's done by index_text()). So we see the benefits from fewer terms for the Document object to handle, but not the savings in the database backend from the reduced number of terms generated.

If you still have the callgrind logs around, it would be interesting to also check the cycles taken by WritableDatabase::add_document() and WritableDatabase::commit() with CJK data.

Did you compare FLAG_CJK_NGRAM on master and the branch?

Addendum: of course this doesn't tell anything about performance if we replace the UTF8 implementation in Xapian with ICU. That's next on my list.

Did you get a chance to look at that yet?

@rsto
Copy link
Contributor Author

@rsto rsto commented Jun 9, 2017

I haven't had a chance to replace the Xapian UTF8 implementation with ICU, yet. I am working on it right now.

Meanwhile, I've rebased against current master and reran tests. I've used both callgrind's --cache-sim=yes and --branch-sim=yes options to enable cache and branch prediction simulation. According to the Valgrind manual, Callgrind's cache simulation is based on that of Cachegrind (source).

Callgrind reports a substantial performance improvement in terms of instruction count for CJK word segmentation, thanks to the lower number of terms:

Master branch

Totals                              7,259,187,162
WritableDatabase::commit            1,448,765,395
WritableDatabase::add_document      3,259,209,541
TermGenerator::Internal::index_text 2,142,284,579

CJKWord branch

Totals                              4,877,429,310
WritableDatabase::commit              542,824,865
WritableDatabase::add_document        491,925,449
TermGenerator::Internal::index_text 3,488,865,005

I can send you the log files and test setup for this benchmark and the previous round. Just let me know if you would like to have a play with it.

Next step

I'll post more results, when I've prototyped UTF-8 support using ICU in Xapian.

(Edit: removed section "Are my tests broken")

@rsto
Copy link
Contributor Author

@rsto rsto commented Jun 23, 2017

I've finally managed to replace Xapian's internal Unicode implementation with one backed by ICU. It's in in separate branch: https://github.com/rsto/xapian/tree/utf8itor

I can't reproduce the dramatic performance increase of the CJK word segmentation branch from my previous post. This is puzzling and I am looking further into what's the cause.

Still, I now consistently get cycle counts that show that CJK word segmentation trumps over ngrams, whereas the Unicode implementations don't make much of a difference.

Branches:

  • master
  • cjkwords (with Xapian Unicode implementation)
  • utf8itor (with CJK word segmentation and ICU Unicode implementation).

Here are the results for English text

=== en
master:   1,043,095,910
cjkwords: 1,034,099,514
utf8itor: 1,049,401,671

CJK text

=== cjk
master:   5,047,169,100
cjkwords: 4,566,416,004
utf8itor: 4,574,959,785

and other languages (German, Norwegian, Russian)

=== misc
master:   1,531,033,367
cjkwords: 1,521,115,944
utf8itor: 1,537,557,810
@ojwb
Copy link
Contributor

@ojwb ojwb commented Jun 27, 2017

To answer a question I think got deleted - in case it's still relevant, document length is the sum of the wdf for terms in the document, so I would expect segmentation to result in lower values for document length than n-grams.

Seems odd that the cjkwords branch would make much of a difference when handling only non-CJK, especially that it would be faster. I wonder what's going on there.

It seems the ICU UTF8 handling is consistently slower, though perhaps that's fixable. E.g. the lookup table for character categories can be eliminated if we just make sure the constant values are aligned.

@rsto
Copy link
Contributor Author

@rsto rsto commented Feb 14, 2018

I've just pushed an updated version of this PR that's rebased on latest master, but it's not going to pass Travis because xapian-check-patch complains about a couple of issues. I updated the code to comply with the coding conventions at most of the places, but the check currently still breaks due to the following reasons:

Function naming convention of an external library

xapian-core/queryparser/cjk-tokenizer.cc:158: error: camelCase identifier
   'createWordInstance' - Xapian coding convention is to use lower case and
   underscores for variables and functions, and CamelCase for class names: +
   brk = icu::BreakIterator::createWordInstance(0/*unknown locale*/, err);

The createWordInstance function is part of the ICU library, so I don't know how to make this pass xapian-check-patch?

Unit tests exceeding the 80-character line limit

There's a bunch of errors like

xapian-core/tests/api_queryparser.cc:723: error: Line extends beyond column 80
    (to column 93): +    { "title:久有 归 天愿", "(((XT久@1 AND XT有@1) OR 归@2) OR (天@3 AND 愿@3))" }

Should I line-break the unit tests of this PR, even if all lines above and below violate the same rule?

@ojwb
Copy link
Contributor

@ojwb ojwb commented Feb 15, 2018

@rsto
Copy link
Contributor Author

@rsto rsto commented Feb 16, 2018

I've rewritten history of this branch to make tracking the individual changes more straight-forward. If this lands on master we could squash the commits again.

The code now passes xapian-check-patch on my workstation, after I added the following proposed changes to the patch checker:

  • Whitelisted the ICU-library camel-cased functions that the CJK word iterator requires.
  • Allowed excessive long lines in the term generator and query-parser test files. IMHO the test fixtures in these two files are quite text-centric, and breaking up lines for sake of the 80 character line limit might make them less readable. Of course, I'll be happy reformat the CJK tests if you decide otherwise.

I'm waiting for Travis to provide feedback but this should be all fine now.

@rsto
Copy link
Contributor Author

@rsto rsto commented Feb 16, 2018

Meh, Travis fails during the OS X checks, all other builds look fine. It looks as if the homebrew package for libicu is broken

libtool: compile:  g++ -DHAVE_CONFIG_H -I. -I./common -I./include -I/usr/local/Cellar/icu4c/59.1/include -I/tmp/xapian-libsvm-fixed-include -Wall -W -Wredundant-decls -Wpointer-arith -Wcast-qual -Wcast-align -Wformat-security -fno-gnu-keywords -Wundef -Woverloaded-virtual -Wshadow -Wstrict-overflow=1 -Wmissing-declarations -Winit-self -Werror -fvisibility=hidden -fvisibility-inlines-hidden -Wno-error=reserved-user-defined-literal -std=gnu++11 -MT queryparser/cjk-tokenizer.lo -MD -MP -MF queryparser/.deps/cjk-tokenizer.Tpo -c queryparser/cjk-tokenizer.cc  -fno-common -DPIC -o queryparser/.libs/cjk-tokenizer.o
In file included from queryparser/cjk-tokenizer.cc:30:
In file included from queryparser/cjk-tokenizer.h:39:
In file included from /usr/local/Cellar/icu4c/59.1/include/unicode/unistr.h:32:
In file included from /usr/local/Cellar/icu4c/59.1/include/unicode/utypes.h:38:
In file included from /usr/local/Cellar/icu4c/59.1/include/unicode/umachine.h:46:
In file included from /usr/local/Cellar/icu4c/59.1/include/unicode/ptypes.h:52:
In file included from /usr/local/Cellar/icu4c/59.1/include/unicode/platform.h:25:
/usr/local/Cellar/icu4c/59.1/include/unicode/uvernum.h:128:5: error: 'U_PLATFORM_HAS_WINUWP_API' is not defined, evaluates to 0 [-Werror,-Wundef]
#if U_PLATFORM_HAS_WINUWP_API == 0
    ^
1 error generated.
make[3]: *** [queryparser/cjk-tokenizer.lo] Error 1

This and the OS X build error on my other PR makes me wonder how to best get this through the Travis OS X builds.

@ojwb
Copy link
Contributor

@ojwb ojwb commented Feb 17, 2018

Just going from that output, it appears the issue is that a preprocessor conditional uses a macro which is not defined in the configuration used in the ICU build we're using, resulting in a warning with -Wundef, and we want to be warning clear so we use -Werror here so that's an error.

Even if U_PLATFORM_HAS_WINUWP_API not being defined here is an expected situation, I'd argue this is a bug in ICU's headers - headers intended to be included from other people's code really should compile without warnings with common compiler warning flags enabled.

One common idiom to use instead is:

#if U_PLATFORM_HAS_WINUWP_API-0 == 0

Or the more obvious:

#if !defined U_PLATFORM_HAS_WINUWP_API || U_PLATFORM_HAS_WINUWP_API == 0

Or just ensure that the macro is actually always defined.

I don't really see a good workaround. Adding -Wno-undef to the compiler flags would mean we could miss warnings in our own code. Perhaps we can temporarily disable it with a #pragma while we include libicu, for compilers which support such a #pragma.

@rsto
Copy link
Contributor Author

@rsto rsto commented Feb 19, 2018

I added a work-around for the bogus pre-processor definitions in ICU 59.1. I see that other BSD-based systems also have issues with this setting (e.g. here).

The build now fails with the same OS X build error in Travis as my other PR.

rsto added 4 commits Jan 26, 2017
The previous CJK token iterator has been refactored into CJK text
iterators to either break CJK text into ngrams or use word boundary
analysis. To enable the latter, the new flag FLAG_CJK_WORDS has been
added to both QueryParser and TermGenerator.

CJK ngrams remain default and are backwards compatible.

The word boundary analysis uses the ICU - International Components for
Unicode library (http://site.icu-project.org/).
@ojwb
Copy link
Contributor

@ojwb ojwb commented Feb 16, 2019

I've rebuilt this branch onto current master at https://github.com/ojwb/xapian/tree/cjk_words_rebuilt and managed to eliminate both the virtual method call and use of dynamic_cast that was happening for each CJK token - that will probably speed things up a bit.

Currently the (new) emscripten CI build fails because ICU isn't available. I think we want to make it an optional dependency anyway, so I'll adjust that.

@ojwb
Copy link
Contributor

@ojwb ojwb commented Feb 17, 2019

I think we want to make it an optional dependency anyway, so I'll adjust that.

I've pushed a change to do that so CI will probably now pass.

@rsto Two questions:

  • The patch adds a cjk_flags parameter to MSet::snippet() - is there a reason I'm missing not to fold this into the existing flags parameter on that method? (Maybe that wasn't even there when you originally wrote this!)

  • Your commits use three different email addresses (paranoia.at, fastmail.com and fastmailteam.com) - I was thinking I should squash the commits which are fixing up earlier merge problems onto the original commit before merging this as having them separate seems more confusing than useful to the reader (history from development is arguably potentially useful, but these aren't that). However I wasn't sure which address you'd prefer to have on the squashed commit.

@ojwb
Copy link
Contributor

@ojwb ojwb commented Feb 17, 2019

Just noticed this in the ICU docs:

UText can be used with BreakIterator APIs (character/word/sentence/... segmentation). utext_openUTF8() creates a read-only UText for a UTF-8 string.

There's a caveat:

Note: In ICU 4.4 and before, BreakIterator only works with UTF-8 [or any other charset with non-1:1 index conversion to UTF-16] if no dictionary is supported. This excludes Thai word break. See ticket #5532. No fix is currently scheduled.

However from the referenced ticket it seems the "No fix" part is out of date and this only applies to ICU 4.4 and earlier.

I've adjusted the code locally to take advantage of this and the testsuite passes. I'll benchmark to see how it actually compares - conceivably it could turn out to be slower.

@ojwb
Copy link
Contributor

@ojwb ojwb commented Feb 17, 2019

(And ICU 4.6 was released in 2010 so requiring at least that seems entirely reasonable, especially as this is an optional dependency).

@ojwb
Copy link
Contributor

@ojwb ojwb commented Feb 17, 2019

Testing with the same collection of CJK texts from Project Gutenberg as you were above, I seem to get a slight slow-down using UTF-8:

$ callgrind_annotate --inclusive=yes callgrind-master-cjkwords-utf16.out | grep internal.cc:Xapian::TermGenerator::Internal::index_text
2,425,500,348   635,318,113 242,984,577 40,865,191 19,189,150 1,098,391  1,734  22,493  75,775  /home/olly/git/xapian/xapian-core/queryparser/termgenerator_internal.cc:Xapian::TermGenerator::Internal::index_text(Xapian::Utf8Iterator, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool) [/home/olly/git/xapian/xapian-core/.libs/libxapian-1.5.so.0.0.0]
$ callgrind_annotate --inclusive=yes callgrind-master-cjkwords-utf8.out | grep internal.cc:Xapian::TermGenerator::Internal::index_text
2,504,992,987   662,017,909 258,537,886 44,926,817 20,790,072 1,614,697  1,782  22,480  75,791  /home/olly/git/xapian/xapian-core/queryparser/termgenerator_internal.cc:Xapian::TermGenerator::Internal::index_text(Xapian::Utf8Iterator, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool) [/home/olly/git/xapian/xapian-core/.libs/libxapian-1.5.so.0.0.0]

Will double-check to make sure.

@ojwb
Copy link
Contributor

@ojwb ojwb commented Feb 17, 2019

Reading the linked ticket more carefully, a fix to disable use of the dictionary (and avoid the previous crash) was in ICU 4.8 (not 4.6) but a fix to support dictionaries went into ICU 54.1, released 2014-10-01. I think that's still fine for an optional dependency (if it was a required dependency I think we'd want to look at what ICU versions are packaged for popular platforms).

@ojwb
Copy link
Contributor

@ojwb ojwb commented Feb 18, 2019

After some other optimisations, the UTF-8 change actually results in a slight speedup, so I've committed and pushed the branch.

CI is currently failing on Linux, probably because we're using Ubuntu trusty which has ICU 52.1:

Get:5 http://us-east-1.ec2.archive.ubuntu.com/ubuntu trusty-updates/main amd64 libicu52 amd64 52.1-3ubuntu0.8 [6,751 kB]

CI passes on macOS (the tests also pass locally).

We've stuck with trusty as has approximately the oldest GCC we aim to support, so is good for catching changes which stop us building with older GCC.

We could turn off libicu for the trusty builds and add a separate build on xenial with libicu enabled. Or maybe review the GCC baseline version and bump travis to use xenial - looks like trusty EOL is April 2019 so travis probably isn't going to keep supporting builds on trusty for more than a few months more anyway.

@rsto
Copy link
Contributor Author

@rsto rsto commented Feb 18, 2019

@ojwb Great to hear that this PR got traction!

@rsto Two questions:

  • The patch adds a cjk_flags parameter to MSet::snippet() - is there a reason I'm missing not to fold this into the existing flags parameter on that method? (Maybe that wasn't even there when you originally wrote this!)

I think the flags argument got added after I had written the patch. Let's just that argument and remove cjk_flags, I guess that will require adding flags SNIPPET_CJK_NGRAM and SNIPPET_CJK_WORDS to the snippet enums? Do you want me to update the patch or will you take care of it?

  • Your commits use three different email addresses (paranoia.at, fastmail.com and fastmailteam.com) - [...] However I wasn't sure which address you'd prefer to have on the squashed commit.

Let's please use the fastmailteam.com address, it's the one I use for maintaining our Xapian branch at FastMail.

@ojwb
Copy link
Contributor

@ojwb ojwb commented Feb 19, 2019

Let's please use the fastmailteam.com address, it's the one I use for maintaining our Xapian branch at FastMail.

OK, done and force pushed the rebased branch with fixups squashed. git diff of the branch head before and after the rebased showed no changes.

I think the flags argument got added after I had written the patch. Let's just that argument and remove cjk_flags, I guess that will require adding flags SNIPPET_CJK_NGRAM and SNIPPET_CJK_WORDS to the snippet enums? Do you want me to update the patch or will you take care of it?

I'm happy to do it. It probably does need appropriate flags adding.

I'm also thinking we should deprecate the XAPIAN_CJK_NGRAM environment variable - it was only intended as a temporary quick hack to allow enabling the CJK ngram code, and we've had a flag to replace it since 1.2.22 (released 2015-12-29).

@ojwb
Copy link
Contributor

@ojwb ojwb commented Feb 23, 2019

While adding the flags, I found an existing bug.

Currently in theory you can set $XAPIAN_CJK_NGRAM and get snippet highlighting with CJK ngram terms. The snippet code checks that variable and does something different, but what it does is not right - the highlighting added is all empty start/end pairs at the end of the span of CJK characters containing the CJK ngram terms. To the user it'll typically look like it's selecting the end of the text and not highlighting anything.

I have a fix which I'm going to apply to master first, since it'll want backporting for 1.4.x whereas we concluded not to backport the CJK segmentation code.

Then I'll rebase the branch and resolve any clashes, then finish off the flags and get this merged.

@ojwb
Copy link
Contributor

@ojwb ojwb commented Feb 24, 2019

Flags sorted; configure now checks the ICU version; added a note to INSTALL about ICU.

Still need to sort out CI, but I think that's all.

@ojwb
Copy link
Contributor

@ojwb ojwb commented Feb 24, 2019

I was testing that building without ICU works, but found some of the cjk words tests weren't throwing FeatureUnavailableError as expected without ICU, but instead were returning the result you'd expect to get with ICU - that's highly suspect.

Investigating I found we're missing special handling inside quoted phrases - for ngram mode this splits the CJK into unigrams which happens to be also correct for FLAG_CJK_WORDS mode for the testcase currently on the branch, but is wrong in general and a tweaked version shows this:

Query: "久有归天愿"
api_queryparser.cc:801: ((parsed) == (expect))
Expected parsed and expect to be equal, were:
"Query((久@1 PHRASE 5 有@1 PHRASE 5 归@1 PHRASE 5 天@1 PHRASE 5 愿@1))"
"Query((久@1 PHRASE 4 有@1 PHRASE 4 归天@1 PHRASE 4 愿@1))"

I think this should be easy to fix.

@ojwb ojwb closed this in f881f0b Feb 24, 2019
@ojwb
Copy link
Contributor

@ojwb ojwb commented Feb 24, 2019

I was looking at making codepoint_is_cjk() more efficient and noticed that this function omits some codepoint ranges which maybe ought to be included.

It is unchanged since it was first added in b578641 on 2011-08-24, which is back when Unicode 6.0.0 was the current version, and some of the missing ranges have been added in newer versions of Unicode (e.g. https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_Extension_F was added in 10.0), but https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_Extension_C was added in Unicode 5.2 and isn't included (but extensions A and B are). Possibly the list was actually prepared pre-5.2.

https://en.wikipedia.org/wiki/CJK_Unified_Ideographs lists all the Unified and compatibility blocks, and more vaguely says "Apart from the seven blocks of "Unified Ideographs," Unicode has about a dozen more blocks with not-unified CJK-characters. These are mainly CJK radicals, strokes, punctuation, marks, symbols and compatibility characters."

Here's an annotated version of the comment from the source:

//   2E80..2EFF; CJK Radicals Supplement                                                                                               
//   3000..303F; CJK Symbols and Punctuation
//   3040..309F; Hiragana
//   30A0..30FF; Katakana
//   3100..312F; Bopomofo
//   3130..318F; Hangul Compatibility Jamo                                                                                                   
//   3190..319F; Kanbun
//   31A0..31BF; Bopomofo Extended
//   31C0..31EF; CJK Strokes
//   31F0..31FF; Katakana Phonetic Extensions
//   3200..32FF; Enclosed CJK Letters and Months                                                                                           
// C 3300..33FF; CJK Compatibility
// U 3400..4DBF; CJK Unified Ideographs Extension A
//   4DC0..4DFF; Yijing Hexagram Symbols
// U 4E00..9FFF; CJK Unified Ideographs
//   A700..A71F; Modifier Tone Letters
//   AC00..D7AF; Hangul Syllables
// C F900..FAFF; CJK Compatibility Ideographs                                                                                                   
// C FE30..FE4F; CJK Compatibility Forms
//   FF00..FFEF; Halfwidth and Fullwidth Forms
// U 20000..2A6DF; CJK Unified Ideographs Extension B                                                                                        
// C 2F800..2FA1F; CJK Compatibility Ideographs Supplement

And possible additions:

// ? 1100..11FF; Hangul Jamo
// ? 2F00..2FDF; Kangxi Radicals
// ? A960..A97F; Hangul Jamo Extended-A
// ? D7B0..D7FF; Hangul Jamo Extended-B
// ? 1B000..1B0FF; Kana Supplement
// ? 1B100..1B12F; Kana Extended-A
// ? 1F200..1F2FF; Enclosed Ideographic Supplement
// + 2A700..2B73F; CJK Unified Ideographs Extension C                                                                                        
// + 2B740..2B81F; CJK Unified Ideographs Extension D
// + 2B820..2CEAF; CJK Unified Ideographs Extension E
// + 2CEB0..2EBEF; CJK Unified Ideographs Extension F                                                                                                                                              

Key:

  • U - existing Unified block (based on the link above)
  • C - existing Compatibility block (based on the link above)
  • + - block we should probably add (based on the link above)
  • ? - blocks we should perhaps add (based on similarity to blocks already included)

@rsto Since you're actually using this in production, I wondered if you had any useful insights here? My knowledge of CJK is fairly limited.

@rsto
Copy link
Contributor Author

@rsto rsto commented Feb 26, 2019

@ojwb thanks for landing CJK word segmentation on master! Unfortunately I'm not much into CJK languages either. We very rarely get support questions for CJK search with the existing patch. I would add the additional Unicode blocks, they all look reasonable from their Wikipedia definitions.

@ojwb
Copy link
Contributor

@ojwb ojwb commented Feb 27, 2019

OK, added in 0fe1cf6.

@lockywolf
Copy link

@lockywolf lockywolf commented Sep 30, 2020

Does this feature "just work" now (in xapian-core 1.4.17), without any special environmental variables and such?

I have just built xapian-core 1.4.17 (2020-08-21), and find -exec ldd doesn't show any references to icu/icu4c or anything like that.

@ojwb
Copy link
Contributor

@ojwb ojwb commented Sep 30, 2020

@lockywolf It's only on git master - I'm afraid we decided it wasn't really suitable for backporting (there's some discussion above).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants