New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CJK word boundary iterator to query parser and term generator #114

Open
wants to merge 8 commits into
base: master
from

Conversation

Projects
None yet
2 participants
@rsto
Copy link
Contributor

rsto commented Jul 29, 2016

The CJK token iterator has been refactored to break CJK text either into ngrams or words. To enable the latter, a new flag FLAG_CJK_WORDS has been added to both QueryParser and TermGenerator.

Summary and caveats:

  • CJK ngrams remain default for backwards-compatibility (including the XAPIAN_CJK_NGRAM environment variable).
  • Word boundary analysis builds on the ICU - International Components for Unicode library. The autoconf builder has been adapted to require libicu, but this could be made optional.
  • Word boundaries might differ from other text analysis tools. Notably, for Japanese the ICU library results have been found to break some input text into more words than the Kuromoji morphological analyzer.

Unit tests pass for both ngrams and word boundaries:

Running tests with backend "none"...
./apitest backend none: All 128 tests passed.

@rsto rsto force-pushed the rsto:cjk_words branch from c7a0d41 to 9d76738 Oct 10, 2016

@rsto rsto force-pushed the rsto:cjk_words branch 2 times, most recently from 7068668 to 555392e Jan 9, 2017

@rsto

This comment has been minimized.

Copy link
Contributor

rsto commented Jan 9, 2017

Rebased to latest master following the snippet enhancement 4e56983

@rsto rsto force-pushed the rsto:cjk_words branch 7 times, most recently from ee54b3a to 8d2e548 Jan 18, 2017

@rsto

This comment has been minimized.

Copy link
Contributor

rsto commented Jan 19, 2017

This most probably will require some code indentation cleaning before merging into upstream. Except for that it's what we currently are using in production. Would be happy to get this into Xapian's upstream version as well.

@@ -105,8 +105,7 @@ lib_LTLIBRARIES = libxapian-1.5.la
libxapian_1_5_la_SOURCES = $(lib_src)
libxapian_1_5_la_LIBADD = $(XAPIAN_LIBS)
libxapian_1_5_la_LDFLAGS = \
$(XAPIAN_LDFLAGS) $(NO_UNDEFINED) -version-info $(LIBRARY_VERSION_INFO) $(lib_ldflags)

$(XAPIAN_LDFLAGS) $(NO_UNDEFINED) -version-info $(LIBRARY_VERSION_INFO) $(lib_ldflags) $(ICU_LIBS)

This comment has been minimized.

@ojwb

ojwb Jan 20, 2017

Contributor

Unless ICU_LIBS is misleadingly named, it ought to go in LIBADD not LDFLAGS.

PKG_CHECK_MODULES([ICU], [icu-uc])
AC_SUBST([ICU_CFLAGS])
AC_SUBST([ICU_LIBS])

AC_SUBST([XAPIAN_LDFLAGS])
AC_SUBST([XAPIAN_LIBS])

This comment has been minimized.

@ojwb

ojwb Jan 20, 2017

Contributor

Bad place to insert this - you've split up a logical group of lines...

@ojwb

This comment has been minimized.

Copy link
Contributor

ojwb commented Jan 20, 2017

I've not forgotten about this, but I still think it's something for master rather than 1.4.x, and merging patches and PRs which are suitable for 1.4.x seems more sensible to prioritise at this point in the release cycle.

The current ABI incompatibility could be addressed, but fundamentally it's a significant change on a hot code path so needs care to ensure it doesn't slow down indexing for everyone. There's also a longer-term strategy to decide - do we want to ditch our current Unicode support and use ICU everywhere? Or have a --with-icu/--without-icu configure option? ICU is rather large (it seems to be an order of magnitude larger than xapian-core), so not ideal for users in resource-constrained environments, but configure-time options are unhelpful for packagers - they have to make a single choice for all users of their package, or else maintain two variants of the package.

@rsto rsto force-pushed the rsto:cjk_words branch 3 times, most recently from 2225357 to b3e562c Jan 26, 2017

@rsto

This comment has been minimized.

Copy link
Contributor

rsto commented Jan 26, 2017

Thanks for review, I've updated the library flags and autoconf setup for ICU. I also reformatted the code to make it pass xapian-check-patch (thanks for that!).

In terms of strategy on how/if to add ICU to Xapian, I'd be tempted to make it an optional dependency (which admitedly this patch doesn't support, yet). Currently, it's only used for a niche feature along a specific code path. This might not warrant the efforts of reworking Xapian's own Unicode support, and users of CJK word break support could build Xapian from source.

Plus: most of the core string routines in ICU operate on UTF-16, which might add overhead to indexing.

@ojwb

This comment has been minimized.

Copy link
Contributor

ojwb commented Jan 27, 2017

Thanks for the updates.

Not sure I'd agree that CJK is niche. Ideally we should support CJK well out of the box, and there's certainly wider interest in that - for example, there's the n-gram code we already have, and a patch to add support for the SCWS Chinese segmentation library (https://trac.xapian.org/ticket/594).

I still don't think this is suitable for 1.4.x though - even if use of ICU were disabled by default, the patch makes changes to the existing CJK support - when indexing CJK as n-grams two method calls per token are now virtual calls - unless the compiler can see enough information and is smart enough to devirtualise those calls, that'll prevent inlining and so be slower. Probably not by a lot per call, but we're talking about a lot of calls.

And if people have to build from source anyway, having to build from a different source tree is not a big deal.

Xapian's current Unicode support is pretty much iterating over UTF-8 strings, and character categorisation and case conversion - it looks like ICU has such functionality which works without having to round-trip the strings via UTF-16 (http://userguide.icu-project.org/strings/utf-8). If they're significantly slower, that's not so good (though they may be significantly faster, which would be great). But having to round-trip to use most of the functionality does make ICU less appealing as an answer to general Unicode needs.

@rsto

This comment has been minimized.

Copy link
Contributor

rsto commented Feb 21, 2017

Getting @elliefm in the loop on this PR. Looking forward to make CJK-word segmentation happen :)

@ojwb

This comment has been minimized.

Copy link
Contributor

ojwb commented Feb 23, 2017

Xapian's current Unicode support is pretty much [...]

There's also converting to UTF-8 (append a given code-point to a char[] buffer or std::string.

If you want to help make this happen sooner, a useful thing to investigate would be speed. Ideally adding this support shouldn't make things slower for existing use cases (and I can't see a reason why it should inherently need to).

A simple test would be comparing the time taken for an unpatched version to index non-CJK text with a patched version indexing the same text. (The speed of indexing CJK text also matters, but the patched version is doing more so it's reasonable for it to take longer).

@rsto

This comment has been minimized.

Copy link
Contributor

rsto commented Mar 1, 2017

If you want to help make this happen sooner, a useful thing to investigate would be speed.

I've ran tests to compare performance for English and CJK text indexing.

Summary

The performance for non-CJK text is not impacted (the runtime performance differs by 0.5‰). For CJK text, word segmentation requires roughly 63% more CPU instructions than ngrams. Looking at the code it's clear why non-CJK text performance stays the same: the virtual functions only come into play, when CJK text is encountered!

But of course, this is just one test. I haven't looked for opportunities to optimise the word segmentation code.

How I measured

I'm measuring instruction counts as reported by Valgrind's callgrind module.

E.g. I ran callgrind as

$ valgrind --tool=callgrind --callgrind-out-file=callgrind-master-cjk.out --simulate-cache=yes simpleindex-1.5 db-master-cjk

and inspect the results as

$ callgrind_annotate --inclusive=yes callgrind-master-cjk.out
--------------------------------------------------------------------------------
Profile data file 'callgrind-master-cjk.out' (creator: callgrind-3.10.0)
--------------------------------------------------------------------------------
[...]
7,153,136,191 [...]  PROGRAM TOTALS

That is, I measure the cumulated Valgrind instruction counts for functions and descendants.

What I measured

I adapted Xapian's example/simpleindex.cc command line tool.

In contrast to the normal implementation, my instrumented version reads all its input into memory, then indexes it as a single document. A gist of the adapted version is here. For the master branch I chose FLAG_CJK_NGRAM, for the pull request FLAG_CJK_WORDS as flags.

For completeness, I asserted that both branches produce the same index:

$ xapian-delve-1.5 db-master-en > db-master-en.dump
$ xapian-delve-1.5 db-cjkwords-en > db-cjkwords-en.dump
$ diff db-master-en.dump db-cjkwords-en.dump
1c1
< UUID = 1556ea32-8de1-4c33-be00-c2de37f240e6
---
> UUID = 4765fe7f-649e-4719-8e54-665b0177e2dc

What I used as input

For non-CJK text I used Mark Twain's Huckleberry Fin:

613K Aug 17  2016 twain-76-0.txt

For CJK text I used a sample of Chinese texts, all concatenated into one stream read from stdin

 33K Nov 22 2007 buddha-23585-0.txt
 87K Jan 27 2014 confucius-23839-0.txt
 65K May  7 2008 daoshen-25366-0.txt
815K Apr 25 2008 shizhenwang-25162-0.txt
 42K Dec 15 2007 sunzi-23864-0.txt

All are UTF-8 encoded from Project Gutenberg

What I observed

I compared the cumulated instruction counts for the index_text method of the TermGenerator::Internal class.

The callgrind logs for the non-CJK text produced

$ callgrind_annotate --inclusive=yes callgrind-master-en.out | grep internal.cc:Xapian::TermGenerator::Internal::index_text
736,988,581 [...]  
$ callgrind_annotate --inclusive=yes callgrind-cjkwords-en.out | grep internal.cc:Xapian::TermGenerator::Internal::index_text
737,422,216 [...]

and for CJK text

$ callgrind_annotate --inclusive=yes callgrind-cjkwords-cjk.out | grep internal.cc:Xapian::TermGenerator::Internal::index_text
3,495,988,130 [...]
$ callgrind_annotate --inclusive=yes callgrind-master-cjk.out | grep internal.cc:Xapian::TermGenerator::Internal::index_text
2,143,674,128 [...]

I can share the callgrind log files, just let me know.

@rsto rsto force-pushed the rsto:cjk_words branch from b3e562c to 6e7e484 Mar 1, 2017

@rsto

This comment has been minimized.

Copy link
Contributor

rsto commented Mar 1, 2017

Addendum: of course this doesn't tell anything about performance if we replace the UTF8 implementation in Xapian with ICU. That's next on my list.

@ojwb

This comment has been minimized.

Copy link
Contributor

ojwb commented May 19, 2017

That's generally promising.

I've used cachegrind for this sort of profiling before, as it gives you estimated cycles taking into account CPU caches - does callgrind with --simulate-cache=yes do the same thing (and so when you say "CPU instructions" above I can read that as "cycles")?

For CJK text, word segmentation requires roughly 63% more CPU instructions than ngrams.

That 63% is against a baseline of FLAG_CJK_NGRAM on master, if I follow the above.

It seems reasonable for segmentation to be a bit more work, as it's presumably inherently harder than generating every ngram.

Segmentation should mean fewer terms get generated which should balance the extra work at least somewhat. If I follow, you're only comparing the time to generate terms (and add them to a Document object since that's done by index_text()). So we see the benefits from fewer terms for the Document object to handle, but not the savings in the database backend from the reduced number of terms generated.

If you still have the callgrind logs around, it would be interesting to also check the cycles taken by WritableDatabase::add_document() and WritableDatabase::commit() with CJK data.

Did you compare FLAG_CJK_NGRAM on master and the branch?

Addendum: of course this doesn't tell anything about performance if we replace the UTF8 implementation in Xapian with ICU. That's next on my list.

Did you get a chance to look at that yet?

@rsto rsto force-pushed the rsto:cjk_words branch from 6e7e484 to 54368e2 Jun 9, 2017

@rsto

This comment has been minimized.

Copy link
Contributor

rsto commented Jun 9, 2017

I haven't had a chance to replace the Xapian UTF8 implementation with ICU, yet. I am working on it right now.

Meanwhile, I've rebased against current master and reran tests. I've used both callgrind's --cache-sim=yes and --branch-sim=yes options to enable cache and branch prediction simulation. According to the Valgrind manual, Callgrind's cache simulation is based on that of Cachegrind (source).

Callgrind reports a substantial performance improvement in terms of instruction count for CJK word segmentation, thanks to the lower number of terms:

Master branch

Totals                              7,259,187,162
WritableDatabase::commit            1,448,765,395
WritableDatabase::add_document      3,259,209,541
TermGenerator::Internal::index_text 2,142,284,579

CJKWord branch

Totals                              4,877,429,310
WritableDatabase::commit              542,824,865
WritableDatabase::add_document        491,925,449
TermGenerator::Internal::index_text 3,488,865,005

I can send you the log files and test setup for this benchmark and the previous round. Just let me know if you would like to have a play with it.

Next step

I'll post more results, when I've prototyped UTF-8 support using ICU in Xapian.

(Edit: removed section "Are my tests broken")

@rsto

This comment has been minimized.

Copy link
Contributor

rsto commented Jun 23, 2017

I've finally managed to replace Xapian's internal Unicode implementation with one backed by ICU. It's in in separate branch: https://github.com/rsto/xapian/tree/utf8itor

I can't reproduce the dramatic performance increase of the CJK word segmentation branch from my previous post. This is puzzling and I am looking further into what's the cause.

Still, I now consistently get cycle counts that show that CJK word segmentation trumps over ngrams, whereas the Unicode implementations don't make much of a difference.

Branches:

  • master
  • cjkwords (with Xapian Unicode implementation)
  • utf8itor (with CJK word segmentation and ICU Unicode implementation).

Here are the results for English text

=== en
master:   1,043,095,910
cjkwords: 1,034,099,514
utf8itor: 1,049,401,671

CJK text

=== cjk
master:   5,047,169,100
cjkwords: 4,566,416,004
utf8itor: 4,574,959,785

and other languages (German, Norwegian, Russian)

=== misc
master:   1,531,033,367
cjkwords: 1,521,115,944
utf8itor: 1,537,557,810
@ojwb

This comment has been minimized.

Copy link
Contributor

ojwb commented Jun 27, 2017

To answer a question I think got deleted - in case it's still relevant, document length is the sum of the wdf for terms in the document, so I would expect segmentation to result in lower values for document length than n-grams.

Seems odd that the cjkwords branch would make much of a difference when handling only non-CJK, especially that it would be faster. I wonder what's going on there.

It seems the ICU UTF8 handling is consistently slower, though perhaps that's fixable. E.g. the lookup table for character categories can be eliminated if we just make sure the constant values are aligned.

@rsto rsto force-pushed the rsto:cjk_words branch 2 times, most recently from 7a84636 to 15c2735 Feb 14, 2018

@rsto

This comment has been minimized.

Copy link
Contributor

rsto commented Feb 14, 2018

I've just pushed an updated version of this PR that's rebased on latest master, but it's not going to pass Travis because xapian-check-patch complains about a couple of issues. I updated the code to comply with the coding conventions at most of the places, but the check currently still breaks due to the following reasons:

Function naming convention of an external library

xapian-core/queryparser/cjk-tokenizer.cc:158: error: camelCase identifier
   'createWordInstance' - Xapian coding convention is to use lower case and
   underscores for variables and functions, and CamelCase for class names: +
   brk = icu::BreakIterator::createWordInstance(0/*unknown locale*/, err);

The createWordInstance function is part of the ICU library, so I don't know how to make this pass xapian-check-patch?

Unit tests exceeding the 80-character line limit

There's a bunch of errors like

xapian-core/tests/api_queryparser.cc:723: error: Line extends beyond column 80
    (to column 93): +    { "title:久有 归 天愿", "(((XT久@1 AND XT有@1) OR 归@2) OR (天@3 AND 愿@3))" }

Should I line-break the unit tests of this PR, even if all lines above and below violate the same rule?

@ojwb

This comment has been minimized.

Copy link
Contributor

ojwb commented Feb 15, 2018

@rsto rsto force-pushed the rsto:cjk_words branch from 15c2735 to 4192c23 Feb 16, 2018

@rsto

This comment has been minimized.

Copy link
Contributor

rsto commented Feb 16, 2018

I've rewritten history of this branch to make tracking the individual changes more straight-forward. If this lands on master we could squash the commits again.

The code now passes xapian-check-patch on my workstation, after I added the following proposed changes to the patch checker:

  • Whitelisted the ICU-library camel-cased functions that the CJK word iterator requires.
  • Allowed excessive long lines in the term generator and query-parser test files. IMHO the test fixtures in these two files are quite text-centric, and breaking up lines for sake of the 80 character line limit might make them less readable. Of course, I'll be happy reformat the CJK tests if you decide otherwise.

I'm waiting for Travis to provide feedback but this should be all fine now.

@rsto

This comment has been minimized.

Copy link
Contributor

rsto commented Feb 16, 2018

Meh, Travis fails during the OS X checks, all other builds look fine. It looks as if the homebrew package for libicu is broken

libtool: compile:  g++ -DHAVE_CONFIG_H -I. -I./common -I./include -I/usr/local/Cellar/icu4c/59.1/include -I/tmp/xapian-libsvm-fixed-include -Wall -W -Wredundant-decls -Wpointer-arith -Wcast-qual -Wcast-align -Wformat-security -fno-gnu-keywords -Wundef -Woverloaded-virtual -Wshadow -Wstrict-overflow=1 -Wmissing-declarations -Winit-self -Werror -fvisibility=hidden -fvisibility-inlines-hidden -Wno-error=reserved-user-defined-literal -std=gnu++11 -MT queryparser/cjk-tokenizer.lo -MD -MP -MF queryparser/.deps/cjk-tokenizer.Tpo -c queryparser/cjk-tokenizer.cc  -fno-common -DPIC -o queryparser/.libs/cjk-tokenizer.o
In file included from queryparser/cjk-tokenizer.cc:30:
In file included from queryparser/cjk-tokenizer.h:39:
In file included from /usr/local/Cellar/icu4c/59.1/include/unicode/unistr.h:32:
In file included from /usr/local/Cellar/icu4c/59.1/include/unicode/utypes.h:38:
In file included from /usr/local/Cellar/icu4c/59.1/include/unicode/umachine.h:46:
In file included from /usr/local/Cellar/icu4c/59.1/include/unicode/ptypes.h:52:
In file included from /usr/local/Cellar/icu4c/59.1/include/unicode/platform.h:25:
/usr/local/Cellar/icu4c/59.1/include/unicode/uvernum.h:128:5: error: 'U_PLATFORM_HAS_WINUWP_API' is not defined, evaluates to 0 [-Werror,-Wundef]
#if U_PLATFORM_HAS_WINUWP_API == 0
    ^
1 error generated.
make[3]: *** [queryparser/cjk-tokenizer.lo] Error 1

This and the OS X build error on my other PR makes me wonder how to best get this through the Travis OS X builds.

@ojwb

This comment has been minimized.

Copy link
Contributor

ojwb commented Feb 17, 2018

Just going from that output, it appears the issue is that a preprocessor conditional uses a macro which is not defined in the configuration used in the ICU build we're using, resulting in a warning with -Wunused, and we want to be warning clear so we use -Werror here so that's an error.

Even if U_PLATFORM_HAS_WINUWP_API not being defined here is an expected situation, I'd argue this is a bug in ICU's headers - headers intended to be included from other people's code really should compile without warnings with common compiler warning flags enabled.

One common idiom to use instead is:

#if U_PLATFORM_HAS_WINUWP_API-0 == 0

Or the more obvious:

#if !defined U_PLATFORM_HAS_WINUWP_API || U_PLATFORM_HAS_WINUWP_API == 0

Or just ensure that the macro is actually always defined.

I don't really see a good workaround. Adding -Wno-unused to the compiler flags would mean we could miss warnings in our own code. Perhaps we can temporarily disable it with a #pragma while we include libicu, for compilers which support such a #pragma.

@rsto

This comment has been minimized.

Copy link
Contributor

rsto commented Feb 19, 2018

I added a work-around for the bogus pre-processor definitions in ICU 59.1. I see that other BSD-based systems also have issues with this setting (e.g. here).

The build now fails with the same OS X build error in Travis as my other PR.

// work around missing U_PLATFORM_HAS_WINUWP_API on libicu MacOSX builds.
#ifndef U_PLATFORM_HAS_WINUWP_API
#define U_PLATFORM_HAS_WINUWP_API 0
#endif

This comment has been minimized.

@ojwb

ojwb Feb 19, 2018

Contributor

This seems dubious - on platforms which would define this, it seems it gets defined by <unicode/platform.h> but we haven't included that yet. So this is going to lead to a macro redefinition error on platforms which define this.

I think it would be best to just disable -Wunused while included these headers, like already happens for -Wold-style-cast above (and report this to ICU if it's not already been reported apparently already fixed upstream: http://bugs.icu-project.org/trac/ticket/13175).

This comment has been minimized.

@rsto

rsto Feb 21, 2018

Contributor

Fixed in a858fcb

@@ -36,6 +36,11 @@
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wold-style-cast"

This comment has been minimized.

@ojwb

ojwb Feb 19, 2018

Contributor

These should be conditional on #ifdef __GNUC__ to avoid "unknown pragma" warnings with other compilers. Looks like support for these were added in GCC 4.6 so we don't need to worry about what happens with older GCC versions.

This comment has been minimized.

@rsto

rsto Feb 21, 2018

Contributor

Fixed in a858fcb

#ifndef U_PLATFORM_HAS_WINUWP_API
#define U_PLATFORM_HAS_WINUWP_API 0
#endif

#include <unicode/unistr.h>
#include <unicode/brkiter.h>

This comment has been minimized.

@ojwb

ojwb Feb 19, 2018

Contributor

Generally we sort groups of includes (unless ICU requires this order, in which case "ick").

This comment has been minimized.

@rsto

rsto Feb 21, 2018

Contributor

Fixed in a858fcb

@rsto rsto force-pushed the rsto:cjk_words branch 2 times, most recently from 9aaf2e6 to a858fcb Feb 20, 2018

@rsto rsto force-pushed the rsto:cjk_words branch from a858fcb to 5400bb4 May 21, 2018

rsto added some commits Jan 26, 2017

Add CJK word boundary iterator to query parser and term generator
The previous CJK token iterator has been refactored into CJK text
iterators to either break CJK text into ngrams or use word boundary
analysis. To enable the latter, the new flag FLAG_CJK_WORDS has been
added to both QueryParser and TermGenerator.

CJK ngrams remain default and are backwards compatible.

The word boundary analysis uses the ICU - International Components for
Unicode library (http://site.icu-project.org/).

@rsto rsto force-pushed the rsto:cjk_words branch from 2965d58 to c3c6c67 Oct 30, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment