Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A partial patch for better tokenisation of mixed Chinese numbers #220

Open
wants to merge 15 commits into
base: master
Choose a base branch
from
Open

A partial patch for better tokenisation of mixed Chinese numbers #220

wants to merge 15 commits into from

Conversation

outdream
Copy link

@outdream outdream commented Mar 8, 2019

It's a partial patch try to implement better tokenisation of mixed Chinese numbers,
as the whole mixed Chinese numbers would be treated as a single token.
Partly fix the problem of 'mixed CJK numbers' mentioned in https://trac.xapian.org/ticket/699
And still have some commented code for unsure test cases.

add some test cases about mixed CJK
numbers to the dataset test_simple,
and test to termgen1 would fail now
it is a partial patch for the ticket #699 mixed CJK number,
it is implemented in the operator++ of CJKNgramIterator,
and works only for mixed number start with Latin digit
and the previous character should be a CJK character.
Reasons for these limits are puzzles to the requirements
and considerations on performence. So maybe it would change quickly.
fix the indent style in file cjk-tokenizer.cc
replace the Chinese Unicode with
hex, as a try to fix C2015'too many characters in constant'
reported by AppVeyor
rollback the changes to cjk-tokenizer.cc
Add is_Chinese_digit() to check
if a Unicode character is Chinese digit.
Modification placed within parse_term,
and a new test case enabled.
@ojwb
Copy link
Contributor

ojwb commented Mar 13, 2019

Aside from the issues noted above (which are mostly stylistic), this seems very promising.

The QueryParser code needs updating similarly (as you're probably aware, since you marked this as a "partial patch").

I wonder if we should also handle "pure" Chinese numbers as a single token in the same code? It seems like splitting them into n-grams (as we currently do) will result in a lot of false matches without adding useful extra matches. E,g, a Chinese document talking about 1919 and 2011 will match a query for 2019.

@ojwb
Copy link
Contributor

ojwb commented Mar 14, 2019

Also, a bit of admin - are you happy to license your contributions as described here?

https://trac.xapian.org/browser/git/xapian-core/HACKING#L1403

@outdream
Copy link
Author

outdream commented Mar 14, 2019

Thanks for your code review.

I will fix the troubles about stylistic and misuse of is_digit() firstly,
then update code in QueryParser,
and finally try to add support for pure Chinese numbers.

I'm glad to license my contributions :).

Test cases like '二千3百' were added,
to check the cases unexpected to be affected.
is_chinese_digit() has been moved to cjk-tokenizer,
add test cases to queryparse1, which are similar to the ones in termgen1.
modify queryparser.lemony to extract mixed Chinese numbers as one token.
@ojwb
Copy link
Contributor

ojwb commented Apr 1, 2019

Thanks for the updates, and sorry for not getting back to this for a while.

This looks complete to me for handling these mixed numbers (which is what the ticket was about), but I'm not sure if we want to merge as-is, or try to come up with a broader solution.

I've been trying to get a better understanding of numbers in Chinese by reading https://en.wikipedia.org/wiki/Chinese_numerals#Whole_numbers and reviewing the IRC discussion we had.

The examples you gave of the digit and power-of-ten characters in contexts where they have other meanings (such as 九九归一) seem to be patterns which aren't valid as numbers, except for single-character numbers (which we can probably leave to ICU's word boundary finding code to discriminate).

The valid pattern for a number looks possible to express as a fairly simple state machine - in regexp terms it's along the lines of this (but extended for more powers of 10) with the additional requirement that the matched string to be at least two characters long:

(<digit>?<1000>)?(<digit>?<100>)?(<digit>?<10>)<digit>?

And we should be able to convert that to a common form for indexing (so e.g. 四十五 is indexed as the term 45, and a search for 45 or 四十五 will match documents containing either, and 4十5 could be handled in the same way for little extra work).

But I may well be missing something that means this just isn't going to work that easily.

Thoughts?

@outdream
Copy link
Author

outdream commented Apr 1, 2019

Never mind. And have to say, the wiki surprises me a lot, it seems pretty complete and wonderful!
The GFW really keeps me unfamiliar with such great resources :(

However, I think maybe the pattern is not enough, as some important cases are mismatched:

  1. Interior zeroes
    for example, 205 | [2] [100] [0] [5] | 二百零五,
    the interior [0] has no number size likes [10] or [100].

  2. Year numbers
    as the example you gave before,
    "1919 and 2011 will match a query for 2019"
    the year numbers in Chinese are continuous <digit>, as 1919 -> [1][9][1][9].
    I'm not sure (<digit>+)<year> could handle the most cases.

These are also mentioned in the wiki, with more details. And I'm afraid the pattern would go more complex if we want to support fractional values or other languages like Japanese and Korean.

For the common form of numbers, if it is decided to implement, I advise that we could index a number with both its original form and common form (and with OR relationship in query?) if they are different. By this, language of the number also contribute to the final weight.

@outdream
Copy link
Author

outdream commented Apr 1, 2019

Besides, could you tell me would Xapian support customize tokenizer? I think it is a way to solve many cases like this. (Or is there just such functions? Sorry in advance if I missed it carelessly, if that's the case, please ignore my verbose text below...)

Below are my consideration to this question. I'm not very confident with these thoughts, please your comments and forgiveness if I'm wrong:

I think the tokenisation of CJK languages is pretty complex, and I have found many libs concentrate on such question of Chinese (I know little about the situation of Japanese and Korean...). ICU gives a universal solution to the problem. But in a specific language, there may be other special cases likes the numbers.

I think fixing these cases patch by patch is laborious and unsuitable, as I think Xapian concerns more on the search technology rather than language process. Allow users to use localized tokenizer lib can't just solve all the problem, but could improve it on more cases.

@ojwb
Copy link
Contributor

ojwb commented Apr 4, 2019

A custom tokeniser is already possible - both TermGenerator and QueryParser could be implemented using only public API functions (I'm not sure if the current implementations actually are but there's nothing special they use which isn't available via public APIs). So someone could write a CJKQueryParser which parses a query string in some way and builds a Query object tree from it, and a CJKTermGenerator to index text in a compatible way.

So far for the API we've aimed to instead have a single class with flags to control behaviour, which seems more flexible as we can automatically switch mode to handle a document which is partly in Chinese and partly in English - with separate classes the user would need to scan the text first and split it up, then pass the different parts to different classes.

I'm not sure a custom tokeniser really makes the problem easier to solve though - someone still needs to write the tokenisation code whether it's in QueryParser and TermGenerator or in a custom tokeniiser.

@outdream
Copy link
Author

outdream commented Apr 6, 2019

Thanks, I got it.

For the custom tokenizer, sorry for my ambiguous words. I mean to extract a framework and public interfaces, to allow users easily use other libs like ICU, but more localized to specific language, to support the language better. But consider the adaption code also would be a hard work, maybe that's not a good idea... It doesn't have strict association with the question, my fault.

And for the API, is what you want a common number tokenizer which would handle Chinese numbers when the CJK flag is on? As current code mismatch the design, I don't know if I should close this branch or left for further discussion? And maybe ticket #699 on trac also should be modified.

@ojwb
Copy link
Contributor

ojwb commented May 3, 2019

Sorry for the slow response - I've been a bit backlogged with patches lately.

For the custom tokenizer, sorry for my ambiguous words. I mean to extract a framework and public interfaces, to allow users easily use other libs like ICU, but more localized to specific language, to support the language better. But consider the adaption code also would be a hard work, maybe that's not a good idea... It doesn't have strict association with the question, my fault.

Oh I see. No, that's a good question.

I think the hard part is probably doing it in a way with is sufficiently flexible but doesn't impose any significant overhead. If turning text into terms during indexing is slow that makes indexing slower, potentially which limits the amount of data that can be searched. It's probably less of a concern during query parsing, just because queries are typically much shorter.

And for the API, is what you want a common number tokenizer which would handle Chinese numbers when the CJK flag is on? As current code mismatch the design, I don't know if I should close this branch or left for further discussion? And maybe ticket #699 on trac also should be modified.

The ideal of turning different representations of a number into a common form for indexing still seems appealing, though it seems quite complex for Chinese. But your idea of indexing the original form too would mean that it would be OK if the common form handling didn't handle all cases (so long as it didn't mishandle them in problematic ways). And over time people can add versions for other scripts with their own way of writing numbers (like Arabic). Even Roman Numerals could be handled, e.g. MCMLXXIV -> 1974.

I'm leaning towards thinking we should merge what we have here for now, as it seems an improvement over the current behaviour even if we can dream of grander solutions.

And yes, #699 should be updated once we decide about this patch.

@outdream
Copy link
Author

Sorry, I'm busy with my courses these days, and haven't got it in time...
Thanks for your approval, I'm glad that my efforts would be helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants