Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Wrong ids returned for docs using non-latin chars as id #120

heshiming opened this Issue May 28, 2011 · 2 comments


None yet
2 participants

I discovered that in certain situations, couchdb-lucene might not return the correct document id. The situation seemed to be involved with non-latin characters as document id in particular.

I've created a minimal script to reproduce this issue. The script is in python, and it depends on couchdb-python 0.8 . The script will:

  1. delete and recreate a database named 'couchdb-lucene-test' at localhost:5984 with no password
  2. upload a design document that has one "fulltext" index function that intends to match doc.tag which is an array of strings
  3. create sample records with the tags and name them in a template manner
  4. run couchdb-lucene

Here is the source:

# -*- coding: utf-8 -*-

import random
import couchdb
from couchdb.mapping import Document

# setup database
print 'Setting up database ...'
server = couchdb.client.Server()
if 'couchdb-lucene-test' in server:
    del server['couchdb-lucene-test']
db = server.create('couchdb-lucene-test')

# design document for the sample
design_doc = {
    "_id": "_design/sample",
    "fulltext": {
        "by.tag": {
            "index": "function(doc) { var ret = new Document(); for (var i = 0; i < doc.tag.length; i++) ret.add(doc.tag[i], {'field': 'tag'}); return ret; }"
    "views": {
        "enumerate.tags": {
            "map": "function(doc) { if (doc.tag) { for (var i = 0; i < doc.tag.length; i++) emit(doc.tag[i], 1); } }", 
            "reduce": "function(keys, values) { return sum(values); }"

print '-'*40
print 'Populating sample records ...'
# populate some records
tags = [
     [u'\u6807\u51c6\u957f\u5ea6', u'\u6807\u51c6\u957f\u5ea6\u6807', u'\u6807\u51c6\u957f\u5ea6\u6807\u51c6', u'\u6807\u51c6\u957f\u5ea6\u6807\u51c6\u957f', u'\u6807\u51c6\u957f\u5ea6\u6807\u51c6\u957f\u5ea6']

passes = 10
for count in range(1, passes+1):
    for tag in tags:
        doc = {
            "_id": 'psst'+str(count) + '-' + '-'.join(tag) + '-' + str(random.randint(10000, 99999)),
            "tag": tag

print '-'*40
print 'Enumerating tags ...'
enumed_tags = []
for row in db.view('sample/enumerate.tags', group=True):
print len(enumed_tags), 'tags'

print '-'*40
print 'Running couchdb-lucene query ...'

# index cleanups
(status, headers, results) = \
(status, headers, results) = \
    db.resource('_fti','_design', 'sample', 'by.tag', '_optimize').post_json()

for tag in enumed_tags:
    query = 'tag:"' + tag + '"'
    (status, headers, results) = \
        db.resource('_fti','_design','sample','by.tag').get_json(q=query, limit=1000)
    if results['total_rows'] > 0:
        print query + ' -> ' + str(results['total_rows']) + ' / ' + str(len(results['rows']))
        for row in results['rows']:
            if not row['id'] in db:
                print '* Error *, ' + row['id'] + ' is not found in db'

It used some Chinese characters as document id. I ran this on ubuntu server 10.04, python 2.6.5 (via aptitude), couchdb 1.0.1 (also via aptitude), and the current trunk of couchdb-lucene.

Here is the output I got:

Setting up database ...
Populating sample records ...
Enumerating tags ...
5 tags
Running couchdb-lucene query ...
tag:"标准长度" -> 10 / 10
tag:"标准长度标" -> 10 / 10
tag:"标准长度标准" -> 10 / 10
* Error *, psst9-标准长度-标准长度标标-标准长度标准-标准长度标准长-标准长度标准长度-41958 is not found in db
tag:"标准长度标准长" -> 10 / 10
* Error *, psst9-标准长度-标准长长度标-标准长度标准-标准长度标准长-标准长度标准长度-41958 is not found in db
tag:"标准长度标准长度" -> 10 / 10
* Error *, psst9-标准长度-标准长度标-标标准长度标准-标准长度标准长-标准长度标准长度-41958 is not found in db

Sorry for the Chinese characters. But basically, the * Error * part is only printed when the results of couchdb-lucene, cannot be found in the database. There is actually a pattern in the error. These ids has one (and only one) certain character being repeated once in the non-latin part. 3 errors contained 3 different "repeated characters". The error always involved this particular id. It's not random.

If you look closely, the script writes 10 documents that are all named alike. But there's only 1 id that's always failing. I tried this many times.

The script is very close to my development situation where I first discovered this problem. I tried a few combinations, it appears that the document id is the problem, but not the searching (indexing?) itself.

With all the combination I tried, the failing situations (such as the one the above script reproduced) only occur in less than 5% of all situations. Not all non-latin character combinations are failing, only a small portion of them. And they don't have to be this long. But so far, I cannot reproduce this with latin character doc ids.

I'm wondering if this is a bug or something I didn't do right. I'm not that familiar with lucene, I'm wondering what module should I look into to find the cause of this problem?


rnewson commented May 28, 2011

This could be related to utf-8 bugs in couchdb itself. I'll investigate, thanks.

@rnewson rnewson closed this in 0f64894 May 28, 2011

Wow, very sorry about that. I thought db.resource will take care of encoding automatically. Thank you very much for your time!

@mmm444 mmm444 pushed a commit to mmm444/couchdb-lucene that referenced this issue Jul 19, 2011

Robert Newson encode POST and PUT bodies in UTF-8 (closes #120) 53f99c1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment