Skip to content

documents are silently dropped under parallel load #16

@ghost

Description

This may ultimately be a redis issue, not a redisearch issue, but...

Short version

(see below for full details and self-contained example)

When indexing tens of thousands of documents contained in batches in XML files serially, indexing proceeds as expected, like this:

ls *.xml | xargs -n1 -P1 ../myIndexingScript.py 

But when run in parallel like this, some documents (about one per thousand) are silently dropped:

ls *.xml | xargs -n1 -P4 ../myIndexingScript.py 

Details

Given this indexing script:

#!/usr/bin/python

from redis import ResponseError, Connection
from redisearch import Client, TextField
import sys
import re

indexer = client.batch_indexer(chunk_size=500)
count = 0;

# create index if it does not already exist
try:
    client.create_index([TextField('abstract')])
except ResponseError:
    pass

# load the datafile
with open(sys.argv[1], 'r') as f:
    data=f.read()
recs = data.split("<PubmedArticle>");
recs = recs[1:] # discard preamble

for r in recs:
    # extract the ID and abstract.                
    pmid = re.findall('<PMID Version="1">(.*?)</PMID>', r)[0]
    abstract = re.findall('<Abstract>([\s\S]*?)</Abstract>', r)
    if abstract:
        abstract = abstract[0]
    else:
        abstract = ""

    #index the document        
    res = client.add_document(pmid, abstract=abstract)
    if res == "OK":
        count = count+1

print(str(count) + " records (ostensibly) indexed.")

I can successfully index tens of thousands of medline records in batches in XML files (each XML file contained 30,000 records). For example, if we fetch some medline XML files:

# The following assumes above script is saved as bugdemo.py in current directory and is executable

# create a working directory
mkdir sandbox
cd sandbox

# fetch (partical) filelist
ROOT=ftp.ncbi.nlm.nih.gov/pubmed/baseline
wget $ROOT
grep -o '>medline.*gz' baseline | uniq |  grep -o medline.* | head -n4 | xargs -n1 -I% echo $ROOT/% > filelist

# fetch files 
wget -i filelist
gunzip *.gz

I can then index them:

ls *.xml | xargs -n1 -P1 ../bugdemo.py 
redis-cli FT.info bugdemo | grep num_docs -A1 

And this produces 120,000 documents as expected. However, most of the compute time is actually taken up by reading the documents and parsing them prior to indexing. Therefore, although redis is single threaded, I get a big performance boost by doing this in parallel, like this (note the -P4):

redis-cli FT.drop bugdemo
ls *.xml | xargs -n1 -P4 ../bugdemo.py 
redis-cli FT.info bugdemo | grep num_docs -A1 

Unfortunately, this results in almost, but not quite, 120,000 indexed documents (I typically end up 100-200 documents short).

It seems to me that (1) this should not be the case and (2) if it IS the case, an error should be thrown. Any insights into why this is happening?

Environment details
(Note that unstable version of redis used as advocated by redisearch Quick Start:
Ubuntu 16.04.2 LTS (running on an r4.2xlarge AWS EC2 instance, 60GB RAM, 100GB root disk)
Python 2.7.12
Redis server v=999.999.999 sha=00000000:0 malloc=jemalloc-4.0.3 bits=64 build=523270dd92165bcf
redisearch 0.21.4
redisearch-py 0.6.3

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions