Skip to content
/ Qchunk Public

Source code for Qchunk work that published in the SIGMOD 2011 and TODS 2015

License

Notifications You must be signed in to change notification settings

qinbill/Qchunk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 

Repository files navigation

Qchunk similarity search source code

Basic information

  • Source code author: Jianbin Qin
  • Version: 0.1
  • Contact: jqin@inf.ed.ac.uk
  • More information: http://qinjianbin.com/

Introduction:

Verdict:

This work pionerred the trend of using Q-Chunk, fixed length non-overlapping gram, as the signatures (features) generator to perform efficient string similarity query processing. The fixed length non-overlapping scheme has been proven effective by many researchers. There are two versions of QChunk papers and six versions of algorithms:

  • In SIGMOD version, we only studied the basis IndexChunk and IndexGram.
  • In the Journal version, we further improve this IndexChunk and IndexGram to Turbo and Super charged version by constraint the QChunk selection space.
  • Super charged IndexGram is the default version in this source code. Please compare this version, i.e. IndexGramSuper

Journel Version:

  • Title: Asymmetric signature schemes for efficient exact edit similarity query processing
  • Authors: Asymmetric signature schemes for efficient exact edit similarity query processing
  • Published in TODS, 2013

Abstract:

Given a query string Q, an edit similarity search finds all strings in a database whose edit distance with Q is no more than a given threshold t. Most existing methods answering edit similarity queries employ schemes to generate string subsequences as signatures and generate candidates by set overlap queries on query and data signatures.

In this article, we show that for any such signature scheme, the lower bound of the minimum number of signatures is t + 1, which is lower than what is achieved by existing methods. We then propose several asymmetric signature schemes, that is, extracting different numbers of signatures for the data and query strings, which achieve this lower bound. A basic asymmetric scheme is first established on the basis of matching q-chunks and q-grams between two strings. Two efficient query processing algorithms (IndexGram and IndexChunk) are developed on top of this scheme. We also propose novel candidate pruning methods to further improve the efficiency. We then generalize the basic scheme by incorporating novel ideas of floating q-chunks, optimal selection of q-chunks, and reducing the number of signatures using global ordering. As a result, the Super and Turbo families of schemes are developed together with their corresponding query processing algorithms. We have conducted a comprehensive experimental study using the six asymmetric algorithms and nine previous state-of-the-art algorithms. The experiment results clearly showcase the efficiency of our methods and demonstrate space and time characteristics of our proposed algorithms.

Conference Version:

  • Title: Efficient exact edit similarity query processing with the asymmetric signature scheme
  • Authors: Jianbin Qin, Wei Wang, Yifei Lu, Chuan Xiao, Xuemin Lin
  • Published in SIGMOD, 2011

Abstract:

“Given a query string Q, an edit similarity search finds all strings in a database whose edit distance with Q is no more than a given threshold t. Most existing method answering edit similarity queries rely on a signature scheme to generate candidates given the query string. We observe that the number of signatures generated by existing methods is far greater than the lower bound, and this results in high query time and index space complexities.

In this paper, we show that the minimum signature size lower bound is t + 1. We then propose asymmetric signature schemes that achieve this lower bound. We develop efficient query processing algorithms based on the new scheme. Several dynamic programming-based candidate pruning methods are also developed to further speed up the performance. We have conducted a comprehensive experimental study in- volving nine state-of-the-art algorithms. The experiment results clearly demonstrate the efficiency of our methods.”

Please Cite this paper:

@inproceedings{DBLP:conf/sigmod/QinWLXL11,
  author    = {Jianbin Qin and
               Wei Wang and
               Yifei Lu and
               Chuan Xiao and
               Xuemin Lin},
  title     = {Efficient exact edit similarity query processing with the asymmetric
               signature scheme},
  booktitle = {Proceedings of the {ACM} {SIGMOD} International Conference on Management
               of Data, {SIGMOD} 2011, Athens, Greece, June 12-16, 2011},
  pages     = {1033--1044},
  year      = {2011},
  url       = {http://doi.acm.org/10.1145/1989323.1989431},
  doi       = {10.1145/1989323.1989431},
  timestamp = {Mon, 14 May 2012 08:45:16 +0200},
  biburl    = {https://dblp.org/rec/bib/conf/sigmod/QinWLXL11},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Related Papers

  1. Jianbin Qin, Wei Wang, Yifei Lu, Chuan Xiao, Xuemin Lin. Efficient Exact Edit Similarity Query Processing with Asymmetric Signature Schemes. SIGMOD 2011.
  2. Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu: Efficient Similarity Joins For Near Duplicate Detection. WWW 2008: 131-140.
  3. Chuan Xiao, Wei Wang, Xuemin Lin: Ed-Join: An Efficient Algorithm for Similarity Join with Edit Distance Constraints. VLDB 2008.
  4. Roberto J. Bayardo, Yiming Ma, Ramakrishnan Srikant: Scaling up all pairs similarity search. WWW 2007: 131-140.

Package Manual

Installation

  • code, say
$ git clone https://github.com/qinbill/Qchunk.git
$ cd Qchunk/src/
$ make

Overview of Programs

Executables

Program Name Description

  • preproc Preprocess the text file and output a index.
  • search Perform the search queries.

Preprocessing

The preprocessing dose three things:

  1. Sort the input data.
  2. Generate qchunk/qgram set.
  3. Calculate statistics.

Usage:

$ cat <textfile> | ./preproc -o <output_prefix> -q <qgram size> 

The process will genearte a set of files that named with <output_prefix>

Query processing

The query processing part takes input text from standard input. Usage:

$ cat <query_file> | ./search -i <input_prefix> -t <threshold> -c/G/C/g/b

$ ./search -h
Need input file name
usage: -t <Max Edit Distance>    :Edit distance threshold for index building.>
    -i <input file name>      :input binary file prefix
    -G Use Algoritm indexGramSuper
    -C Use Algoritm indexChunkSuper 
    -g Use Algoritm indexGramTurbo (Default)
    -c Use Algoritm indexChunkTurbo 
    -b Use Algoritm indexChunkTurbine 

Result interpretation:

# Q: 8       // Length of Q
# Tau: 5     // The threshold.
# DataDucNum: 2000000  // Number of data.
# DataUnderflow: 0     // Number of data dumped because length less that Q*tau+1
# IndexedTokenNum: 284771740  // Unique indexed token number
# SkipListNum: 0           // Use skip list size. 
# QueryNum: 1000     // Number of queries.      
# UderflowQuery: 0    // Number of queries is too short. 
# CandOneNum: 2710686   // Number of Candidate one. Just past prefix filtering.
# FinalResultsNum: 3618  // Number of results. 
# IndexTotalTime: 17.862824   // Time used in indexing. 
# SearchTotalTime: 2.907339

Last Modified: <2018-03-26 Mon> by Jianbin Qin

About

Source code for Qchunk work that published in the SIGMOD 2011 and TODS 2015

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published