Skip to content

RemilYoucef/deep-locality-sensitive-hashing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

DeepLSH: Deep Locality-Sensitive Hash Learning for Fast and Efficient Near-Duplicate Crash Report Detection

Overview

Automatic crash bucketing is a critical step in the software development process to efficiently analyze and triage bug reports. In this work, we aim at detecting for a crash report its candidate near-duplicates (i.e., similar crashes that are likely to be induced by the same software bug) in a large database of historical crashes and given any similarity measure dedicated to compare between stack traces. To this end, we propose DeepLSH a deep Siamese hash coding neural network based on Locality-Sensitive Hashing (LSH) property in order to provide binary hash codes aiming to locate the most similar stack traces into hash buckets. DeepLSH have been conducted on a large stack trace dataset and performed on state-of-the-art similarity measures proposed to tackle the crash deduplication problem:

  • Jaccard coefficient [Ref]
  • Cosine similarity [Ref]
  • Lucene TF-IDF [Ref]
  • Edit distance [Ref]
  • Brodie et al. [Paper]
  • PDM-Rebucket [Paper]
  • DURFEX [Paper]
  • Lerch and Mezini [Paper]
  • Moroo et al. [Paper]
  • TraceSIM [Paper]

Contributions

Our contribution is three-fold.

  • Aiming to overcome the problem of deriving LSH functions for stack-trace similarity measures, we propose a generic approach dubbed DeepLSH that learns and provides a family of binary hash functions that perfectly approximate the locality-sensitive property to retrieve efficiently and rapidly near-duplicate stack traces.

lsh

  • Technically, we design a deep Siamese neural network architecture to perform end-to-end hashing with an original objective loss function based on the locality-sensitive property preserving with appropriate regularizations to cope with the binarization problem of optimizing non-smooth loss functions.
  • We demonstrate through our experimental study the effectiveness and scalability of DeepLSH to yield near-duplicate crash reports under a dozen of similarity metrics. We successfully compare to standard LSH techniques (MinHash and SimHash), and the most relevant deep hashing baselineon a large real-world dataset that we make available.

contrib

How to use this code?

  1. Clone this repository
  2. Install the required python packages: pip install -r ./code/requirements.txt
  3. Run the notebooks in code/notebooks/measure:
    • measure-DeepLSH.ipynb : the end-to-end procedure to train/test and validate DeepLSH
    • measure-Baseline.ipynb: the end-to-end procedure to train/test and validate the CNHH+LSH baseline
    • measure-MinHash.ipynb (for Jaccard [boW / N-grams]): the end-to-end procedure to run MinHash
    • measure-SimHash.ipynb (for Cosine [boW / N-grams / TF-IDF]): the end-to-end procedure to run SimHash
    • measure-Runtime.ipynb : comparison between the required runtime for DeepLSH vs. CNNH+LSH vs. KNN-based approach

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published