Skip to content

Multi-set Bloom Filter built according to the MapReduce paradigm using Hadoop and Spark

Notifications You must be signed in to change notification settings

terranovafr/MultisetBloomFilterHadoopSpark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Multiset Bloom Filter: Hadoop and Spark implementations

In this project, we proposed a solution for the multiple-set matching problem, defining a multi-set Bloom Filter.
A multi-set Bloom Filter is a space-efficient probabilistic data structure that allows to check the membership of an element in multiple sets.
The new data structure presented in this project has associated multiple sets of data and supports the construction operation of all the sets through an efficient construction operation.
The IMDb dataset containing ratings of movies has been used as a reference for building and assessing the performances of the Bloom Filter.
An implementation based on the MapReduce paradigm is presented, specifically employing the Hadoop and Spark frameworks.

Project Structure

The project is divided into two main modules:

  • BloomFilterHadoop: containing the implementation of the multi-set Bloom Filter in Java using the Hadoop Framework
  • BloomFilterSpark: containing the implementation of the multi-set Bloom Filter in Python using the Spark Framework

Hadoop

The Bloom Filter construction can be decomposed in two stages, suitable for a MapReduce implementation:

  1. Parameter Calculation Stage, where the parameters of each Bloom Filter {mi, ki} are computed, based on the number of elements with a given rating in the dataset. image image

  2. Construction Stage, where the Bloom Filters are filled based on the results of the hash functions. image image

image

Spark

Starting from the solutions highlighted in the Hadoop Design, the following lineage graphs will show the computation workflow of the Spark Framework image image

Complexity

Our solution has a space complexity of 𝑂(𝑚∗𝑙) where 𝑚 is the maximum size among all Bloom Filters and 𝑙 is the total number of possible ratings.
Moreover, the operation of insertion has a complexity of O(𝑘∗𝑛) where 𝑘 is the maximum number of hash functions among all Bloom Filters and 𝑛 is the total number of elements.

About

Multi-set Bloom Filter built according to the MapReduce paradigm using Hadoop and Spark

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published