Multiset Bloom Filter: Hadoop and Spark implementations

In this project, we proposed a solution for the multiple-set matching problem, defining a multi-set Bloom Filter.
A multi-set Bloom Filter is a space-efficient probabilistic data structure that allows to check the membership of an element in multiple sets.
The new data structure presented in this project has associated multiple sets of data and supports the construction operation of all the sets through an efficient construction operation.
The IMDb dataset containing ratings of movies has been used as a reference for building and assessing the performances of the Bloom Filter.
An implementation based on the MapReduce paradigm is presented, specifically employing the Hadoop and Spark frameworks.

Project Structure

The project is divided into two main modules:

BloomFilterHadoop: containing the implementation of the multi-set Bloom Filter in Java using the Hadoop Framework
BloomFilterSpark: containing the implementation of the multi-set Bloom Filter in Python using the Spark Framework

Hadoop

The Bloom Filter construction can be decomposed in two stages, suitable for a MapReduce implementation:

Parameter Calculation Stage, where the parameters of each Bloom Filter {mi, ki} are computed, based on the number of elements with a given rating in the dataset.
Construction Stage, where the Bloom Filters are filled based on the results of the hash functions.

Spark

Starting from the solutions highlighted in the Hadoop Design, the following lineage graphs will show the computation workflow of the Spark Framework

Complexity

Our solution has a space complexity of 𝑂(𝑚∗𝑙) where 𝑚 is the maximum size among all Bloom Filters and 𝑙 is the total number of possible ratings.
Moreover, the operation of insertion has a complexity of O(𝑘∗𝑛) where 𝑘 is the maximum number of hash functions among all Bloom Filters and 𝑛 is the total number of elements.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
BloomFilterHadoop		BloomFilterHadoop
BloomFilterSpark		BloomFilterSpark
Documentation.pdf		Documentation.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BloomFilterHadoop

BloomFilterHadoop

BloomFilterSpark

BloomFilterSpark

Documentation.pdf

Documentation.pdf

README.md

README.md

Repository files navigation

Multiset Bloom Filter: Hadoop and Spark implementations

Project Structure

Hadoop

Spark

Complexity

About

Releases

Packages

Languages

terranovafr/MultisetBloomFilterHadoopSpark

Folders and files

Latest commit

History

Repository files navigation

Multiset Bloom Filter: Hadoop and Spark implementations

Project Structure

Hadoop

Spark

Complexity

About

Topics

Resources

Stars

Watchers

Forks

Languages