In this project, we proposed a solution for the multiple-set matching problem, defining a multi-set Bloom Filter.
A multi-set Bloom Filter is a space-efficient probabilistic data structure that allows to check the membership of an element in multiple sets.
The new data structure presented in this project has associated multiple sets of data and supports the construction operation of all the sets through an efficient construction operation.
The IMDb dataset containing ratings of movies has been used as a reference for building and assessing the performances of the Bloom Filter.
An implementation based on the MapReduce paradigm is presented, specifically employing the Hadoop and Spark frameworks.
The project is divided into two main modules:
- BloomFilterHadoop: containing the implementation of the multi-set Bloom Filter in Java using the Hadoop Framework
- BloomFilterSpark: containing the implementation of the multi-set Bloom Filter in Python using the Spark Framework
The Bloom Filter construction can be decomposed in two stages, suitable for a MapReduce implementation:
-
Parameter Calculation Stage, where the parameters of each Bloom Filter {mi, ki} are computed, based on the number of elements with a given rating in the dataset.
-
Construction Stage, where the Bloom Filters are filled based on the results of the hash functions.
Starting from the solutions highlighted in the Hadoop Design, the following lineage graphs will show the computation workflow of the Spark Framework
Our solution has a space complexity of 𝑂(𝑚∗𝑙) where 𝑚 is the maximum size among all Bloom Filters and 𝑙 is the total number of possible ratings.
Moreover, the operation of insertion has a complexity of O(𝑘∗𝑛) where 𝑘 is the maximum number of hash functions among all Bloom Filters and 𝑛 is the total number of elements.