# Transformations on Sets:
Set operations are conceptually similar to mathematical set operations. A set function operates
against two RDDs and results in one RDD. Consider the Venn diagram shown in Figure below, which
shows a set of odd integers and a subset of Fibonacci numbers. The following sections use these
two sets to demonstrate the various set transformations available in the Spark API.

![Transformations-on-Sets](img/sets-tranformation.png "Set Transformation")

# union():
                        Syntax:   RDD.union(<otherRDD>)
The union() transformation takes one RDD and appends another RDD to it, resulting in a
combined output RDD. The RDDs are not required to have the same schema or structure. For
instance, the first RDD can have five fields, whereas the second can have more or fewer than five
fields.

The union() transformation does not filter duplicates from the output RDD in the case that two
unioned RDDs have records that are identical to each other. To filter duplicates, you could follow
the union() transformation with the distinct() function discussed previously.

The RDD that results from a union() operation is not sorted either, but you could sort it by
following union() with a sortBy() function.

In [1]:
from pyspark.sql import SparkSession
from pyspark import SparkContext,SparkConf

configure = SparkConf().setAppName("Transformations-on-Sets").setMaster("local")
sc = SparkContext(conf = configure)

spark = SparkSession.builder \
        .appName("Transformations-on-Sets") \
        .getOrCreate()
    
spark.sparkContext.getConf().getAll()

[('spark.master', 'local'),
 ('spark.app.id', 'local-1590561599286'),
 ('spark.rdd.compress', 'True'),
 ('spark.driver.port', '61336'),
 ('spark.serializer.objectStreamReset', '100'),
 ('spark.app.name', 'Transformations-on-Sets'),
 ('spark.executor.id', 'driver'),
 ('spark.submit.deployMode', 'client'),
 ('spark.ui.showConsoleProgress', 'true'),
 ('spark.driver.host', 'DESKTOP-I7971JS')]

In [2]:
odds = sc.parallelize([1,3,5,7,9])
fibonacci = sc.parallelize([0,1,2,3,5,8])
odds.union(fibonacci).collect()

[1, 3, 5, 7, 9, 0, 1, 2, 3, 5, 8]

# intersection():
                Syntax:  RDD.intersection(<otherRDD>)
The intersection() transformation returns elements that are present in both RDDs. In other
words, it returns the overlap between two sets. The elements or records must be identical in both
sets, with each respective record’s data structure and all of its fields matching in both RDDs.

In [3]:
odds.intersection(fibonacci).collect()

[1, 3, 5]

# subtract():
                Syntax:  RDD.subtract(<otherRDD>, numPartitions=None)
The subtract() transformation, as shown in Listing 4.43, returns all elements from the first
RDD that are not present in the second RDD. This is an implementation of a mathematical set
subtraction.

In [4]:
odds.subtract(fibonacci).collect()

[7, 9]

# subtractByKey():
                        Syntax: RDD.subtractByKey(<otherRDD>, numPartitions=None)
The subtractByKey() transformation is a set operation similar to the subtract transformation.
The subtractByKey() transformation returns key/value pair elements from an RDD with keys
that are not present in key/value pair elements from otherRDD.

The numPartitions argument specifies how many output partitions are to be created in the
resultant RDD, and it defaults to the configured spark.default.parallelism value.


In [5]:
cities1 = sc.parallelize([
                         ('Hayward',(37.668819,-122.080795)),
                         ('Baumholder',(49.6489,7.3975)),
                        ('Alexandria',(38.820450,-77.050552)),
                        ('Melbourne', (37.663712,144.844788))]
)

cities2 = sc.parallelize([
                            ('Boulder Creek',(64.0708333,-148.2236111)),
                            ('Hayward',(37.668819,-122.080795)),
                            ('Alexandria',(38.820450,-77.050552)),
                            ('Arlington', (38.878337,-77.100703))]
)

cities1.subtractByKey(cities2).collect()

[('Baumholder', (49.6489, 7.3975)), ('Melbourne', (37.663712, 144.844788))]