# cogroup():
                        Syntax: RDD.cogroup(<otherRDD>, numPartitions=None)
The cogroup() transformation groups multiple key/value pair datasets by a key. It is somewhat
similar conceptually to a fullOuterJoin(), but there are a few key differences in its
implementation:
<ul>
    <li>
 The cogroup() transformation returns an iterable object, similar to the object returned
from the groupByKey() function you saw earlier.
    </li>
    <li>
 The cogroup() transformation groups multiple elements from both RDDs into iterable
objects, whereas fullOuterJoin() creates separate output elements for the same key.
    </li>
    <li>
 The cogroup() transformation can group three or more RDDs using the Scala API or the
groupWith() function alias.
    </li>
</ul>

The resultant RDD output from a cogroup() operation of two RDDs (A, B) with a key K could be
summarized as:

                        [K, Iterable(K,VA, …), Iterable(K,VB, …)]
                        
If an RDD does not have elements for a given key that is present in the other RDD, the corresponding
iterable is empty.Lets see it using the example in previous Notebook:

In [1]:
from pyspark.sql import SparkSession
from pyspark import SparkContext,SparkConf

configure = SparkConf().setAppName("Cogroup-Cartesian").setMaster("local")
sc = SparkContext(conf = configure)

spark = SparkSession.builder \
        .appName("cogroup-cartesian") \
        .getOrCreate()
    
spark.sparkContext.getConf().getAll()

[('spark.master', 'local'),
 ('spark.driver.port', '59186'),
 ('spark.app.name', 'cogroup-cartesian'),
 ('spark.rdd.compress', 'True'),
 ('spark.serializer.objectStreamReset', '100'),
 ('spark.executor.id', 'driver'),
 ('spark.submit.deployMode', 'client'),
 ('spark.app.id', 'local-1590555337997'),
 ('spark.ui.showConsoleProgress', 'true'),
 ('spark.driver.host', 'DESKTOP-I7971JS')]

In [2]:
stores = sc.parallelize(
                        [
                        (100, 'Boca Raton'),
                        (101, 'Columbia'),
                        (102, 'Cambridge'),
                        (103, 'Naperville')
                        ]
)
# stores schema (store_id, store_location)
salespeople = sc.parallelize(
                              [
                                (1, 'Henry', 100),
                                (2, 'Karen', 100),
                                (3, 'Paul', 101),
                                (4, 'Jimmy', 102),
                                (5, 'Janice', None)
                              ]
)
# salespeople schema (salesperson_id, salesperson_name, store_id)

In [3]:
salespeople.keyBy(lambda x: x[2]) \
          .cogroup(stores).take(2)

[(100,
  (<pyspark.resultiterable.ResultIterable at 0x220caa26588>,
   <pyspark.resultiterable.ResultIterable at 0x220caa26a88>)),
 (102,
  (<pyspark.resultiterable.ResultIterable at 0x220caa24988>,
   <pyspark.resultiterable.ResultIterable at 0x220caa26b88>))]

In [4]:
salespeople.keyBy(lambda x: x[2]) \
            .cogroup(stores) \
            .mapValues(lambda x: [item for sublist in x for item in sublist]) \
            .collect()

[(100, [(1, 'Henry', 100), (2, 'Karen', 100), 'Boca Raton']),
 (102, [(4, 'Jimmy', 102), 'Cambridge']),
 (None, [(5, 'Janice', None)]),
 (101, [(3, 'Paul', 101), 'Columbia']),
 (103, ['Naperville'])]

# cartesian():
                    Syntax:  RDD.cartesian(<otherRDD>)
The cartesian() transformation, sometimes referred to by its colloquial name, cross join, generates
every possible combination of records from both RDDs. The number of records produced by
this transformation is equal to the number of records in the first RDD multiplied by the number
of records in the second RDD.

Lets see it with an example:

In [5]:
salespeople.keyBy(lambda x: x[2]) \
          .cartesian(stores).take(1)
# returns:
# [((100, (1, 'Henry', 100)), (100, 'Boca Raton'))]


[((100, (1, 'Henry', 100)), (100, 'Boca Raton'))]

In [6]:
salespeople.keyBy(lambda x: x[2]) \
            .cartesian(stores).count()
# returns 20 as there are 5 x 4 = 20 records

20

# Use the cartesian() Transformation Cautiously
Cartesian, or cross-product, operations can yield excessively large amounts of data. Although
this is a useful function for testing multiple combinations of items for machine learning, you
could create a Big Data problem where one otherwise did not exist!