MET CS 777 - Big Data Analytics, prof. Dimitar Trajanov
# PySpark RDD Basics
This notebook is a simple introduction to the Spark RDD API.  It uses learning by example to demonstrate how the RDD API works and how to use it. For a comprehensive guide to the RDD API, please see the [Spark RDD API Documentation](https://spark.apache.org/docs/latest/api/python/reference/pyspark.html#rdd-apis).

*) Visual diagrams depicting the Spark API Created by Jeff Thomspon, https://github.com/jkthompson/pyspark-pictures.

In [1]:
import findspark
findspark.init()
from pyspark import SparkContext
from pyspark.sql import SparkSession
sc = SparkContext("local")
spark = SparkSession.builder.getOrCreate()

In [2]:
# Before restarting the kernel (Notebook), stop the spark context
# sc.stop()

# Parallelizing and Collecting the data

**parallelize(c, numSlices=None)**

The `parallelize()` function is used to create an RDD from a collection or an iterable object by distributing the data across multiple partitions in a parallel manner.

**Parameters:**
- `c`: The collection or iterable object to be parallelized into an RDD.
- `numSlices` (optional): The number of partitions to split the data into. By default, it is set to `None`, and the number of partitions is determined automatically based on the cluster configuration.

**Returns:**
An RDD representing the distributed data.

**Note:**
- The `parallelize()` function is a method available in PySpark's `SparkContext` class.
- The data in the collection or iterable object is partitioned and distributed across multiple partitions, allowing for parallel computation on a cluster.
- The resulting RDD is immutable and can be operated upon using various transformation and action operations available in PySpark.
- It is generally recommended to have a sufficient number of partitions to fully utilize the available cluster resources and enable parallel processing.
- The `parallelize()` operation is a transformation operation in PySpark, meaning it is lazily evaluated. It will not be executed until an action is triggered on the resulting RDD.
- The `parallelize()` function is commonly used when working with small to medium-sized datasets that can fit into memory on a single machine.
- The `parallelize()` function provides a convenient way to create an RDD from in-memory data, but for larger datasets, it is often more efficient to read the data from external storage systems using input operations like `textFile()` or `csvFile()`.

In [3]:
rdd = sc.parallelize([1,2,3])

<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.collect">
<img align=left src="images/pyspark-page22.svg" width=360 height=203 />
</a>

**collect()**

The `collect()` method returns a list that contains all the elements in the Resilient Distributed Dataset (RDD). This operation transfers the data from the Spark Java Core to the Python environment. However, it's important to note that `collect()` can be an expensive operation.

**Note**: It is recommended to use this method only when the resulting array is expected to be small, as it loads all the data into the driver's memory. Handling large datasets with `collect()` may lead to memory constraints and performance issues.

In [4]:
# collect
y = rdd.collect()
print(rdd)  # distributed object
print(y)  # not distributed, local data

ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:274
[1, 2, 3]


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.take">
<img align=left src="images/pyspark-page39.svg" width=360 height=203 />
</a>

**take(num)**

The `take()` method retrieves the first `num` elements from an RDD. It works by initially scanning one partition and using the results from that partition to estimate the number of additional partitions needed to satisfy the limit.

**Parameters:**
- `num`: The number of elements to retrieve from the RDD.

**Returns:**
A list containing the first `num` elements of the RDD.

**Note:**
- It is important to note that the order in which elements are retrieved is not guaranteed unless the RDD has been sorted in a specific order.
- If `num` is larger than the total number of elements in the RDD, it will return all the elements available in the RDD.

In [5]:
# take
x = sc.parallelize([1,3,1,2,3])
y = x.take(num = 3)
print("x=",x.collect())
print(y)

x= [1, 3, 1, 2, 3]
[1, 3, 1]


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.takeOrdered">
<img align=left src="images/pyspark-page38.svg" width=360 height=203 />
</a>

**takeOrdered(num, key=None)**

The `takeOrdered()` method retrieves the N elements from an RDD, ordered in ascending order or as specified by an optional key function.

**Parameters:**
- `num`: The number of elements to retrieve from the RDD.
- `key` (optional): A function that specifies the sorting criteria. If provided, the elements will be sorted based on this key.

**Returns:**
A list containing the N elements from the RDD, ordered in ascending order or as specified by the key function.

In [6]:
# takeOrdered
x = sc.parallelize([1,3,1,2,3])
y = x.takeOrdered(num = 3)
print("x=",x.collect())
print(y)

x= [1, 3, 1, 2, 3]
[1, 1, 2]


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.first">
<img align=left src="images/pyspark-page40.svg" width=360 height=203 />
</a>

**first()**

The `first()` method returns the first element in the Resilient Distributed Dataset (RDD).

**Returns:**
The first element in the RDD.

**Note:**
- The `first()` operation retrieves the first element from the RDD. The order in which elements are returned is not guaranteed unless the RDD has been sorted in a specific order.
- If the RDD is empty, calling `first()` will result in an error. It is recommended to handle such scenarios by checking the RDD's size or using alternative methods like `take(1)` to retrieve the first element.

In [7]:
# first
x = sc.parallelize([1,3,1,2,3])
y = x.first()
print("x=",x.collect())
print('The first element is',y)

x = sc.parallelize([])
# this will return an error
y = x.first()
print("x=",x.collect())
print(y)

x= [1, 3, 1, 2, 3]
The first element is 1


ValueError: RDD is empty

<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.top">
<img align=left src="images/pyspark-page37.svg" width=360 height=203 />
</a>

**top(num, key=None)**

The `top()` method retrieves the top N elements from an RDD. It returns a list sorted in descending order by default.

**Parameters:**
- `num`: The number of elements to retrieve from the RDD.
- `key` (optional): A function that specifies the sorting criteria. If provided, the elements will be sorted based on this key.

**Returns:**
A list containing the top N elements from the RDD, sorted in descending order.

**Note:**
- If `num` is larger than the total number of elements in the RDD, it will return all the elements in descending order.

In [8]:
# top
x = sc.parallelize([1,3,1,2,4])
y = x.top(num = 3)
print("x=",x.collect())
print("Biggest elements",y)

# top with key function that will return the top 3 smallest elements
y = x.top(num = 3, key = lambda x: -x)
print("Smallest elements",y)

# If `num` is larger than the total number of elements in the RDD, it will return all the elements in descending order.
y = x.top(num = 10)
print("All elements",y)

x= [1, 3, 1, 2, 4]
Biggest elements [4, 3, 2]
Smallest elements [1, 1, 2]
All elements [4, 3, 2, 1, 1]


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.collectAsMap">
<img align=left src="images/pyspark-page41.svg" width=360 height=203 />
</a>

**collectAsMap()**

The `collectAsMap()` method returns the key-value pairs in the Resilient Distributed Dataset (RDD) as a dictionary to the master node.

**Returns:**
A dictionary containing the key-value pairs from the RDD.

**Note:**
- The `collectAsMap()` operation transfers the data from the distributed RDD to the master node and represents it as a dictionary.
- Each key-value pair in the RDD is mapped to a corresponding entry in the dictionary, with the keys being unique.
- It is important to ensure that the resulting dictionary can fit into the memory of the master node, as all the data is loaded into memory.
- If there are duplicate keys in the RDD, the final dictionary will contain the value corresponding to the last occurrence of each key.

In [9]:
# collectAsMap
# collectAsMap
x = sc.parallelize([('C',3),('A',1),('B',2), ('D', 4), ('E', 5)])
y = x.collectAsMap()
print("x=", x.collect())
print("The dictionary of {key:vlaue} pairs:",y)

# If there are duplicate keys, the value of the last key will be retained.
x = sc.parallelize([('C',3),('A',1),('B',2), ('A', 4), ('B', 5)])
y = x.collectAsMap()
print("x=",x.collect())
print("Result if there are duplicates:",y)

x= [('C', 3), ('A', 1), ('B', 2), ('D', 4), ('E', 5)]
The dictionary of {key:vlaue} pairs: {'C': 3, 'A': 1, 'B': 2, 'D': 4, 'E': 5}
x= [('C', 3), ('A', 1), ('B', 2), ('A', 4), ('B', 5)]
Result if there are duplicates: {'C': 3, 'A': 4, 'B': 5}


# Mapping

<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.map">
<img align=left src="images/pyspark-page3.svg" width=360 height=203 />
</a>

**map(f, preservesPartitioning=False)**

The `map()` method returns a new distributed dataset formed by passing each element of the source RDD through a function `f`.

**Parameters:**
- `f`: The function to apply to each element of the RDD.
- `preservesPartitioning` (optional): A boolean flag indicating whether the new RDD should preserve the original partitioning. The default value is `False`.

**Returns:**
A new distributed dataset (RDD) where each element is the result of applying the function `f` to the corresponding element of the source RDD.

**Note:**
- The function `f` can be any Python function, lambda function, or a callable object that accepts an element of the RDD as input and produces a transformed output.
- The `map()` operation is a transformation operation in PySpark, meaning it is **lazily evaluated**. It will not be executed until an action is triggered on the resulting RDD.
- By default, the new RDD does not preserve the original partitioning. If `preservesPartitioning` is set to `True`, the resulting RDD will have the same partitioning as the source RDD, assuming the transformation does not change the keys of the elements.
- The `map()` operation is commonly used for element-wise transformations, such as applying mathematical operations, data cleaning, feature extraction, or any other custom logic on each element of the RDD.

In [57]:
x = sc.parallelize(["b", "a", "c"])
y = x.map(lambda x: (x, 1))
print("x=",x.collect())  # collect copies RDD elements to a list on the driver
print(y.collect())

x= ['b', 'a', 'c']
[('b', 1), ('a', 1), ('c', 1)]


In [11]:
# map
x = sc.parallelize([1,2,3]) # sc = spark context, parallelize creates an RDD from the passed object
y = x.map(lambda x: (x,x**2))
print("x=",x.collect())  # collect copies RDD elements to a list on the driver
print(y.collect())

x= [1, 2, 3]
[(1, 1), (2, 4), (3, 9)]


In [12]:
# map
x = sc.parallelize([1,2,3]) # sc = spark context, parallelize creates an RDD from the passed object
y = x.map(lambda x: x**2)
print("x=",x.collect())  # collect copies RDD elements to a list on the driver
print(y.collect())

x= [1, 2, 3]
[1, 4, 9]


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.flatMap">
<img align=left src="images/pyspark-page4.svg" width=360 height=203 /></a>
<br>

**flatMap(f, preservesPartitioning=False)**

The `flatMap()` method returns a new RDD by first applying a function `f` to all elements of the source RDD, and then flattening the resulting sequences or collections.

**Parameters:**
- `f`: The function to apply to each element of the RDD.
- `preservesPartitioning` (optional): A boolean flag indicating whether the new RDD should preserve the original partitioning. The default value is `False`.

**Returns:**
A new RDD resulting from applying the function `f` to each element of the source RDD and flattening the results.

**Note:**
- The function `f` can be any Python function, lambda function, or a callable object that accepts an element of the RDD as input and returns an iterable (e.g., a list, tuple, set) of elements.
- The `flatMap()` operation is a transformation operation in PySpark, meaning it is lazily evaluated. It will not be executed until an action is triggered on the resulting RDD.
- The resulting RDD may have more or fewer elements than the source RDD, depending on the transformation function `f`.
- By default, the new RDD does not preserve the original partitioning. If `preservesPartitioning` is set to `True`, the resulting RDD will have the same partitioning as the source RDD, assuming the transformation does not change the keys of the elements.
- The `flatMap()` operation is commonly used when each input element of the RDD is mapped to multiple output elements, such as when exploding nested structures, tokenizing text, or performing any operation that expands or flattens the data.

In [58]:
# Map
x = sc.parallelize([(1,2,3),(2,3,4),(3,4,5)])
y = x.map(lambda x: x)
print("x=",x.collect())
print(y.collect())

x= [(1, 2, 3), (2, 3, 4), (3, 4, 5)]
[(1, 2, 3), (2, 3, 4), (3, 4, 5)]


In [59]:
# flatMap
x = sc.parallelize([(1,2,3),(2,3,4),(3,4,5)])
y = x.flatMap(lambda x: x)
print("x=",x.collect())
print(y.collect())

x= [(1, 2, 3), (2, 3, 4), (3, 4, 5)]
[1, 2, 3, 2, 3, 4, 3, 4, 5]


In [60]:
# flatMap
x = sc.parallelize([1,2,3])
y = x.flatMap(lambda x: (x, 100*x, x**2))
print("x=",x.collect())
print(y.collect())

x= [1, 2, 3]
[1, 100, 1, 2, 200, 4, 3, 300, 9]


In [61]:
# Map
x = sc.parallelize([1,2,3])
y = x.map(lambda x: (x, 100*x, x**2))
print("x=",x.collect())
print(y.collect())

x= [1, 2, 3]
[(1, 100, 1), (2, 200, 4), (3, 300, 9)]


In [62]:
x = sc.parallelize([2, 3, 4])
y = x.flatMap(lambda x: range(1, x))
print("x=",x.collect())
print(y.collect())

x= [2, 3, 4]
[1, 1, 2, 1, 2, 3]


In [18]:
# Split sentence into words
lines = sc.parallelize([
    "Apache Spark is a unified analytics engine for large-scale data processing.",
    "It provides high-level APIs in Java, Scala, Python and R",
    "It also supports a rich set of higher-level tools including Spark SQL",
    "MLlib for machine learning",
    "GraphX for graph processing",
    "Structured Streaming for incremental computation and stream processing"
 ])
words = lines.flatMap(lambda x: x.split(' '))
print(lines.collect())
print(words.collect())

['Apache Spark is a unified analytics engine for large-scale data processing.', 'It provides high-level APIs in Java, Scala, Python and R', 'It also supports a rich set of higher-level tools including Spark SQL', 'MLlib for machine learning', 'GraphX for graph processing', 'Structured Streaming for incremental computation and stream processing']
['Apache', 'Spark', 'is', 'a', 'unified', 'analytics', 'engine', 'for', 'large-scale', 'data', 'processing.', 'It', 'provides', 'high-level', 'APIs', 'in', 'Java,', 'Scala,', 'Python', 'and', 'R', 'It', 'also', 'supports', 'a', 'rich', 'set', 'of', 'higher-level', 'tools', 'including', 'Spark', 'SQL', 'MLlib', 'for', 'machine', 'learning', 'GraphX', 'for', 'graph', 'processing', 'Structured', 'Streaming', 'for', 'incremental', 'computation', 'and', 'stream', 'processing']


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.mapValues">
<img align=left src="images/pyspark-page56.svg" width=360 height=203 />
</a>

**mapValues(f)**

The `mapValues()` method applies a map function `f` to each value in a key-value pair RDD, while retaining the original keys and the partitioning of the RDD.

**Parameters:**
- `f`: The function to apply to each value in the key-value pairs.

**Returns:**
A new key-value pair RDD with the same keys as the original RDD, where each value has been transformed by the function `f`.

**Note:**
- The `mapValues()` operation only applies the provided function `f` to the values of the key-value pairs, keeping the keys unchanged.
- The function `f` can be any Python function, lambda function, or a callable object that accepts a value from the key-value pairs as input and returns a transformed value.
- The `mapValues()` operation is a transformation operation in PySpark, meaning it is lazily evaluated. It will not be executed until an action is triggered on the resulting RDD.
- The resulting RDD will have the same partitioning as the original RDD, preserving the partitioning scheme of the key-value pair RDD.
- The `mapValues()` operation is useful when you want to apply a transformation to the values of a key-value pair RDD while keeping the keys intact. It is commonly used for value-specific operations, such as mathematical transformations, data cleaning, or feature extraction on the values of the RDD.

In [63]:
# mapValues
x = sc.parallelize([('A',(1,2,3)),('B',(4,5))])
y = x.mapValues(lambda x: [i**2 for i in x]) # function is applied to entire value
print("x=",x.collect())
print(y.collect())

x= [('A', (1, 2, 3)), ('B', (4, 5))]
[('A', [1, 4, 9]), ('B', [16, 25])]


In [64]:
# mapValues
x = sc.parallelize([('A',(1,2,3)),('B',(4,5))])
y = x.map(lambda x: (x[0], [i**2 for i in x[1]])) # function is applied to entire value
print("x=",x.collect())
print(y.collect())

x= [('A', (1, 2, 3)), ('B', (4, 5))]
[('A', [1, 4, 9]), ('B', [16, 25])]


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.filter">
<img align=left src="images/pyspark-page8.svg" width=360 height=203 />
</a>

**filter(func)**

The `filter()` method returns a new dataset formed by selecting the elements from the source dataset for which the function `func` returns `True`.

**Parameters:**
- `func`: The function that determines the filtering condition for each element.

**Returns:**
A new dataset (RDD) that contains the elements from the source dataset for which the function `func` returns `True`.

**Note:**
- The function `func` can be any Python function, lambda function, or a callable object that accepts an element of the dataset as input and returns a Boolean value indicating whether the element should be included (`True`) or excluded (`False`).
- The `filter()` operation is a transformation operation in PySpark, meaning it is lazily evaluated. It will not be executed until an action is triggered on the resulting RDD.
- The `filter()` operation is commonly used to perform data filtering or selection based on certain criteria. It allows you to include or exclude specific elements from the dataset based on a custom filtering logic.

In [65]:
# filter
x = sc.parallelize([1,2,3])
y = x.filter(lambda x: x%2 == 1)  # filters even odd elements
print("x=",x.collect())
print(y.collect())

x= [1, 2, 3]
[1, 3]


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.distinct">
<img align=left src="images/pyspark-page9.svg" width=360 height=203 />
</a>

**distinct(numPartitions=None)**

The `distinct()` method returns a new RDD that contains only the distinct elements from the source RDD, removing any duplicate elements.

**Parameters:**
- `numPartitions` (optional): The number of partitions to use for the resulting RDD. If not specified, the default partitioning scheme will be used.

**Returns:**
A new RDD containing only the distinct elements from the source RDD.

**Note:**
- The order of elements in the resulting RDD may not be preserved.
- The `distinct()` operation is a transformation operation in PySpark, meaning it is lazily evaluated. It will not be executed until an action is triggered on the resulting RDD.
- The `distinct()` operation is useful for eliminating duplicate elements in a dataset, ensuring that each element appears only once. It is commonly used for data deduplication or to extract unique values from a dataset.

In [66]:
# distinct
x = sc.parallelize(['A','A','B'])
y = x.distinct()
print("x=",x.collect())
print(y.collect())

x= ['A', 'A', 'B']
['A', 'B']


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.keys">
<img align=left src="images/pyspark-page42.svg" width=360 height=203 />
</a>

**keys()**

The `keys()` method returns a new RDD that contains only the keys of each tuple in a key-value pair RDD.

**Returns:**
An RDD containing only the keys of each tuple in the key-value pair RDD.

**Note:**
- The resulting RDD will have the same number of elements as the original RDD, with each element representing a key from the tuples.
- The `keys()` operation is a transformation operation in PySpark, meaning it is lazily evaluated. It will not be executed until an action is triggered on the resulting RDD.
- The `keys()` operation is commonly used when you need to perform operations specifically on the keys of a key-value pair RDD, such as filtering or joining based on the keys.

In [23]:
# keys
x = sc.parallelize([('C',3),('A',1),('B',2)])
y = x.keys()
print("x=",x.collect())
print(y.collect())

x= [('C', 3), ('A', 1), ('B', 2)]
['C', 'A', 'B']


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.values">
<img align=left src="images/pyspark-page43.svg" width=360 height=203 />
</a>

**values()**

The `values()` method returns a new RDD that contains only the values of each tuple in a key-value pair RDD.

**Returns:**
An RDD containing only the values of each tuple in the key-value pair RDD.

**Note:**
- The resulting RDD will have the same number of elements as the original RDD, with each element representing a value from the tuples.
- The `values()` operation is a transformation operation in PySpark, meaning it is lazily evaluated. It will not be executed until an action is triggered on the resulting RDD.
- The `values()` operation is commonly used when you need to perform operations specifically on the values of a key-value pair RDD, such as aggregations, calculations, or transformations on the values.

In [24]:
# values
x = sc.parallelize([('C',3),('A',1),('B',2)])
y = x.values()
print("x=",x.collect())
print(y.collect())

x= [('C', 3), ('A', 1), ('B', 2)]
[3, 1, 2]


# Partitions
Partitions in Spark refer to the fundamental units of parallelism in distributed processing. When you work with large datasets in Spark, they are divided into smaller, more manageable chunks called partitions. Each partition contains a subset of the data and can be processed independently on different executor nodes in a cluster.

Here are some key points about partitions in Spark:

- **Parallel Processing**: Spark performs computations in parallel by dividing the data into partitions. Each partition is processed independently by a task running on a separate executor, enabling parallelism and distributed computing.
- **Data Distribution**: Partitions help distribute the data across the nodes in a cluster. By dividing the data into smaller partitions, Spark can achieve load balancing and utilize the available resources efficiently.
- **Partitioning Schemes**: Spark provides various partitioning schemes, such as hash partitioning and range partitioning, to determine how the data is divided among partitions. The choice of partitioning scheme can impact data distribution and performance.
- **Transformation and Actions**: Transformations in Spark, such as `map()` or `filter()`, are applied on a per-partition basis. Actions like `reduce()` or `collect()` operate on the data across all partitions, leveraging parallelism.
- **Control and Optimization**: Partitions provide fine-grained control over data processing. Developers can control the number of partitions, repartition data, or perform custom partitioning to optimize performance and resource usage.
- **Data Locality**: Spark tries to achieve data locality, where partitions are processed on the same nodes where the data resides or is cached. This reduces data transfer across the network and improves performance.
- **Shuffling**: Certain operations, like `groupBy()` or `join()`, may require data to be shuffled across partitions. Shuffling involves redistributing data based on specific keys or criteria, which can incur additional overhead.
- **Partition Size**: The optimal partition size depends on factors like the available resources, data characteristics, and the specific workload. Choosing an appropriate partition size helps balance data distribution, minimize data skew, and avoid memory or performance issues.

Understanding and managing partitions in Spark is crucial for efficient data processing and performance optimization. By appropriately configuring and utilizing partitions, you can leverage the distributed processing capabilities of Spark to handle large-scale datasets effectively.

<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.getNumPartitions">
<img align=left src="images/pyspark-page7.svg" width=360 height=203 />
</a>

**getNumPartitions()**

The `getNumPartitions()` method returns the number of partitions in an RDD.

**Returns:**
The number of partitions in the RDD.

**Note:**
- Partitions in an RDD represent the division of data into smaller, manageable chunks for distributed processing.
- The number of partitions affects parallelism and the degree of concurrency during RDD processing.
- The actual number of partitions in an RDD depends on factors such as the input data, data sources, transformations applied, and the cluster configuration.
- The `getNumPartitions()` operation is a metadata operation in Spark and does not trigger any computation.
- By knowing the number of partitions in an RDD, you can optimize transformations, resource allocation, and workload management accordingly.

In [25]:
# getNumPartitions
x = sc.parallelize([1,2,3,4,5,6,7,8,9,10,11,12,13,14], 5)
y = x.getNumPartitions()
print(x.glom().collect())
print(y)

[[1, 2], [3, 4], [5, 6, 7, 8], [9, 10], [11, 12, 13, 14]]
5


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.repartition">
<img align=left src="images/pyspark-page63.svg" width=360 height=203 />
</a>

**repartition(numPartitions)**

The `repartition()` method returns a new RDD that has exactly `numPartitions` partitions. This operation can increase or decrease the level of parallelism in the RDD by redistributing the data using a shuffle.

**Parameters:**
- `numPartitions`: The desired number of partitions for the resulting RDD.

**Returns:**
A new RDD with exactly `numPartitions` partitions.

**Note:**
- The `repartition()` operation reshuffles the data in the RDD to create a new RDD with the specified number of partitions.
- If the number of partitions is increased, a shuffle is performed to redistribute the data across the new partitions. This can be an expensive operation.
- If the number of partitions is decreased, it is recommended to use the `coalesce()` operation instead of `repartition()`. `coalesce()` can avoid a full shuffle by merging partitions without redistributing the data randomly.
- The resulting RDD may have a different distribution of data across partitions compared to the original RDD.
- The `repartition()` operation is a transformation operation in PySpark and is lazily evaluated. It will not be executed until an action is triggered on the resulting RDD.
- Use `repartition()` when you explicitly need to change the number of partitions in an RDD, such as to increase parallelism or balance data distribution. If you want to decrease the number of partitions without performing a full shuffle, consider using `coalesce()`.

In [26]:
# repartition
x = sc.parallelize([1,2,3,4,5],2)
y = x.repartition(numPartitions=3)
print(x.glom().collect())
print(y.glom().collect())

[[1, 2], [3, 4, 5]]
[[], [1, 2], [3, 4, 5]]


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.coalesce">
<img align=left src="images/pyspark-page64.svg" width=360 height=203 />
</a>

**coalesce(numPartitions, shuffle=False)**

The `coalesce()` method returns a new RDD that is reduced into `numPartitions` partitions.

**Parameters:**
- `numPartitions`: The number of partitions to reduce the RDD into.
- `shuffle` (optional): A boolean flag indicating whether to shuffle the data during the coalesce operation. The default value is `False`.

**Returns:**
A new RDD that has been reduced into `numPartitions` partitions.

**Note:**
- The `coalesce()` operation reduces the number of partitions in an RDD to the specified `numPartitions`.
- If `shuffle` is set to `False`, the coalesce operation tries to minimize data movement and avoids a full shuffle. It merges partitions into larger ones by moving data across the partitions, if necessary.
- If `shuffle` is set to `True`, the coalesce operation performs a full shuffle, redistributing the data across partitions randomly. This can be more expensive in terms of performance and resource usage.
- The resulting RDD may have fewer partitions than the original RDD, but it does not guarantee a balanced distribution of data across the partitions.
- The `coalesce()` operation is a transformation operation in PySpark and is lazily evaluated. It will not be executed until an action is triggered on the resulting RDD.
- The `coalesce()` operation can be useful for reducing the number of partitions to optimize resource usage, improve data locality, or prepare the data for subsequent operations that require a specific partitioning scheme.

In [27]:
# coalesce
x = sc.parallelize([1,2,3,4,5,6,7,8,9,10,11,12,13,14], 4)
y = x.coalesce(numPartitions=2)
print(x.glom().collect())
print(y.glom().collect())

[[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12, 13, 14]]
[[1, 2, 3, 4, 5, 6], [7, 8, 9, 10, 11, 12, 13, 14]]


**The difference between repartition and coalesce**
The main difference between `repartition()` and `coalesce()` in PySpark is how they handle the number of partitions and data shuffling:

1. **Number of Partitions:**
   - `repartition(numPartitions)` explicitly sets the exact number of partitions for the resulting RDD. It can increase or decrease the number of partitions by performing a full shuffle of the data.
   - `coalesce(numPartitions)` can only decrease the number of partitions to a smaller value. It tries to minimize data movement by merging partitions without a full shuffle.

2. **Data Shuffling:**
   - `repartition(numPartitions)` always performs a shuffle operation, redistributing the data across the new partitions. It is an expensive operation as it involves data movement and network communication.
   - `coalesce(numPartitions)` avoids shuffling if `shuffle=False` (default). It merges partitions by moving data within the existing partitions, which can be more efficient than a full shuffle. However, if `shuffle=True` is explicitly set, it performs a shuffle.

Considerations:
- If you need to increase or explicitly set the number of partitions or perform a random redistribution of data, use `repartition()`.
- If you want to decrease the number of partitions without shuffling or when the desired number of partitions is smaller than the current number, use `coalesce()` to minimize data movement.
- `coalesce()` is more efficient than `repartition()` when reducing the number of partitions, as it avoids the costly shuffle operation. However, it may result in an uneven data distribution across partitions.
- If you are uncertain about whether to use `repartition()` or `coalesce()`, consider factors such as the desired level of parallelism, data skew, available resources, and the cost of shuffling.
- Both operations are transformation operations and are lazily evaluated, meaning they won't be executed until an action is triggered on the resulting RDD.

<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.glom">
<img align=left src="images/pyspark-page16.svg" width=360 height=203 />
</a>

**glom()**

The `glom()` method returns a new RDD created by coalescing all elements within each partition into a list.

**Returns:**
An RDD where each partition is represented as a single list containing all the elements of that partition.

**Note:**
- The `glom()` operation is a transformation operation in PySpark that restructures the RDD by combining all elements within each partition into a single list.
- Each partition in the resulting RDD is represented as a list containing all the elements from that partition.
- The order of elements within each list is the same as the original order of elements within the partition.
- The `glom()` operation is useful when you need to process the entire partition as a whole, rather than individual elements. It can be beneficial for certain types of computations or operations that require aggregating or analyzing data within each partition.
- The `glom()` operation is lazily evaluated and will not be executed until an action is triggered on the resulting RDD.
- It is important to consider the size of partitions and memory constraints when using `glom()`, as coalescing all elements into a single list within each partition can increase memory requirements.

In [28]:
# glom
x = sc.parallelize(['C','B','A'], 2)
y = x.glom()
print("x=",x.collect()) 
print(y.collect())

x= ['C', 'B', 'A']
[['C'], ['B', 'A']]


# Sampling

<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.takeSample">
<img align=left src="images/pyspark-page11.svg" width=360 height=203 />
</a>

**takeSample(withReplacement, num, seed=None)**

The `takeSample()` method returns a fixed-size sampled subset of the RDD. This operation requires the `numpy` library.

**Parameters:**
- `withReplacement`: A boolean flag indicating whether sampling should be done with replacement (`True`) or without replacement (`False`).
- `num`: The number of elements to sample from the RDD.
- `seed` (optional): The seed value for the random number generator used for sampling. Providing a seed ensures reproducibility of the sampled subset.

**Returns:**
A list containing a fixed-size sampled subset of the RDD.

**Note:**
- The `takeSample()` operation randomly selects a fixed-size subset of elements from the RDD.
- If `withReplacement` is set to `True`, the same element can be sampled multiple times, allowing duplicates in the resulting subset. If `False`, each element can be selected at most once, ensuring distinct elements in the subset.
- The `num` parameter specifies the size of the sampled subset.
- The `seed` parameter is used to initialize the random number generator for reproducibility. If not provided, the sampling will be different each time the operation is executed.
- The `takeSample()` operation requires the `numpy` library to be available.
- The resulting subset may contain fewer elements than the specified `num` if the RDD has fewer elements available for sampling.
- The `takeSample()` operation is an action in PySpark, and it triggers the execution of the sampling and returns the sampled subset as a list.
- The `takeSample()` operation is commonly used when you need a random subset of elements from an RDD for analysis, testing, or sampling purposes.

In [67]:
# takeSample
x = sc.parallelize(range(7))
ylist = [x.takeSample(withReplacement=False, num=3) for i in range(5)]  # call 'sample' 5 times
print('x = ' + str(x.collect()))
for cnt,y in zip(range(len(ylist)), ylist):
    print('sample:' + str(cnt) + ' y = ' +  str(y))  # no collect on y

x = [0, 1, 2, 3, 4, 5, 6]
sample:0 y = [1, 2, 4]
sample:1 y = [1, 2, 4]
sample:2 y = [4, 1, 2]
sample:3 y = [5, 6, 2]
sample:4 y = [5, 2, 3]


# Set oprations

<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.union">
<img align=left src="images/pyspark-page12.svg" width=360 height=203 />
</a>

**union(other)**

The `union()` method returns a new RDD that represents the union of the elements in the source RDD and another RDD.

**Parameters:**
- `other`: The RDD to be combined with the source RDD.

**Returns:**
A new RDD that contains all the elements from the source RDD and the `other` RDD.

**Note:**
- The resulting RDD contains all the elements from both the source RDD and the `other` RDD, without eliminating duplicates. If an element appears in both RDDs, it will be included twice in the resulting RDD.
- The source RDD and the `other` RDD must have the same element type.
- The `union()` operation is a transformation operation in PySpark, meaning it is lazily evaluated. It will not be executed until an action is triggered on the resulting RDD.
- The resulting RDD will have a number of partitions based on the partitioning scheme of the source RDD and the `other` RDD.
- The `union()` operation is useful for combining the data from two RDDs into a single RDD, enabling you to perform operations or analysis on the combined dataset.

In [30]:
# union
x = sc.parallelize(['A','A','B'])
y = sc.parallelize(['D','C','A'])
z = x.union(y)
print("x=",x.collect())
print("y=",y.collect())
print(z.collect())

x= ['A', 'A', 'B']
y= ['D', 'C', 'A']
['A', 'A', 'B', 'D', 'C', 'A']


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.intersection">
<img align=left src="images/pyspark-page13.svg" width=360 height=203 />
</a>

**intersection(other)**

The `intersection()` method returns a new RDD that represents the intersection of the elements in the source RDD and another RDD. The resulting RDD contains only the distinct elements that are common to both RDDs.

**Parameters:**
- `other`: The RDD to find the intersection with.

**Returns:**
A new RDD that contains the distinct elements common to both the source RDD and the `other` RDD.

**Note:**
- The `intersection()` operation finds the common elements between the source RDD and the `other` RDD, removing any duplicates in the process.
- The resulting RDD contains only the distinct elements that are present in both RDDs. If an element appears multiple times in either RDD, it will appear only once in the resulting RDD.
- The `intersection()` operation performs a shuffle internally to identify the common elements across partitions of the RDDs. This shuffle can incur additional overhead.
- The source RDD and the `other` RDD must have the same element type.
- The `intersection()` operation is a transformation operation in PySpark, meaning it is lazily evaluated. It will not be executed until an action is triggered on the resulting RDD.
- The resulting RDD will have a number of partitions based on the partitioning scheme of the source RDD and the `other` RDD.
- The `intersection()` operation is useful for finding common elements between two RDDs, such as identifying shared data points, performing set operations, or filtering datasets based on common attributes.

In [31]:
# intersection
x = sc.parallelize(['A','A','B'])
y = sc.parallelize(['A','C','D'])
z = x.intersection(y)
print("x=",x.collect())
print(y.collect())
print(z.collect())

x= ['A', 'A', 'B']
['A', 'C', 'D']
['A']


# Aggregate

<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.reduce">
<img align=left src="images/pyspark-page23.svg" width=360 height=203 />
</a>

**reduce(f)**

The `reduce()` method reduces the elements of the RDD using the specified **commutative and associative binary operator** `f`. It performs the reduction operation locally within each partition of the RDD.

**Parameters:**
- `f`: The commutative and associative binary operator function to be applied to the elements.

**Returns:**
The result of reducing the elements of the RDD using the binary operator `f`.

**Note:**
- The `reduce()` operation applies the binary operator `f` to the elements of the RDD in a cumulative manner, combining them to produce a single result.
- The binary operator `f` must be commutative and associative, meaning the order of applying the operator does not affect the result, and the grouping of elements does not impact the final output.
- The reduction is performed independently within each partition of the RDD, resulting in partial results for each partition.
- The partial results from each partition are then combined together using the same binary operator `f` to produce the final result.
- The `reduce()` operation is useful for aggregating the elements of an RDD into a single value, such as calculating sums, maximum or minimum values, or any other operation that can be expressed as a commutative and associative binary operation.

In [3]:
# reduce
x = sc.parallelize([1,2,3])
y = x.reduce(lambda x, y: x + y)  # computes a cumulative sum
print("x=",x.collect())
print(y)

x= [1, 2, 3]
6


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.fold">
<img align=left src="images/pyspark-page24.svg" width=360 height=203 />
</a>

**fold(zeroValue, op)**

The `fold()` method aggregates the elements of each partition in an RDD and then combines the results of all partitions using a given associative function and a neutral "zero value". The `op` function is used to perform the aggregation and must adhere to certain requirements.

**Parameters:**
- `zeroValue`: The initial or neutral value for the aggregation operation.
- `op`: The associative function used for aggregating the elements.

**Returns:**
The result of aggregating the elements of the RDD using the `op` function and the `zeroValue`.

**Note:**
- The `fold()` operation applies the `op` function to each partition of the RDD to aggregate the elements within that partition.
- The `op` function must be associative, meaning the order of applying the function does not affect the final result.
- The `zeroValue` serves as an initial value for the aggregation operation and is used as a neutral element that does not change the result when combined with any other element using the `op` function.
- The `op(t1,t2)` function is allowed to modify the `t1` parameter and return it as the result value, avoiding object allocation. However, it should not modify the `t2` parameter.
- The aggregation is performed independently within each partition, resulting in partial results for each partition.
- The partial results from each partition are then combined together using the `op` function to produce the final result.
- The `fold()` operation is useful for aggregating data in RDDs by applying a user-defined associative function. It allows for custom aggregation logic while providing the flexibility to optimize performance by avoiding object allocation and minimizing data shuffling.

In [77]:
# fold
x = sc.parallelize([1,2,3])
neutral_zero_value = 0  # 0 for sum, 1 for multiplication
y = x.fold(neutral_zero_value,lambda obj, accumulated: accumulated + obj) # computes cumulative sum
print("x=",x.collect())
print(y)

x= [1, 2, 3]
6


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.aggregate">
<img align=left src="images/pyspark-page25.svg" width=360 height=203 />
</a>

**aggregate(zeroValue, seqOp, combOp)**

The `aggregate()` method aggregates the elements of each partition in an RDD and then combines the results of all partitions using given combine functions and a neutral "zero value". This operation allows for different result types for the sequential and combined operations.

**Parameters:**
- `zeroValue`: The initial or neutral value for the aggregation operation.
- `seqOp`: The sequential operation function used for aggregating the elements within each partition.
- `combOp`: The combined operation function used for merging the results of different partitions.

**Returns:**
The result of aggregating the elements of the RDD using the `seqOp` and `combOp` functions and the `zeroValue`.

**Note:**
- The `aggregate()` operation applies the `seqOp` function to each partition of the RDD to aggregate the elements within that partition. The result type of `seqOp` can be different from the RDD's element type.
- The `combOp` function is used to merge the results of different partitions, combining them into a single result.
- The `zeroValue` serves as an initial value for the aggregation operation and is used as a neutral element that does not change the result when combined with any other element using the `seqOp` and `combOp` functions.
- Both the `seqOp` and `combOp` functions are allowed to modify the `t1` parameter and return it as the result value to avoid object allocation. However, they should not modify the `t2` parameter.
- The aggregation is performed independently within each partition, resulting in partial results for each partition.
- The partial results from each partition are then combined together using the `combOp` function to produce the final result.
- The `aggregate()` operation is useful for performing custom aggregation operations on RDDs, allowing for flexibility in the result types and providing efficient ways to combine the results of different partitions.

In [78]:
# aggregate
x = sc.parallelize([2,3,4])
neutral_zero_value = (0,1) # sum: x+0 = x, product: 1*x = x
seqOp = (lambda aggregated, el: (aggregated[0] + el, aggregated[1] * el)) 
combOp = (lambda aggregated1, aggregated2: (aggregated1[0] + aggregated2[0], aggregated1[1] * aggregated2[1]))
y = x.aggregate(neutral_zero_value,seqOp,combOp)  # computes (cumulative sum, cumulative product)
print("x=",x.collect())
print(y)

x= [2, 3, 4]
(9, 24)


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.reduceByKey">
<img align=left src="images/pyspark-page44.svg" width=360 height=203 />
</a>

**reduceByKey(func, numPartitions=None)**

The `reduceByKey()` method merges the values for each key in an RDD using an associative reduce function.

**Parameters:**
- `func`: The associative reduce function to merge values for each key.
- `numPartitions` (optional): The number of partitions to use for the resulting RDD. If not specified, the default partitioning scheme will be used.

**Returns:**
A new RDD with the values for each key merged using the reduce function.

**Note:**
- The reduce function `func` must be associative, meaning that the order of applying the function does not affect the result.
- The `reduceByKey()` operation is a transformation operation in PySpark, meaning it is lazily evaluated. It will not be executed until an action is triggered on the resulting RDD.
- If `numPartitions` is specified, it determines the number of partitions for the resulting RDD. Otherwise, the default partitioning scheme will be used.
- The `reduceByKey()` operation is commonly used for aggregation tasks, such as summing values for each key, finding maximum or minimum values, or any other reduction operation that combines values based on the key.
- It is important to choose an appropriate reduce function that can handle the merging of values for each key efficiently and correctly.

In [35]:
# reduceByKey
x = sc.parallelize([('B',1),('B',2),('A',3),('A',4),('A',5)])
y = x.reduceByKey(lambda agg, obj: agg + obj)
print("x=",x.collect())
print(y.collect())

x= [('B', 1), ('B', 2), ('A', 3), ('A', 4), ('A', 5)]
[('B', 3), ('A', 12)]


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.foldByKey">
<img align=left src="images/pyspark-page53.svg" width=360 height=203 />
</a>

**foldByKey(zeroValue, func, numPartitions=None)**

The `foldByKey()` method merges the values for each key in an RDD using an associative function `func` and a neutral `zeroValue`. The `zeroValue` can be added to the result an arbitrary number of times and should not affect the final outcome.

**Parameters:**
- `zeroValue`: The neutral value that can be added to the result an arbitrary number of times.
- `func`: The associative function used to merge values for each key.
- `numPartitions` (optional): The number of partitions to use for the resulting RDD. If not specified, the default partitioning scheme will be used.

**Returns:**
A new RDD with the values for each key merged using the fold function.

**Note:**
- The `func` function must be associative, meaning that the order of applying the function and adding the `zeroValue` does not affect the result.
- The `zeroValue` serves as a neutral element that does not change the result when combined with any other element using the `func` function.
- The `foldByKey()` operation is a transformation operation in PySpark, meaning it is lazily evaluated.
- The resulting RDD will have the keys from the original RDD and the merged values based on the fold function and the `zeroValue`.
- If `numPartitions` is specified, it determines the number of partitions for the resulting RDD. Otherwise, the default partitioning scheme will be used.
- The `foldByKey()` operation is commonly used for tasks where values for each key need to be aggregated or combined, such as calculating sums, products, or any other operation that can be expressed as an associative function with a neutral element.
- It is important to choose an appropriate `func` function and `zeroValue` that can handle the merging of values for each key correctly and ensure the neutral element does not affect the final result.

In [36]:
# foldByKey
x = sc.parallelize([('B',1),('B',2),('A',3),('A',4),('A',5)])
zeroValue = 1 # one is 'zero value' for multiplication
y = x.foldByKey(zeroValue,lambda agg,x: agg*x )  # computes cumulative product within each key
print("x=",x.collect())
print(y.collect())

x= [('B', 1), ('B', 2), ('A', 3), ('A', 4), ('A', 5)]
[('B', 2), ('A', 60)]


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.aggregateByKey">
<img align=left src="images/pyspark-page52.svg" width=360 height=203 />
</a>

**aggregateByKey(zeroValue, seqFunc, combFunc, numPartitions=None)**

The `aggregateByKey()` method aggregates the values of each key in an RDD using given combine functions and a neutral "zero value". This function allows for a different result type `U` than the type of the values in the RDD `V`. It requires two operations: one for merging a `V` into a `U` within a partition and another for merging two `U` values between partitions. These operations can modify and return their first argument to avoid memory allocation.

**Parameters:**
- `zeroValue`: The neutral value or zero value for the aggregation operation.
- `seqFunc`: The function used to merge a value `V` into an intermediate result `U` within each partition.
- `combFunc`: The function used to merge two intermediate results `U` between partitions.
- `numPartitions` (optional): The number of partitions to use for the resulting RDD. If not specified, the default partitioning scheme will be used.

**Returns:**
A new RDD with the values for each key aggregated using the provided combine functions.

**Note:**
- The `zeroValue` serves as the initial or neutral element for the aggregation operation and is used when merging values within a partition or between partitions.
- The `seqFunc` function is used to merge a value `V` into an intermediate result `U` within each partition. It modifies and returns its first argument to avoid memory allocation.
- The `combFunc` function is used to merge two intermediate results `U` between partitions. It also modifies and returns its first argument to avoid memory allocation.
- Both `seqFunc` and `combFunc` must be associative, meaning that the order of applying the functions does not affect the result.
- The `aggregateByKey()` operation is a transformation operation in PySpark, meaning it is lazily evaluated. It will not be executed until an action is triggered on the resulting RDD.
- The resulting RDD will have the keys from the original RDD and the aggregated values based on the provided combine functions.
- If `numPartitions` is specified, it determines the number of partitions for the resulting RDD. Otherwise, the default partitioning scheme will be used.
- The `aggregateByKey()` operation is commonly used for tasks where values for each key need to be aggregated or combined using custom combine functions, such as calculating sums, averages, or any other operation that can be expressed as associative functions with a neutral element.
- It is important to choose appropriate `seqFunc` and `combFunc` functions and a suitable `zeroValue` to handle the merging of values for each key correctly and efficiently, modifying and returning the first argument to avoid unnecessary object creation.

In [37]:
# aggregateByKey
x = sc.parallelize([('B',1),('B',2),('A',3),('A',4),('A',5)])
zeroValue = [] # empty list is 'zero value' for append operation
mergeVal = (lambda aggregated, el: aggregated + [(el,el**2)])
mergeComb = (lambda agg1,agg2: agg1 + agg2 )
y = x.aggregateByKey(zeroValue,mergeVal,mergeComb)
print("x=",x.collect())
print(y.collect())

x= [('B', 1), ('B', 2), ('A', 3), ('A', 4), ('A', 5)]
[('B', [(1, 1), (2, 4)]), ('A', [(3, 9), (4, 16), (5, 25)])]


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.groupByKey">
<img align=left src="images/pyspark-page54.svg" width=360 height=203 />
</a>

**groupByKey()**

The `groupByKey()` method groups the values of each key in an RDD, returning a new RDD where each unique key is associated with a sequence of its corresponding values.

**Returns:**
A new RDD where each unique key is associated with an iterable sequence of its corresponding values.

**Note:**
- The resulting RDD is a key-value pair RDD, where the keys are the unique keys from the original RDD, and the values are sequences (iterables) containing all the corresponding values for each key.
- The `groupByKey()` operation is a transformation operation in PySpark, meaning it is lazily evaluated.
- It is important to note that the `groupByKey()` operation can lead to data skew, especially if there are keys with a large number of associated values. In such cases, it may be more efficient to use other operations like `reduceByKey()` or `aggregateByKey()` to perform aggregations on the values.
- The `groupByKey()` operation is useful when you need to gather all the values for each unique key, such as when you want to perform further computations or analysis on a per-key basis. However, be cautious when working with large datasets and keys with high cardinality, as it can impact performance and memory usage.

In [38]:
# groupByKey
x = sc.parallelize([('B',5),('B',4),('A',3),('A',2),('A',1)])
y = x.groupByKey()
print("x=",x.collect())
print([(j[0],[i for i in j[1]]) for j in y.collect()])

x= [('B', 5), ('B', 4), ('A', 3), ('A', 2), ('A', 1)]
[('B', [5, 4]), ('A', [3, 2, 1])]


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.countByKey">
<img align=left src="images/pyspark-page46.svg" width=360 height=203 />
</a>

**countByKey()**

The `countByKey()` method counts the number of elements for each key in an RDD and returns the result as a dictionary to the driver program.

**Returns:**
A dictionary where each unique key is mapped to the count of elements associated with it.

**Note:**
- The `countByKey()` operation triggers the execution of the RDD and collects the counts on the driver program. Therefore, it is important to consider the memory limitations of the driver program when using `countByKey()` on large datasets.
- It is important to note that `countByKey()` returns the count for each unique key as a dictionary, which means the results are collected and stored in memory on the driver program. If the number of unique keys or the size of the resulting dictionary is large, it can impact memory usage on the driver program.
- The `countByKey()` operation is useful when you need to determine the count of elements associated with each key in an RDD, such as in frequency analysis, data profiling, or for generating summary statistics based on the key-value pairs.

In [39]:
# countByKey
x = sc.parallelize([('B',1),('B',2),('A',3),('A',4),('A',5)])
y = x.countByKey()
print("x=",x.collect())
print(y)

x= [('B', 1), ('B', 2), ('A', 3), ('A', 4), ('A', 5)]
defaultdict(<class 'int'>, {'B': 2, 'A': 3})


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.countByValue">
<img align=left src="images/pyspark-page36.svg" width=360 height=203 />
</a>

**countByValue()**

The `countByValue()` method counts the occurrences of each unique value in an RDD and returns the result as a dictionary where the keys are the unique values and the values are the counts.

**Returns:**
A dictionary containing the count of each unique value in the RDD.

**Note:**
- The `countByValue()` operation is an action in PySpark, meaning it triggers the execution of the RDD and collects the counts.
- The `countByValue()` operation can be useful for analyzing the distribution or frequency of values in an RDD, such as when working with categorical or discrete data.
- It is important to note that the `countByValue()` operation collects the counts to the driver program, so the resulting dictionary should fit into memory. For RDDs with a large number of unique values, consider using other methods like `reduceByKey()` or `aggregateByKey()` to perform distributed counting and aggregation.
- The `countByValue()` operation does not guarantee a specific order of the values in the resulting dictionary.
- The keys in the resulting dictionary correspond to the unique values in the RDD, and the values represent the count of each unique value.

In [40]:
# countByValue
x = sc.parallelize([1,3,1,2,3])
y = x.countByValue()
print("x=",x.collect())
print(y)

x= [1, 3, 1, 2, 3]
defaultdict(<class 'int'>, {1: 2, 3: 2, 2: 1})


# Statistics

<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.count">
<img align=left src="images/pyspark-page29.svg" width=360 height=203 />
</a>

**count()**

The `count()` method returns the number of elements in an RDD.

**Returns:**
The number of elements in the RDD.

**Note:**
- The operation is performed on the RDD and returns the count as an integer value.
- The `count()` operation is an action in PySpark, meaning it triggers the execution of the RDD and collects the count on the driver program.
- If the RDD is empty, the `count()` operation will return 0.
- The `count()` operation can be useful for tasks such as calculating the size of an RDD.
- Keep in mind that invoking `count()` on a very large RDD can be time-consuming and resource-intensive. In such cases, consider using approximate methods like `countApprox()` or sampling techniques to estimate the count without processing the entire RDD.

In [41]:
# count
x = sc.parallelize([1,3,2])
y = x.count()
print("x=",x.collect())
print(y)

x= [1, 3, 2]
3


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.max">
<img align=left src="images/pyspark-page26.svg" width=360 height=203 />
</a>

**max(key=None)**

The `max()` method is used to find the maximum item in an RDD based on the elements' natural order or a custom key function.

**Parameters:**
- `key` (optional): A function used to generate a key for comparing the elements. By default, the elements' natural order is used.

**Returns:**
The maximum item in the RDD.

**Note:**
- If the RDD contains elements with a natural order (e.g., numeric or string values), the maximum item is determined based on that order.
- If a `key` function is provided, it is applied to each element to generate a key for comparison.
- The `max()` operation is an action in PySpark, meaning it triggers the execution of the RDD to find the maximum item.
- If the RDD is empty, the `max()` operation will throw an exception. Ensure that the RDD has at least one element before using `max()`.
- If multiple elements have the maximum value, `max()` will return one of them, but the specific element chosen may not be deterministic.
- The `max()` operation can be useful for finding the maximum value in an RDD or determining the maximum element based on a specific attribute or key.
- If you want to find the maximum item based on a custom key function, provide the `key` parameter to transform the elements before comparison. The `key` function should return the attribute or value to be used for comparison.
- Keep in mind that the `max()` operation requires comparing all elements in the RDD and can be computationally expensive for large datasets.

In [42]:
# max
x = sc.parallelize([1,3,2,11])
y = x.max()
z = x.max(key=str)
print("x=",x.collect())
print(y)
print(z)

x= [1, 3, 2, 11]
11
3


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.min">
<img align=left src="images/pyspark-page27.svg" width=360 height=203 />
</a>

**min(key=None)**

The `min()` method is used to find the minimum item in an RDD based on the elements' natural order or a custom key function.

**Parameters:**
- `key` (optional): A function used to generate a key for comparing the elements. By default, the elements' natural order is used.

**Returns:**
The minimum item in the RDD.

**Note:**
- If the RDD contains elements with a natural order (e.g., numeric or string values), the minimum item is determined based on that order.
- If a `key` function is provided, it is applied to each element to generate a key for comparison.
- The `min()` operation is an action in PySpark, meaning it triggers the execution of the RDD to find the minimum item.
- If the RDD is empty, the `min()` operation will throw an exception. Ensure that the RDD has at least one element before using `min()`.
- If multiple elements have the minimum value, `min()` will return one of them, but the specific element chosen may not be deterministic.
- The `min()` operation can be useful for finding the minimum value in an RDD or determining the minimum element based on a specific attribute or key.
- If you want to find the minimum item based on a custom key function, provide the `key` parameter to transform the elements before comparison. The `key` function should return the attribute or value to be used for comparison.
- Keep in mind that the `min()` operation requires comparing all elements in the RDD and can be computationally expensive for large datasets.

In [43]:
# min
x = sc.parallelize([1,3,2])
y = x.min()
print("x=",x.collect())
print(y)

x= [1, 3, 2]
1


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.sum">
<img align=left src="images/pyspark-page28.svg" width=360 height=203 />
</a>

**sum()**

The `sum()` method is used to add up the elements in an RDD.

**Returns:**
The sum of the elements in the RDD.

**Note:**
- The operation is performed on the RDD and returns the sum as a numerical value.
- The `sum()` operation is an action in PySpark, meaning it triggers the execution of the RDD and aggregates the elements to calculate the sum.
- If the RDD is empty, the `sum()` operation will return `0`.
- The `sum()` operation can be used with RDDs containing numerical values, such as integers or floating-point numbers.
- It is important to note that the `sum()` operation requires accessing and aggregating all elements in the RDD, which can be computationally expensive and memory-intensive for large datasets. Ensure that the RDD can fit in memory and consider using approximate methods or distributed computing techniques if the dataset is too large to process entirely on a single machine.
- If the RDD contains non-numeric elements or elements that cannot be added together, the `sum()` operation will throw an exception. Make sure that the RDD contains elements that can be summed or use appropriate transformations to filter or convert the elements before applying `sum()`.

In [44]:
# sum
x = sc.parallelize([1,3,2])
y = x.sum()
print("x=",x.collect())
print(y)

x= [1, 3, 2]
6


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.mean">
<img align=left src="images/pyspark-page31.svg" width=360 height=203 />
</a>

**mean()**

The `mean()` method computes the mean (average) of the elements in an RDD.

**Returns:**
The mean of the RDD's elements as a floating-point value.

**Note:**
- The operation is performed on the RDD and returns the mean as a floating-point value.
- The `mean()` operation is an action in PySpark, meaning it triggers the execution of the RDD and collects the necessary statistics to compute the mean.
- If the RDD is empty, the `mean()` operation will return `None`.
- The `mean()` operation can be useful for calculating the average of numerical values in an RDD, such as when working with datasets that represent measurements, statistics, or numerical features.
- It is important to note that the `mean()` operation requires accessing and aggregating all elements in the RDD, which can be computationally expensive and memory-intensive for large datasets. Ensure that the RDD can fit in memory and consider using sampling techniques or approximate methods for calculating the mean if the dataset is too large to process entirely.

In [45]:
# mean
x = sc.parallelize([1,3,2])
y = x.mean()
print("x=",x.collect())
print(y)

x= [1, 3, 2]
2.0


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.variance">
<img align=left src="images/pyspark-page32.svg" width=360 height=203 />
</a>

**variance()**

The `variance()` method computes the variance of the elements in an RDD. The variance is a measure of how spread out the values in the RDD are from the mean. It quantifies the average squared difference between each element and the mean of the RDD.

**Returns:**
The variance of the RDD's elements as a floating-point value.

**Note:**
- The `variance()` operation is an action in PySpark, meaning it triggers the execution of the RDD and collects the necessary statistics to compute the variance.
- If the RDD is empty or contains only one element, the `variance()` operation will return `None` or `0`, respectively, as the variance is undefined in these cases.
- The `variance()` operation can be useful for analyzing the distribution and variability of numerical values in an RDD, such as when working with datasets that represent measurements, statistics, or numerical features.
- The variance is sensitive to outliers and can be strongly influenced by extreme values in the RDD. Consider preprocessing or filtering the data if outliers or extreme values are present and affecting the variance calculation.

In [46]:
# variance
x = sc.parallelize([1,3,2])
y = x.variance()  # divides by N
print("x=",x.collect())
print(y)

x= [1, 3, 2]
0.6666666666666666


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.stdev">
<img align=left src="images/pyspark-page33.svg" width=360 height=203 />
</a>

**stdev()**

The `stdev()` method computes the standard deviation of the elements in an RDD. The standard deviation is a measure of the spread or dispersion of the values in the RDD. It quantifies the average deviation of each element from the mean of the RDD.

**Returns:**
The standard deviation of the RDD's elements as a floating-point value.

**Note:**
- The `stdev()` operation is an action in PySpark, meaning it triggers the execution of the RDD and collects the necessary statistics to compute the standard deviation.
- If the RDD is empty or contains only one element, the `stdev()` operation will return `None` or `0`, respectively, as the standard deviation is undefined in these cases.
- The `stdev()` operation can be useful for analyzing the dispersion and variability of numerical values in an RDD, such as when working with datasets that represent measurements, statistics, or numerical features.
- The standard deviation is influenced by outliers and extreme values in the RDD. Consider preprocessing or filtering the data if outliers or extreme values are present and affecting the standard deviation calculation.

In [47]:
# stdev
x = sc.parallelize([1,3,2])
y = x.stdev()  # divides by N
print("x=",x.collect())
print(y)

x= [1, 3, 2]
0.816496580927726


# Join and combine RDDs

<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.join">
<img align=left src="images/pyspark-page47.svg" width=360 height=203 />
</a>

**join(other, numPartitions=None)**

The `join()` method returns an RDD containing all pairs of elements with matching keys in the source RDD (`self`) and another RDD (`other`). It performs a hash join across the cluster.

**Parameters:**
- `other`: The other RDD to join with.
- `numPartitions` (optional): The number of partitions to use for the resulting RDD. If not specified, the default partitioning scheme will be used.

**Returns:**
An RDD containing all pairs of elements with matching keys, represented as tuples `(k, (v1, v2))`, where `(k, v1)` is in the source RDD and `(k, v2)` is in the other RDD.

**Note:**
- The `join()` operation combines elements from the source RDD and the other RDD based on matching keys.
- It performs a hash join, which is a type of join that utilizes hashing techniques to efficiently match elements with the same keys across partitions in a distributed manner.
- The resulting RDD contains tuples where the key `k` is the common key and `(v1, v2)` represents the values associated with that key, with `v1` from the source RDD (`self`) and `v2` from the other RDD.
- The `join()` operation is a transformation operation in PySpark, meaning it is lazily evaluated. It will not be executed until an action is triggered on the resulting RDD.
- If `numPartitions` is specified, it determines the number of partitions for the resulting RDD. Otherwise, the default partitioning scheme will be used.
- The `join()` operation can be used to combine datasets based on common keys, such as merging data from different sources, performing relational-style joins, or joining datasets for subsequent analysis or processing.
- It is important to consider the data distribution and partitioning scheme of the source and other RDDs to ensure efficient execution of the join operation.

In [48]:
# join
x = sc.parallelize([('C',4),('B',3),('A',2),('A',1)])
y = sc.parallelize([('A',8),('B',7),('A',6),('D',5)])
z = x.join(y)
print("x=",x.collect())
print("y=",y.collect())
print(z.collect())

x= [('C', 4), ('B', 3), ('A', 2), ('A', 1)]
y= [('A', 8), ('B', 7), ('A', 6), ('D', 5)]
[('B', (3, 7)), ('A', (2, 8)), ('A', (2, 6)), ('A', (1, 8)), ('A', (1, 6))]


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.leftOuterJoin">
<img align=left src="images/pyspark-page48.svg" width=360 height=203 />
</a>

**leftOuterJoin(other, numPartitions=None)**

The `leftOuterJoin()` method performs a left outer join between the source RDD (`self`) and another RDD (`other`). It combines the elements based on their keys, including all pairs from the source RDD and matching pairs from the other RDD. For each element `(k, v)` in the source RDD, the resulting RDD will contain pairs `(k, (v, w))` for `w` in the other RDD, or the pair `(k, (v, None))` if no elements in the other RDD have the key `k`.

**Parameters:**
- `other`: The RDD to join with the source RDD.
- `numPartitions` (optional): The number of partitions to use for the resulting RDD. If not specified, the default partitioning scheme will be used.

**Returns:**
An RDD containing the joined pairs `(k, (v, w))` for elements `(k, v)` in the source RDD and elements `(k, w)` in the other RDD, or `(k, (v, None))` if no elements in the other RDD have the key `k`.

**Note:**
- The `leftOuterJoin()` operation is a transformation operation in PySpark, meaning it is lazily evaluated.
- If `numPartitions` is specified, it determines the number of partitions for the resulting RDD. Otherwise, the default partitioning scheme will be used.
- The `leftOuterJoin()` operation is useful when you want to combine elements from two RDDs based on their keys, while preserving all elements from the source RDD and including `None` for keys that do not exist in the other RDD.
- It is important to note that the `leftOuterJoin()` operation performs a join based on keys and does not consider the values of the elements. If you need to perform more complex operations or filtering based on both keys and values, you may need to use other operations like `join()` with appropriate transformations and functions.

In [49]:
# leftOuterJoin
x = sc.parallelize([('C',4),('B',3),('A',2),('A',1)])
y = sc.parallelize([('A',8),('B',7),('A',6),('D',5)])
z = x.leftOuterJoin(y)
print("x=",x.collect())
print(y.collect())
print(z.collect())

x= [('C', 4), ('B', 3), ('A', 2), ('A', 1)]
[('A', 8), ('B', 7), ('A', 6), ('D', 5)]
[('C', (4, None)), ('B', (3, 7)), ('A', (2, 8)), ('A', (2, 6)), ('A', (1, 8)), ('A', (1, 6))]


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.rightOuterJoin">
<img align=left src="images/pyspark-page49.svg" width=360 height=203 />
</a>

**rightOuterJoin(other, numPartitions=None)**

The `rightOuterJoin()` method performs a right outer join between the source RDD (`self`) and another RDD (`other`). It combines the elements based on their keys, including all pairs from the other RDD and matching pairs from the source RDD. For each element `(k, w)` in the other RDD, the resulting RDD will contain pairs `(k, (v, w))` for `v` in the source RDD, or the pair `(k, (None, w))` if no elements in the source RDD have the key `k`.

**Parameters:**
- `other`: The RDD to join with the source RDD.
- `numPartitions` (optional): The number of partitions to use for the resulting RDD. If not specified, the default partitioning scheme will be used.

**Returns:**
An RDD containing the joined pairs `(k, (v, w))` for elements `(k, w)` in the other RDD and elements `(k, v)` in the source RDD, or `(k, (None, w))` if no elements in the source RDD have the key `k`.

**Note:**
- The `rightOuterJoin()` operation is a transformation operation in PySpark, meaning it is lazily evaluated.
- If `numPartitions` is specified, it determines the number of partitions for the resulting RDD. Otherwise, the default partitioning scheme will be used.
- The `rightOuterJoin()` operation is useful when you want to combine elements from two RDDs based on their keys, while preserving all elements from the other RDD and including `None` for keys that do not exist in the source RDD.
- It is important to note that the `rightOuterJoin()` operation performs a join based on keys and does not consider the values of the elements. If you need to perform more complex operations or filtering based on both keys and values, you may need to use other operations like `join()` with appropriate transformations and functions.

In [50]:
# rightOuterJoin
x = sc.parallelize([('C',4),('B',3),('A',2),('A',1)])
y = sc.parallelize([('A',8),('B',7),('A',6),('D',5)])
z = x.rightOuterJoin(y)
print("x=",x.collect())
print(y.collect())
print(z.collect())

x= [('C', 4), ('B', 3), ('A', 2), ('A', 1)]
[('A', 8), ('B', 7), ('A', 6), ('D', 5)]
[('B', (3, 7)), ('A', (2, 8)), ('A', (2, 6)), ('A', (1, 8)), ('A', (1, 6)), ('D', (None, 5))]


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.cartesian">
<img align=left src="images/pyspark-page17.svg" width=360 height=203 />
</a>

**cartesian(other)**

The `cartesian()` method returns an RDD representing the Cartesian product of the elements in the source RDD and another RDD. It generates pairs of all possible combinations where an element `a` is from the source RDD and an element `b` is from the other RDD.

**Parameters:**
- `other`: The other RDD to form the Cartesian product with.

**Returns:**
An RDD containing all pairs of elements `(a, b)` where `a` is in the source RDD and `b` is in the other RDD.

**Note:**
- The `cartesian()` operation generates pairs of all possible combinations between the elements in the source RDD and the elements in the other RDD.
- The resulting RDD contains tuples `(a, b)` where `a` is an element from the source RDD and `b` is an element from the other RDD.
- The Cartesian product generates every possible combination of elements, so the resulting RDD can be large and memory-intensive, especially if the source and other RDDs have many elements.
- The `cartesian()` operation is a transformation operation in PySpark, meaning it is lazily evaluated. It will not be executed until an action is triggered on the resulting RDD.
- The resulting RDD will have a number of partitions determined by the default partitioning scheme or the partitioning scheme of the source RDD, depending on the version of PySpark being used.
- The `cartesian()` operation is useful when you need to generate all possible pairs of elements between two RDDs, such as for cross-referencing, combining data from different sources, or performing extensive data exploration.
- Care should be taken when using `cartesian()` on large RDDs, as the resulting RDD can be computationally expensive and memory-intensive due to the exponential growth in the number of combinations.

In [51]:
# cartesian
x = sc.parallelize(['A','B'])
y = sc.parallelize(['C','D'])
z = x.cartesian(y)
print("x=",x.collect())
print(y.collect())
print(z.collect())

x= ['A', 'B']
['C', 'D']
[('A', 'C'), ('A', 'D'), ('B', 'C'), ('B', 'D')]


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.subtract">
<img align=left src="images/pyspark-page61.svg" width=360 height=203 />
</a>

**subtract(other, numPartitions=None)**

The `subtract()` method returns an RDD containing the values from the source RDD (`self`) that are not present in another RDD (`other`).

**Parameters:**
- `other`: The RDD to subtract from the source RDD.
- `numPartitions` (optional): The number of partitions to use for the resulting RDD. If not specified, the default partitioning scheme will be used.

**Returns:**
An RDD containing the values from the source RDD that are not present in the other RDD.

**Note:**
- The `subtract()` operation compares the values of the source RDD with the values of the other RDD and returns the values that are present in the source RDD but not in the other RDD.
- The comparison is performed based on the equality of the elements in both RDDs.
- The `subtract()` operation is a transformation operation in PySpark, meaning it is lazily evaluated.
- If `numPartitions` is specified, it determines the number of partitions for the resulting RDD. Otherwise, the default partitioning scheme will be used.
- The `subtract()` operation can be useful for filtering out specific values or removing duplicates between two RDDs.
- The performance of the `subtract()` operation depends on the data distribution and partitioning scheme of the RDDs.

In [52]:
# subtract
x = sc.parallelize([('C',4),('B',3),('A',2),('A',1)])
y = sc.parallelize([('C',8),('A',2),('D',2)])
z = x.subtract(y)
print("x=",x.collect())
print("x=",y.collect())
print(z.collect())

x= [('C', 4), ('B', 3), ('A', 2), ('A', 1)]
x= [('C', 8), ('A', 2), ('D', 2)]
[('C', 4), ('B', 3), ('A', 1)]


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.zip">
<img align=left src="images/pyspark-page65.svg" width=360 height=203 />
</a>

**zip(other)**

The `zip()` method zips together two RDDs, returning an RDD of key-value pairs where the first element of each RDD is paired with the corresponding element from the other RDD. This operation assumes that both RDDs have the same number of partitions and the same number of elements in each partition.

**Parameters:**
- `other`: The other RDD to zip with.

**Returns:**
An RDD of key-value pairs where the elements from each RDD are paired together.

**Note:**
- Both RDDs should have the same number of partitions and the same number of elements in each partition. The elements should be ordered such that the first element in each RDD corresponds to the second element in each RDD, and so on.
- The resulting RDD contains key-value pairs, where the key is an element from the first RDD and the value is the corresponding element from the other RDD.
- The `zip()` operation is a transformation operation in PySpark, meaning it is lazily evaluated. It will not be executed until an action is triggered on the resulting RDD.
- The resulting RDD will have the same number of partitions as the input RDDs.
- The `zip()` operation is useful when you need to combine the elements of two RDDs that are related in some way, such as when you want to pair data from two different sources based on a common key or perform parallel processing on related datasets.

In [53]:
# zip
x = sc.parallelize(['B','A','A'])
y = x.map(lambda x: ord(x))  # zip expects x and y to have same #partitions and #elements/partition
z = x.zip(y)
print("x=",x.collect())
print(y.collect())
print(z.collect())

x= ['B', 'A', 'A']
[66, 65, 65]
[('B', 66), ('A', 65), ('A', 65)]


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.zipWithIndex">
<img align=left src="images/pyspark-page66.svg" width=360 height=203 />
</a>

**zipWithIndex()**

The `zipWithIndex()` method zips an RDD with its element indices, creating a new RDD where each element is paired with its corresponding index. The ordering of the elements is based on the partition index and the ordering within each partition.

**Returns:**
An RDD where each element is paired with its index.

**Note:**
- The `zipWithIndex()` operation pairs each element in the RDD with its corresponding index.
- The resulting RDD contains tuples `(element, index)` where the element is an element from the original RDD, and the index represents the position of the element in the RDD.
- The ordering of the elements is determined first by the partition index and then by the ordering of items within each partition. Elements within the same partition will have contiguous indices, and the indices will be assigned in increasing order across partitions.
- The `zipWithIndex()` operation is a transformation operation in PySpark, meaning it is lazily evaluated. It will not be executed until an action is triggered on the resulting RDD.
- The resulting RDD will have the same number of partitions as the original RDD.
- The `zipWithIndex()` operation is useful when you need to associate an index with each element in an RDD. It can be helpful for tasks such as ranking elements, creating unique identifiers, or tracking the order of elements in the RDD.
- Be aware that using `zipWithIndex()` on a large RDD can introduce a performance overhead, especially if the RDD has a skewed distribution or a large number of elements, as it requires assigning an index to each element.

In [54]:
# zipWithIndex
x = sc.parallelize(['B','A','A'],2)
y = x.zipWithIndex()
print(x.glom().collect())
print(y.collect())

[['B'], ['A', 'A']]
[('B', 0), ('A', 1), ('A', 2)]


# Other functions

<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.sortByKey">
<img align=left src="images/pyspark-page14.svg" width=360 height=203 />
</a>

**sortByKey(ascending=True, numPartitions=None, keyfunc=lambda)**

The `sortByKey()` method sorts an RDD that consists of `(key, value)` pairs based on the keys. The sort order can be specified as ascending or descending. 

**Parameters:**
- `ascending`: A Boolean value indicating whether the sorting should be in ascending order (`True`) or descending order (`False`). Default is `True`.
- `numPartitions` (optional): The number of partitions to use for the resulting RDD. If not specified, the default partitioning scheme will be used.
- `keyfunc` (optional): A function to extract a comparison key from each element in the RDD. This function is applied to the keys of the `(key, value)` pairs.

**Returns:**
An RDD containing the `(key, value)` pairs sorted by the keys.

**Note:**
- By default, the sorting is done in ascending order, but you can specify `ascending=False` to sort in descending order.
- The `numPartitions` parameter determines the number of partitions for the resulting RDD. If not specified, the default partitioning scheme will be used.
- The `keyfunc` parameter allows you to provide a custom function to extract a comparison key from each element in the RDD. This function is applied to the keys of the `(key, value)` pairs. The default behavior uses the keys as is.
- The `sortByKey()` operation is a transformation operation in PySpark, meaning it is lazily evaluated.
- Sorting is performed based on the keys, while preserving the association with the corresponding values.
- If two keys are equal, the order of the corresponding values is preserved during sorting.
- The `sortByKey()` operation can be useful for tasks where you need to sort an RDD of key-value pairs, such as finding the top values per key, performing range queries, or preparing data for further analysis or processing based on key order.
- Depending on the data distribution and the size of the RDD, the `sortByKey()` operation can be computationally expensive and may require a significant amount of memory, especially if the RDD has a large number of keys or if the keys have high cardinality.

In [55]:
# sortByKey
x = sc.parallelize([('B',1),('A',2),('C',3)])
y = x.sortByKey()
print("x=",x.collect())
print(y.collect())

x= [('B', 1), ('A', 2), ('C', 3)]
[('A', 2), ('B', 1), ('C', 3)]


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.foreach">
<img align=left src="images/pyspark-page20.svg" width=360 height=203 />
</a>

**foreach(f)**

The `foreach()` method applies a function `f` to each element of the RDD. It executes the provided function on each element of the RDD in a distributed manner.

**Parameters:**
- `f`: The function to be applied to each element of the RDD.

**Note:**
- The `foreach()` operation applies the function `f` to each element of the RDD, allowing for custom processing or side effects on the elements.
- The provided function `f` should be a void function or a function that does not return any value.
- The `foreach()` operation is an action in PySpark, and it triggers the execution of the provided function on each element of the RDD.
- The execution of `foreach()` is distributed across the worker nodes in the cluster, applying the function in parallel to each element.
- The order of execution of the provided function on elements is not guaranteed, as it depends on the distributed processing and the available resources.
- The `foreach()` operation does not return any result or new RDD. It is primarily used for performing operations or side effects on each element of the RDD, such as writing to an external system, updating shared variables, or performing other custom actions.
- The provided function `f` should be carefully designed to ensure it is idempotent and does not have any dependencies on the order or specific execution of elements.
- It is important to consider the potential side effects and the function's execution time when using `foreach()`, as it directly operates on the RDD elements and can impact the performance and behavior of the system.

In [56]:
# foreach
from __future__ import print_function
x = sc.parallelize([1,2,3])
def f(el):
    '''side effect: append the current RDD elements to a file'''
    f1=open("./foreachExample.txt", 'a+') 
    print(el,file=f1)

open('./foreachExample.txt', 'w').close()  # first clear the file contents

y = x.foreach(f) # writes into foreachExample.txt

print("x=",x.collect())
print(y) # foreach returns 'None'
# print the contents of foreachExample.txt
with open("./foreachExample.txt", "r") as foreachExample:
    print (foreachExample.read())

x= [1, 2, 3]
None
1
2
3

