# PySpark RDD Operations Example

This notebook demonstrates basic and advanced operations on an RDD (Resilient Distributed Dataset) in PySpark. It covers creating an RDD from a sample list, performing various transformations and actions, and applying set operations on RDDs.


#### Initializing Spark Session
To begin working with PySpark, a Spark session needs to be initialized. This session acts as the entry point for interacting with Spark.

```python
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
```

#### Creating an RDD from a List
An RDD (Resilient Distributed Dataset) is the fundamental data structure in Spark, representing a distributed collection of elements that can be processed in parallel. Create an RDD from a list:

```python
sampleList = [10, 11, 22, 23, 34, 35, 40, 45, 60]
sampleRdd = spark.sparkContext.parallelize(sampleList)
```

#### Counting Elements in the RDD
The `count()` action returns the number of elements in the RDD.

```python
sampleRdd.count()
```

#### Collecting All Elements of the RDD
The `collect()` action retrieves all the elements of the RDD into a list. *Be cautious with large RDDs, as collecting too much data can lead to memory issues.*

```python
sampleRdd.collect()
```

#### Retrieving the First Element of the RDD
The `first()` action returns the first element of the RDD.

```python
sampleRdd.first()
```

#### Taking the First 3 Elements of the RDD
The `take(n)` action returns the first `n` elements from the RDD.

```python
sampleRdd.take(3)
```

#### Summing All Elements in the RDD
The `sum()` action computes the sum of all elements in the RDD.

```python
sampleRdd.sum()
```

#### Finding the Minimum Value in the RDD
The `min()` action returns the smallest element in the RDD.

```python
sampleRdd.min()
```

#### Finding the Maximum Value in the RDD
The `max()` action returns the largest element in the RDD.

```python
sampleRdd.max()
```

#### Calculating the Mean of the RDD Elements
The `mean()` action computes the average of the elements in the RDD.

```python
sampleRdd.mean()
```

#### Calculating the Variance of the RDD Elements
The `variance()` action calculates how much the elements in the RDD deviate from the mean.

```python
sampleRdd.variance()
```

#### Calculating the Standard Deviation of the RDD Elements
The `stdev()` action calculates the standard deviation, which is the square root of the variance.

```python
sampleRdd.stdev()
```

#### Getting RDD Statistics
The `stats()` action returns an object containing statistics like count, mean, variance, and more.

```python
sampleRdd.stats()
```

#### Applying a map Transformation
The `map()` transformation applies a function to each element in the RDD, returning a new RDD with the transformed elements.

```python
sampleRdd.map(lambda x: (x, 1)).collect()
```

#### Mapping to a Tuple with a String
Use the `map()` transformation to pair each element with a string.

```python
sampleRdd.map(lambda x: (x, "Hello")).collect()
```

#### Mapping to a Complex Tuple
Use the `map()` transformation to create a complex tuple with multiple values derived from each element.

```python
sampleRdd.map(lambda x: (x, "Hello", x+10, x*x)).collect()
```

#### Reducing the RDD Elements
The `reduce()` action combines elements of the RDD using a specified binary operation, returning a single result.

```python
sampleRdd.reduce(lambda x, y: x+y)
```

#### Filtering Odd Elements
The `filter()` transformation returns a new RDD containing only the elements that satisfy a specified condition.

```python
sampleRdd.filter(lambda x: x % 2 == 1).collect()
```

#### Reducing Filtered Odd Elements
Filter the RDD to keep only odd elements and then sum them using `reduce()`.

```python
sampleRdd.filter(lambda x: x % 2 == 1).reduce(lambda x, y: x+y)
```

#### Mapping Filtered Elements
Filter the RDD to keep only odd elements and then map them to their triple value using `map()`.

```python
sampleRdd.filter(lambda x: x % 2 == 1).map(lambda x: x+x+x).collect()
```

#### Reducing the Triple Value of Filtered Elements
Filter the RDD to keep only odd elements, map them to their triple value, and then sum the results using `reduce()`.

```python
sampleRdd.filter(lambda x: x % 2 == 1).map(lambda x: x+x+x).reduce(lambda x, y: x+y)
```

#### Set Operations on RDDs
RDDs can be used to perform set operations like `union()`, `intersection()`, and `subtract()`. Create two RDDs for set operations:

```python
rdd1 = spark.sparkContext.parallelize([1,2,3,4,5])
rdd2 = spark.sparkContext.parallelize([3,4,5,6,7])
```

#### Union of Two RDDs
The `union()` transformation combines the elements of two RDDs, returning a new RDD containing all elements from both.

```python
rdd1.union(rdd2).collect()
```

#### Distinct Union of Two RDDs
The `distinct()` transformation removes duplicates from an RDD. Apply it after `union()` to get distinct elements from both RDDs.

```python
rdd1.union(rdd2).distinct().collect()
```

#### Intersection of Two RDDs
The `intersection()` transformation returns an RDD containing elements common to both RDDs.

```python
rdd1.intersection(rdd2).collect()
```

#### Subtraction of Two RDDs
The `subtract()` transformation returns an RDD containing elements in the first RDD but not in the second.

```python
rdd1.subtract(rdd2).collect()
```

#### Reflections

- **RDD Basics in PySpark**: How to create an RDD from a list using `sc.parallelize()` and perform fundamental operations.
- **Counting and Collecting**: Techniques for counting elements in an RDD and collecting them into a list.
- **Retrieving Elements**: How to retrieve the first element and a subset of elements from an RDD.
- **Computing Statistics**: Methods for computing the sum, minimum, maximum, mean, variance, and standard deviation of RDD elements.
- **RDD Transformations and Actions**: How to use `map()`, `filter()`, and `reduce()` to manipulate and process data in an RDD.
- **Combining and Comparing RDDs**: Techniques for performing set operations like `union()`, `intersection()`, and `subtract()` on RDDs.
- **Applying Functions to RDD Elements**: Use of lambda functions within RDD transformations to create complex mappings and filters.