In [1]:
from pyspark.sql import SparkSession


### Summary of Key Classes:
1. **RDD** – Base class for all RDDs.
2. **ParallelCollectionRDD** – RDD created from in-memory collections (via `sc.parallelize`).
3. **HadoopRDD** – RDD created by reading from Hadoop-based file systems (e.g., HDFS, S3).
4. **UnionRDD** – RDD created by applying `union()` on two or more RDDs.
5. **MapPartitionsRDD** – RDD created by transformations like `mapPartitions()`.
6. **PairRDD** – RDD with key-value pairs, supporting operations like `reduceByKey()` and `groupByKey()`.
7. **CoalescedRDD** – RDD created by applying `coalesce()` to reduce the number of partitions.
8. **CheckpointRDD** – RDD that has been checkpointed to fault-tolerant storage.
9. **WholeTextFileRDD** – RDD created by reading whole text files (using `sc.wholeTextFiles()`).
10. **CachedRDD** – RDD that has been cached or persils on any of these!

In [2]:
spark = SparkSession.builder.appName("rddTests").getOrCreate()

In [3]:
sc = spark.sparkContext

In [4]:
def show_plan(rdd):
    for x in rdd.toDebugString().decode().split('\n'):
        print(x)

### **RDD** (Base Class)
   - This is the fundamental abstraction representing a distributed collection of data in Spark.
   - It provides methods for various transformations (like `map`, `filter`, `flatMap`) and actions (like `collect`, `count`, `saveAsTextFile`).
   - The `RDD` class itself is generic and can be used for any distributed collection.

   Example:
   ```scala
   val rdd = sc.parallelize(Seq(1, 2, 3))

   ```

### **HadoopRDD**
   - This subclass is used when reading data from external storage systems such as Hadoop’s HDFS, Amazon S3, or any Hadoop-compatible file system.
   - When you use methods like `sc.textFile()` or `sc.sequenceFile()`, Spark creates a `HadoopRDD`.
   - It is optimized for reading large files distributed across a cluster.
   
   Example:
   ```scala
   val rdd = sc.textFile("hdfs://path/to/data")
   
   ```


In [5]:
rdd = sc.textFile("../data/company_data/companies.csv")

In [6]:
show_plan(rdd)

(2) ../data/company_data/companies.csv MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0 []
 |  ../data/company_data/companies.csv HadoopRDD[0] at textFile at NativeMethodAccessorImpl.java:0 []


### **ParallelCollectionRDD**
   - This is the subclass of `RDD` used when you create an RDD from a parallelized collection (like a list or an array) via `sc.parallelize()`.
   - It represents an RDD that is made from an existing in-memory collection, distributed across the cluster.
   
   Example:
   ```scala
   val rdd = sc.parallelize(List(1, 2, 3, 4, 5))
   ```

In [7]:
rdd = sc.parallelize([1, 2, 3, 4, 5])

In [8]:
show_plan(rdd)

(8) ParallelCollectionRDD[2] at readRDDFromFile at PythonRDD.scala:287 []


In [9]:
map_rdd = rdd.map(lambda x: x*2)
map_rdd.collect()

[2, 4, 6, 8, 10]

In [10]:
show_plan(map_rdd)

(8) PythonRDD[3] at collect at C:\Users\shahv\AppData\Local\Temp\ipykernel_10484\1503817425.py:2 []
 |  ParallelCollectionRDD[2] at readRDDFromFile at PythonRDD.scala:287 []


### **MapPartitionsRDD**
   - This subclass is created when applying a transformation like `mapPartitions()`.
   - It operates on the data in partitions (not just individual elements) and applies the transformation to each partition as a whole, which can be more efficient than applying the transformation element-by-element.
   
   Example:
   ```scala
   val rdd = sc.parallelize(Seq(1, 2, 3, 4, 5))
   val mapPartitionsRdd = rdd.mapPartitions(iterator => iterator.map(x => x * 2))
   ```

In [11]:
rdd = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
rdd.glom().collect()

[[1], [2], [3], [4, 5], [6], [7], [8], [9, 10]]

In [12]:
def sum_partition(iterator):
    yield sum(iterator)

result_rdd = rdd.mapPartitions(sum_partition)
result_rdd.glom().collect()

[[1], [2], [3], [9], [6], [7], [8], [19]]

In [13]:
show_plan(result_rdd)

(8) PythonRDD[7] at RDD at PythonRDD.scala:53 []
 |  ParallelCollectionRDD[4] at readRDDFromFile at PythonRDD.scala:287 []


### **UnionRDD**
   - This class represents the union of two or more RDDs. It is produced when you use the `union()` transformation.
   - It combines the data from multiple RDDs into one RDD.
   
   Example:
   ```scala
   val rdd1 = sc.parallelize(Seq(1, 2))
   val rdd2 = sc.parallelize(Seq(3, 4))
   val unionRdd = rdd1.union(rdd2)
   ```


In [14]:
rdd1 = sc.parallelize([1, 2])
rdd2 = sc.parallelize([4, 3])
union_rdd = rdd1.union(rdd2)
union_rdd.collect()

[1, 2, 4, 3]

In [15]:
show_plan(union_rdd)

(16) UnionRDD[10] at union at NativeMethodAccessorImpl.java:0 []
 |   ParallelCollectionRDD[8] at readRDDFromFile at PythonRDD.scala:287 []
 |   ParallelCollectionRDD[9] at readRDDFromFile at PythonRDD.scala:287 []


### 6. **PairRDD**
   - This class is a specialized subclass of `RDD` where each element is a key-value pair (often used with `Map` operations).
   - PairRDDs allow for key-based transformations like `reduceByKey()`, `groupByKey()`, `join()`, and `cogroup()`.
   
   Example:
   ```scala
   val pairRdd = sc.parallelize(Seq(("a", 1), ("b", 2), ("a", 3)))
   ```


In [16]:
pair_rdd = sc.parallelize((("a", 1), ("b", 2), ("a", 3)))
pair_rdd.collect()

[('a', 1), ('b', 2), ('a', 3)]

In [17]:
show_plan(pair_rdd)

(8) ParallelCollectionRDD[11] at readRDDFromFile at PythonRDD.scala:287 []


### **CoalescedRDD**
   - This subclass is used when applying a `coalesce()` transformation, which is designed to reduce the number of partitions in an RDD without causing a full shuffle of the data.
   - This is especially useful when you want to optimize the number of partitions, typically for operations like writing data to disk.
   
   Example:
   ```scala
   val coalescedRdd = rdd.coalesce(2)
   
   ```


In [18]:
rdd = sc.parallelize((("a", 1), ("b", 2), ("a", 3)))
rdd.glom().collect()

[[], [], [('a', 1)], [], [], [('b', 2)], [], [('a', 3)]]

In [19]:
coalesce_rdd = rdd.coalesce(2)
coalesce_rdd.glom().collect()

[[('a', 1)], [('b', 2), ('a', 3)]]

In [20]:
show_plan(coalesce_rdd)

(2) CoalescedRDD[14] at coalesce at NativeMethodAccessorImpl.java:0 []
 |  ParallelCollectionRDD[12] at readRDDFromFile at PythonRDD.scala:287 []


### **CheckpointRDD**
   - This subclass represents an RDD that has been checkpointed to storage, typically HDFS.
   - Checkpointing helps Spark recover from failures by saving the RDD’s lineage to a fault-tolerant storage system.

   Example:
   ```scala
   rdd.checkpoint()
   ```


In [21]:
sc

In [22]:
# rdd = sc.parallelize((("Alice", 1), ("Brian", 2), ("Claire", 3)))
# rdd.checkpoint()

### **WholeTextFileRDD**
   - This class is used when reading a directory of text files into an RDD with `sc.wholeTextFiles()`.
   - Each element in this RDD is a tuple, where the first element is the file path, and the second is the contents of the file.
   
   Example:
   ```scala
   val wholeTextFileRdd = sc.wholeTextFiles("hdfs://path/to/directory")
   
   ```

In [23]:
whole_text_file_rdd = sc.wholeTextFiles("../data/company_data/companies.csv")
whole_text_file_rdd.collect()

[('file:/D:/BigData/Spark/data/company_data/companies.csv',
  'id,company,country_id\r\n1,Mybuzz,11\r\n2,Chatterbridge,3\r\n3,Skyble,7\r\n4,Brainverse,4\r\n5,Jabbertype,7\r\n6,Zoombeat,12\r\n7,Tanoodle,8\r\n8,Feedmix,13\r\n9,Meembee,20\r\n10,Riffpath,7\r\n11,Dynabox,19\r\n12,Browsetype,3\r\n13,Dynazzy,20\r\n14,Demizz,19\r\n15,Riffpedia,18\r\n16,Zava,13\r\n17,Pixonyx,20\r\n18,Yambee,15\r\n19,Yombu,7\r\n20,Voomm,14\r\n21,Skilith,12\r\n22,Ooba,11\r\n23,Oyoyo,2\r\n24,Avavee,3\r\n25,Livepath,13\r\n26,Meedoo,12\r\n27,Dynabox,13\r\n28,Skipfire,13\r\n29,Flashdog,2\r\n30,Twimm,12\r\n31,Tagfeed,14\r\n32,Teklist,11\r\n33,Tanoodle,6\r\n34,Linkbuzz,14\r\n35,Jaxbean,9\r\n36,Babblestorm,2\r\n37,Wikizz,4\r\n38,Quatz,5\r\n39,Bubbletube,12\r\n40,Dazzlesphere,18\r\n41,Centimia,17\r\n42,Thoughtbeat,16\r\n43,Roombo,17\r\n44,Shuffledrive,8\r\n45,Roodel,17\r\n46,Twitterworks,8\r\n47,Thoughtsphere,8\r\n48,Meejo,16\r\n49,Divavu,9\r\n50,Yamia,10\r\n51,Meezzy,2\r\n52,Thoughtmix,17\r\n53,Quire,20\r\n54,Babblestor

In [24]:
show_plan(whole_text_file_rdd)

(1) ../data/company_data/companies.csv MapPartitionsRDD[17] at wholeTextFiles at NativeMethodAccessorImpl.java:0 []
 |  WholeTextFileRDD[16] at wholeTextFiles at NativeMethodAccessorImpl.java:0 []


### **CachedRDD**
   - This is an RDD that has been persisted in memory. It’s created by calling the `cache()` or `persist()` method on an existing RDD to keep it in memory for faster access in future operations.
   
   Example:
   ```scala
   val cachedRdd = rdd.cache()
   
   ```

In [25]:
rdd = sc.parallelize((("Alice", 1), ("Brian", 2), ("Claire", 3)))
cache_rdd = rdd.cache()

In [26]:
show_plan(cache_rdd)

(8) ParallelCollectionRDD[18] at readRDDFromFile at PythonRDD.scala:287 [Memory Serialized 1x Replicated]
