**Table of contents**<a id='toc0_'></a>    
- [Create an RDD from a list](#toc1_)    
- [Create an RDD from a list of tuples](#toc2_)    
- [Create an RDD from a file](#toc3_)    
  - [Using textFile](#toc3_1_)    
  - [Using wholeTextFiles()](#toc3_2_)    
- [Create an RDD from the Existing RDD with a constant column](#toc4_)    
- [Create an Empty RDD](#toc5_)    
- [Create a Key-Value RDD](#toc6_)    
- [Create an RDD with Random numbers](#toc7_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [3]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("RDD-Examples").getOrCreate()
sc = spark.sparkContext

# <a id='toc1_'></a>[Create an RDD from a list](#toc0_)

In [2]:
numbers = list(range(10))
rdd = spark.sparkContext.parallelize(numbers)
rdd.collect()

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

# <a id='toc2_'></a>[Create an RDD from a list of tuples](#toc0_)

In [3]:
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35), ("Alice", 40)]
rdd_1 = spark.sparkContext.parallelize(data)
rdd_1.collect()

[('Alice', 25), ('Bob', 30), ('Charlie', 35), ('Alice', 40)]

# <a id='toc3_'></a>[Create an RDD from a file](#toc0_)

## <a id='toc3_1_'></a>[Using textFile](#toc0_)

- **`textFile()`**:
  - Reads a text file line by line, treating each line as an individual element in the RDD.
  - Suitable for large, line-oriented text files (e.g., logs, CSVs).
  - **Output**: Each line is a separate element (String) in the RDD.

In [5]:
rdd_2 = spark.sparkContext.textFile("./data/data1.csv").map(lambda a: a.split(","))
rdd_2.collect()

# [['rechargeid', 'rechargedate', 'remainingdays', 'validity'],
#  ['r201235', '20200511', '1', 'online'],
#  ['r201236', '20210315', '3', 'offline'],
#  ['r201237', '20220101', '5', 'online'],
#  ['r201238', '20211225', '7', 'offline'],
#  ['r201239', '20221010', '2', 'online']]

# Create DataFrame from RDD
header = rdd_2.first()  # Extract header
rdd_no_header = rdd_2.filter(lambda row: row != header)  # Remove header row

# Define schema based on header
df = rdd_no_header.toDF(header)
df.show()

+----------+------------+-------------+--------+
|rechargeid|rechargedate|remainingdays|validity|
+----------+------------+-------------+--------+
|   r201235|    20200511|            1|  online|
|   r201236|    20210315|            3| offline|
|   r201237|    20220101|            5|  online|
|   r201238|    20211225|            7| offline|
|   r201239|    20221010|            2|  online|
+----------+------------+-------------+--------+



In [None]:
# create rdd from text file
rdd_text = spark.sparkContext.textFile("files/rdd_output.txt")
rdd_text.collect()

## <a id='toc3_2_'></a>[Using wholeTextFiles()](#toc0_)

- Reads the entire contents of each file as a single record (file name and content pair).
- Used when you need to process entire files at once, especially when there are multiple small files.
- **Output**: Each element is a tuple of `(filename, content)`.

In [None]:
rdd = sparkContext.wholeTextFiles("path/to/files/*")
# Output: RDD[(String, String)], where first string is the filename, and second is file content


# [('file:/f:/Pyspark/3. Pyspark RDD Ops/files/textFile.txt',
#   'seller_id,seller_name,daily_target\r\n0,seller_0,25000\r\n1,seller_1,176\r\n2,seller_2,173\r\n3,seller_3,186\r\n4,seller_4,145\r\n5,seller_5,129\r\n6,seller_6,194\r\n7,seller_7,142\r\n8,seller_8,154\r\n9,seller_9,168\r\n')]

# <a id='toc4_'></a>[Create an RDD from the Existing RDD with a constant column](#toc0_)

In [6]:
rdd_3 = rdd_2.map(lambda a: a + ['constant'])

rdd_3.collect()

[['rechargeid', 'rechargedate', 'remainingdays', 'validity', 'constant'],
 ['r201235', '20200511', '1', 'online', 'constant'],
 ['r201236', '20210315', '3', 'offline', 'constant'],
 ['r201237', '20220101', '5', 'online', 'constant'],
 ['r201238', '20211225', '7', 'offline', 'constant'],
 ['r201239', '20221010', '2', 'online', 'constant']]

# <a id='toc5_'></a>[Create an Empty RDD with no partitions](#toc0_)

In [11]:
rdd_4 = spark.sparkContext.emptyRDD
print(rdd_4)

<bound method SparkContext.emptyRDD of <SparkContext master=local[*] appName=RDD-Examples>>


In [8]:
rdd_5 = spark.sparkContext.parallelize([])
print(rdd_5)
rdd_5.collect()

ParallelCollectionRDD[18] at readRDDFromFile at PythonRDD.scala:289


[]

# Create an Empty RDD with partition

In [None]:
rdd8 = spark.sparkContext.parallelize([],10) #This will create 10 partitions

# <a id='toc6_'></a>[Create a Key-Value RDD](#toc0_)

In [12]:
rdd_6 = spark.sparkContext.parallelize(["hello","world","good","hello"])
rdd_6.map(lambda w: (w,1)).collect()

[('hello', 1), ('world', 1), ('good', 1), ('hello', 1)]

# <a id='toc7_'></a>[Create an RDD with Random numbers](#toc0_)

In [4]:
import numpy as np

lst = np.random.randint(0, 10, 20)
print(lst)

# Parallelize the list to create an RDD
A = sc.parallelize(lst)

[3 8 2 5 1 1 8 8 0 8 9 1 4 2 6 5 8 8 7 4]
