**Table of contents**<a id='toc0_'></a>    
- [Create an RDD from a list](#toc1_)    
- [Create an RDD from a list of tuples](#toc2_)    
- [Create an RDD from a text file(csv)](#toc3_)    
- [Create an RDD from the Existing RDD with a constant column](#toc4_)    
- [Create an Empty RDD](#toc5_)    
- [Create a Key-Value RDD](#toc6_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("RDD-Examples").getOrCreate()

# <a id='toc1_'></a>[Create an RDD from a list](#toc0_)

In [2]:
numbers = list(range(10))
rdd = spark.sparkContext.parallelize(numbers)
rdd.collect()

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

# <a id='toc2_'></a>[Create an RDD from a list of tuples](#toc0_)

In [3]:
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35), ("Alice", 40)]
rdd_1 = spark.sparkContext.parallelize(data)
rdd_1.collect()

[('Alice', 25), ('Bob', 30), ('Charlie', 35), ('Alice', 40)]

# <a id='toc3_'></a>[Create an RDD from a text file(csv)](#toc0_)

In [5]:
rdd_2 = spark.sparkContext.textFile("./data/data1.csv").map(lambda a: a.split(","))
rdd_2.collect()

# [['rechargeid', 'rechargedate', 'remainingdays', 'validity'],
#  ['r201235', '20200511', '1', 'online'],
#  ['r201236', '20210315', '3', 'offline'],
#  ['r201237', '20220101', '5', 'online'],
#  ['r201238', '20211225', '7', 'offline'],
#  ['r201239', '20221010', '2', 'online']]

# Create DataFrame from RDD
header = rdd_2.first()  # Extract header
rdd_no_header = rdd_2.filter(lambda row: row != header)  # Remove header row

# Define schema based on header
df = rdd_no_header.toDF(header)
df.show()

+----------+------------+-------------+--------+
|rechargeid|rechargedate|remainingdays|validity|
+----------+------------+-------------+--------+
|   r201235|    20200511|            1|  online|
|   r201236|    20210315|            3| offline|
|   r201237|    20220101|            5|  online|
|   r201238|    20211225|            7| offline|
|   r201239|    20221010|            2|  online|
+----------+------------+-------------+--------+



# <a id='toc4_'></a>[Create an RDD from the Existing RDD with a constant column](#toc0_)

In [6]:
rdd_3 = rdd_2.map(lambda a: a + ['constant'])

rdd_3.collect()

[['rechargeid', 'rechargedate', 'remainingdays', 'validity', 'constant'],
 ['r201235', '20200511', '1', 'online', 'constant'],
 ['r201236', '20210315', '3', 'offline', 'constant'],
 ['r201237', '20220101', '5', 'online', 'constant'],
 ['r201238', '20211225', '7', 'offline', 'constant'],
 ['r201239', '20221010', '2', 'online', 'constant']]

# <a id='toc5_'></a>[Create an Empty RDD](#toc0_)

In [11]:
rdd_4 = spark.sparkContext.emptyRDD
print(rdd_4)

<bound method SparkContext.emptyRDD of <SparkContext master=local[*] appName=RDD-Examples>>


In [8]:
rdd_5 = spark.sparkContext.parallelize([])
print(rdd_5)
rdd_5.collect()

ParallelCollectionRDD[18] at readRDDFromFile at PythonRDD.scala:289


[]

# <a id='toc6_'></a>[Create a Key-Value RDD](#toc0_)

In [12]:
rdd_6 = spark.sparkContext.parallelize(["hello","world","good","hello"])
rdd_6.map(lambda w: (w,1)).collect()

[('hello', 1), ('world', 1), ('good', 1), ('hello', 1)]