In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("RDD basics").getOrCreate()

## Creating RDDs in PySpark

<img src="ways-to-create-rdd-in-spark-1.png">

#### 1. Using parallelize method

In [3]:
num = [1, 2, 3, 4, 5, 6, 2]
RDD = spark.sparkContext.parallelize(num)
RDD

ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:195

#### 2. From existing RDDs

In [4]:
RDD2 = RDD.filter(lambda x: x % 2 == 0)
RDD2

PythonRDD[1] at RDD at PythonRDD.scala:53

#### 3. From external sources (CSV, JSON etc.,)

In [7]:
RDD3 = spark.sparkContext.textFile('README.md')
RDD3

README.md MapPartitionsRDD[7] at textFile at NativeMethodAccessorImpl.java:0

### RDD Transformations

### Map

In [8]:
RDD.map(lambda x: x + 2)

PythonRDD[8] at RDD at PythonRDD.scala:53

### FlatMap

In [9]:
# To split the lines to individual words
RDD3.flatMap(lambda x: x.split())

PythonRDD[9] at RDD at PythonRDD.scala:53

### Distinct

In [10]:
# To get distinct elements
RDD.distinct()

PythonRDD[14] at RDD at PythonRDD.scala:53

### RDD Actions

### Collect

In [11]:
# Collect action (Not recommended)
RDD.collect() 
RDD2.collect()
RDD3.collect()

[1, 2, 3, 4, 5, 6, 2]

[2, 4, 6, 2]

['# pysparkdemo',
 'RDD (Resilient Distributed Dataset) is the fundamental data structure of Apache Spark which are an immutable collection of objects which computes on the different node of the cluster. Each and every dataset in Spark RDD is logically partitioned across many servers so that they can be computed on different nodes of the cluster.',
 '',
 'In this blog, we are going to get to know about what is RDD in Apache Spark. What are the features of RDD, What is the motivation behind RDDs, RDD vs DSM? We will also cover Spark RDD operation i.e. transformations and actions, various limitations of RDD in Spark and how RDD make Spark feature rich in this Spark tutorial. ']

### Take

In [12]:
# Use take() or first() instead
RDD.take(2) 
RDD2.take(3)
RDD3.first()

[1, 2]

[2, 4, 6]

'# pysparkdemo'

### Reduce

In [13]:
# Reduce action
RDD.reduce(lambda x,y: x+y)

23

### Count

In [16]:
RDD.count()

7

### Paired RDD Transformations

In [17]:
# Basic reduceByKey example in python
# creating PairRDD x with key value pairs
x = spark.sparkContext.parallelize([("a", 1), ("b", 1), ("a", 1), ("a", 1),
                    ("b", 1), ("b", 1), ("b", 1), ("b", 1)], 3)
 
# Applying reduceByKey operation on x
y = x.reduceByKey(lambda accum, n: accum + n)
y.collect()

[('b', 5), ('a', 3)]

### Word count example

In [25]:
RDD3 = spark.sparkContext.textFile('README.md', 4)
words = RDD3.flatMap(lambda x: x.split()) \
        .map(lambda x: (x, 1)).reduceByKey(lambda x, y: x+y) \
        .map(lambda x: (x[1], x[0])).sortByKey(ascending=False)
for w in words.take(10):
    print(w)

(7, 'RDD')
(6, 'of')
(6, 'the')
(6, 'Spark')
(4, 'in')
(4, 'is')
(3, 'are')
(3, 'and')
(2, 'Apache')
(2, 'different')


In [26]:
# Save the output
words.saveAsTextFile("test.txt")

In [27]:
# Check the number of partitions
RDD3.getNumPartitions()

4

In [30]:
# Save the output as 1 partition
words.coalesce(1).saveAsTextFile("output")