# Learning Objectives

- Array and string manipulation in Pyspark
- Resilient Distributed Datasets (RDD) in Pyspark
- Difference between Python map-reduce and Pyspark map-reduce

# Resilient Distributed Datasets (RDD) 

- Spark uses Resilient Distributed Datasets (RDD) to perform parallel processing across a cluster (with different nodes) or computer processors

- RDD are immutable Distributed collections of objects of any type

- Apache Spark RDD Basics: https://www.youtube.com/watch?v=NRo8TluH7KI

- PySpark RDD Tutorial: https://www.youtube.com/watch?v=e5ol7oyKV0A

In [38]:
from pyspark import SparkContext
sc = SparkContext()

ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=pyspark-shell, master=local[*]) created by __init__ at /var/folders/fj/r2kb_f4d3k1gxmcwsdc81_1r0000gn/T/ipykernel_13888/414631668.py:2 

In [39]:
# If you get error for above line, try the followings:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]").appName("SparkByExamples.com").getOrCreate() 
sc = spark.sparkContext

In [40]:
pythonList = [2.3,3.4,4.3,2.4,2.3,4.0]

In [41]:
parPythonData = sc.parallelize(pythonList,2)

### What is the type of parPythonData?

## Question: Can we get the first element of parPythonData like parPythonData[0]?

- Answer: No, as the parPythonData is RDD (a type of data structure in Pyspark that can be considered as a distributed data) not a Python list anymore. Below we can see how can get access to RDD elements

In [42]:
type(parPythonData)

pyspark.rdd.RDD

In [43]:
parPythonData.first()

2.3

In [44]:
parPythonData.collect()

[2.3, 3.4, 4.3, 2.4, 2.3, 4.0]

In [45]:
parPythonData.take(2)

[2.3, 3.4]

In [46]:
parPythonData.getNumPartitions()

2

In [47]:
# with the following syntax, can see how Spark Context Manager splited the RDD
parPythonData.glom().collect()

[[2.3, 3.4, 4.3], [2.4, 2.3, 4.0]]

In [48]:
a = "my name is rishabh sharma"

a = a.split(' ')

In [49]:
a = sc.parallelize(a, 2)

In [50]:
a.collect()

['my', 'name', 'is', 'rishabh', 'sharma']

In [51]:
a.getNumPartitions()

2

In [52]:
a.glom().collect()

[['my', 'name'], ['is', 'rishabh', 'sharma']]

## Question:

- We split pythonList to 2 (can be $n$) parts (partitions)
- Is 2 (can be $n$) should be equal to number of nodes we have in a cluster? If yes, why? If no why not?

<img src="RDD_partitions_number_of_nodes.png" width="600" height="600">

In above figure, we see the number of partitions (which is 5 here) can be different from number of nodes (which is 3 here)

## The following list of temperatures is given

In [53]:
tempData = [59,57.2,53.6,55.4,51.8,53.6,55.4]

## Activity: Convert all of the tempratures in tempData into Centigarde

In [54]:
def farToCent(temp):
    return (temp - 32)*(5/9)

list(map(lambda x: farToCent(x), tempData))

[15.0,
 14.000000000000002,
 12.000000000000002,
 13.0,
 10.999999999999998,
 12.000000000000002,
 13.0]

In [68]:
temp = sc.parallelize(tempData, 2)

In [69]:
temp.collect()

[59, 57.2, 53.6, 55.4, 51.8, 53.6, 55.4]

In [70]:
centTemp = temp.map(lambda x: farToCent)

In [72]:
centTemp.collect()

[<function __main__.farToCent(temp)>,
 <function __main__.farToCent(temp)>,
 <function __main__.farToCent(temp)>,
 <function __main__.farToCent(temp)>,
 <function __main__.farToCent(temp)>,
 <function __main__.farToCent(temp)>,
 <function __main__.farToCent(temp)>]

def fahrenheitToCentigrade(temperature):
    centigrade = (temperature-32)*5/9
    return centigrade

In [60]:
list(map(lambda x: farToCent(x), tempData))

[15.0,
 14.000000000000002,
 12.000000000000002,
 13.0,
 10.999999999999998,
 12.000000000000002,
 13.0]

## Create RDD and do the same thing in Pyspark

In [61]:
parTempData = sc.parallelize(tempData,2)

In [62]:
parTempData.collect()

[59, 57.2, 53.6, 55.4, 51.8, 53.6, 55.4]

In [66]:
parCentigradeData = parTempData.map(lambda x: farToCent(x))

In [73]:
parCentigradeData.collect()

[15.0,
 14.000000000000002,
 12.000000000000002,
 13.0,
 10.999999999999998,
 12.000000000000002,
 13.0]

## Activity: Filter the Centigarde temprature if warmer than or equal to 13 in Pyspark

In [74]:
def tempMoreThanThirteen(temperature):
    return temperature >=13

In [75]:
filtered_temp = parCentigradeData.filter(lambda x: tempMoreThanThirteen(x))

In [76]:
filtered_temp.collect()

[15.0, 14.000000000000002, 13.0, 13.0]

In [77]:
def tempMoreThan30inCent(temp):
    cent = (temp - 32)*(5/9)
    return cent >= 30

In [78]:
filterTemp = parTempData.filter(lambda x: tempMoreThan30inCent(x))

In [80]:
filterTemp.collect()

[]

## Activity: Transform the below computation into a Python function (not in map/reduce way)

In [81]:
import math

nums = sc.parallelize(range(100000), numSlices=100)
doubled = nums.map(lambda n: n*2)
total = doubled.filter(lambda n: n%4==0).reduce(lambda a,b: a+b)
print(math.sqrt(total))



70709.97100833799


                                                                                

In [None]:
def f(ls):
    s = 0
    for i in ls:
        if (i*2)%4 == 0:
            s += (i*2)
    return math.sqrt(s)
    
    
print(f(range(100000)))

70709.97100833799


## Difference Between map and flatMap in Pyspark

- Based on a function we pass, map can return list of list. With flatMap, list of list will be converted to list 

In [82]:
values = sc.parallelize([1, 2, 3, 4], 2)
print(values.map(lambda x: [i for i in range(x)]).collect())
# [[0], [0, 1], [0, 1, 2], [0, 1, 2, 3]]
print(values.flatMap(lambda x: [i for i in range(x)]).collect())
# [0, 0, 1, 0, 1, 2, 0, 1, 2, 3]

[[0], [0, 1], [0, 1, 2], [0, 1, 2, 3]]
[0, 0, 1, 0, 1, 2, 0, 1, 2, 3]


In [85]:
values = sc.parallelize([1,2,3,4], 2)
print(values.map(lambda x: [i for i in range(x)]).collect())

[[0], [0, 1], [0, 1, 2], [0, 1, 2, 3]]


In [86]:
print(values.flatMap(lambda x: [i for i in range(x)]).collect())

[0, 0, 1, 0, 1, 2, 0, 1, 2, 3]


## String (text file) manipulation in Pyspark

- Reminder: In order to know the number of characters (including spaces) in a string we can use `len`

In [87]:
s1 = 'this is a book'
len(s1)

14

In [88]:
s2 = 'this book is about DS'
len(s2)

21

## We can directly open a text file and transform it to RDD
- Let's then obtain the number of characters in the whole text (here, we assume the txt file is a big data)

In [89]:
lines = sc.textFile("for_pyspark.txt")
lineLengths = lines.map(lambda s: len(s))
totalLength = lineLengths.reduce(lambda a, b: a + b)

In [90]:
lineLengths.collect()

[14, 21]

In [91]:
totalLength

35

23/01/25 20:30:15 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 912743 ms exceeds timeout 120000 ms
23/01/25 20:30:15 WARN SparkContext: Killing executors is not supported by current scheduler.


The first line defines a base RDD from an external file. This dataset is not loaded in memory or otherwise acted on: lines is merely a pointer to the file. The second line defines lineLengths as the result of a map transformation. Again, lineLengths is not immediately computed, due to laziness. Finally, we run reduce, which is an action. At this point Spark breaks the computation into tasks to run on separate machines, and each machine runs both its part of the map and a local reduction, returning only its answer to the driver program.

## Activity: Add all the second elements in a list of tuples

In [None]:
sample_rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 4), ("c", 7)])
sample_rdd.map(lambda x: x[1]).reduce(lambda x, y: x + y)

13

## Activity: Add all the second elements in a list of tuples if they have the same first element

In [None]:
sample_rdd.reduceByKey(lambda x, y: x + y).collect()

[('a', 5), ('b', 1), ('c', 7)]

## Activity: Obtain the histogram of the words we have in `for_pyspark.txt`

In [None]:
lines = sc.textFile("for_pyspark.txt")
words = lines.flatMap(lambda x: x.split(" "))
result = words.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x + y)

In [None]:
words.collect()

['this', 'is', 'a', 'book', 'this', 'book', 'is', 'about', 'DS']

In [None]:
result.collect()

[('this', 2), ('is', 2), ('a', 1), ('book', 2), ('about', 1), ('DS', 1)]

## Explore Difference Between map and flatMap for text example

In [None]:
lines = sc.textFile("for_pyspark.txt")
words = lines.map(lambda x: x.split(" "))
words.collect()

[['this', 'is', 'a', 'book'], ['this', 'book', 'is', 'about', 'DS']]

## Write down the syntax difference between Python map-reduce and Pyspark map-reduce

In [None]:
# in python
map(function, List)
# in Pyspark
sample_rdd.map(function)

In [None]:
# in python
reduce(function(x,y), List)
# in Pyspark
sample_rdd.reduce(function(x,y))

## Pyspark Cheat Sheet

- Open `PySpark_Cheat_Sheet_Python.pdf`

## Different ways to create Pyspark Data-Frames

In [None]:
##  Pre-req
s_rdd = sc.parallelize([('a',7),('a',2),('b',2)])
# Although we can not define index like [0], [1], ...  for s_rdd but each element of s_rdd is a tuple which has index 0 and 1 
print(s_rdd.map(lambda x:x[0]).collect())
# print(s_rdd.map(lambda x:x[0]).distinct().collect())

['a', 'a', 'b']


### Create Pyspark dataframe from Row

- Row is a basic data structure in Pyspark. Consider it as a row of a table in a database

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql import Row

spark = SparkSession.builder.appName('SparkRowExamples.com').getOrCreate()

In [None]:
data_sample = [Row(name="James,,Smith",lang=["Java","Scala","C++"],state="CA"), 
        Row(name="Michael,Rose,",lang=["Spark","Java","C++"],state="NJ"),
        Row(name="Robert,,Williams",lang=["CSharp","VB"],state="NV")]

#Create RDD from data_sample
sample_rdd = spark.sparkContext.parallelize(data_sample)

In [None]:
sample_rdd.take(1)

[Row(name='James,,Smith', lang=['Java', 'Scala', 'C++'], state='CA')]

In [None]:
sample_rdd.collect()

[Row(name='James,,Smith', lang=['Java', 'Scala', 'C++'], state='CA'),
 Row(name='Michael,Rose,', lang=['Spark', 'Java', 'C++'], state='NJ'),
 Row(name='Robert,,Williams', lang=['CSharp', 'VB'], state='NV')]

In [None]:
sample_rdd.map(lambda x:x[0]).take(3)

['James,,Smith', 'Michael,Rose,', 'Robert,,Williams']

In [None]:
type(sample_rdd.map(lambda x:x[0]))

pyspark.rdd.PipelinedRDD

In [None]:
sample_rdd.map(lambda x:x[1]).take(3)

[['Java', 'Scala', 'C++'], ['Spark', 'Java', 'C++'], ['CSharp', 'VB']]

In [None]:
sample_rdd.map(lambda x:x[2]).take(3)

['CA', 'NJ', 'NV']

In [None]:
df = spark.createDataFrame(data_sample)
df.show()

+----------------+------------------+-----+
|            name|              lang|state|
+----------------+------------------+-----+
|    James,,Smith|[Java, Scala, C++]|   CA|
|   Michael,Rose,|[Spark, Java, C++]|   NJ|
|Robert,,Williams|      [CSharp, VB]|   NV|
+----------------+------------------+-----+



In [None]:
df[['name']].rdd.collect()

[Row(name='James,,Smith'),
 Row(name='Michael,Rose,'),
 Row(name='Robert,,Williams')]

In [None]:
df[['name']].rdd.take(1)

[Row(name='James,,Smith')]

### Another way to create Pyspark dataframe from RDD

In [None]:
dept = [("Finance",10), 
        ("Marketing",20), 
        ("Sales",30), 
        ("IT",40) 
      ]
sample_rdd2 = spark.sparkContext.parallelize(dept)

deptColumns = ["dept_name","dept_id"]

df2 = sample_rdd2.toDF(deptColumns)
df2.show()

+---------+-------+
|dept_name|dept_id|
+---------+-------+
|  Finance|     10|
|Marketing|     20|
|    Sales|     30|
|       IT|     40|
+---------+-------+



In [None]:
df2[['dept_name']].rdd.collect()

[Row(dept_name='Finance'),
 Row(dept_name='Marketing'),
 Row(dept_name='Sales'),
 Row(dept_name='IT')]

In [None]:
df2[['dept_name']].rdd.map(lambda x: x[0]).collect()

['Finance', 'Marketing', 'Sales', 'IT']

In [None]:
df2[['dept_name']].rdd.map(lambda x: x[0]).filter(lambda x:x[0]=='F').collect()

['Finance']