# **Running Pyspark in Colab**

To run spark in Colab, we need to first install all the dependencies in Colab environment i.e. Apache Spark 3.0.0 with hadoop 3.2, Java 11 and Findspark to locate the spark in the system. The tools installation can be carried out inside the Jupyter Notebook of the Colab.
Follow the steps to install the dependencies:

In [0]:
!apt-get install openjdk-11-jdk-headless -qq > /dev/null
!wget -q http://apache.osuosl.org/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz 
!tar xf spark-3.0.0-bin-hadoop3.2.tgz 
!pip install -q findspark

Now that you installed Spark and Java in Colab, it is time to set the environment path which enables you to run Pyspark in your Colab environment. Set the location of Java and Spark by running the following code:

In [0]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-bin-hadoop3.2"

Run a local spark session to test your installation:

In [0]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark import SparkContext
spark = SparkSession.builder.master("local[*]").getOrCreate()
sc = SparkContext.getOrCreate()

In [0]:
nums = sc.parallelize([1,2,3,4])
squared = nums.map(lambda x: x * x).collect()
for num in squared:
  print (num)

1
4
9
16


# **First App**

First Spark program

Upload a file to colab

In [0]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving auto-data.csv to auto-data.csv
User uploaded file "auto-data.csv" with length 11549 bytes


Create an RDD by loading from a file

In [0]:
tweetsRDD = sc.textFile("movietweets.csv")

Show top 5 records

In [0]:
tweetsRDD.take(5)

['positive,The Da Vinci Code book is just awesome.',
 'positive,i liked the Da Vinci Code a lot.',
 'positive,i liked the Da Vinci Code a lot.',
 "positive,I liked the Da Vinci Code but it ultimatly didn't seem to hold it's own.",
 "positive,that's not even an exaggeration ) and at midnight we went to Wal-Mart to buy the Da Vinci Code"]

Transform Data - change to upper Case

In [0]:
ucRDD = tweetsRDD.map( lambda x : x.upper() )
ucRDD.take(5)

['POSITIVE,THE DA VINCI CODE BOOK IS JUST AWESOME.',
 'POSITIVE,I LIKED THE DA VINCI CODE A LOT.',
 'POSITIVE,I LIKED THE DA VINCI CODE A LOT.',
 "POSITIVE,I LIKED THE DA VINCI CODE BUT IT ULTIMATLY DIDN'T SEEM TO HOLD IT'S OWN.",
 "POSITIVE,THAT'S NOT EVEN AN EXAGGERATION ) AND AT MIDNIGHT WE WENT TO WAL-MART TO BUY THE DA VINCI CODE"]

Action - Count the number of tweets

In [0]:
tweetsRDD.count()

100

# **Spark Operations**

First Spark program

Load from a collection

In [0]:
collData = sc.parallelize([4,3,8,5,8])
collData.collect()

[4, 3, 8, 5, 8]

Load the file. Lazy initialization

In [0]:
autoData = sc.textFile("auto-data.csv")
autoData.cache()
#Loads only now.
print(autoData.count())
print(autoData.first())
print(autoData.take(5))
new_rdd = sc.parallelize(autoData.take(5))

198
MAKE,FUELTYPE,ASPIRE,DOORS,BODY,DRIVE,CYLINDERS,HP,RPM,MPG-CITY,MPG-HWY,PRICE
['MAKE,FUELTYPE,ASPIRE,DOORS,BODY,DRIVE,CYLINDERS,HP,RPM,MPG-CITY,MPG-HWY,PRICE', 'subaru,gas,std,two,hatchback,fwd,four,69,4900,31,36,5118', 'chevrolet,gas,std,two,hatchback,fwd,three,48,5100,47,53,5151', 'mazda,gas,std,two,hatchback,fwd,four,68,5000,30,31,5195', 'toyota,gas,std,two,hatchback,fwd,four,62,4800,35,39,5348']


In [0]:
for line in new_rdd.collect():
    print(line)

MAKE,FUELTYPE,ASPIRE,DOORS,BODY,DRIVE,CYLINDERS,HP,RPM,MPG-CITY,MPG-HWY,PRICE
subaru,gas,std,two,hatchback,fwd,four,69,4900,31,36,5118
chevrolet,gas,std,two,hatchback,fwd,three,48,5100,47,53,5151
mazda,gas,std,two,hatchback,fwd,four,68,5000,30,31,5195
toyota,gas,std,two,hatchback,fwd,four,62,4800,35,39,5348


Save to a local file. First collect the RDD to the master and then save as local file.

In [0]:
autoDataFile = open("auto-data-saved.csv","w")
autoDataFile.write("\n".join(new_rdd.collect()))
autoDataFile.close()

In [0]:
!cat auto-data-saved.csv

MAKE,FUELTYPE,ASPIRE,DOORS,BODY,DRIVE,CYLINDERS,HP,RPM,MPG-CITY,MPG-HWY,PRICE
subaru,gas,std,two,hatchback,fwd,four,69,4900,31,36,5118
chevrolet,gas,std,two,hatchback,fwd,three,48,5100,47,53,5151
mazda,gas,std,two,hatchback,fwd,four,68,5000,30,31,5195
toyota,gas,std,two,hatchback,fwd,four,62,4800,35,39,5348

# **Transformations**

Map and create a new RDD

In [0]:
tsvData=autoData.map(lambda x : x.replace(",","\t"))
tsvData.take(5)

['MAKE\tFUELTYPE\tASPIRE\tDOORS\tBODY\tDRIVE\tCYLINDERS\tHP\tRPM\tMPG-CITY\tMPG-HWY\tPRICE',
 'subaru\tgas\tstd\ttwo\thatchback\tfwd\tfour\t69\t4900\t31\t36\t5118',
 'chevrolet\tgas\tstd\ttwo\thatchback\tfwd\tthree\t48\t5100\t47\t53\t5151',
 'mazda\tgas\tstd\ttwo\thatchback\tfwd\tfour\t68\t5000\t30\t31\t5195',
 'toyota\tgas\tstd\ttwo\thatchback\tfwd\tfour\t62\t4800\t35\t39\t5348']

Filter and create a new RDD

In [0]:
toyotaData=autoData.filter(lambda x: "toyota" in x)
toyotaData.count()

32

FlatMap

In [0]:
words=toyotaData.flatMap(lambda line: line.split(","))
words.count()
words.take(20)

['toyota',
 'gas',
 'std',
 'two',
 'hatchback',
 'fwd',
 'four',
 '62',
 '4800',
 '35',
 '39',
 '5348',
 'toyota',
 'gas',
 'std',
 'two',
 'hatchback',
 'fwd',
 'four',
 '62']

Distinct

In [0]:
for numbData in collData.distinct().collect():
    print(numbData)

4
8
3
5


Set operations

In [0]:
words1 = sc.parallelize(["hello","war","peace","world"])
words2 = sc.parallelize(["war","peace","universe"])

for unions in words1.union(words2).distinct().collect():
    print(unions)
print("------------------")
for intersects in words1.intersection(words2).collect():
    print(intersects)

hello
peace
world
universe
war
------------------
peace
war


Using functions for transformation cleanse and transform an RDD

In [0]:
def cleanseRDD(autoStr) :
    if isinstance(autoStr, int) :
        return autoStr
    attList=autoStr.split(",")
    #convert doors to a number str
    if attList[3] == "two" :
         attList[3]="2"
    elif attList[3] == "four" :
         attList[3]="4"
    #Convert Drive to uppercase
    attList[5] = attList[5].upper()
    return ",".join(attList)
    
cleanedData=autoData.map(cleanseRDD)

for line in autoData.take(5):
  print(line)
print("------------------")
for line in cleanedData.take(5):
  print(line)

MAKE,FUELTYPE,ASPIRE,DOORS,BODY,DRIVE,CYLINDERS,HP,RPM,MPG-CITY,MPG-HWY,PRICE
subaru,gas,std,two,hatchback,fwd,four,69,4900,31,36,5118
chevrolet,gas,std,two,hatchback,fwd,three,48,5100,47,53,5151
mazda,gas,std,two,hatchback,fwd,four,68,5000,30,31,5195
toyota,gas,std,two,hatchback,fwd,four,62,4800,35,39,5348
------------------
MAKE,FUELTYPE,ASPIRE,DOORS,BODY,DRIVE,CYLINDERS,HP,RPM,MPG-CITY,MPG-HWY,PRICE
subaru,gas,std,2,hatchback,FWD,four,69,4900,31,36,5118
chevrolet,gas,std,2,hatchback,FWD,three,48,5100,47,53,5151
mazda,gas,std,2,hatchback,FWD,four,68,5000,30,31,5195
toyota,gas,std,2,hatchback,FWD,four,62,4800,35,39,5348


# **Actions**

Reduce - compute the sum

In [0]:
collData.reduce(lambda x,y: x+y)

28

Find the shortest line

In [0]:
autoData.reduce(lambda x,y: x if len(x) < len(y) else y)

'bmw,gas,std,two,sedan,rwd,six,182,5400,16,22,41315'

Use a function to perform reduce

In [0]:
def getMPG( autoStr) :
    if isinstance(autoStr, int) :
        return autoStr
    attList=autoStr.split(",")
    if attList[9].isdigit() :
        return int(attList[9])
    else:
        return 0

Find average MPG-City for all cars    

In [0]:
autoData.reduce(lambda x,y : getMPG(x) + getMPG(y)) \
    / (autoData.count()-1.0)  # account for header line

25.15228426395939

# **Working with Key/Value RDDs**

Create a KV RDD of auto Brand and Horsepower

In [0]:
cylData = autoData.map( lambda x: ( x.split(",")[0], \
    x.split(",")[7]))

for line in cylData.take(5):
  print(line)
print("-------------------")
for line in cylData.keys().take(10):
  print(line)

('MAKE', 'HP')
('subaru', '69')
('chevrolet', '48')
('mazda', '68')
('toyota', '62')
-------------------
MAKE
subaru
chevrolet
mazda
toyota
mitsubishi
honda
nissan
dodge
plymouth


Remove header row

In [0]:
header = cylData.first()
cylHPData= cylData.filter(lambda line: line != header)

[('subaru', '69'),
 ('chevrolet', '48'),
 ('mazda', '68'),
 ('toyota', '62'),
 ('mitsubishi', '68')]

Find average HP by Brand

Add a count 1 to each record and then reduce to find totals of HP and counts

In [0]:
addOne = cylHPData.mapValues(lambda x: (x,1))

for line in addOne.take(5):
  print(line)
  
print("-----------------------")
brandValues= addOne \
    .reduceByKey(lambda x, y: (int(x[0]) + int(y[0]), \
    x[1] + y[1])) 

for line in brandValues.collect():
  print(line)

('subaru', ('69', 1))
('chevrolet', ('48', 1))
('mazda', ('68', 1))
('toyota', ('62', 1))
('mitsubishi', ('68', 1))
-----------------------
('chevrolet', (188, 3))
('mazda', (1390, 16))
('mitsubishi', (1353, 13))
('nissan', (1846, 18))
('dodge', (675, 8))
('plymouth', (607, 7))
('saab', (760, 6))
('volvo', (1408, 11))
('alfa-romero', (376, 3))
('mercedes-benz', (1170, 8))
('jaguar', (614, 3))
('subaru', (1035, 12))
('toyota', (2969, 32))
('honda', (1043, 13))
('isuzu', (168, 2))
('volkswagen', (973, 12))
('peugot', (1098, 11))
('audi', (687, 6))
('bmw', (1111, 8))
('mercury', ('175', 1))
('porsche', (764, 4))


Find average by dividing HP total by count total

In [0]:
brandValues.mapValues(lambda x: int(x[0])/int(x[1])). \
    collect()

[('chevrolet', 62.666666666666664),
 ('mazda', 86.875),
 ('mitsubishi', 104.07692307692308),
 ('nissan', 102.55555555555556),
 ('dodge', 84.375),
 ('plymouth', 86.71428571428571),
 ('saab', 126.66666666666667),
 ('volvo', 128.0),
 ('alfa-romero', 125.33333333333333),
 ('mercedes-benz', 146.25),
 ('jaguar', 204.66666666666666),
 ('subaru', 86.25),
 ('toyota', 92.78125),
 ('honda', 80.23076923076923),
 ('isuzu', 84.0),
 ('volkswagen', 81.08333333333333),
 ('peugot', 99.81818181818181),
 ('audi', 114.5),
 ('bmw', 138.875),
 ('mercury', 175.0),
 ('porsche', 191.0)]

# **Advanced Spark : Accumulators & Broadcast Variables**

Function that splits the line as well as counts sedans and hatchbacks

Speed optimization

Initialize accumulator

In [0]:
sedanCount = sc.accumulator(0)
hatchbackCount =sc.accumulator(0)

Set Broadcast variable

In [0]:
sedanText=sc.broadcast("sedan")
hatchbackText=sc.broadcast("hatchback")

In [0]:
def splitLines(line) :

    global sedanCount
    global hatchbackCount

    #Use broadcast variable to do comparison and set accumulator
    if sedanText.value in line:
        sedanCount +=1
    if hatchbackText.value in line:
        hatchbackCount +=1
        
    return line.split(",")

Do the map

In [0]:
splitData=autoData.map(splitLines)

Make it execute the map (lazy execution)

In [0]:
splitData.count()
print(sedanCount, hatchbackCount)

92 67


# **Advanced Spark : Partitions**

In [0]:
collData.getNumPartitions()

2

Specify no. of partitions.

In [0]:
collData=sc.parallelize([3,5,4,7,4],4)
collData.cache()
collData.count()


5

In [0]:
collData.getNumPartitions()

4

# Practice

1. Your course resource has a CSV file "iris.csv". 
Load that file into an RDD called irisRDD
Cache the RDD and count the number of lines

2. Create a new RDD from irisRDD with the following changes
     - The name of the flower should be all capitals
     - The numeric values should be rounded off (as integers)
     
3. Filter irisRDD for lines that contain "versicolor" and count them.

4. Find the average Sepal.Length for all flowers in the irisRDD.

5. Convert the irisRDD into a key-value RDD with Species as key and Sepal.Length
as the value.
Then find the maximum of Sepal.Length by each Species.

6. Find the number of records in irisRDD, whose Sepal.Length is 
greater than the Average Sepal Length we found in the earlier practice
Note: Use Broadcast and Accumulator variables for this exercise.