<a href="https://colab.research.google.com/github/tijazz/Big-Data/blob/main/SparkNotebooks/helloworld_spark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Setting up PySpark in Colab
Spark is written in the Scala programming language and requires the Java Virtual Machine (JVM) to run. Therefore, our first task is to download Java.



In [1]:
!apt-get install openjdk-8-jdk-headless

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following packages were automatically installed and are no longer required:
  libnvidia-common-460 nsight-compute-2020.2.0
Use 'apt autoremove' to remove them.
The following additional packages will be installed:
  openjdk-8-jre-headless
Suggested packages:
  openjdk-8-demo openjdk-8-source libnss-mdns fonts-dejavu-extra
  fonts-ipafont-gothic fonts-ipafont-mincho fonts-wqy-microhei
  fonts-wqy-zenhei fonts-indic
The following NEW packages will be installed:
  openjdk-8-jdk-headless openjdk-8-jre-headless
0 upgraded, 2 newly installed, 0 to remove and 42 not upgraded.
Need to get 36.5 MB of archives.
After this operation, 143 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 openjdk-8-jre-headless amd64 8u312-b07-0ubuntu1~18.04 [28.2 MB]
Get:2 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 openjdk-8-jdk-headless

Next, we will install Apache Spark 3.0.1 with Hadoop 2.7 .


In [2]:
#!wget https://apache.mirrors.nublue.co.uk/spark/spark-3.0.2/spark-3.0.2-bin-hadoop2.7.tgz
!wget https://archive.apache.org/dist/spark/spark-3.2.1/spark-3.2.1-bin-hadoop2.7.tgz

--2022-05-16 22:37:45--  https://archive.apache.org/dist/spark/spark-3.2.1/spark-3.2.1-bin-hadoop2.7.tgz
Resolving archive.apache.org (archive.apache.org)... 138.201.131.134, 2a01:4f8:172:2ec5::2
Connecting to archive.apache.org (archive.apache.org)|138.201.131.134|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 272637746 (260M) [application/x-gzip]
Saving to: ‘spark-3.2.1-bin-hadoop2.7.tgz’


2022-05-16 22:37:54 (28.7 MB/s) - ‘spark-3.2.1-bin-hadoop2.7.tgz’ saved [272637746/272637746]



Now, we just need to unzip that folder.


In [3]:
!tar xf /content/spark-3.2.1-bin-hadoop2.7.tgz


There is one last thing that we need to install and that is the findspark library. It will locate Spark on the system and import it as a regular library.



In [4]:
!pip install -q findspark


Now that we have installed all the necessary dependencies in Colab, it is time to set the environment path. This will enable us to run Pyspark in the Colab environment.


In [5]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.2.1-bin-hadoop2.7"


We need to locate Spark in the system. For that, we import findspark and use the findspark.init() method.

In [6]:
import findspark
findspark.init()
findspark.find()

'/content/spark-3.2.1-bin-hadoop2.7'

Now, we can import SparkSession from pyspark.sql and create a SparkSession, which is the entry point to Spark.

You can give a name to the session using appName() and add some configurations with config() if you wish.

In [8]:
from pyspark.sql import SparkSession

spark = SparkSession.builder\
        .master("local")\
        .appName("Colab")\
        .config('spark.ui.port', '4050')\
        .getOrCreate()

Finally, print the SparkSession variable.

In [9]:
spark


#Loading data into PySpark
The SparkContext, sc, is the main entry point for accessing Spark in Python. The textFile() method reads the file into a Resilient Distributed Dataset (RDD) with each line in the file being an element in the RDD collection. The URL hdfs:/user/cloudera/words.txt specifies the location of the file in HDFS.



In [11]:
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
#you need to download the file from the below link then upload it to the Colab ( save it as "shakespeare.txt")
#https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt
lines = sc.textFile("/content/shakespeare.txt")

We can verify the file was successfully loaded by calling the count() method, which prints the number of elements in the RDD:

In [12]:
lines.count()

124456

#Split each line into words. Next, we will split each line into a set of words.
To split each line into words and store them in an RDD called words, run: The flatMap() method iterates over every line in the RDD, and lambda line : line.split(" ") is executed on each line. The lambda notation is an anonymous function in Python, i.e., a function defined without using a name. In this case, the anonymous function takes a single argument, line, and calls split(" ") which splits the line into an array words. 

In [13]:
words = lines.flatMap(lambda line : line.split(" "))


In [20]:
#count of words
words.count()


1418390

#Assign initial count value to each word. 
Next, we will create tuples for each word with an initial count of 1:
The map() method iterates over every word in the words RDD, and the lambda expression creates a tuple with the word and a value of 1.
Note that in the previous step we used flatMap, but here we used map. In this step, we want to create a tuple for every word, i.e., we have a one-to-one mapping between the input words and output tuples. In the previous step, we wanted to split each line into a set of words, i.e., there is a one-to-many mapping between input lines and output words. In general, use map when the number of inputs to number of outputs is one-to-one, and flatMap for one-to-many (or one-to-none).


In [21]:
tuples = words.map(lambda word : (word,1))

#Sum all word count values.
 We can sum all the counts in the tuples for each word into a new RDD counts:

In [22]:
counts = tuples.reduceByKey(lambda a,b: (a + b))

The reduceByKey() method calls the lambda expression for all the tuples with the same word. The lambda expression has two arguments, a and b, which are the count values in two tuples.

#Write word counts to text file.
 We can write the counts RDD to current folder:

In [23]:
counts.coalesce(1).saveAsTextFile("outputDir")

The coalesce() method combines all the RDD partitions into a single partition since we want a single output file, and saveAsTextFile() writes the RDD to the specified location.

#View result

In [24]:
!cat "/content/outputDir/part-00000"

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
('hermits', 1)
('alphabet,', 1)
('laments;', 1)
('sapling;', 1)
('[MARCUS', 1)
("'But!'", 1)
('reprehending', 1)
('insult', 1)
('substances.', 1)
('dazzle.', 1)
('Follows', 2)
('signs?', 1)
('Somewhither', 1)
('Cornelia', 2)
("Tully's", 1)
('Orator.', 1)
('Extremity', 1)
('Although,', 1)
('Causeless,', 1)
('aunt;', 1)
('Lavinia!', 2)
('these?-', 1)
('boy.-', 1)
("skill'd;", 1)
('library,', 1)
('lifts', 2)
('fact;', 1)
('tosseth', 1)
('Grandsire,', 1)
("Ovid's", 1)
('Metamorphoses;', 1)
('leaves!', 1)
('find?', 1)
("Tereus'", 1)
('rape;', 1)
('rape,', 2)
('annoy.', 1)
('quotes', 1)
('leaves.', 1)
("Ravish'd", 1)
('Philomela', 1)
('woods?', 1)
('hunt-', 1)
('there!-', 1)
("Pattern'd", 1)
('describes,', 1)
('rapes.', 1)
('tragedies?', 1)
('slunk', 1)
('erst,', 1)
("Lucrece'", 2)
('niece;', 6)
('Pallas,', 1)
('find!', 1)
('mouth]', 1)
('shift!', 1)
('stumps,', 1)
('writes]', 1)
("'Stuprum-", 1)
('Chiron-', 1)
("Demetrius.'", 