# Setup

Let's set up Spark on your Colab environment.  Run the cell below!

In [1]:
!pip install pyspark
!pip install -U -q PyDrive
!apt install openjdk-8-jdk-headless -qq
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.1.tar.gz (281.4 MB)
[K     |████████████████████████████████| 281.4 MB 47 kB/s 
[?25hCollecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[K     |████████████████████████████████| 199 kB 42.5 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.1-py2.py3-none-any.whl size=281845513 sha256=d0c48566e3d37d80a86af44aebb4d73f02255c365a54bfd0d5a541d090dcd898
  Stored in directory: /root/.cache/pip/wheels/42/59/f5/79a5bf931714dcd201b26025347785f087370a10a3329a899c
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.5 pyspark-3.3.1
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove

Now we import some of the libraries usually needed by our workload.





In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import pyspark
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark import SparkContext, SparkConf

# Downloading Data

In [3]:
!wget https://raw.githubusercontent.com/gogundur/Pyspark-WordCount/master/romeojuliet.txt

--2022-11-09 19:25:29--  https://raw.githubusercontent.com/gogundur/Pyspark-WordCount/master/romeojuliet.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 173028 (169K) [text/plain]
Saving to: ‘romeojuliet.txt’


2022-11-09 19:25:30 (7.29 MB/s) - ‘romeojuliet.txt’ saved [173028/173028]



# RDD (Resilient Distributed Dataset)


<font color='red'>Note that, before Spark 2.0, the main programming interface of Spark was the Resilient Distributed Dataset (RDD). After Spark 2.0, RDDs are replaced by Dataset, which is strongly-typed like an RDD, but with richer optimizations under the hood. The RDD interface is still supported. However, we highly recommend you to switch to use Dataset, which has better performance than RDD.</font>

RDD Programming Guide: https://spark.apache.org/docs/latest/rdd-programming-guide.html

A RDD is a parallelized data structure that gets workload distributed across the worker nodes. They are the basic units of Spark programming. 

- The fundamental abstraction of Apache Spark is a read-only, parallel, distributed, fault-tolerent collection called a resilient distributed datasets (RDD).
- RDDs behave a bit like Python collections (e.g. lists).
- When working with Apache Spark we iteratively apply functions to every item of these collections in parallel to produce *new* RDDs.
- The data is distributed across nodes in a cluster of computers.
- Functions implemented in Spark can work in parallel across elements of the collection.
- The  Spark framework allocates data and processing to different nodes, without any intervention from the programmer.
- RDDs automatically rebuilt on machine failure.

## Lifecycle of a Spark Program


1. Create some input RDDs from external data or parallelize a collection in your driver program.
2. Lazily transform them to define new RDDs using transformations like `filter()` or `map()`
3. Ask Spark to cache() any intermediate RDDs that will need to be reused.
4. Launch actions such as count() and collect() to kick off a parallel computation, which is then optimized and executed by Spark.

## Operations on Distributed Data

* Two types of operations: ***transformations*** and ***actions***. 
>* Transformations are lazy (not computed immediately)
>* Transformations are executed when an action is run
* A Spark progream consist of a sequence of steps, each of which typically applies some function to an RDD to produce another RDD. Such operations are called transformation.
* It is also possible to take data from the surrounding file systems, such as HDFS, and turn it into an RDD, and to take an RDD and return it to the surrouding file systems or to produce a result taht is passed back to an application that called a Spart program. This kinds of operations are called actions. 


### [Transformations](https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations) (lazy)

```
map() flatMap()
filter() 
mapPartitions() mapPartitionsWithIndex() 
sample()
union() intersection() distinct()
groupBy() groupByKey()
reduceBy() reduceByKey()
sortBy() sortByKey()
join()
cogroup()
cartesian()
pipe()
coalesce()
repartition()
partitionBy()
...
```


### [Actions](https://spark.apache.org/docs/latest/rdd-programming-guide.html#actions)

```
reduce()
collect()
count()
first()
take()
takeSample()
saveToCassandra()
takeOrdered()
saveAsTextFile()
saveAsSequenceFile()
saveAsObjectFile()
countByKey()
foreach()
```

# Initializing Spark

In order to work with RDDs, we need to create a SparkContext.

The first thing a Spark program must do is to create a SparkContext object, which tells Spark how to access a cluster. To create a SparkContext you first need to build a SparkConf object that contains information about your application.

In [4]:
# create the session
# conf = SparkConf().set("spark.ui.port", "4050")
conf = SparkConf().setMaster("local").setAppName("My app")

# create the context
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession.builder.getOrCreate()

# Word Count

In [5]:
!rm -rf output

In [6]:
filename = 'romeojuliet.txt'

# read data from text file and split each line into words
words_rdd = sc.textFile(filename).flatMap(lambda line: line.split(" "))
words_rdd.take(5)

['', '', '', '', '']

In [7]:
# Exclude whitespaces
words_rdd = words_rdd.filter(lambda x:x!='')
words_rdd.take(5)

['WILLIAM', "SHAKESPEARE'S", 'ROMEO', '&', 'JULIET']

In [8]:
# map
wordCounts_rdd = words_rdd.map(lambda word: (word, 1))
wordCounts_rdd.take(5)

[('WILLIAM', 1), ("SHAKESPEARE'S", 1), ('ROMEO', 1), ('&', 1), ('JULIET', 1)]

In [9]:
# count the occurrence of each word
wordCounts_rdd = wordCounts_rdd.reduceByKey(lambda a,b:a+b)
wordCounts_rdd.take(5)

[('WILLIAM', 1),
 ("SHAKESPEARE'S", 1),
 ('ROMEO', 145),
 ('&', 1),
 ('JULIET', 91)]

In [10]:
wordCounts_rdd = wordCounts_rdd.reduceByKey(lambda a,b:(a+b)).sortByKey(ascending=False)
wordCounts_rdd.take(5)

[('zillion', 1),
 ('youth.', 1),
 ('youth', 1),
 ('yourself.', 2),
 ('yourself', 1)]

In [11]:
# save the counts to output
wordCounts_rdd.saveAsTextFile("output")

In [12]:
!ls -l ./coutput/

ls: cannot access './coutput/': No such file or directory


In [13]:
!cat output/part-00000 | head

('zillion', 1)
('youth.', 1)
('youth', 1)
('yourself.', 2)
('yourself', 1)
('yours?', 1)
('your', 39)
('younger', 1)
('young?', 1)
('young;', 1)


If you try to run the application again, you may get an error. This is because, during our first run, the output folder is created. Before you try it again, you need to explicitly delete the output folder. 

# Find the most frequent words

 As the first step, we switch (key,val) pairs as (val,key)

In [14]:
wordCounts_reversed_rdd = wordCounts_rdd.map(lambda x:(x[1],x[0]))
wordCounts_reversed_rdd.take(5)

[(1, 'zillion'),
 (1, 'youth.'),
 (1, 'youth'),
 (2, 'yourself.'),
 (1, 'yourself')]

In [15]:
wordCounts_reversed_rdd = wordCounts_reversed_rdd.sortByKey(ascending=False)
wordCounts_reversed_rdd.take(5)

[(1120, 'the'), (460, 'of'), (435, 'a'), (384, 'to'), (381, 'and')]

# Excluding stopwords

A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query. 

In [16]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [17]:
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
stopwords[1:10]

['me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [18]:
wordCounts_reversed_rdd = wordCounts_reversed_rdd.filter(lambda x: x[1] not in stopwords).sortByKey(False)
wordCounts_reversed_rdd.take(5)

[(246, 'The'), (219, 'Romeo'), (206, 'I'), (145, 'ROMEO'), (116, 'A')]

# Remove Punctuation and Transform All Words to Lowercase.


In [19]:
def lower_clean_str(x):
  punc='!"#$%&\'()*+,./:;<=>?@[\\]^_`{|}~-'
  lowercased_str = x.lower()
  for ch in punc:
    lowercased_str = lowercased_str.replace(ch, '')
  return lowercased_str

In [20]:
words_rdd = sc.textFile(filename).flatMap(lambda line: line.split(" "))
words_rdd = words_rdd.map(lower_clean_str)                    # Convert to lowercase, and remove punctuations
words_rdd = words_rdd.flatMap(lambda satir: satir.split(" ")) # Split based on words
words_rdd = words_rdd.filter(lambda x:x!='')                  # Remove whitespace
words_rdd = words_rdd.filter(lambda x:x not in stopwords)     # Remove stop words
wordCounts_rdd = words_rdd.map(lambda word:(word,1))
wordCounts_rdd = wordCounts_rdd.reduceByKey(lambda x,y:(x+y))
wordCounts_reversed_rdd = wordCounts_rdd.map(lambda x:(x[1],x[0]))
wordCounts_reversed_rdd = wordCounts_reversed_rdd.sortByKey(ascending=False)
wordCounts_reversed_rdd.take(5)

[(464, 'romeo'),
 (251, 'juliet'),
 (143, 'mercutio'),
 (133, 'capulet'),
 (114, 'thou')]