# Lab 1. PySpark and Big Data Processing

(DS702, Zhiqiang Xu, MBZUAI) 

It’s becoming more common to face situations where the amount of data is simply too big to handle on a single machine. Luckily, technologies such as Apache Spark, Hadoop, and others have been developed to solve this exact problem. The power of those systems can be tapped into directly from Python using PySpark!

In this lab, you'll learn: 
    <ul>
    <li> What Python concepts can be applied to Big Data </li>
    <li> How to use Apache Spark and PySpark </li>
    <li> How to write basic PySpark programs </li>
    <li> How to run PySpark programs on small datasets locally </li>
    </ul>

## Big Data Concepts in Python

Despite its popularity as just a scripting language, Python exposes several programming paradigms like array-oriented programming, object-oriented programming, asynchronous programming, and many others. One paradigm that is of particular interest for aspiring Big Data professionals is [functional programming](https://en.wikipedia.org/wiki/Functional_programming).

Functional programming is a common paradigm when you are dealing with Big Data. Writing in a functional manner makes for embarrassingly parallel code. This means it’s easier to take your code and have it run on several CPUs or even entirely different machines. You can work around the physical memory and CPU restrictions of a single workstation by running on multiple systems at once.

This is the power of the PySpark ecosystem, allowing you to take functional code and automatically distribute it across an entire cluster of computers.

The core idea of functional programming is that data should be manipulated by functions without maintaining any external state. This means that your code avoids global variables and always returns new data instead of manipulating the data in-place.

### Lambda Functions

[Lambda functions](https://en.wikipedia.org/wiki/Anonymous_function) in Python are defined inline and are limited to a single expression. You’ve likely seen lambda functions when using the built-in `sorted()` function:

In [1]:
x = ['Big', 'data', 'processing', 'is', 'awesome!']
print(sorted(x))

['Big', 'awesome!', 'data', 'is', 'processing']


In [2]:
print(sorted(x, key = lambda arg: arg.lower()))

['awesome!', 'Big', 'data', 'is', 'processing']


The key parameter to sorted is called for each item in the iterable. This makes the sorting case-insensitive by changing all the strings to lowercase before the sorting takes place.

This is a common use-case for lambda functions, small anonymous functions that maintain no external state.

Other common functional programming functions exist in Python as well, such as `filter()`, `map()`, and `reduce()`. All these functions can make use of lambda functions or standard functions defined with `def` in a similar manner.

###  `sorted()`,  `filter()`, `map()`, and `reduce()`

The built-in `filter()`, `map()`, and `reduce()` functions are all common in functional programming. You’ll soon see that these concepts can make up a significant portion of the functionality of a PySpark program.

It’s important to understand these functions in a core Python context. Then, you’ll be able to translate that knowledge into PySpark programs and the Spark API.

`filter()` filters items out of an iterable based on a condition, typically expressed as a lambda function:

In [3]:
x = ['Big', 'data', 'processing', 'is', 'awesome!']
list(filter(lambda arg: len(arg) < 8, x))

['Big', 'data', 'is']

`filter()` takes an iterable, calls the `lambda` function on each item, and returns the items where the `lambda` returned True.

You can imagine using `filter()` to replace a common `for` loop pattern like the following:

In [4]:
def is_less_than_8_characters(item):
    return len(item) < 8

x = ['Big', 'data', 'processing', 'is', 'awesome!']
results = []

for item in x:
    if is_less_than_8_characters(item):
        results.append(item)

print(results)

['Big', 'data', 'is']


This code collects all the strings that have less than 8 characters. The code is more verbose than the `filter()` example, but it performs the same function with the same results.

Another less obvious benefit of `filter()` is that it returns an iterable. This means `filter()` doesn’t require that your computer have enough memory to hold all the items in the iterable at once. This is increasingly important with Big Data sets that can quickly grow to several gigabytes in size.

`map()` is similar to `filter()` in that it applies a function to each item in an iterable, but it always produces a 1-to-1 mapping of the original items. The **new** iterable that `map()` returns will always have the same number of elements as the original iterable, which was not the case with `filter()`:

In [5]:
x = ['Big', 'data', 'processing', 'is', 'awesome!']
print(list(map(lambda arg: arg.upper(), x)))

['BIG', 'DATA', 'PROCESSING', 'IS', 'AWESOME!']


`map()` automatically calls the `lambda` function on all the items, effectively replacing a `for` loop like the following:

In [6]:
results = []

x = ['Big', 'data', 'processing', 'is', 'awesome!']
for item in x:
    results.append(item.upper())

print(results)

['BIG', 'DATA', 'PROCESSING', 'IS', 'AWESOME!']


The for loop has the same result as the `map()` example, which collects all items in their upper-case form. However, as with the `filter()` example, `map()` returns an iterable, which again makes it possible to process large sets of data that are too big to fit entirely in memory.

Finally, the last of the functional trio in the Python standard library is `reduce()`. As with `filter()` and `map()`, `reduce()` applies a function to elements in an iterable.

Again, the function being applied can be a standard Python function created with the `def` keyword or a `lambda` function.

However, `reduce()` doesn’t return a new iterable. Instead, `reduce()` uses the function called to reduce the iterable to a single value:

In [7]:
from functools import reduce
x = ['Big', 'data', 'processing', 'is', 'awesome!']
reduce(lambda val1, val2: val1 + val2, x)

'Bigdataprocessingisawesome!'

This code combines all the items in the iterable, from left to right, into a single item. There is no call to `list()` here because `reduce()` already returns a single item.

## Hello World in PySpark

First, let's download `pyspark` library in your virtual environment if you're working in a jupyter notebook or [Google Colab](https://colab.research.google.com/) (preferred). 

In [8]:
import pyspark
pyspark.__version__

'3.2.0'

In [8]:
# !pip install pyspark

Collecting pyspark
  Downloading pyspark-3.2.0.tar.gz (281.3 MB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting py4j==0.10.9.2
  Downloading py4j-0.10.9.2-py2.py3-none-any.whl (198 kB)
Using legacy 'setup.py install' for pyspark, since package 'wheel' is not installed.
Installing collected packages: py4j, pyspark
    Running setup.py install for pyspark: started
    Running setup.py install for pyspark: finished with status 'done'
Successfully installed py4j-0.10.9.2 pyspark-3.2.0


Now, let's download this text file and name it `copyright.txt` to work further:

In [9]:
!wget -O copyright.txt 'https://raw.githubusercontent.com/ialbert/booleannet/master/COPYRIGHT.txt'

'wget' is not recognized as an internal or external command,
operable program or batch file.


As in any programming language/framework, you’ll want to get started with a `Hello World` example. Below is the PySpark equivalent:

In [None]:
import pyspark
sc = pyspark.SparkContext('local[*]')

In [None]:
txt = sc.textFile('copyright.txt')
print(txt.count())

python_lines = txt.filter(lambda line: 'python' in line.lower())
print(python_lines.count())

28
1


In [None]:
sc.stop()

You’ll learn all the details of this program soon, but take a good look. The program counts the total number of lines and the number of lines that have the word `python` in a file named `copyright.txt`.

There can be a lot of things happening behind the scenes that distribute the processing across multiple nodes if you’re on a cluster. However, for now, think of the program as a Python program that uses the PySpark library.

Now that you’ve seen some common functional concepts that exist in Python as well as a simple PySpark program, it’s time to dive deeper into Spark and PySpark.

### What Is Spark?

Apache Spark is made up of several components, so describing it can be difficult. At its core, Spark is a generic engine for processing large amounts of data.

Spark is written in Scala and runs on the JVM. Spark has built-in components for processing streaming data, machine learning, graph processing, and even interacting with data via SQL.

### What Is PySpark?

Spark is implemented in Scala, a language that runs on the JVM, so how can you access all that functionality via Python?

PySpark is the answer.

The current version of PySpark is 3.2.0 and requires Python >= 3.6. 

PySpark is a Python-based wrapper on top of the Scala API. PySpark communicates with the Spark Scala-based API via the [Py4J library](https://www.py4j.org/). Py4J isn’t specific to PySpark or Spark. Py4J allows any Python program to talk to JVM-based code.

## PySpark API and Data Structures

### RDD Creation

To interact with PySpark, you create specialized data structures called **Resilient Distributed Datasets** (RDDs). RDDs hide all the complexity of transforming and distributing your data automatically across multiple nodes by a scheduler if you’re running on a cluster. 

In order to create an RDD, first, you need to create a SparkSession which is an entry point to the PySpark application. SparkSession can be created using a `builder()` or `newSession()` methods of the SparkSession.

Spark session internally creates a `sc` variable of `SparkContext`. You can create multiple SparkSession objects but only one SparkContext per JVM. In case if you want to create another new SparkContext you should stop existing Sparkcontext (using `stop()`) before creating a new one.

In [None]:
spark = pyspark.sql.SparkSession.builder.master("local[1]").appName("SparkExample").getOrCreate()

#### using `parellelize()`

SparkContext has several functions to use with RDDs. For example, it’s `parallelize()` method is used to create an RDD from a list.

In [None]:
# Create RDD from parallelize    
lst = [("Python", 100000), ("Java", 20000), ("Scala", 3000)]
rdd = spark.sparkContext.parallelize(lst)
rdd.take(2) # takes first two elements

[('Python', 100000), ('Java', 20000)]

#### using `textFile()`

RDD can also be created from a text file using `textFile()` function of the SparkContext.

In [None]:
# Create RDD from external data source
rdd2 = spark.sparkContext.textFile("copyright.txt")
rdd2.take(2)

['',
 'Note there are two third party packages included with BooleanNet, these are goverened by ']

Once you have an RDD, you can perform transformation and action operations. Any operation you perform on RDD runs in parallel.

### RDD Operations

On PySpark RDD, you can perform two kinds of operations.

**RDD transformations** – Transformations are lazy operations. When you run a transformation(for example update), instead of updating a current RDD, these operations return another RDD. 

Transformations on Spark RDD returns another RDD and transformations are lazy meaning they don’t execute until you call an action on RDD. Some transformations on RDD’s are `flatMap()`, `map()`, `reduceByKey()`, `filter()`, `sortByKey()` and return new RDD instead of updating the current.

**RDD actions** – operations that trigger computation and return RDD values to the driver.

RDD Action operation returns the values from an RDD to a driver node. In other words, any RDD function that returns non RDD[T] is considered as an action. 
Some actions on RDD’s are `count()`, `collect()`, `first()`, `max()`, `reduce()` and more.

### PySpark DataFrame

DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as structured data files, tables in Hive, external databases, or existing RDDs.

If you are coming from a Python background you already know what Pandas DataFrame is; PySpark DataFrame is mostly similar to Pandas DataFrame with exception PySpark DataFrames are distributed in the cluster (meaning the data in DataFrame’s are stored in different machines in a cluster) and any operations in PySpark executes in parallel on all machines whereas Panda Dataframe stores and operates on a single machine.

If you have no Python background, we would recommend you learn some basics on Python (tutorials). For now, just know that data in PySpark DataFrame’s are stored in different machines in a cluster.

### DataFrame Creation

Simplest way to create DataFrame is from a Python list of data. DataFrame can also be created from RDD and by reading a files from several sources.

#### using `createDataFrame()`

In [None]:
data = [('James','','Smith','1991-04-01','M',3000),
  ('Michael','Rose','','2000-05-19','M',4000),
  ('Robert','','Williams','1978-09-05','M',4000),
  ('Maria','Anne','Jones','1967-12-01','F',4000),
  ('Jen','Mary','Brown','1980-02-17','F',-1)
]

columns = ["firstname","middlename","lastname","dob","gender","salary"]
df = spark.createDataFrame(data=data, schema = columns)

Since DataFrame’s are structure format which contains names and column, we can get the schema of the DataFrame using `df.printSchema()`

In [None]:
df.printSchema()

root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- dob: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: long (nullable = true)



`df.show()` shows the 20 elements from the DataFrame.

In [None]:
df.show()

+---------+----------+--------+----------+------+------+
|firstname|middlename|lastname|       dob|gender|salary|
+---------+----------+--------+----------+------+------+
|    James|          |   Smith|1991-04-01|     M|  3000|
|  Michael|      Rose|        |2000-05-19|     M|  4000|
|   Robert|          |Williams|1978-09-05|     M|  4000|
|    Maria|      Anne|   Jones|1967-12-01|     F|  4000|
|      Jen|      Mary|   Brown|1980-02-17|     F|    -1|
+---------+----------+--------+----------+------+------+



#### DataFrame from external data sources

In realtime applications, DataFrame’s are created from external sources like files from the local system, HDFS, S3 Azure, HBase, MySQL table e.t.c. Below is an example of how to read a csv file from a local system.

In [None]:
!wget -O zipcodes.csv 'https://raw.githubusercontent.com/btucker/thisweknow/master/script/zipcodes.csv'

--2022-01-01 14:59:34--  https://raw.githubusercontent.com/btucker/thisweknow/master/script/zipcodes.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 784128 (766K) [text/plain]
Saving to: ‘zipcodes.csv’


2022-01-01 14:59:35 (26.0 MB/s) - ‘zipcodes.csv’ saved [784128/784128]



In [None]:
df = spark.read.csv("zipcodes.csv", header=True, inferSchema=True)
df.printSchema()

root
 |-- ZIP Code: integer (nullable = true)
 |-- City: string (nullable = true)
 |-- State Abbreviation: string (nullable = true)



In [None]:
df.show()

+--------+----------+------------------+
|ZIP Code|      City|State Abbreviation|
+--------+----------+------------------+
|     210|PORTSMOUTH|                NH|
|     211|PORTSMOUTH|                NH|
|     212|PORTSMOUTH|                NH|
|     213|PORTSMOUTH|                NH|
|     214|PORTSMOUTH|                NH|
|     215|PORTSMOUTH|                NH|
|     501|HOLTSVILLE|                NY|
|     544|HOLTSVILLE|                NY|
|     601|  ADJUNTAS|                PR|
|     602|    AGUADA|                PR|
|     603| AGUADILLA|                PR|
|     604| AGUADILLA|                PR|
|     605| AGUADILLA|                PR|
|     606|   MARICAO|                PR|
|     610|    ANASCO|                PR|
|     611|   ANGELES|                PR|
|     612|   ARECIBO|                PR|
|     613|   ARECIBO|                PR|
|     614|   ARECIBO|                PR|
|     616|  BAJADERO|                PR|
+--------+----------+------------------+
only showing top

### PySpark SQL

In order to use SQL, first, create a temporary table on DataFrame using `createOrReplaceTempView()` function.

In [None]:
df.createOrReplaceTempView("ZIPCODES_DB")

Once created, this table can be accessed throughout the SparkSession using `sql()` and it will be dropped along with your `SparkContext` termination.

In [None]:
spark.sql("SELECT * from ZIPCODES_DB").show()

+--------+----------+------------------+
|ZIP Code|      City|State Abbreviation|
+--------+----------+------------------+
|     210|PORTSMOUTH|                NH|
|     211|PORTSMOUTH|                NH|
|     212|PORTSMOUTH|                NH|
|     213|PORTSMOUTH|                NH|
|     214|PORTSMOUTH|                NH|
|     215|PORTSMOUTH|                NH|
|     501|HOLTSVILLE|                NY|
|     544|HOLTSVILLE|                NY|
|     601|  ADJUNTAS|                PR|
|     602|    AGUADA|                PR|
|     603| AGUADILLA|                PR|
|     604| AGUADILLA|                PR|
|     605| AGUADILLA|                PR|
|     606|   MARICAO|                PR|
|     610|    ANASCO|                PR|
|     611|   ANGELES|                PR|
|     612|   ARECIBO|                PR|
|     613|   ARECIBO|                PR|
|     614|   ARECIBO|                PR|
|     616|  BAJADERO|                PR|
+--------+----------+------------------+
only showing top

Next, we can see schema of `ZIPCODES_DB` database by creating new dataframe. 

In [None]:
df2 = spark.sql("SELECT * from ZIPCODES_DB")
df2.printSchema()

root
 |-- ZIP Code: integer (nullable = true)
 |-- City: string (nullable = true)
 |-- State Abbreviation: string (nullable = true)



Here, we calculated the number of rows related to the `City` column and showed the first 5 rows.

In [None]:
spark.sql("SELECT City, COUNT(*) AS ENTRIES from ZIPCODES_DB GROUP BY City ORDER BY ENTRIES DESC").show(5)

+----------+-------+
|      City|ENTRIES|
+----------+-------+
|WASHINGTON|    301|
|   HOUSTON|    190|
|  NEW YORK|    162|
|   EL PASO|    158|
|    DALLAS|    130|
+----------+-------+
only showing top 5 rows



In [None]:
spark.stop()

## Word Count

In [None]:
from pyspark import SparkContext, SparkConf

sc = SparkContext("local", "Word Count")
txt = sc.textFile('copyright.txt') # list of line of sentences
words = txt.flatMap(lambda line: line.split()) # transforming sentences into separate words
wordCounts = words.map(lambda word: (word.lower(), 1)).reduceByKey(lambda a, b: a + b)

wordCounts.collect()

[Stage 0:>                                                          (0 + 1) / 1]                                                                                

[('note', 1),
 ('there', 1),
 ('are', 2),
 ('two', 1),
 ('third', 1),
 ('party', 1),
 ('packages', 1),
 ('included', 2),
 ('with', 3),
 ('booleannet,', 1),
 ('these', 1),
 ('goverened', 1),
 ('by', 1),
 ('the', 17),
 ('licenses', 1),
 ('below', 1),
 ('and', 5),
 ('have', 1),
 ('different', 1),
 ('copyrights.', 1),
 ('ply:', 1),
 ('gnu', 1),
 ('lesser', 1),
 ('general', 1),
 ('public', 1),
 ('license,', 2),
 ('copyright', 5),
 ('(c)', 3),
 ('2001-2007,', 1),
 ('david', 1),
 ('m.', 1),
 ('beazley', 1),
 ('functional:', 1),
 ('python', 2),
 ('software', 8),
 ('foundation', 2),
 ('2004-2006', 1),
 ('rest', 1),
 ('of', 9),
 ('is', 4),
 ('licensed', 1),
 ('mit', 1),
 ('open', 1),
 ('source', 1),
 ('license', 1),
 ('2007,', 1),
 ('istvan', 1),
 ('albert', 1),
 ('permission', 2),
 ('hereby', 1),
 ('granted,', 1),
 ('free', 1),
 ('charge,', 1),
 ('to', 8),
 ('any', 3),
 ('person', 1),
 ('obtaining', 1),
 ('a', 2),
 ('copy', 1),
 ('this', 2),
 ('associated', 1),
 ('documentation', 1),
 ('files',

Next, briefly about steps: 
We created a SparkContext to connect connect the Driver that runs locally.

`sc = SparkContext("local", "Word Count")`

Next, we read the input text file using SparkContext variable and created a flatmap of words. `words` is of type PythonRDD.

`txt = sc.textFile('copyright.txt')`

`words = txt.flatMap(lambda line: line.split())` 
we have split the words using single space as separator.

Then we will map each word to a key:value pair of word:1, 1 being the number of occurrences.

`words.map(lambda word: (word.lower(), 1))` 

The result is then reduced by key, which is the word, and the values are added.

`reduceByKey(lambda a, b: a + b)`

In [None]:
# to stop SparkContext
sc.stop()