# Spark Basics

This notebook will go over some simple PySpark tasks. 

We will start the Spark Session, which will be named "MyPySpark". 

Additionally, we will give 15 GB of memory to the Spark Driver Process. By default, Spark only gives the Driver a few GBs.

The SparkContext is created with the spark object. We will limit the `sc` SparkContext object to give only ERROR messages. If not, you may see a lot of INFO messages.


In [1]:
from pyspark import SparkContext
from pyspark.sql import SparkSession
import pyspark
import numpy as np

spark = SparkSession.builder.appName("MyPySpark").config("spark.driver.memory", "15g").getOrCreate()
sc = spark.sparkContext
sc.setLogLevel("ERROR")

spark


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/06/28 20:56:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


# Create a Random Number List

Next, we'll create a random number list with 200 numbers and save the list to the variable `num_data`.


We can see the details of the Spark object.
The `local[*]` shows we will use all the CPU cores on this current compute node

In [2]:
np.random.seed(42)

num_data = [np.random.randint(1,100) for _ in range(200)]

# Print first 5 values from list
num_data[:5]


[52, 93, 15, 72, 61]

# Create a Resilient Distributed Dataset (RDD)

We'll create an RDD object from the `num_data` list. This RDD object will be saved as `num_rdd`.


In [3]:
num_rdd = sc.parallelize(num_data)
type(num_rdd), num_rdd.count()


                                                                                

(pyspark.rdd.RDD, 200)

# Map Transformation

We'll use the .map() method on the RDD (Spark) list. This will create a new RDD object with the map function. This "lazy evaluation" will NOT compute the results so will NOT have the final value. This RDD object will just contain the "task" of running x^2 to be computed when you need it. This .map() function would be very quick since it will not compute x^2 over the list. This new RDD object is saved as `num_map_rdd`.


In [4]:
num_map_rdd = num_rdd.map(lambda x: x * x)
num_map_rdd


PythonRDD[2] at RDD at PythonRDD.scala:53

# Take Action

Now we'll ask to print the first 5 values from x^2. Using .take(N) will return an array with the first N elements. Spark will now compute the x^2 values since we asked for the values. This will be quicker since the RDD object is parallelized over 36 cores.


In [5]:
num_map_rdd.take(5)

[2704, 8649, 225, 5184, 3721]

# Filter Transformation

Next, let's use .filter() on the RDD object to return a new RDD object will only the numbers that pass the condition.


In [6]:
num_filter_rdd = num_rdd.filter(lambda x: x < 10)
print("Number of values in new filter RDD: ", str(num_filter_rdd.count()))
print("First 5 values in the new RDD: ", str(num_filter_rdd.take(5)))

Number of values in new filter RDD:  28
First 5 values in the new RDD:  [3, 2, 2, 3, 7]


# Collect Action

Using .collect() on the RDD will return all the elements of the RDD object to a normal Python list.


In [7]:
num_filter_array = num_rdd.collect()
type(num_filter_array), num_filter_array[:5]


(list, [52, 93, 15, 72, 61])

---

# Working with Text Data

In this example, we will use data from Project Gutenberg: "The Hound of the Baskervilles, by Arthur Conan Doyle". 

The download TXT file is at "3070.txt". 

We will load this TXT file to an RDD object with `.textFile()`. This will load the TXT by LINE.


In [8]:
path = "3070.txt"
book_rdd = sc.textFile(path)

# Print the first 10 lines of the Text
book_rdd.take(10)


["Project Gutenberg's The Hound of the Baskervilles, by Arthur Conan Doyle",
 '',
 'This eBook is for the use of anyone anywhere at no cost and with',
 'almost no restrictions whatsoever.  You may copy it, give it away or',
 're-use it under the terms of the Project Gutenberg License included',
 'with this eBook or online at www.gutenberg.org',
 '',
 '',
 'Title: The Hound of the Baskervilles',
 '']

# Count Action

We'll count the number of rows in the data.


In [9]:
book_rdd.count()

7729

# FlatMap Transformation

Let's split the lines into words. We will split the book text into the individual words. The .flatMap() function can return multiple values for each element in the RDD.


In [10]:
words_rdd = book_rdd.flatMap(lambda x: x.split())
num_words = words_rdd.count()
print("Number of words: " + str(num_words))

num_distinct_words = words_rdd.distinct().count()
print("Number of distinct words: " + str(num_distinct_words))

Number of words: 62248
Number of distinct words: 9885


# Map and ReduceByKey Transformations

We'll create a new "pair" RDD `key_value_rdd` with key/value pairs. First, this RDD will have ("word",1) for each element of the RDD. Then we will use .reduceByKey() to combine all the same words.


In [11]:
key_value_rdd = words_rdd.map(lambda x: (x,1))
word_kv_rdd = key_value_rdd.reduceByKey(lambda x,y: x+y)
flip_word_kv_rdd = word_kv_rdd.map(lambda x: (x[1],x[0]))

flip_word_kv_rdd.take(5)


[(78, 'Project'), (217, 'The'), (11, 'Hound'), (1694, 'of'), (4, 'Arthur')]

# SortByKey Transformation

We'll use .sortByKey() to sort the key/value pairs by decreasing occurrences of words.


In [12]:
word_results_rdd = flip_word_kv_rdd.sortByKey(False)
word_results_rdd.take(5)


[(3230, 'the'), (1694, 'of'), (1562, 'and'), (1449, 'to'), (1279, 'a')]

---

# Conclusion

There are many other functions within Spark that can do more linguistics type tasks with an RDD object, like removing "stop" words and "stemming". This notebook provides a brief introduction to the basic operations you can perform with PySpark and RDDs.
