# RDDs from Parallelized collections

Resilient Distributed Dataset (RDD) is the basic abstraction in Spark. It is an immutable distributed collection of objects. Since RDD is a fundamental and backbone data type in Spark

In [1]:
from pyspark import SparkConf, SparkContext
# Create a SparkConf object to configure the SparkContext
conf = SparkConf().setAppName("YourAppName").setMaster("local[*]")

# Create a SparkContext with the configured SparkConf object
sc = SparkContext(conf=conf)

# Create an RDD from a list of words
RDD = sc.parallelize(["Spark", "is", "a", "framework", "for", "Big Data processing"])

# Print out the type of the created object
print("The type of RDD is", type(RDD))

The type of RDD is <class 'pyspark.rdd.RDD'>


# RDDs from External Datasets

PySpark can easily create RDDs from files that are stored in external storage devices, such as HDFS (Hadoop Distributed File System), Amazon S3 buckets, etc. However, the most common method of creating RDD's is from files stored in your local file system. This method takes a file path and reads it as a collection of lines.

In [2]:
file_path = "dataset/ham.txt"
# Print the file_path
print("The file_path is", file_path)

# Create a fileRDD from file_path
fileRDD = sc.textFile(file_path)

# Check the type of fileRDD
print("The file type of fileRDD is", type(fileRDD))

The file_path is dataset/ham.txt
The file type of fileRDD is <class 'pyspark.rdd.RDD'>


# Partitions in your data

SparkContext's `textFile()` method takes an optional second argument called `minPartitions` for specifying the minimum number of partitions. In this exercise, you'll create a RDD named fileRDD_part with 5 partitions and then compare that with fileRDD that you created in the previous exercise.

In [3]:
# Check the number of partitions in fileRDD
print("Number of partitions in fileRDD is", fileRDD.getNumPartitions())

# Create a fileRDD_part from file_path with 5 partitions
fileRDD_part = sc.textFile(file_path, minPartitions = 5)

# Check the number of partitions in fileRDD_part
print("Number of partitions in fileRDD_part is", fileRDD_part.getNumPartitions())

Number of partitions in fileRDD is 2
Number of partitions in fileRDD_part is 5


# Map and Collect

The main method with which you can manipulate data in PySpark is using `map()`. The map() transformation takes in a function and applies it to each element in the RDD. It can be used to do any number of things, from fetching the website associated with each URL in our collection to just squaring the numbers. 

In [4]:
# from pyspark.sql import SparkSession

# # Create a SparkSession
# spark = SparkSession.builder \
#     .appName("example") \
#     .getOrCreate()
# sc = spark.sparkContext
# type(sc)

In [6]:
# numbRDD = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
# # Create map() transformation to cube numbers
# cubedRDD = numbRDD.map(lambda x: x ** 3)

# # Collect the results
# numbers_all = cubedRDD.collect()

# # Print the numbers from numbers_all
# for numb in numbers_all:
# 	print(numb)

# Filter and Count

The RDD transformation `filter()` returns a new RDD containing only the elements that satisfy a particular function. It is useful for filtering large datasets based on a keyword. For this exercise, you'll filter out lines containing keyword Spark from fileRDD RDD which consists of lines of text from the README.md file

In [7]:
fileRDD

dataset/ham.txt MapPartitionsRDD[2] at textFile at NativeMethodAccessorImpl.java:0

In [9]:
# # Filter the fileRDD to select lines with Spark keyword
# fileRDD_filter = fileRDD.filter(lambda line: 'Spark' in line)

# # How many lines are there in fileRDD?
# print("The total number of lines with the keyword Spark is", fileRDD_filter.count())

# # Print the first four lines of fileRDD
# for line in fileRDD_filter.take(4): 
#   print(line)

# ReduceBykey and Collect

One of the most popular pair RDD transformations is `reduceByKey()` which operates on key, value (k,v) pairs and merges the values for each key. In this exercise, you'll first create a pair RDD from a list of tuples, then combine the values with the same key and finally print out the result.

In [11]:
# # Create PairRDD Rdd with key value pairs
# Rdd = sc.parallelize([(1,2),(3,4),(3,6),(4,5)])

# # Apply reduceByKey() operation on Rdd
# Rdd_Reduced = Rdd.reduceByKey(lambda x, y: x+y)

# # Iterate over the result and print the output
# for num in Rdd_Reduced.collect(): 
#   print("Key {} has {} Counts".format(num[0], num[1]))

# ReduceBykey and Collect

One of the most popular pair RDD transformations is `reduceByKey()` which operates on key, value (k,v) pairs and merges the values for each key. In this exercise, you'll first create a pair RDD from a list of tuples, then combine the values with the same key and finally print out the result.

In [13]:
# # Create PairRDD Rdd with key value pairs
# Rdd = sc.parallelize([(1,2),(3,4),(3,6),(4,5)])

# # Apply reduceByKey() operation on Rdd
# Rdd_Reduced = Rdd.reduceByKey(lambda x, y: x+y)

# # Iterate over the result and print the output
# for num in Rdd_Reduced.collect(): 
#   print("Key {} has {} Counts".format(num[0], num[1]))

# SortByKey and Collect

Many times it is useful to sort the pair RDD based on the key (for example word count which you'll see later in the chapter). In this exercise, you'll sort the pair RDD `Rdd_Reduced` that you created in the previous exercise into descending order and print the final output.

In [14]:
# # Sort the reduced RDD with the key by descending order
# Rdd_Reduced_Sort = Rdd_Reduced.sortByKey(ascending=False)

# # Iterate over the result and retrieve all the elements of the RDD
# for num in Rdd_Reduced_Sort.collect():
#   print("Key {} has {} Counts".format(num[0], num[1]))

# CountingBykeys

For many datasets, it is important to count the number of keys in a key/value dataset. For example, counting the number of countries where the product was sold or to show the most popular baby names. In this simple exercise, you'll use the `Rdd` that you created earlier and count the number of unique keys in that pair RDD.

In [16]:
# Rdd = sc.parallelize([(1, 2), (3, 4), (3, 6), (4, 5)])
# # Count the unique keys
# total = Rdd.countByKey()

# # What is the type of total?
# print("The type of total is", type(total))

# # Iterate over the total and print the output
# for k, v in total.items(): 
#   print("key", k, "has", v, "counts")

# Create a base RDD and transform it

The volume of unstructured data (log lines, images, binary files) in existence is growing dramatically, and PySpark is an excellent framework for analyzing this type of data through RDDs. In this 3 part exercise, you will write code that calculates the most common words from Complete Works of William Shakespeare.

In [18]:
# # Create a baseRDD from the file path
# baseRDD = sc.textFile("dataset/Complete_Shakespeare.txt")

# # Split the lines of baseRDD into words
# splitRDD = baseRDD.flatMap(lambda x: x.split())

# # Count the total number of words
# print("Total number of words in splitRDD:", splitRDD.count())

# Remove stop words and reduce the dataset

In this exercise you'll remove stop words from your data. Stop words are common words that are often uninteresting, for example, "I", "the", "a" etc. You can remove many obvious stop words with a list of your own. But for this exercise, you will just remove the stop words from a curated list `stop_words` provided to you in your environment.

In [20]:
import nltk
from nltk.corpus import stopwords

# Download NLTK stop words (if not already downloaded)
nltk.download('stopwords')

# Get stop words list
stop_words = stopwords.words('english')

# print(stop_words)

# # Convert the words in lower case and remove stop words from the stop_words curated list
# splitRDD_no_stop = splitRDD.filter(lambda x: x.lower() not in stop_words)

# # Create a tuple of the word and 1 
# splitRDD_no_stop_words = splitRDD_no_stop.map(lambda w: (w, 1))

# # Count of the number of occurences of each word
# resultRDD = splitRDD_no_stop_words.reduceByKey(lambda x, y: x + y)
# resultRDD.collect()


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\88016\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Print word frequencies

After combining the values (counts) with the same key (word), in this exercise, you'll return the first 10 word frequencies. You could have retrieved all the elements at once using collect(), but it is bad practice and not recommended. RDDs can be huge: you may run out of memory and crash your computer..

In [21]:
# # Display the first 10 words and their frequencies from the input RDD
# for word in resultRDD.take(10):
# 	print(word)

# # Swap the keys and values from the input RDD
# resultRDD_swap = resultRDD.map(lambda x: (x[1], x[0]))

# # Sort the keys in descending order
# resultRDD_swap_sort = resultRDD_swap.sortByKey(ascending=False)

# # Show the top 10 most frequent words and their frequencies from the sorted RDD
# for word in resultRDD_swap_sort.take(10):
# 	print("{},{}". format(word[1], word[0]))