# The goal of these exercises is to get familiar with pyspark APIs

# Cheat Sheet
**Transformation operations:**
- map: Takes a function as input and applies it to each element in the source RDD to create a new RDD
- flatMap: Takes an input function, which returns a sequence for each input element passed to it returns a new RDD formed by flattening this collection of sequence
- filter: Takes a Boolean function as input and applies it to each element in the source RDD to create a new RDD by selecting only those elements for which the input Boolean function returned true
- distinct: The distinct method of an RDD returns a new RDD containing the distinct elements in the source RDD.
- zip:takes an RDD as input and returns an RDD of pairs, where the first element in a pair is from the source RDD and second element is from the input RDD. Both the source RDD and the input RDD must have the same length.
- groupBy: Groups the elements of an RDD according to a user specified criteria. In each returned pair, the first item is a key and the second item is a collection of the elements mapped to that key by the input function to the groupBy method.
- sortBy: returns an RDD with sorted elements from the source RDD. It takes two input parameters. The first input is a function that generates a key for each element in the source RDD. The second argument allows you to specify ascending or descending order for sort.
- sample: Returns a sampled subset of the source RDD. It takes three input parameters. The first parameter specifies the replacement strategy. The second parameter specifies the ratio of the sample size to source RDD size.
- union: Return the union of this RDD and another one.
- intersection: Return the intersection of this RDD and another one. The output will not contain any duplicate elements, even if the input RDDs did.

**Action operations:**
- collect: Returns the elements in the source RDD as an array. **It can crash the driver program if called on a very large RDD.**
- count: The count method returns a count of the elements in the source RDD.
- countByValue: The countByValue method returns a count of each unique element in the source RDD
- first: The first element in the source RDD.
- max: Returns the largest element in an RDD. Similar idea for min
- stdev: Compute the standard deviation of this RDD’s elements.
- take: takes an integer N as input and returns an array containing the first N element in the source RDD.
- takeOrdered: takes an integer N as input and returns an array containing the N smallest elements in the source RDD.
- top: takes an integer N as input and returns an array containing the N largest elements in the source RDD.
- reduce: aggregates the elements of the source RDD using an associative and commutative binary operator provided to it.

In [1]:
#start the SparkContext
import findspark
findspark.init()
from pyspark import SparkContext 
sc = SparkContext()

# 1. Basic stuff

In [2]:
RDD_text = sc.textFile('Numbers.csv') # reads as text
RDD = RDD_text.map(lambda x: int(x))
print(RDD.take(5))
# How many numbers are there

# Max value

# Min value

# mean

# Standard deviation


[3, 91, 77, 24, 33]


In [3]:
# count of even numbers

# count of numbers greater than 80

# sum of odd numbers

# number of unique elements in the RDD

# Summation of (x^2 + 5)


# 2. Working with Text

In [4]:
RDD_words = sc.textFile('sonnetWords.txt')
RDD_words = RDD_words.filter(lambda x: x!='')
# How many words are there

# How many unique words

# How many words having at least 4 characters

# Average number of characters per word

# Number of unique words case-insensitive

# Convert the words to UPPERCASE and show few samples

# print the longest word (having highest number of characters)


In [5]:
%%time
RDD_texts = sc.textFile('Moby-Dick.txt')
RDD_texts = RDD_texts.filter(lambda line: line.strip())
print(RDD_texts.take(3))
# How many sentences are there

# Show a sample sentence containing word 'delights'

# Average number of words per sentence

# replace all 'crazy' with 'genius' and show 3 examples

# shortest sentence (in terms of number of words) in the text


['CHAPTER 1. Loomings.', 'Call me Ishmael. Some years ago--never mind how long precisely--having', 'little or no money in my purse, and nothing particular to interest me on']
Wall time: 1.13 s
