# The goal of these exercises is to get familiar with key-value RDDs

# Cheat Sheet
**Transformation operations:**
- map: Takes a function as input and applies it to each element in the source RDD to create a new RDD
- flatMap: Takes an input function, which returns a sequence for each input element passed to it returns a new RDD formed by flattening this collection of sequence
- filter: Takes a Boolean function as input and applies it to each element in the source RDD to create a new RDD by selecting only those elements for which the input Boolean function returned true
- distinct: The distinct method of an RDD returns a new RDD containing the distinct elements in the source RDD.
- zip:takes an RDD as input and returns an RDD of pairs, where the first element in a pair is from the source RDD and second element is from the input RDD. Both the source RDD and the input RDD must have the same length.
- groupBy: Groups the elements of an RDD according to a user specified criteria. In each returned pair, the first item is a key and the second item is a collection of the elements mapped to that key by the input function to the groupBy method.
- sortBy: returns an RDD with sorted elements from the source RDD. It takes two input parameters. The first input is a function that generates a key for each element in the source RDD. The second argument allows you to specify ascending or descending order for sort.
- sample: Returns a sampled subset of the source RDD. It takes three input parameters. The first parameter specifies the replacement strategy. The second parameter specifies the ratio of the sample size to source RDD size.
- union: Return the union of this RDD and another one.
- intersection: Return the intersection of this RDD and another one. The output will not contain any duplicate elements, even if the input RDDs did.

**Transformation operations on (key, value) RDDs:**
- keys: returns an RDD of only the keys in the source RDD.
- values: returns an RDD of only the values in the source RDD.
- mapValues: takes a function as input and applies it to each value in the source RDD.
- join: takes an RDD of key-value pairs as input and performs an inner join on the source and input RDDs. It returns an RDD of pairs, where the first element in a pair is a key found in both source and input RDD and the second element is a tuple containing values mapped to that key in the source and input RDD. , also check rightOuterJoin, leftOuterJoin, fullOuterJoin
- sampleByKey: returns a subset of the source RDD sampled by key. It takes the sampling rate for each key as input and returns a sample of the source RDD.
- subtractByKey: takes an RDD of key-value pairs as input and returns an RDD of key-value pairs containing only those keys that exist in the source RDD, but not in the input RDD.
- groupByKey: returns an RDD of pairs, where the first element in a pair is a key from the source RDD and the second element is a collection of all the values that have the same key.
- reduceByKey: takes an associative binary operator as input and reduces values with the same key to a single value using the specified binary operator.
- sortByKey: 

**Action operations:**
- collect: Returns the elements in the source RDD as an array. **It can crash the driver program if called on a very large RDD.**
- count: The count method returns a count of the elements in the source RDD.
- countByValue: The countByValue method returns a count of each unique element in the source RDD
- first: The first element in the source RDD.
- max: Returns the largest element in an RDD. Similar idea for min
- stdev: Compute the standard deviation of this RDD’s elements.
- take: takes an integer N as input and returns an array containing the first N element in the source RDD.
- takeOrdered: takes an integer N as input and returns an array containing the N smallest elements in the source RDD.
- top: takes an integer N as input and returns an array containing the N largest elements in the source RDD.
- reduce: aggregates the elements of the source RDD using an associative and commutative binary operator provided to it.

**Action operations for (key,Value) RDDs:**
- countByKey: counts the occurrences of each unique key in the source RDD.
- lookup: takes a key as input and returns a sequence of all the values mapped to that key in the source RDD.
- collectAsMap: same as collect but now using key to return the result as a map. If a key has more than one instance, only one of the instance is collected.

In [2]:
#start the SparkContext
import findspark
findspark.init()
from pyspark import SparkContext
sc=SparkContext(master="local[4]")

# 1. Word Count

In [10]:
%%time
text_file = sc.textFile("sonnet.txt") # Moby-Dick.txt

Wall time: 35.2 ms


In [14]:
%%time
# split line by spaces.

#remove extra blank spaces (lambda x: x!='')

# map word to (word,1)

# count the number of occurances of each word.


Wall time: 35.2 ms


In [15]:
%%time
#execute the plan to count number of different words
Count= None
# find total words in the text file
Sum=None
print('Different words=%5.0f, total words=%6.0f, mean no. occurances per word=%4.2f'%(Count,Sum,float(Sum)/Count))

Different words= 4904, total words= 17667, mean no. occurances per word=3.60
Wall time: 5.98 s


# 2. Finding the most common words

In [17]:
%%time
# map words to (word,frequency) pair

Wall time: 992 µs


In [18]:
%%time
# Count frequency of each word


Wall time: 30.3 ms


In [19]:
%%time
# sort the RDD by frequncy of a word as key and not the word itself


Wall time: 5.92 s


In [21]:
%%time
# print the 5 most common words

most common words
356:	the
349:	of
341:	my
331:	to
326:	I
Wall time: 3.33 s


# 3. Extras
You can try other stuff like:
- Most common words having atleast 4 characters
- Least common words starting with character 'a'
- Most common words ending with 'ing'
- Count words removing the stop-words like ['the', 'a', 'an', 'I', 'he', 'she', 'they', 'to', 'of', 'it', 'from', 'and', 'his', 'her'] and punctuations like ['.', ',', '(', ')', '"', ''', ':', ';', '?', '!', '-']