## Actions

1. **reduce**
1. **collect**
1. **count**
1. **first**
1. **take**
1. **takeSample**
1. **countByKey**
1. **saveAsTextFile**

In [0]:
sc

### Reduce

In [0]:
reduceRdd1 = sc.parallelize(range(1,10),3)

In [0]:
reduceRdd1.reduce(lambda t1, t2: t1+t2)

45

In [0]:
vehicleRdd = sc.parallelize(["car", "bus", "bike"])

In [0]:
reduceRdd2 = vehicleRdd.map(lambda i: [i, len(i)])
reduceRdd2.collect()

[['car', 3], ['bus', 3], ['bike', 4]]

In [0]:
reduceRdd2.flatMap(lambda r: [r[1]]).collect()

[3, 3, 4]

In [0]:
reduceRdd2.flatMap(lambda r: [r[1]]).reduce(lambda t1, t2: t1+t2)

10

### collect
collect returns the elements of the RDD back to the driver program.

In [0]:
sc.parallelize([1,2,3]).flatMap(lambda x: [x,x,x]).collect()

[1, 1, 1, 2, 2, 2, 3, 3, 3]

### count
Number of elements in the RDD

In [0]:
vehicleRdd.count()

3

### first
Return the first element in the RDD

In [0]:
vehicleRdd.first()

'car'

### take
Take the first n elements of the RDD.

In [0]:
vehicleRdd.take(2)

['car', 'bus']

### takeSample
Similar to take, in return size of n.  Includes boolean option  of with or without replacement and random generator seed which defaults to None

In [0]:
sc.parallelize(range(1,11)).takeSample(True, 3)

[5, 9, 6]

### countByKey
Count the number of elements for each key, and return the result to the master as a dictionary.

In [0]:
vehicles = sc.parallelize(["car", "bus", "bike", "bike", "cycle", "car", "ship", "truck", "jeep"]).map(lambda i: (i, 1))

In [0]:
vehicles.countByKey().items()

dict_items([('car', 2), ('bus', 1), ('bike', 2), ('cycle', 1), ('ship', 1), ('truck', 1), ('jeep', 1)])

### saveAsTextFile
Save RDD as text file, using string representations of elements.

In [0]:
action_dir = "/FileStore/rdd/action/"

In [0]:
vehicles.saveAsTextFile(action_dir + "vehicles.txt")

### Example

In [0]:
baseRdd = sc.textFile(action_dir + "pg100.txt")
baseRdd.collect()

['The Project Gutenberg eBook of The Complete Works of William Shakespeare',
 '    ',
 'This ebook is for the use of anyone anywhere in the United States and',
 'most other parts of the world at no cost and with almost no restrictions',
 'whatsoever. You may copy it, give it away or re-use it under the terms',
 'of the Project Gutenberg License included with this ebook or online',
 'at www.gutenberg.org. If you are not located in the United States,',
 'you will have to check the laws of the country where you are located',
 'before using this eBook.',
 '',
 'Title: The Complete Works of William Shakespeare',
 '',
 '',
 'Author: William Shakespeare',
 '',
 'Release date: January 1, 1994 [eBook #100]',
 '                Most recently updated: January 18, 2024',
 '',
 'Language: English',
 '',
 '',
 '',
 '*** START OF THE PROJECT GUTENBERG EBOOK THE COMPLETE WORKS OF WILLIAM SHAKESPEARE ***',
 '\ufeffThe Complete Works of William Shakespeare',
 '',
 'by William Shakespeare',
 '',
 '',
 '',

In [0]:
stop_words = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

In [0]:
splitRdd = baseRdd.flatMap(lambda x: x.split()).filter(lambda x: x.isalpha() and ((x.lower()) not in stop_words))
splitRdd.take(5)

['Project', 'Gutenberg', 'eBook', 'Complete', 'Works']

In [0]:
print(f'No. of words : {splitRdd.count()}')

No. of words : 305300


In [0]:
pairRdd = splitRdd.map(lambda x: (x,1))
pairRdd.take(5)

[('Project', 1), ('Gutenberg', 1), ('eBook', 1), ('Complete', 1), ('Works', 1)]

In [0]:
wordCountRdd = pairRdd.groupByKey().map(lambda x: (len(x[1]), x[0]))

In [0]:
wordCountRdd.sortByKey(False).take(5)

[(4498, 'thou'),
 (3906, 'thy'),
 (3246, 'shall'),
 (2267, 'Enter'),
 (2166, 'good')]