## Getting Spark ready

In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
!tar xf spark-3.1.1-bin-hadoop3.2.tgz
!pip install -q findspark

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop3.2"

In [None]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
spark.conf.set("spark.sql.repl.eagerEval.enabled", True) # Property used to format output tables better
spark

In [None]:
spark.sparkContext.version

'3.1.1'

## Part 0



First, textfile is read and spread across 4 workers.

Then, it is preprocessed. Each element in 'book' array contains a line or a sentence.

floatMap splits the lines to words and now 'words' is an array of all the words in the book.

number of all words in the text is returned. it is 5020

In [None]:
import re

# Preprocessing text (remove punctuation, lowercase words, and strip extra spaces)
def removePunctuation(text):
  text = text.lower()
  text = re.sub('[^a-z\s\d]', "", text)
  return text.strip()

book = spark.sparkContext.textFile('input.txt', 4).map(removePunctuation)
words = book.flatMap(lambda line: line.split(" ")) 

# Number of words
all_count = words.count()
print(all_count)

5020


## Part 1

This part is a simple map reduce operation where in the mapper, 1 is assigned to each word and in the reducer, all assigned numbers are summed for each key value.

10 random words and their number of occurances are shown in the output.

Full result is saved in '1_1_results' folder in 4 files. (each worker outputs one file)

In [None]:
word_counts = words.map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)

# Occurance of each word
print(word_counts.take(10))

# Save to file
word_counts.saveAsTextFile("1_1_results")

[('are', 32), ('learning', 3), ('in', 128), ('them', 33), ('work', 3), ('concepts', 4), ('of', 162), ('perfect', 2), ('negotiation', 2), ('persuasion', 4)]


## Part 2

Because all words are already lowercased, I've only filtered all the words by the condition of starting with 'm' and shown 10 words in output

In [None]:
# We lowercased all words so only need to check if it is lowercase 'm'def startwith_m(text):
m_s = words.filter(lambda x: x.startswith(("m"))
m_s.take(10)

['make', 'make', 'mastery', 'many', 'me', 'make', 'me', 'me', 'my', 'more']

## Part 3

First, only 5 character words are kept by filtering. There are 576 of them and some of them are shown in the first part of the output.

By adding another filter, we can only keep words that do not start with vowels.

At the end, the words are sorted and they are shown in the output distinctively

In [None]:
words_5 = words.filter(lambda x: len(x) == 5)
print("5 character words:")
print(words_5.count())
print(words_5.take(5))
print(20*"*")

not_vowel_5 = words_5.filter(lambda x: not x.startswith(("a", "e", "i", "o", "u")))
not_vowel_5.sortBy(lambda x: x).distinct().take(10)

5 character words:
576
['games', 'happy', 'these', 'games', 'class']
********************


['bonus',
 'buddy',
 'cards',
 'faith',
 'goods',
 'haven',
 'likes',
 'magog',
 'money',
 'novel']

## Part 4

Stop words are found by sorting and keeping the first 10% of Part 1's output. They are nothing but the most occuring words and their counts. 

10 most repeated words are shown in the output

I've discarded the counts at the end to have a list of stop_words

In [None]:
# Descending sort on number of occurances, then take first 10%
stop_words = word_counts.sortBy(lambda x: -x[1]).take(all_count//10)

print("10 most repeated words:")
print(stop_words[:10])
print(10*"*")

# Only keep words (remove counts)
stop_words = [i[0] for i in stop_words]


10 most repeated words:
[('the', 335), ('and', 238), ('of', 162), ('to', 153), ('in', 128), ('a', 106), ('is', 76), ('this', 74), ('people', 62), ('that', 60)]
**********


In this part, the text file is read again but passed to a different preprocessing function.

In this function, all words are lowercased, unalphabetic words are removed and stop_words are removed directly from the lines. All this process is done in line level, not word level!

'book' is a list of sentences or lines that don't contain any stopwords.

Full result is saved in '1_4_results' folder and partial result is shown in the output below.

In [None]:

# Remove unalphabetical characters and stopwords from each line
def remove_stopwords_keep_alphabetic(text):
  # lowercase, Keep alphabetic, and strip extra spaces
  text = text.lower()
  text = re.sub('[^a-z\s\d]', "", text)
  text = text.strip()
  # Remove stopwords
  words = [word for word in text.split(" ") if word not in stop_words]
  text = " ".join(words)
  return text

# Read lines and apply preprocessing
book = spark.sparkContext.textFile('input.txt', 4).map(remove_stopwords_keep_alphabetic)

# Save to file
book.saveAsTextFile("1_4_results")

book.take(10)

10 most repeated words:
[('the', 335), ('and', 238), ('of', 162), ('to', 153), ('in', 128), ('a', 106), ('is', 76), ('this', 74), ('people', 62), ('that', 60)]
**********


['involved happy loosened learn reinforcing talked familiar lectures spin test our peers able connection framing extent subject practice mastery rufo shown us anything total car salesman goes names purposes referred win lose precursor debate off count several overlapped discuss strongest admitting straight stronger persuading easier competition excel played strong focused',
 'instantly facet exceed exactly truth question saying conversation passionate nod affirm slip table winning card everybody wins gonna x minute write y reap rewards seem exaggeration secondly lied expect sort expectation later outlandish believing',
 'leads competitors huge leg group slowly suggesting moves explaining sway results favour cards too won practise ganged graded round kicked sure isnt tarnished reason thats gotten reputation accurate burned classmates wolf blend audible members speaking quiet classmate wolves acted try swaying kill victory townspeople pretty appropriately',
 'defeat collaborating faster 

## Part 5

The text file is read and passed to the preprocessing function of part 0.

'book' is a list of sentences or lines.

First, book is flatMapped to bigrams by my self-defined function which takes a line, and returns a list of tuples containing neighbor words.

Then, word-counts are calculated by a simple map-reduce and the bigrams occuring more than once are kept, sorted and shown.

In [None]:
# This function splits sentence and returns bigrams 
def split(line):
    words = line.split(" ")
    return [(words[i], words[i+1]) for i in range(len(words)-1)]

book = spark.sparkContext.textFile('input.txt', 4).map(removePunctuation)

bigrams = book.flatMap(lambda line: split(line)) \
              .map(lambda bigram: (bigram, 1)) \
              .reduceByKey(lambda a, b: a+b) \
              .filter(lambda x: x[1] > 1) \
              .sortBy(lambda x: -x[1])

bigrams.take(10)

[(('of', 'the'), 56),
 (('in', 'the'), 53),
 (('and', 'the'), 15),
 (('this', 'is'), 15),
 (('all', 'of'), 13),
 (('it', 'is'), 13),
 (('on', 'the'), 12),
 (('to', 'the'), 11),
 (('soho', 'square'), 11),
 (('of', 'this'), 10)]