Implement a spark program that performs the following:
a.Reads the posted text file for a book named “Applied Data Science.txt”
b.Read the text file into an RDD, and then perform actions and transformations on the RDD
c.Displays the most used 5 words of length greater than 5 characters in the file (ensure your result is not case sensitive, so the word “Data” and “data” are the same and should be counted two of the same word)


In [1]:
import findspark
findspark.init()

In [2]:
import pyspark
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
spark = SparkSession.builder.getOrCreate()

In [3]:
conf = spark.sparkContext._conf.getAll()

In [4]:
spark.sparkContext.stop()

In [5]:
spark = SparkSession.builder.master("local[1]").appName("Spark_Examples").getOrCreate()

In [6]:
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords 
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\tikle\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\tikle\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [7]:
# created a function to clean the file from panctuation marks, numbers and others which are not words.
def clean_data(data):
    data=data.lower()
    data= re.sub(r'(https?:\/\/)(\s)*(www\.)?(\s)*((\w|\s)+\.)*([\w\-\s]+\/)*([\w\-]+)((\?)?[\w\s]*=\s*[\w\%&]*)*','', data)
    data= re.sub(r'((1-\d{3}-\d{3}-\d{4})|(\(\d{3}\) \d{3}-\d{4})|(\d{3}-\d{3}-\d{4}))$', '', data)
    data = re.sub(r'[^\w\s]', '', data)
    data= re.sub("\d+",'',  data)
    data= filtered_data(data)
    return data

In [8]:
# created a function to clean the data from stopwords.
def filtered_data(data):
    word_tokens= word_tokenize(data)
    stop_words = set(stopwords.words('english')) 
    filtered_list=[w for w in word_tokens if not w in stop_words ]
        
    return filtered_list

In [9]:
# read the text file from local computer 
f= open("Applied_Data_Science.txt",
        encoding="utf8",errors="ignore", mode= "r")
textfile= f.read()
cleaned_data = clean_data(textfile)


In [10]:
# created RDDS, applied transform and action methods 
word_rdds= spark.sparkContext.parallelize(cleaned_data)
word_counts= word_rdds.map(lambda x:(x,1)).filter(lambda x: len(x[0])>5)
Top_5_words=word_counts.reduceByKey(lambda x, y: (x + y)).takeOrdered(5,key=lambda x:-x[1])



In [11]:
# Displying 5 most used words in the book
print('The most used words in the Applied Data Science textbook is:')
for x in Top_5_words:
    print(x[0],' occured ',x[1],' times')

The most used words in the Applied Data Science textbook is:
python  occured  92  times
example  occured  87  times
regression  occured  83  times
chapter  occured  79  times
problem  occured  73  times
