### 𝐐𝐮𝐞𝐬𝐭𝐢𝐨𝐧 1: Find the top N most frequent words in a large text file

#### Sample Data
Let’s assume you have a text file named sample_1.txt with the following content:

```text
Hello world
Hello from PySpark
PySpark is awesome
Hello PySpark world
```

- **Load the Data:** Read the text file into a DataFrame or
RDD.
- **Tokenize the Text:** Split the text into words.
- **Count Word Frequencies:** Count the occurrences of
each word.
- **Sort and Extract Top N Words:** Sort the words by
frequency and extract the top N.

In [0]:
from pyspark.sql.functions import col, split, explode, lower

# load the data
file_path = "dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/sample_1.text"
df = spark.read.text(file_path)
df.show()

+-------------------+
|              value|
+-------------------+
|        Hello world|
| Hello from PySpark|
| PySpark is awesome|
|Hello PySpark world|
+-------------------+



In [0]:
# tokenized the text into words
words_df = df.select(split(col("Value"), " ").alias("word"))
words_df.show(truncate=False)

words_df = df.select(explode(split(col("Value"), " ")).alias("word"))
words_df.show(truncate=False)

+-----------------------+
|word                   |
+-----------------------+
|[Hello, world]         |
|[Hello, from, PySpark] |
|[PySpark, is, awesome] |
|[Hello, PySpark, world]|
+-----------------------+

+-------+
|word   |
+-------+
|Hello  |
|world  |
|Hello  |
|from   |
|PySpark|
|PySpark|
|is     |
|awesome|
|Hello  |
|PySpark|
|world  |
+-------+



In [0]:
# Convert to lower case for case insensitivity
words_df = words_df.withColumn("word", lower(col("word")))

# Count the occurrences of each word.
word_counts_df = words_df.groupBy(col("word")).count()
word_counts_df.show()

+-------+-----+
|   word|count|
+-------+-----+
|  hello|    3|
|     is|    1|
|pyspark|    3|
|   from|    1|
|  world|    2|
|awesome|    1|
+-------+-----+



In [0]:
# Sort the words by frequency and extract the top N. let say N=3
top_n = 3
sorted_word_count_df = word_counts_df.sort(col("count").desc())
top_words_df = sorted_word_count_df.limit(top_n)

# show the results
top_words_df.show()

+-------+-----+
|   word|count|
+-------+-----+
|  hello|    3|
|pyspark|    3|
|  world|    2|
+-------+-----+



- **explode():** is used to transform an array or map column into multiple rows. It helps in flattening nested data structures, particularly when dealing with arrays.