In [4]:
# Import necessary modules from PySpark
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, split, col

In [5]:
# Create a Spark session named "WordCount"
spark = SparkSession.builder.appName("WordCount").getOrCreate()

# ------------------------------------------
# If a text file was used as input, this line would load it as a DataFrame:
# df = spark.read.text("/content/drive/MyDrive/word_count.txt")
# ------------------------------------------

# Instead of reading from a file, we create a DataFrame from a hardcoded input string
# The DataFrame will have one column named "value" and one row containing the sentence
df = spark.createDataFrame([("I love you. You are the love of my life",)], ["value"])

# Perform the word count:
word_counts = (
    # Split the sentence into words using whitespace and explode into multiple rows (one word per row)
    df.select(explode(split(col("value"), "\\s+")).alias("word"))
    # Group by each unique word
    .groupBy("word")
    # Count the occurrences of each word
    .count()
    # Order the results in descending order of count
    .orderBy(col("count").desc())
)

# Show the final word count result
word_counts.show()

+----+-----+
|word|count|
+----+-----+
|love|    2|
|you.|    1|
|life|    1|
| You|    1|
| the|    1|
|  my|    1|
| are|    1|
|  of|    1|
|   I|    1|
+----+-----+



In [6]:
# Stop the Spark session to release resources
spark.stop()

In [8]:
"""
🔹 1. PySpark
PySpark is the Python API for Apache Spark, an open-source distributed computing framework for big data processing.

It enables scalable data analysis using Python.

This creates a DataFrame with one row and one column (value) containing the input string.

Useful for testing logic without loading an external file.

🔹 4. Functions from pyspark.sql.functions

from pyspark.sql.functions import explode, split, col
These are transform functions used in DataFrame operations:

✅ split(col("value"), "\\s+")
Splits the string in value column by one or more whitespace characters (\\s+ is a regex for space, tab, etc.).

Returns an array of words.

✅ explode(...)
Takes an array column (like list of words) and returns a new row for each element in the array.

So, one sentence becomes multiple rows with one word each.

✅ col("column_name")
Refers to a column in a DataFrame by name.

Used for selecting or manipulating DataFrame columns.

🔹 5. DataFrame Operations
✅ .select(...)
Selects specific columns from the DataFrame.

In this case, selects and aliases exploded words as "word".

✅ .groupBy("word").count()
Groups rows by the "word" column.

.count() computes how many times each word occurs.

✅ .orderBy(col("count").desc())
Orders the word counts in descending order, so most frequent words appear first.
"""

'\n🔹 1. PySpark\nPySpark is the Python API for Apache Spark, an open-source distributed computing framework for big data processing.\n\nIt enables scalable data analysis using Python.\n\nThis creates a DataFrame with one row and one column (value) containing the input string.\n\nUseful for testing logic without loading an external file.\n\n🔹 4. Functions from pyspark.sql.functions\n\nfrom pyspark.sql.functions import explode, split, col\nThese are transform functions used in DataFrame operations:\n\n✅ split(col("value"), "\\s+")\nSplits the string in value column by one or more whitespace characters (\\s+ is a regex for space, tab, etc.).\n\nReturns an array of words.\n\n✅ explode(...)\nTakes an array column (like list of words) and returns a new row for each element in the array.\n\nSo, one sentence becomes multiple rows with one word each.\n\n✅ col("column_name")\nRefers to a column in a DataFrame by name.\n\nUsed for selecting or manipulating DataFrame columns.\n\n🔹 5. DataFrame Ope