# WordCount using Spark DataFrame API

- **SparkSession**: The entry point to Spark for reading data and working with DataFrames.
- **builder**: Starts the configuration for Spark.
- **.appName("DF-WordCount")**: Gives your Spark application a name (youâ€™ll see this in Spark UI)
- **.getOrCreate()**: If a Spark session already exists, use it or create one.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import split, explode, col

spark = SparkSession.builder.appName("DF-WordCount").getOrCreate()

## Input
A plain text file named `samplefile1.txt`. Spark reads it line-by-line into a single column named `value`.

In [None]:
df = spark.read.text("samplefile1.txt")

In [None]:
df.show(5, truncate=False)

## WordCount
Each line is split into an array of words, the array is expanded into multiple rows (one row per word), then Spark groups by word and counts how many rows belong to each word.

In [None]:
word_count_df = (
    df
    .withColumn("word", explode(split(col("value"), " ")))
    .groupBy("word")
    .count()
    .orderBy(col("count").desc())
)

## Transformations vs Actions
The steps that build `word_count_df` are transformations (lazy). The computation is triggered only when `show()` is called, which is an action.

In [None]:

word_count_df.show(truncate=False)

## Notes
Words are case-sensitive and punctuation is not removed, so `Spark` and `spark` are treated differently, and `word` vs `word,` are counted as different values unless cleaned.
