
## Objective
This notebook demonstrates how to implement the **WordCount** program using **PySpark RDDs (Resilient Distributed Datasets)**.

The **WordCount** example is a classic way to understand distributed data processing in Spark — reading text, transforming data, and performing aggregation.

---

## Concept Overview
**WordCount** involves reading text data, splitting lines into words, and counting how many times each word appears.

### Steps:
1. Initialize a Spark Session  
2. Load the text file into an RDD  
3. Split lines into words using `flatMap()`  
4. Map words to `(word, 1)` pairs using `map()`  
5. Aggregate word counts using `reduceByKey()`  
6. Collect and display results  

| Transformation | Description |
|----------------|--------------|
| `textFile()` | Reads file into an RDD |
| `flatMap()` | Splits lines into individual words |
| `map()` | Creates key-value pairs `(word, 1)` |
| `reduceByKey()` | Aggregates values by key |
| `collect()` | Returns results to driver |

---

### Why Use `flatMap()`?
- `map()` → produces a list of lists (nested structure)
- `flatMap()` → flattens and returns one continuous list of words

---


In [None]:
from pyspark.sql import SparkSession

# Initialize Spark Session
spark = SparkSession.builder.getOrCreate()

In [None]:
# Load the text file into an RDD
filePath = 'samplefile.txt'
linesRdd = spark.sparkContext.textFile(filePath)

# Display the contents of RDD
linesRdd.collect()

In [None]:
# Split each line into words using flatMap()
wordsRdd = linesRdd.flatMap(lambda x: x.split())
wordsRdd.collect()

In [None]:
# Map each word to a key-value pair (word, 1)
wordsMapRdd = wordsRdd.map(lambda x: (x, 1))
wordsMapRdd.collect()

In [None]:
# Reduce by key to count occurrences of each word
wordCountRdd = wordsMapRdd.reduceByKey(lambda x, y: x + y)
wordCountRdd.collect()

In [None]:
# Combine all transformations into a single chain
spark.sparkContext.textFile('samplefile.txt') \    
.flatMap(lambda x: x.split()) \    
.map(lambda x: (x, 1)) \    
.reduceByKey(lambda x, y: x + y) \    
.collect()