<a href="https://colab.research.google.com/github/urmilapol/Blockchain-ligh-/blob/main/pysparkwordcount.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1. Setup: Install PySpark:

!pip install pyspark -q: This command installs the PySpark library quietly, which is necessary to use Spark functionalities in Python.
from pyspark.sql import SparkSession: Imports the SparkSession class, the entry point to programming Spark with the Dataset and DataFrame API.
import re: Imports the regular expression module, used for text processing.
2. Initialize Spark Session:

spark = SparkSession.builder.master("local[*]").appName("WordCount").getOrCreate(): This line creates or gets an existing SparkSession.
.master("local[*]"): Configures Spark to run locally, using all available CPU cores on your machine.
.appName("WordCount"): Sets a name for your Spark application.
sc = spark.sparkContext: Retrieves the SparkContext from the SparkSession. The SparkContext is the main entry point for Spark's RDD (Resilient Distributed Dataset) API.
3. INPUT: Create a dummy text file:

with open("sample_text.txt", "w") as f: f.write("Hadoop is good for storage. Spark is good for speed. Spark and Hadoop work together."): This creates a simple text file named sample_text.txt with some example content that will be used for the word count.
4. EXTRACT: Read the file into an RDD:

text_rdd = sc.textFile("sample_text.txt"): This reads the content of sample_text.txt into a Spark RDD. Each line of the file becomes an element in the RDD.
5. TRANSFORM: The working logic:

This section performs the actual word count using a series of transformations on the RDD:
.flatMap(lambda line: re.findall(r'\w+', line.lower())): This is a transformation that first converts each line to lowercase (line.lower()) and then uses a regular expression (re.findall(r'\w+', ...) ) to find all word characters. flatMap then flattens the list of words from each line into a single RDD of individual words.
.map(lambda word: (word, 1)): This transformation takes each word and pairs it with the number 1, creating a (word, 1) tuple for each occurrence of a word.
.reduceByKey(lambda a, b: a + b): This transformation groups all tuples by their key (the word) and then applies a reduction function (summing the values) to get the total count for each unique word.
6. OUTPUT (Action): Trigger the computation and print results:

results = word_counts.collect(): collect() is an action that triggers the execution of all the previous transformations. It brings all the results from the distributed Spark environment back to the driver program (your Colab notebook) as a list of (word, count) tuples.
The print statements then iterate through these results and display each word along with its count.
7. CLEANUP:

spark.stop(): This gracefully stops the SparkSession and releases all associated resources.

**This script extracts text, transforms it into individual words, counts the frequency of each word, and loads the results back to storage.**

In [1]:
# 1. Setup: Install PySpark
!pip install pyspark -q

from pyspark.sql import SparkSession
import re

# 2. Initialize Spark Session
# 'local[*]' tells Spark to use all available CPU cores on the machine [cite: 4]
spark = SparkSession.builder.master("local[*]").appName("WordCount").getOrCreate()
sc = spark.sparkContext # Access SparkContext for RDD operations

# 3. INPUT: Create a dummy text file
with open("sample_text.txt", "w") as f:
    f.write("Hadoop is good for storage. Spark is good for speed. Spark and Hadoop work together.")

# 4. EXTRACT: Read the file into an RDD [cite: 4, 100]
text_rdd = sc.textFile("sample_text.txt")

# 5. TRANSFORM: The working logic
# We use flatMap to split lines into words, map to create pairs, and reduceByKey to sum
word_counts = text_rdd.flatMap(lambda line: re.findall(r'\w+', line.lower())) \
                     .map(lambda word: (word, 1)) \
                     .reduceByKey(lambda a, b: a + b)

# 6. OUTPUT (Action): Trigger the computation and print results
results = word_counts.collect()

print("--- Word Count Results ---")
for word, count in results:
    print(f"{word}: {count}")

# 7. CLEANUP
spark.stop()

--- Word Count Results ---
hadoop: 2
good: 2
for: 2
storage: 1
speed: 1
and: 1
work: 1
is: 2
spark: 2
together: 1


To understand WordCount, think of it as a distributed assembly line. Instead of one person counting every word in a huge book, Spark splits the book into pages and gives them to different workers to count simultaneously.
Here is the simple data flow of how Spark processes the sentence: "Spark is fast. Hadoop is reliable."
Step 1: The Input (Loading Data)
Spark reads the raw text file from storage (like your Local Disk or HDFS) into an RDD or DataFrame.

Data State: ["Spark is fast. Hadoop is reliable."]
Step 2: Tokenization (Splitting)
The flatMap operation breaks the sentences into individual words.

Data State: ["Spark", "is", "fast", "Hadoop", "is", "reliable"]
Step 3: Mapping (Key-Value Pairs)
The map operation turns each word into a pair with the number 1. This is like giving each word a "vote".

Data State: [("Spark", 1), ("is", 1), ("fast", 1), ("Hadoop", 1), ("is", 1), ("reliable", 1)]
Step 4: Shuffling (Grouping)
Spark moves all identical words to the same worker node. This is the only part where data might move across the network.

Data State: * Worker A: ("is", 1), ("is", 1)
Worker B: ("Spark", 1), ("Hadoop", 1), ("fast", 1), ("reliable", 1)
Step 5: Reducing (Aggregating)
The reduceByKey operation sums the "1s" for each word.

Data State: [("is", 2), ("Spark", 1), ("Hadoop", 1), ("fast", 1), ("reliable", 1)]
Summary Table for Students
Stage
Operation
What happens to the data?
Input
textFile
Data is loaded into RAM.


Split
flatMap
Sentences become individual words.
Map
map
Each word is paired with a count of 1.


Reduce
reduceByKey
Similar words are grouped and their counts are summed.


Result
collect
The final list is sent back to the Driver program.



Why is this "Spark Style"?
In a Hadoop MapReduce flow, the data would be written to the physical disk after the Map and Shuffle phases. In Spark, this entire flow happens in-memory (RAM), which is why the word count finishes in seconds rather than minutes.


