-
Notifications
You must be signed in to change notification settings - Fork 0
Python SPark example
A M X edited this page May 23, 2024
·
1 revision
To save PySpark code as a Python script and run it, don't need to import Spark itself directly because PySpark scripts implicitly assume the necessary Spark context will be available when the script is executed using spark-submit
or within a PySpark environment.
Here’s a complete example script with proper imports and setup that can save and run as a Python file (e.g., example.py
):
from pyspark import SparkConf, SparkContext
# Initialize Spark context
conf = SparkConf().setAppName("ExampleApp")
sc = SparkContext(conf=conf)
# Sample data: a list of numbers
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
# Parallelize the data
distData = sc.parallelize(data)
# Filter the data to keep only even numbers
evenNumbers = distData.filter(lambda x: x % 2 == 0)
# Sum the filtered even numbers
sumEvenNumbers = evenNumbers.reduce(lambda x, y: x + y)
# Collect the filtered even numbers
collectedEvenNumbers = evenNumbers.collect()
# Print the results
print("Even Numbers: ", collectedEvenNumbers)
print("Sum of Even Numbers: ", sumEvenNumbers)
# Stop the Spark context
sc.stop()
To run this script, we need to use the spark-submit
command, which will handle setting up the necessary Spark context:
spark-submit example.py
-
Imports: Import
SparkConf
andSparkContext
frompyspark
. - SparkConf: Create a Spark configuration object and set the application name.
- SparkContext: Initialize the Spark context with the configuration.
- Stop the Spark Context: Always stop the Spark context at the end of script to free up resources.
This script initializes the Spark context, performs the transformations and actions, and then prints the results. By running it with spark-submit
, need ensure that Spark is properly initialized and the script runs in the correct environment.