# PySpark Example Notebook

This notebook demonstrates how to use Apache Spark with Python (PySpark) in Jupyter.

## Checking Versions

First, let's check which versions of Python and Spark we're using:

In [4]:
# Print Python and Spark versions
import sys
from pyspark.sql import SparkSession

# Create a SparkSession first
spark = SparkSession.builder.appName("VersionCheck").getOrCreate()

print(f"Python version: {sys.version}")
print(f"Spark version: {spark.version}")

Python version: 3.11.6 | packaged by conda-forge | (main, Oct  3 2023, 10:40:35) [GCC 12.3.0]
Spark version: 3.5.0


## Creating a DataFrame

Let's create a simple DataFrame with some sample data:

In [5]:
# Create a simple DataFrame
data = [
    (1, "John", 25),
    (2, "Alice", 30),
    (3, "Bob", 35),
    (4, "Sarah", 28)
]

# Define the schema
columns = ["id", "name", "age"]

# Create the DataFrame
df = spark.createDataFrame(data, columns)

# Show the DataFrame
df.show()

+---+-----+---+
| id| name|age|
+---+-----+---+
|  1| John| 25|
|  2|Alice| 30|
|  3|  Bob| 35|
|  4|Sarah| 28|
+---+-----+---+



## Transforming Data

Now let's perform some transformations on our DataFrame:

In [6]:
# Filter for people older than 25
from pyspark.sql.functions import col

older_than_25 = df.filter(col("age") > 25)
older_than_25.show()

+---+-----+---+
| id| name|age|
+---+-----+---+
|  2|Alice| 30|
|  3|  Bob| 35|
|  4|Sarah| 28|
+---+-----+---+



## Using SQL with Spark

Spark allows you to use SQL queries on your DataFrames:

In [7]:
# Register as a temporary view to use SQL
df.createOrReplaceTempView("people")

# Run SQL query
sql_result = spark.sql("SELECT * FROM people WHERE age > 25 ORDER BY age")
sql_result.show()

+---+-----+---+
| id| name|age|
+---+-----+---+
|  4|Sarah| 28|
|  2|Alice| 30|
|  3|  Bob| 35|
+---+-----+---+



## Working with RDDs

While DataFrames are the modern API, you can also work with RDDs (Resilient Distributed Datasets):

In [8]:
# Create a simple RDD
rdd = spark.sparkContext.parallelize(range(1, 101))

# Perform some transformations
result = rdd.filter(lambda x: x % 2 == 0).map(lambda x: x * 2).reduce(lambda x, y: x + y)
print(f"Sum of doubled even numbers from 1 to 100: {result}")

Sum of doubled even numbers from 1 to 100: 5100


## More Complex Example: Word Count

Let's implement the classic word count example:

In [9]:
# Sample text
text = """Apache Spark is an open-source unified analytics engine for large-scale data processing.
Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation,
which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance."""

# Create an RDD from the text
import re
text_rdd = spark.sparkContext.parallelize(text.split("\n"))

# Split into words, convert to lowercase, remove punctuation, and count
word_counts = text_rdd \
    .flatMap(lambda line: re.sub(r'[^a-zA-Z ]', '', line.lower()).split(" ")) \
    .filter(lambda word: len(word) > 0) \
    .map(lambda word: (word, 1)) \
    .reduceByKey(lambda a, b: a + b) \
    .sortBy(lambda x: x[1], ascending=False)

# Show the top 10 most frequent words
for word, count in word_counts.take(10):
    print(f"{word}: {count}")

spark: 4
the: 3
an: 3
for: 3
data: 3
apache: 2
implicit: 2
and: 2
provides: 2
programming: 2


## Using DataFrame API for Word Count

The same word count can be implemented more elegantly using the DataFrame API:

In [10]:
# Create a DataFrame from the text
from pyspark.sql.functions import explode, split, lower, regexp_replace, count

# Split the text into lines
lines_df = spark.createDataFrame(text.split("\n"), "string").toDF("line")

# Process the text and count words
word_counts_df = lines_df \
    .select(explode(split(regexp_replace(lower("line"), "[^a-zA-Z ]", ""), " ")).alias("word")) \
    .filter("length(word) > 0") \
    .groupBy("word") \
    .agg(count("*").alias("count")) \
    .orderBy("count", ascending=False)

# Show the top 10 words
word_counts_df.show(10)

+-----------+-----+
|       word|count|
+-----------+-----+
|      spark|    4|
|        for|    3|
|       data|    3|
|         an|    3|
|        the|    3|
|     apache|    2|
|   provides|    2|
|programming|    2|
|       with|    2|
|parallelism|    2|
+-----------+-----+
only showing top 10 rows



## Summary

In this notebook, we've explored:

1. Creating and manipulating DataFrames with PySpark
2. Using SQL queries with Spark
3. Working with RDDs in Python
4. Implementing a word count algorithm using both RDD and DataFrame APIs

PySpark provides a Python-friendly interface to Spark's powerful distributed computing capabilities, making it accessible to data scientists and engineers who are familiar with Python.