# Spark with Scala Example Notebook

This notebook demonstrates how to use Apache Spark with Scala in Jupyter.

## Checking Versions

First, let's check which versions of Scala and Spark we're using:

In [1]:
// Print Scala and Spark versions
println(s"Scala version: ${scala.util.Properties.versionString}")
println(s"Spark version: ${spark.version}")

Scala version: version 2.12.18
Spark version: 3.5.0


## Creating a DataFrame

Let's create a simple DataFrame with some sample data:

In [2]:
// Create a simple DataFrame
val data = Seq(
  (1, "John", 25),
  (2, "Alice", 30),
  (3, "Bob", 35),
  (4, "Sarah", 28)
)

val df = spark.createDataFrame(data).toDF("id", "name", "age")

// Show the DataFrame
df.show()

+---+-----+---+
| id| name|age|
+---+-----+---+
|  1| John| 25|
|  2|Alice| 30|
|  3|  Bob| 35|
|  4|Sarah| 28|
+---+-----+---+



data = List((1,John,25), (2,Alice,30), (3,Bob,35), (4,Sarah,28))
df = [id: int, name: string ... 1 more field]


[id: int, name: string ... 1 more field]

## Transforming Data

Now let's perform some transformations on our DataFrame:

In [3]:
// Filter for people older than 25
val olderThan25 = df.filter($"age" > 25)
olderThan25.show()

+---+-----+---+
| id| name|age|
+---+-----+---+
|  2|Alice| 30|
|  3|  Bob| 35|
|  4|Sarah| 28|
+---+-----+---+



olderThan25 = [id: int, name: string ... 1 more field]


[id: int, name: string ... 1 more field]

## Working with RDDs

While DataFrames are the modern API, you can also work with RDDs (Resilient Distributed Datasets):

In [4]:
// Create a simple RDD
val rdd = sc.parallelize(1 to 100)

// Perform some transformations
val result = rdd.filter(_ % 2 == 0).map(_ * 2).reduce(_ + _)
println(s"Sum of doubled even numbers from 1 to 100: $result")

Sum of doubled even numbers from 1 to 100: 5100


rdd = ParallelCollectionRDD[0] at parallelize at <console>:25
result = 5100


5100

## More Complex Example: Word Count

Let's implement the classic word count example:

In [5]:
// Sample text
val text = """Apache Spark is an open-source unified analytics engine for large-scale data processing.
Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation,
which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance."""

// Create an RDD from the text
val textRDD = sc.parallelize(text.split("\\n"))

// Split into words, convert to lowercase, remove punctuation, and count
val wordCounts = textRDD
  .flatMap(line => line.toLowerCase.replaceAll("[^a-zA-Z ]", "").split(" "))
  .filter(word => word.length > 0)
  .map(word => (word, 1))
  .reduceByKey(_ + _)
  .sortBy(_._2, ascending = false)

// Show the top 10 most frequent words
wordCounts.take(10).foreach(println)

(spark,4)
(for,3)
(the,3)
(data,3)
(an,3)
(interface,2)
(entire,2)
(clusters,2)
(programming,2)
(provides,2)


text = 
textRDD = ParallelCollectionRDD[3] at parallelize at <console>:31
wordCounts = MapPartitionsRDD[12] at sortBy at <console>:39


Apache Spark is an open-source unified analytics engine for large-scale data processing.
Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation,
which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.


MapPartitionsRDD[12] at sortBy at <console>:39

## Summary

In this notebook, we've explored:

1. Creating and manipulating DataFrames
2. Using SQL queries with Spark
3. Working with RDDs
4. Implementing a word count algorithm

These are fundamental operations in Spark that form the building blocks for more complex data processing tasks.