# Apache Spark Streaming Notebook

### **1. Overview of Spark Streaming**
- Spark Streaming is an extension of the core Spark API for scalable, high-throughput, and fault-tolerant stream processing of live data streams.
- Data can be ingested from sources like Kafka, Flume, and HDFS, and can be processed using high-level functions like map, reduce, and join.

In [1]:
from pyspark import SparkContext
from pyspark.streaming import StreamingContext

# Create a local StreamingContext with two threads and batch interval of 1 second
sc = SparkContext("local[2]", "NetworkWordCount")
ssc = StreamingContext(sc, 1)  # Batch interval is 1 second

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/01/13 20:26:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/01/13 20:26:29 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/01/13 20:26:29 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
25/01/13 20:26:29 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
25/01/13 20:26:29 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044.
25/01/13 20:26:29 WARN Utils: Service 'SparkUI' could not bind on port 4044. Attempting port 4045.


### **2. Creating a Streaming Context**
- **Batch Interval**: The duration of time Spark batches incoming data for processing.

In [2]:
# Create a DStream that connects to hostname:port
lines = ssc.socketTextStream("localhost", 9999)

# Split each line into words
words = lines.flatMap(lambda line: line.split(" "))

### **3. Receiving Data Streams**
- **DStreams (Discretized Streams)**: The core abstraction representing a continuous stream of data divided into small batches.

In [3]:
# Count each word in each batch
wordCounts = words.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
wordCounts.pprint()  # Print the first ten elements of each RDD to the console

### **4. Transformations on DStreams**
#### Example: Word Count
- **Common Transformations**:
  - `map()`, `flatMap()`
  - `reduceByKey()`, `groupByKey()`
  - `window()`, `countByWindow()` (for time-based transformations)

In [4]:
# Window duration is 10 seconds, and sliding interval is 5 seconds
windowedWordCounts = words.map(lambda word: (word, 1)).reduceByKeyAndWindow(
    lambda x, y: x + y,
    lambda x, y: x - y,
    10,
    5
)
windowedWordCounts.pprint()

### **5. Managing Batch Intervals**
- **Guidelines**:
  - A smaller batch interval increases processing frequency but requires faster processing.
  - Larger batch intervals reduce computation overhead but may increase latency.

In [5]:
# Save results of transformations to external systems
wordCounts.saveAsTextFiles("output/wordcount")

### **6. Output Operations**
- Save results of transformations to external systems using operations like `pprint()` or `saveAsTextFiles()`.

In [6]:
# Read data from a directory
fileStream = ssc.textFileStream("hdfs://localhost:9000/input")
fileStream.pprint()

### **7. File Streams**
#### Example: Reading Data from a Directory

In [7]:
# Example: Consuming Data from Kafka
from pyspark.streaming.kafka import KafkaUtils

kafkaStream = KafkaUtils.createStream(ssc, "zookeeper:2181", "spark-streaming", {"topic": 1})
lines = kafkaStream.map(lambda x: x[1])
lines.pprint()

ModuleNotFoundError: No module named 'pyspark.streaming.kafka'

### **8. Advanced Streaming Sources**
- Apache Kafka
- AWS Kinesis
- Flume

In [None]:
# Fault tolerance through checkpointing
ssc.checkpoint("checkpoint-directory")

### **9. Fault Tolerance**
- **Checkpointing**: Save DStream metadata to recover from failures.

In [None]:
# Start the computation
ssc.start()

# Wait for the computation to terminate
ssc.awaitTermination()

### **10. Starting and Stopping the Streaming Context**