## Module 11: Kafka & Streaming in Python

We can use Spark (through `pyspark`) to handle streaming data. We do this because Spark is popular and 

1. Fast, fault-tolerant, works with lots of software
2. Can be used with Java, Scala, Python, and R
3. Handles big data
4. Runs machine learning models on big data

Spark SQL DataFrame functions can still be used with streaming to reduce the learning curve.

Spark has at least 2 different interface to handle streaming data. 

1. Spark Streaming (more manual, customizable)
    - uses discretized streams, or `DStreams` (internally is a sequence of RDDs, which are the base object that Spark is using for data)
2. Spark Structured Streaming (much easier to use)
    - can use Spark DataFrames with SQL type functions

#### Spark SQL Recap (not pandas-on-spark)

- Use `pyspark.sql.SparkSession` to create a Spark instance. 
- DataFrames are created and implemented on top of RDDs (retrieve metadata about a DataFrame using `df.printSchema()`)
- DataFrames are stored across the cluster on RDDs
    - when _transformations_ are done (e.g. `groupBy()`, `map()`, etc.), lazy evaluation is used (execution does not happen). Common transformations include `.select()` to subset columns, `.withColumn()` to create a new column from another, and `filter()` to subset via a condition. We also have summarizations (often with grouping uses `.groupBy()`) with `.avg()`, `.sum()`, `.count()`, etc.
    - when _actions_ (e.g. `show(n)`, `take(n)`, `collect()`) are done, computation starts and results are returned

Recall that DataFrame and Spark SQL share the same execution engine, so they can be used interchangeably. We can create temporary views using `df.createOrReplaceTempView("df")` and write SQL statements using `spark.sql("SELECT sex, age FROM df LIMIT 4")`.

#### Spark Structured Streaming

> Spark Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once processing without the user having to reason about streaming.


Spark uses micro-batching at time intervals that you set (100 milliseconds minimum). The general process is 

1. Create a Spark session (already available when running `pyspark`)
2. Read in a stream
    - stream from a file, terminal, or use something like Kafka
3. Set up transformations/aggregations to do (mostly using SQL type functions)
    - perhaps over windows
4. Write a query to implement the transformations and define output type
    - console (for debugging)
    - file (such as .csv)
    - database
5. The above won't process data until you `.start()` the query!
6. Continues for as long as specified or until you terminate it

### Example

Do a basic word count operation using Spark SQL. To prepare the data, we'll use

- `split(str, regex, limit)` - splits `str` around occurrences that match `regex` and returns an array with a length of at most `limit`
- `explore(expr)` - separates the elements of an array `expr` into multiple rows, or the elements of map `expr` into multiple rows and columns