# EX8-STREAM: Spark Structured Streaming

Your assignment: complete the `TODO`'s and include also the **output of each cell**.

#### You may need to read the [Structured Streaming API Documentation](https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/index.html) to complete this lab.

### Step 1: **[PLAN A]** Start Spark Session

In [None]:
from pyspark.sql import SparkSession

try:
    spark.stop()
except NameError:
    print("SparkContext not defined")

    # cluster mode
spark = SparkSession.builder \
            .appName("Spark SQL basic example") \
            .master("spark://spark:7077") \
	    	.config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.3.4") \
            .config("spark.hadoop.fs.s3a.endpoint", "http://minio:9000") \
            .config("spark.hadoop.fs.s3a.access.key", "pdm_minio") \
            .config("spark.hadoop.fs.s3a.secret.key", "pdm_minio") \
            .config("spark.hadoop.fs.s3a.path.style.access", "true") \
            .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
            .config("spark.hadoop.fs.s3a.connection.ssl.enabled", "false") \
	    	.getOrCreate()

### Step 1: **[PLAN B]** Start Spark Session

In [None]:
from pyspark.sql import SparkSession

try:
    spark.stop()
except NameError:
    print("SparkContext not defined")
    

# local mode
spark = SparkSession.builder \
            .appName("Spark SQL basic example") \
            .master("local[*]") \
	    	.config("spark.some.config.option", "some-value") \
	    	.getOrCreate()

### Step 2: Static Dataframe of words

In [None]:
words_df = spark.read.csv("s3a://public/100words.txt.gz") # plan A
#words_df = spark.read.csv("data/100words.txt.gz") # plan B
words_df = words_df.withColumnRenamed("_c0", "word")
words_df.show()

### Step 3: Get meaning for each word (use [Free Dictionary API](https://dictionaryapi.dev/))

In [None]:
from pyspark.sql.functions import *
import requests_ratelimiter

def get_word_meaning(word, session):
    url = f"https://api.dictionaryapi.dev/api/v2/entries/en/{word}"
    response = session.get(url)
    response.raise_for_status()  # Ensure the request was successful
    json_data = response.json()

    try:
        meaning = json_data[0]['meanings'][0]['definitions'][0]['definition']
    except:
        meaning = "__NOT_FOUND__"

    return meaning


try:
    words_with_meaning_df.cache()
    words_with_meaning_df.show()
except NameError:
    print("words_with_meaning_df not defined")
    meanings = []
    session = requests_ratelimiter.LimiterSession(per_second=1)
    for word in [r.word for r in words_df.collect()]:
        meanings.append((word, get_word_meaning(word, session)))
        print(word)
    words_with_meaning_df = spark.createDataFrame(meanings, ["word", "meaning"])
    words_with_meaning_df.cache()
    words_with_meaning_df.show()

### Step 4: **[PLAN A]** Create a stream of sentences using existing socket stream (LAB)

In [None]:
words_stream = spark \
    .readStream.format("socket") \
    .option("host", "socketstreamserver") \
    .option("port", 12345) \
    .load()

### Step 4: **[PLAN B]** Create a socket stream and create a stream of sentences from that (NOTEBOOK LOCAL)

1. Before running the cell below, start socket stream from existing script `hostdir/bin/cmd.sh` using a notebook terminal.
2. Make sure it is running properly.
3. Create a spark stream using the command below

In [None]:
words_stream = spark \
    .readStream.format("socket") \
    .option("host", "localhost") \
    .option("port", 12345) \
    .load()

### Step 5: Start stream just to visualize some of its values (for 10 seconds)

In [None]:
words_stream_writer = words_stream.writeStream.format("console").outputMode("append")
words_stream_writer = words_stream_writer.trigger(processingTime="1 second")
words_stream_query = words_stream_writer.start()
words_stream_query.awaitTermination(10)

-------------------------------------------
Batch: 42
-------------------------------------------
+--------------------+
|               value|
+--------------------+
|18 vacation park ...|
+--------------------+



### Step 6: Transform the stream as requested `#TODO`

1. Each line of the stream starts with a number, let us call this number `user_id`. The rest of the line comprises a set of words generated by this user.
2. For each user request you must take the corresponding words, get the meaning of each word (static dataframe) and return the responses as a new stream of `user_id, [<meaning of word 1>, <meaning of word 2>, ... ]`
3. Let the stream running on console for 10 seconds.

In [None]:
# code here: TODO

### Step 7: Transform the stream as requested `#TODO`

1. Again, from the stream of lines `words_stream`
2. Map each line to rows of `word,user_id` (hint: use `explode` and `split`)
3. From this new stream, group by word and aggregate the set of user IDs that asked for that specific word.
4. Generate a stream of `<list of user IDs> <word> <meaning of word>`
5. Let the resulting stream running for 20 seconds.

In [None]:
# code here: TODO