<a href="https://colab.research.google.com/github/lucprosa/dataeng-basic-course/blob/main/spark_streaming/challenges/challenges.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setting up PySpark

In [None]:
%pip install pyspark



# Context
Message events are coming from platform message broker (kafka, pubsub, kinesis...).
You need to process the data according to the requirements.

Message schema:
- timestamp
- value
- event_type
- message_id
- country_id
- user_id



# Challenge 1

Step 1
- Change exising producer
	- Change parquet location to "/content/lake/bronze/messages/data"
	- Add checkpoint (/content/lake/bronze/messages/checkpoint)
	- Delete /content/lake/bronze/messages and reprocess data
	- For reprocessing, run the streaming for at least 1 minute, then stop it

Step 2
- Implement new stream job to read from messages in bronze layer and split result in to locations
	- "messages_corrupted"
		- logic: event_status is null, empty or equal to "NONE"
    - extra logic: add country name by joining message with countries dataset
		- partition by "date" -extract it from timestamp
		- location: /content/lake/silver/messages_corrupted/data

	- "messages"
		- logic: not corrupted data
		- extra logic: add country name by joining message with countries dataset
		- partition by "date" -extract it from timestamp
		- location: /content/lake/silver/messages/data

	- technical requirements
		- add checkpint (choose location)
		- use StructSchema
		- Set trigger interval to 5 seconds
		- run streaming for at least 20 seconds, then stop it

	- alternatives
		- implementing single streaming job with foreach/- foreachBatch logic to write into two locations
		- implementing two streaming jobs, one for messages and another for messages_corrupted
		- (paying attention on the paths and checkpoints)


  - Check results:
    - results from messages in bronze layer should match with the sum of messages+messages_corrupted in the silver layer

In [103]:
%pip install faker



In [147]:
!rm -rf content/lake

# Producer

In [148]:
import pyspark.sql.functions as F
from pyspark.sql import DataFrame
from faker import Faker
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('Test streaming').getOrCreate()
sc = spark.sparkContext

fake = Faker()
messages = [fake.uuid4() for _ in range(50)]

def enrich_data(df, messages=messages):
  fake = Faker()
  new_columns = {
      'event_type': F.lit(fake.random_element(elements=('OPEN', 'RECEIVED', 'SENT', 'CREATED', 'CLICKED', '', 'NONE'))),
      'message_id': F.lit(fake.random_element(elements=messages)),
      'channel': F.lit(fake.random_element(elements=('CHAT', 'EMAIL', 'SMS', 'PUSH', 'OTHER'))),
      'country_id': F.lit(fake.random_int(min=2000, max=2015)),
      'user_id': F.lit(fake.random_int(min=1000, max=1050)),
  }
  df = df.withColumns(new_columns)
  return df

def insert_messages(df: DataFrame, batch_id):
  enrich = enrich_data(df)
  enrich.write.mode("append").format("parquet").save("content/lake/bronze/messages")

# read stream
df_stream = spark.readStream.format("rate").option("rowsPerSecond", 1).load()

# write stream
query = (df_stream.writeStream
.outputMode('append')
.trigger(processingTime='1 seconds')
.foreachBatch(insert_messages)
.start()
)

query.awaitTermination(60)


False

In [149]:
query.stop()

In [150]:
df = spark.read.format("parquet").load("content/lake/bronze/messages/*")
df.show()

+--------------------+-----+----------+--------------------+-------+----------+-------+
|           timestamp|value|event_type|          message_id|channel|country_id|user_id|
+--------------------+-----+----------+--------------------+-------+----------+-------+
|2024-12-03 21:01:...|  135|  RECEIVED|3de01bc0-78f4-401...|  EMAIL|      2000|   1025|
|2024-12-03 21:00:...|  114|  RECEIVED|5be8f6f1-dedd-4fe...|  EMAIL|      2003|   1033|
|2024-12-03 21:00:...|  119|  RECEIVED|bf899e5d-582e-4b0...|  EMAIL|      2006|   1000|
|2024-12-03 20:59:...|   72|  RECEIVED|cd29b0fc-4416-428...|  EMAIL|      2013|   1035|
|2024-12-03 21:00:...|   78|  RECEIVED|a6bd0277-f664-483...|  OTHER|      2012|   1013|
|2024-12-03 21:00:...|  131|  RECEIVED|dfe8f040-1f4a-42f...|  OTHER|      2006|   1008|
|2024-12-03 21:00:...|   86|  RECEIVED|ce1960f4-db7f-4e8...|  OTHER|      2000|   1024|
|2024-12-03 20:59:...|   62|  RECEIVED|77f57a48-d7af-436...|  OTHER|      2005|   1026|
|2024-12-03 20:59:...|   32|   C

# Additional datasets

In [131]:
countries = [
    {"country_id": 2000, "country": "Brazil"},
    {"country_id": 2001, "country": "Portugal"},
    {"country_id": 2002, "country": "Spain"},
    {"country_id": 2003, "country": "Germany"},
    {"country_id": 2004, "country": "France"},
    {"country_id": 2005, "country": "Italy"},
    {"country_id": 2006, "country": "United Kingdom"},
    {"country_id": 2007, "country": "United States"},
    {"country_id": 2008, "country": "Canada"},
    {"country_id": 2009, "country": "Australia"},
    {"country_id": 2010, "country": "Japan"},
    {"country_id": 2011, "country": "China"},
    {"country_id": 2012, "country": "India"},
    {"country_id": 2013, "country": "South Korea"},
    {"country_id": 2014, "country": "Russia"},
    {"country_id": 2015, "country": "Argentina"}
]

countries = spark.createDataFrame(countries)

# Streaming Messages Corrupted

In [123]:
!rm -rf content/lake/silver/

In [151]:
from pyspark.sql.types import *

def insert_messages_silver(df: DataFrame, batch_id):
  corrupted = df.filter(F.col('event_type').isin('NONE', '') | F.col('event_type').isNull())
  messages = df.exceptAll(corrupted)

  if corrupted.count() > 0:
    corrupted.write.mode("append").partitionBy("date").format("parquet").save("content/lake/silver/messages_corrupted")
  if messages.count() > 0:
    messages.write.mode("append").partitionBy("date").format("parquet").save("content/lake/silver/messages")

schema = StructType([StructField('timestamp', TimestampType(), True), StructField('value', LongType(), True), StructField('event_type', StringType(), True), StructField('message_id', StringType(), True), StructField('channel', StringType(), True), StructField('country_id', IntegerType(), True), StructField('user_id', IntegerType(), True), StructField('date', DateType(), True)])
# read stream
df_stream = spark.readStream.format("parquet").schema(schema).load("content/lake/bronze/messages/*")

df_transformed = df_stream.withColumn("date", F.to_date(F.col("timestamp")))

df_joined = df_transformed.join(F.broadcast(countries), ["country_id"], "left")

# write stream
query = (df_joined.writeStream
.outputMode('append')
.trigger(processingTime='5 seconds')
.foreachBatch(insert_messages_silver)
.start()
)

query.awaitTermination(20)

False

In [152]:
query.stop()

In [153]:
print(spark.read.format("parquet").load("content/lake/bronze/messages").count())
print(spark.read.format("parquet").load("content/lake/silver/messages").count())
print(spark.read.format("parquet").load("content/lake/silver/messages_corrupted").count())

142
102
40


#

# Challenge 2

- Run business report
- But first, there is a bug in the system which is causing some duplicated messages, we need to exclude these lines from the report

- Technical requirements:
  - Identify possible duplicates on message_id, event_type and channel
  - in case of duplicates, consider only the first message (occurrence by timestamp)
  - Ex:
    In table below, the correct message to consider is the second line

```
    message_id | channel | event_type | timestamp
    123        | CHAT    | CREATED    | 10:10:01
    123        | CHAT    | CREATED    | 07:56:45 (first occurrence)
    123        | CHAT    | CREATED    | 08:13:33
```




In [154]:
df = spark.read.format("parquet").load("content/lake/silver/messages")
df.show()

+----------+--------------------+-----+----------+--------------------+-------+-------+--------------+----------+
|country_id|           timestamp|value|event_type|          message_id|channel|user_id|       country|      date|
+----------+--------------------+-----+----------+--------------------+-------+-------+--------------+----------+
|      2006|2024-12-03 21:00:...|  119|  RECEIVED|bf899e5d-582e-4b0...|  EMAIL|   1000|United Kingdom|2024-12-03|
|      2006|2024-12-03 21:00:...|  131|  RECEIVED|dfe8f040-1f4a-42f...|  OTHER|   1008|United Kingdom|2024-12-03|
|      2006|2024-12-03 20:59:...|   28|   CREATED|77ee9a01-90c7-48c...|  EMAIL|   1027|United Kingdom|2024-12-03|
|      2006|2024-12-03 21:00:...|  132|  RECEIVED|5845fa6c-ded6-416...|   CHAT|   1014|United Kingdom|2024-12-03|
|      2007|2024-12-03 21:00:...|  109|   CREATED|58f825b3-7ed2-4d0...|  OTHER|   1022| United States|2024-12-03|
|      2007|2024-12-03 20:59:...|   66|   CREATED|79bc9ee4-6bd5-463...|   PUSH|   1004| 

In [155]:
df.select("message_id").distinct().count()

46

In [156]:
df.count()

102

In [157]:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
df2 = df.withColumn("row_number", F.row_number().over(Window.partitionBy("message_id", "event_type", "channel").orderBy("timestamp"))).filter("row_number = 1").drop("row_number")

In [158]:
df2.count()

98