In [1]:
%env SPARK_HOME=/usr/lib/spark
%env SPARK_KAFKA_VERSION=0.10

env: SPARK_HOME=/usr/lib/spark
env: SPARK_KAFKA_VERSION=0.10


In [2]:
from dotenv import load_dotenv

load_dotenv()

True

In [3]:
import findspark

findspark.init('/usr/lib/spark/')

## PySpark

You can find the updated instructions how to run Data Proc with Spark at directory [`week_5_batch_processing`](https://github.com/vbugaevskii/data-engineering-zoomcamp-cohort2023/blob/main/cohorts/2023/week_5_batch_processing/README.md).

In [4]:
import os

import pyspark

import pyspark.sql.types as T
import pyspark.sql.functions as F

from pyspark.sql import SparkSession

from pathlib import Path

In [5]:
# NOTE: This works properly for pyspark==3.0.3

from IPython.display import clear_output

!rm -r jars || true
!mkdir -p jars
!mvn dependency:copy-dependencies -DoutputDirectory=jars

# clear_output()

[[1;34mINFO[m] Scanning for projects...
[[1;34mINFO[m] 
[[1;34mINFO[m] [1m------------------< [0;36mcom.dataclub.zoomcamp.de:pyspark[0;1m >------------------[m
[[1;34mINFO[m] [1mBuilding pyspark 2.0[m
[[1;34mINFO[m]   from pom.xml
[[1;34mINFO[m] [1m--------------------------------[ jar ]---------------------------------[m
[[1;34mINFO[m] 
[[1;34mINFO[m] [1m--- [0;32mdependency:2.8:copy-dependencies[m [1m(default-cli)[m @ [36mpyspark[0;1m ---[m
[[1;34mINFO[m] Copying kafka-clients-3.2.0.jar to /home/vbugaevskii/pyspark/jars/kafka-clients-3.2.0.jar
[[1;34mINFO[m] Copying unused-1.0.0.jar to /home/vbugaevskii/pyspark/jars/unused-1.0.0.jar
[[1;34mINFO[m] Copying lz4-java-1.8.0.jar to /home/vbugaevskii/pyspark/jars/lz4-java-1.8.0.jar
[[1;34mINFO[m] Copying scala-library-2.12.10.jar to /home/vbugaevskii/pyspark/jars/scala-library-2.12.10.jar
[[1;34mINFO[m] Copying snappy-java-1.1.8.4.jar to /home/vbugaevskii/pyspark/jars/snappy-java-1.1.8.4.jar
[[1;3

In [6]:
# NOTE: jar_packages works properly for spark==3.0.3

jar_packages = [
    "org.apache.spark:spark-streaming-kafka-0-10_2.12:3.0.3",
    "org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.3",
    "org.apache.spark:spark-avro_2.12:3.0.3",
    "org.apache.kafka:kafka-clients:3.2.0",
    # "org.apache.kafka:kafka-clients:0.10.0.1",
]

spark = (
    SparkSession.builder
        .master("yarn")
        # .config("spark.jars", ','.join(map(str, Path("jars").glob("*.jar"))))
        .config("spark.jars.packages", ','.join(jar_packages))
        .config("spark.executor.cores", 2)
        .config("spark.executor.instances", 4)
        .config("spark.executor.memory", "2G")
        .getOrCreate()
)

sc = spark.sparkContext
sc

In [7]:
def read_from_kafka(topic: str) -> pyspark.sql.DataFrame:
    servers = [
        "rc1a-cfsongcosevdstr7.mdb.yandexcloud.net:9092",
        "rc1a-dpcbjr36v0m3hqij.mdb.yandexcloud.net:9092",
        "rc1a-rb5l0smprrcqojlp.mdb.yandexcloud.net:9092",
        "rc1a-up0snrtao9kga6n8.mdb.yandexcloud.net:9092",
    ]

    stream = (
        spark.readStream
            .format("kafka")
            .option("kafka.bootstrap.servers", ",".join(servers))
            .option("kafka.sasl.mechanism", "SCRAM-SHA-256")
            .option("kafka.security.protocol", "SASL_PLAINTEXT")
            .option("kafka.sasl.jaas.config", f"org.apache.kafka.common.security.scram.ScramLoginModule required username=\"{os.environ['KAFKA_USER']}\" password=\"{os.environ['KAFKA_PASS']}\";")
            .option("kafka.partition.assignment.strategy", "range")
            .option("subscribe", topic)
            .option("startingOffsets", "earliest")
            .option("checkpointLocation", "checkpoint")
            .load()
    )

    return stream

In [12]:
df_taxi_green = read_from_kafka("dev-topic").selectExpr("CAST(value AS STRING)")
df_taxi_green

DataFrame[value: string]

In [13]:
df_taxi_green.isStreaming

True

In [14]:
df_taxi_green.writeStream \
    .format("memory") \
    .queryName("rides_green_table") \
    .start()

spark.sql("select * from rides_green_table").show(10)

+-----+
|value|
+-----+
+-----+



In [15]:
df_taxi_green.writeStream \
    .outputMode("append") \
    .trigger(processingTime="5 seconds") \
    .format("console") \
    .option("truncate", False) \
    .start()

<pyspark.sql.streaming.StreamingQuery at 0x7ff21813aa90>

### Note:

At first I have been fighting to run PySpark connected to Kafka:
- https://spark.apache.org/docs/2.4.6/streaming-kafka-0-8-integration.html
- https://spark.apache.org/docs/3.0.3/structured-streaming-kafka-integration.html

Finnaly I have managed to run spark using `spark.jars.packages` or `maven`. The solution for with env `PYSPARK_SUBMIT_ARGS` from [the example](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_6_stream_processing/python/streams-example/pyspark/streaming-notebook.ipynb) didn't work for me. I suppose, the problem is that my virtual machine is not a part of Data Proc cluster (it is not a master machine).

Then I have been fighting with [missing`partition.assignment.strategy` parameter](https://stackoverflow.com/questions/65890891/kafka-partition-assignment-strategy-in-pyspark). Finnaly, I understood that this parameter is an option for a consumer and should be placed in `readStream` operator and the option should have prefix `kafka`.

Now you can see the final output of my program. The output is empty because of the error in Spark:

```
java.lang.NoSuchMethodError: 'void org.apache.kafka.clients.consumer.KafkaConsumer.subscribe(java.util.Collection)'
```

I suppose this error occures because of libraries incompatability. I have spent all my weekend trying to solve this problem, but I have failed.