In [1]:
#create a spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").\
                                     appName("spark_on_docker").\
                                     getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/01/03 05:33:48 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/01/03 05:33:54 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
22/01/03 05:33:54 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


In [2]:
static = spark.read.json("work/TheDefinitiveGuide/Spark-The-Definitive-Guide/data/activity-data/")
dataSchema = static.schema

                                                                                

In [3]:
static.show(3)

                                                                                

+-------------+-------------------+--------+-----+------+----+-----+------------+------------+------------+
| Arrival_Time|      Creation_Time|  Device|Index| Model|User|   gt|           x|           y|           z|
+-------------+-------------------+--------+-----+------+----+-----+------------+------------+------------+
|1424686735090|1424686733090638193|nexus4_1|   18|nexus4|   g|stand| 3.356934E-4|-5.645752E-4|-0.018814087|
|1424686735292|1424688581345918092|nexus4_2|   66|nexus4|   g|stand|-0.005722046| 0.029083252| 0.005569458|
|1424686735500|1424686733498505625|nexus4_1|   99|nexus4|   g|stand|   0.0078125|-0.017654419| 0.010025024|
+-------------+-------------------+--------+-----+------+----+-----+------------+------------+------------+
only showing top 3 rows



In [4]:
static.printSchema()

root
 |-- Arrival_Time: long (nullable = true)
 |-- Creation_Time: long (nullable = true)
 |-- Device: string (nullable = true)
 |-- Index: long (nullable = true)
 |-- Model: string (nullable = true)
 |-- User: string (nullable = true)
 |-- gt: string (nullable = true)
 |-- x: double (nullable = true)
 |-- y: double (nullable = true)
 |-- z: double (nullable = true)



Create Streaming DataFrames 

Streaming DataFrames are largely the same as static DataFrames.

In [5]:
streaming = spark.readStream.schema(dataSchema).option("maxFilesPerTrigger", 1)\
.json("work/TheDefinitiveGuide/Spark-The-Definitive-Guide/data/activity-data")


Transformations on them to get our data into the correct format. 

Specify transformations on our streaming DataFrame before finally calling an action to start the stream. 

In [6]:
activityCounts = streaming.groupBy("gt").count()

Because this code is being written in local mode on a small machine, we are going to set the
shuffle partitions to a small value to avoid creating too many shuffle partitions:

In [6]:
spark.conf.set("spark.sql.shuffle.partitions", 5)

Need only to specify our action to start the query. 

Specify an output destination, or output sink for ourresult of this query. For this basic example, we are going to write to a memory sink which keeps an in-memory table of the results.

In the process of specifying this sink, we’re going to need to define how Spark will output that data. In this example, we use the complete output mode. This mode rewrites all of the keys along with their counts after every trigger:

We are now writing out our stream! You’ll notice that we set a unique query name to represent
this stream, in this case activity_counts. We specified our format as an in-memory table and
we set the output mode.

In [None]:
activityQuery = activityCounts.writeStream.queryName("activity_counts")\
.format("memory").outputMode("complete")\
.start()

After this code is executed, the streaming computation will have started in the background. 

The query object is a handle to that active streaming query, and we must specify that we would like
to wait for the termination of the query using activityQuery.awaitTermination() to prevent
the driver process from exiting while the query is active.

In [12]:
activityQuery.awaitTermination()

Spark lists this stream, and other active ones, under the active streams in our SparkSession. We
can see a list of those streams by running the following:

In [None]:
spark.streams.active

Spark also assigns each stream a UUID, so if need be you could iterate through the list of
running streams and select the above one. In this case, we assigned it to a variable, so that’s not
necessary.

Now that this stream is running, we can experiment with the results by querying the in-memory
table it is maintaining of the current output of our streaming aggregation. This table will be
called activity_counts, the same as the stream. To see the current data in this output table, we
simply need to query it! We’ll do this in a simple loop that will print the results of the streaming
query every second:

In [None]:
from time import sleep
for x in range(1):
    spark.sql("SELECT * FROM activity_counts").show(3)
    sleep(10)


In [7]:
from pyspark.sql.functions import expr
simpleTransform = streaming.withColumn("stairs", expr("gt like '%stairs%'"))\
    .where("stairs")\
    .where("gt is not null")\
    .select("gt", "model", "arrival_time", "creation_time")\
    .writeStream\
    .queryName("simple_transform")\
    .format("memory")\
    .outputMode("append")\
    .start()

22/01/03 05:23:33 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-b03a4426-b2a4-4d8c-8a3d-e9bf73b5b6ab. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
22/01/03 05:23:33 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.


In [9]:
spark.sql("SELECT * FROM simple_transform").show(3)

22/01/03 05:24:08 WARN TaskSetManager: Stage 20 contains a task of very large size (2419 KiB). The maximum recommended task size is 1000 KiB.


+--------+------+-------------+-------------------+
|      gt| model| arrival_time|      creation_time|
+--------+------+-------------+-------------------+
|stairsup|nexus4|1424687983719|1424687981726802718|
|stairsup|nexus4|1424687984000|1424687982009853255|
|stairsup|nexus4|1424687984404|1424687982411977009|
+--------+------+-------------+-------------------+
only showing top 3 rows



                                                                                

Aggregations

Structured Streaming has excellent support for aggregations. You can specify arbitrary
aggregations, as you saw in the Structured APIs. For example, you can use a more exotic
aggregation, like a cube, on the phone model and activity and the average x, y, z accelerations of
our sensor (jump back to Chapter 7 in order to see potential aggregations that you can run on
your stream):

py4j error : at the container terminal  
    
▶ Solution 1 : 
    SOLVED: py4j.protocol.Py4JError: org.apache.spark.api.python.PythonUtils.getEncryptionEnabled does not exist in the JVM
    출처: <https://sparkbyexamples.com/pyspark/pyspark-py4j-protocol-py4jerror-org-apache-spark-api-python-pythonutils-jvm/> 

    $pip install findspark

    import findspark
    findspark.init() 

▶ Solution 2 : 
    https://dhkdn9192.github.io/apache-spark/pyspark-py4j-error/

    $ pip install py4j==0.10.9.2


In [None]:
from pyspark.sql.functions import expr
import findspark
findspark.init() 

In [7]:
deviceModelStats = streaming.cube('gt', 'Model', 'Device').avg()\
    .drop("avg(Arrival_Time)")\
    .drop("avg(Creation_Time)")\
    .drop("avg(Index)")\
    .writeStream.queryName("device_counts").format("memory")\
    .outputMode("complete")\
    .start()

22/01/03 05:35:57 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-739b9aa6-6bb4-42ed-a51e-c84d758e6985. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
22/01/03 05:35:57 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.


In [None]:
spark.sql("SELECT * FROM device_counts").show()    

Joins

As of Apache Spark 2.2, Structured Streaming supports joining streaming DataFrames to static
DataFrames. Spark 2.3 will add the ability to join multiple streams together. You can do multiple
column joins and supplement streaming data with that from static data sources:

In [11]:
historicalAgg = static.groupBy("gt", "model").avg()

22/01/03 05:44:46 WARN FileStreamSource: Listed 80 file(s) in 2685 ms
22/01/03 05:44:48 WARN FileStreamSource: Listed 80 file(s) in 2014 ms


In [14]:
deviceModelStats = streaming.drop("Arrival_Time", "Creation_Time", "Index")\
.cube("gt", "model").avg()\
.join(historicalAgg, ["gt", "model"])\
.writeStream.queryName("device_counts2").format("memory")\
.outputMode("complete")\
.start()

22/01/03 05:47:21 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-eb03e93b-45d8-4e35-8bb0-2f1b75d993c7. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
22/01/03 05:47:21 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
22/01/03 05:47:22 WARN FileStreamSource: Listed 80 file(s) in 2351 ms
22/01/03 05:47:23 WARN FileStreamSource: Listed 80 file(s) in 2059 ms
