# Chapter 21

## Structured Streaming: Basics

In [40]:
static = spark.read.json("gs://reddys-data-for-experimenting/activity-data/")
dataSchema = static.schema
static.count()

6240991

In [2]:
from pyspark.sql.types import StructField, TimestampType
dataSchema.fields[1] = StructField("Creation_Time", TimestampType(), True)
static.printSchema()
print(dataSchema)

root
 |-- Arrival_Time: long (nullable = true)
 |-- Creation_Time: long (nullable = true)
 |-- Device: string (nullable = true)
 |-- Index: long (nullable = true)
 |-- Model: string (nullable = true)
 |-- User: string (nullable = true)
 |-- gt: string (nullable = true)
 |-- x: double (nullable = true)
 |-- y: double (nullable = true)
 |-- z: double (nullable = true)

StructType(List(StructField(Arrival_Time,LongType,true),StructField(Creation_Time,TimestampType,true),StructField(Device,StringType,true),StructField(Index,LongType,true),StructField(Model,StringType,true),StructField(User,StringType,true),StructField(gt,StringType,true),StructField(x,DoubleType,true),StructField(y,DoubleType,true),StructField(z,DoubleType,true)))


### Reading a stream

In [3]:
streaming = spark.readStream.schema(dataSchema).option("maxFilesPerTrigger", 1)\
  .json("gs://reddys-data-for-experimenting/activity-data/")

In [4]:
transformedStream = streaming.selectExpr("User", "gt", "Model")

In [5]:
transformedStreamWriter = transformedStream.writeStream \
    .option("checkpointLocation", "gs://reddys-data-for-experimenting/output/chkpnt") \
    .queryName("transformedStream")\
    .outputMode("append")\
    .start(path="gs://reddys-data-for-experimenting/output/streaming", format="parquet") 
    # .awaitTermination()

### Validating if the stream is working as expected

In [20]:
spark.streams.active

[<pyspark.sql.streaming.StreamingQuery at 0x7fc375e78190>]

In [45]:
staticTransformedData = spark.read.format("parquet").load("gs://reddys-data-for-experimenting/output/streaming")

In [46]:
staticTransformedData.printSchema()

root
 |-- User: string (nullable = true)
 |-- gt: string (nullable = true)
 |-- Model: string (nullable = true)



In [47]:
staticTransformedData.count()

6240991

In [48]:
staticTransformedData.describe().show()

+-------+-------+-------+-------+
|summary|   User|     gt|  Model|
+-------+-------+-------+-------+
|  count|6240991|6240991|6240991|
|   mean|   null|   null|   null|
| stddev|   null|   null|   null|
|    min|      a|   bike| nexus4|
|    max|      i|   walk| nexus4|
+-------+-------+-------+-------+



### Summay:

Input data statics meets output data stats with data fromat converted to `Parquet` and with some stateless transformation.