# Inference of Room Occupancy using Environment Sensors
## Part II. Predicting Occupancy on Real-Time Data

On [Part I](./occupancy_detection_model.snb.ipynb#), we trained and validated a Logistic Regression model
using labeled data of rooms fitted with ambient sensors. 
The outcome of that process is a trained model that we saved to disk for further use.

In this notebook, using Structured Streaming, we are going to use that trained model to make predictions on streaming data.

In [ ]:
// These are shared definitions to locate the model
val baseDir = "/tmp"  // Change this to an appropriate location
val dataDir = s"$baseDir/data"
val modelDir = s"$baseDir/model"
val modelFile = s"$modelDir/occupancy-lg.model"

baseDir: String = /tmp
dataDir: String = /tmp/data
modelDir: String = /tmp/model
modelFile: String = /tmp/model/occupancy-lg.model


### Load the previously trained model

In [ ]:
import org.apache.spark.ml._
val model = PipelineModel.read.load(modelFile)

import org.apache.spark.ml._
model: org.apache.spark.ml.PipelineModel = pipeline_026fbfe8ca4c


In [ ]:
model.stages

res14: Array[org.apache.spark.ml.Transformer] = Array(vecAssembler_530fd58b21e5, logreg_9b02575b4033)


### Load the Data Stream
For this example, we are going to simulate a data stream with a trick.
We are going to create a built-in `rate` source to generate events at 1-second intervals and join those 'ticks' with data from our downloaded _dataset_. 

This results in a regular stream of sample values. This can be easily replaced with a _Kafka_ or _File_ source for practical applications. 

>Note: We use this method because it is self-contained and does not require any additional setup of external dependencies.

In [ ]:
val rateSource = sparkSession.readStream.format("rate").load()

rateSource: org.apache.spark.sql.DataFrame = [timestamp: timestamp, value: bigint]


In [ ]:
val dataSource = sparkSession.read
        .option("header",true)
        .option("inferSchema", true)
        .csv(s"$dataDir/datatest2.txt")

dataSource: org.apache.spark.sql.DataFrame = [id: int, date: timestamp ... 6 more fields]


In [ ]:
dataSource.printSchema()

root
 |-- id: integer (nullable = true)
 |-- date: timestamp (nullable = true)
 |-- Temperature: double (nullable = true)
 |-- Humidity: double (nullable = true)
 |-- Light: double (nullable = true)
 |-- CO2: double (nullable = true)
 |-- HumidityRatio: double (nullable = true)
 |-- Occupancy: integer (nullable = true)



In [ ]:
val dataSize = dataSource.count

dataSize: Long = 9752


In [ ]:
// with the modulo operation, we circularly replay the data, creating a continuous data stream
// for as long as the process executes
val sensorDataStream = rateSource
        .select($"value" % dataSize as "id", $"timestamp")
        .join(dataSource, "id")

sensorDataStream: org.apache.spark.sql.DataFrame = [id: bigint, timestamp: timestamp ... 7 more fields]


### Predict occupancy on the sensor data using the trained model
The `PipelineModel` that we loaded can be directly applied to a streaming `DataFrame` using the `transform` function.

In [ ]:
val scoredStream = model.transform(sensorDataStream)

scoredStream: org.apache.spark.sql.DataFrame = [id: bigint, timestamp: timestamp ... 11 more fields]


In [ ]:
scoredStream.printSchema

root
 |-- id: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- date: timestamp (nullable = true)
 |-- Temperature: double (nullable = true)
 |-- Humidity: double (nullable = true)
 |-- Light: double (nullable = true)
 |-- CO2: double (nullable = true)
 |-- HumidityRatio: double (nullable = true)
 |-- Occupancy: integer (nullable = true)
 |-- features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false)



### Consume the predictions
The final step in our streaming prediction service is to do something with the prediction data.
In this notebook, we are going to limit this step to querying the data. 

For real-world application, we will typically be interested in offering this service to other applications. Maybe in the form of an HTTP-based API or through pub/sub messaging interactions.

In [ ]:
import org.apache.spark.sql.streaming.Trigger
val query = scoredStream.writeStream
        .format("memory")
        .queryName("memory_predictions")
        .start()

import org.apache.spark.sql.streaming.Trigger
query: org.apache.spark.sql.streaming.StreamingQuery = org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@5bb26e2


In [ ]:
sparkSession.sql("select id, timestamp, occupancy, prediction from memory_predictions")
.show(15, false)

+---+---------+---------+----------+
|id |timestamp|occupancy|prediction|
+---+---------+---------+----------+
+---+---------+---------+----------+



Note that as we are using a test dataset for the predictions, we also have access to the actual
occupancy data under the field "Occupancy"

In [ ]:
query.status

res19: org.apache.spark.sql.streaming.StreamingQueryStatus =
{
  "message" : "Waiting for data to arrive",
  "isDataAvailable" : false,
  "isTriggerActive" : false
}


In [ ]:
query.lastProgress

res19: org.apache.spark.sql.streaming.StreamingQueryProgress =
{
  "id" : "e62135ba-2bb9-41cf-83cb-8b7638592d8b",
  "runId" : "014d2004-f9ad-40f4-867b-2ad77300ce4d",
  "name" : "memory_predictions",
  "timestamp" : "2018-08-19T10:42:33.496Z",
  "batchId" : 428,
  "numInputRows" : 0,
  "inputRowsPerSecond" : 0.0,
  "durationMs" : {
    "getOffset" : 0,
    "triggerExecution" : 0
  },
  "stateOperators" : [ ],
  "sources" : [ {
    "description" : "RateSource[rowsPerSecond=1, rampUpTimeSeconds=0, numPartitions=8]",
    "startOffset" : 427,
    "endOffset" : 427,
    "numInputRows" : 0,
    "inputRowsPerSecond" : 0.0
  } ],
  "sink" : {
    "description" : "MemorySink"
  }
}
