# Inference of Room Occupancy using Environment Sensors
## Part I. Training a Model

In this notebook, we are going to explore the use of machine learning techniques 
to estimate the occupance of rooms based on environment sensors present in the rooms.

For this exercise we are going to use the dataset available at: https://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+

First, we will to build a model that learns the relations between the environment factors and the occupancy state by using a dataset of labeled observations.
This is also known as _supervised learning_.

Our dataset consists of timestamped records with the _humidity_, _light levels_, _temperature_ and _CO2 levels_. 
Our training dataset also contains a label that indicates whether the room was occupied or not at the moment of the measurements.

### Loading and Parsing the Data

In preparation to use this notebook, download the zip file containing the dataset to a local folder in your machine. 

We will call this folder `dataDir` in the notebook.

In [ ]:
val baseDir = "/tmp"  // Change this to an appropriate location
val dataDir = s"$baseDir/data"
val modelDir = s"$baseDir/model"
val modelFile = s"$modelDir/occupancy-lg.model"

baseDir: String = /tmp
dataDir: String = /tmp/data
modelDir: String = /tmp/model
modelFile: String = /tmp/model/occupancy-lg.model


By observing the first lines of the data, we can appreciate that it is in CSV format and has a header. We can use the CSV reader to load the data.
```
"id","date","Temperature","Humidity","Light","CO2","HumidityRatio","Occupancy"
"1","2015-02-04 17:51:00",23.18,27.272,426,721.25,0.00479298817650529,1
```
(Note that the original dataset misses the "id" field in the header. To make the process easier, edit the file to add `<"id",>` at the beginning)

In [ ]:
val sensorData = sparkSession.read
        .option("header",true)
        .option("inferSchema", true)
        .csv(s"$dataDir/datatraining.txt")

sensorData: org.apache.spark.sql.DataFrame = [id: int, date: timestamp ... 6 more fields]


In [ ]:
// check that the inferred schema corresponds to the expected types
sensorData.printSchema

root
 |-- id: integer (nullable = true)
 |-- date: timestamp (nullable = true)
 |-- Temperature: double (nullable = true)
 |-- Humidity: double (nullable = true)
 |-- Light: double (nullable = true)
 |-- CO2: double (nullable = true)
 |-- HumidityRatio: double (nullable = true)
 |-- Occupancy: integer (nullable = true)



## Building a Logistic Regression Model

To train our model, we are going to build a ML Pipeline. 

In [ ]:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.LogisticRegression
val lr = new LogisticRegression()
  .setMaxIter(10)
  .setRegParam(0.1)
  .setElasticNetParam(0.8)
  


import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.LogisticRegression
lr: org.apache.spark.ml.classification.LogisticRegression = logreg_9b02575b4033


In [ ]:
import org.apache.spark.ml.feature.VectorAssembler
val assembler = new VectorAssembler()
    .setInputCols(Array("Temperature", "Humidity", "Light", "CO2", "HumidityRatio"))
//.setInputCols(Array("Temperature", "Humidity", "Light", "CO2", "HumidityRatio"))
    .setOutputCol("features")

import org.apache.spark.ml.feature.VectorAssembler
assembler: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_530fd58b21e5


In [ ]:
val labeledData = sensorData.withColumn("label", $"Occupancy".cast("Double"))

labeledData: org.apache.spark.sql.DataFrame = [id: int, date: timestamp ... 7 more fields]


In [ ]:
labeledData.printSchema

root
 |-- id: integer (nullable = true)
 |-- date: timestamp (nullable = true)
 |-- Temperature: double (nullable = true)
 |-- Humidity: double (nullable = true)
 |-- Light: double (nullable = true)
 |-- CO2: double (nullable = true)
 |-- HumidityRatio: double (nullable = true)
 |-- Occupancy: integer (nullable = true)
 |-- label: double (nullable = true)



We define the Pipeline as a sequence of stages. 
In our case, the assembler, which brings the features together into a `Vector` and the parameterized _Logistic Regression_ `Estimator` that we instantiated earlier.

In [ ]:
import org.apache.spark.ml.Pipeline
val pipeline = new Pipeline().setStages(Array(assembler, lr))

import org.apache.spark.ml.Pipeline
pipeline: org.apache.spark.ml.Pipeline = pipeline_026fbfe8ca4c


The `fit` method in a `Pipeline` lets us train the model on a dataset and produces
a `model` that we can use to make predictions on new data.

In [ ]:
val model = pipeline.fit(labeledData)

model: org.apache.spark.ml.PipelineModel = pipeline_026fbfe8ca4c


## Validating the Model

To validate the model, we use data for which we know the expected outcome.

That way, we can compare the real with the predicted value and evaluate how well our model is performing.

In [ ]:
val testData = sparkSession.read
        .option("header",true)
        .option("inferSchema", true)
        .csv(s"$dataDir/datatest.txt")
        .withColumn("label", $"Occupancy".cast("Double"))

testData: org.apache.spark.sql.DataFrame = [id: int, date: timestamp ... 7 more fields]


In [ ]:
val predictions = model.transform(testData)

predictions: org.apache.spark.sql.DataFrame = [id: int, date: timestamp ... 11 more fields]


In [ ]:
predictions.select($"Occupancy", $"rawPrediction",$"probability", $"prediction").show(10, truncate=false )

+---------+----------------------------------------+----------------------------------------+----------+
|Occupancy|rawPrediction                           |probability                             |prediction|
+---------+----------------------------------------+----------------------------------------+----------+
|1        |[-1.165992741474785,1.165992741474785]  |[0.2375800787231101,0.76241992127689]   |1.0       |
|1        |[-1.134749462216978,1.134749462216978]  |[0.24328566985645464,0.7567143301435453]|1.0       |
|1        |[-1.1084831249923008,1.1084831249923008]|[0.24815378909083458,0.7518462109091655]|1.0       |
|1        |[-0.5854424488314132,0.5854424488314132]|[0.35768125007101514,0.6423187499289849]|1.0       |
|1        |[-0.5536855193052468,0.5536855193052468]|[0.3650097638103235,0.6349902361896765] |1.0       |
|1        |[-1.1082268289859558,1.1082268289859558]|[0.24820161021665613,0.7517983897833439]|1.0       |
|1        |[-0.9054518379194303,0.9054518379194303]|[0.

### Model Evaluation

In [ ]:
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
val evaluator = new BinaryClassificationEvaluator()
        .setLabelCol("label")
        .setRawPredictionCol("rawPrediction")
        .setMetricName("areaUnderROC")
// Evaluates predictions and returns AUC (Area Under ROC Curve - larger is better, 1 is perfect).
val accuracy = evaluator.evaluate(predictions)

import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
evaluator: org.apache.spark.ml.evaluation.BinaryClassificationEvaluator = binEval_4164a6bf8fcc
accuracy: Double = 0.9917154635767224


## Store the model for later use
We store the trained model on disk.
It can be read back from disk and applied at a later stage.

In [ ]:
model.write.overwrite.save(modelFile)