## Feature generation

### Extract and summarize the columns you need to learn

Here we create a relationship between the average value of each sensor and the RUL (service life) for each device.

In [4]:
pydf = spark.sql("""
    SELECT
        DeviceId,
        Period,
        max(Cycle) AS RUL,
        round(avg(Sensor11),2) AS avgSensor11,
        round(avg(Sensor14),2) AS avgSensor14,
        round(avg(Sensor15),2) AS avgSensor15,
        round(avg(Sensor9),2) AS avgSensor9
    FROM 
        sensortablespark
    WHERE
        endofperiod = 1 
    GROUP BY 
        DeviceId,
        Period
    """)
pydf.show(10)

+---------+------+---+-----------+-----------+-----------+----------+
| DeviceId|Period|RUL|avgSensor11|avgSensor14|avgSensor15|avgSensor9|
+---------+------+---+-----------+-----------+-----------+----------+
|N1172FJ-2|    16|172|      46.17|    8121.61|       8.78|   8782.62|
|N3172FJ-1|     5|164|       44.3|    8080.08|       9.27|   8720.57|
|N1172FJ-1|    52|149|      45.87|    8122.38|       8.75|    8778.9|
|N1172FJ-1|    35|134|      46.17|    8106.67|       8.72|    8764.4|
|N4172FJ-1|     6|168|      41.91|    8101.64|       9.32|   8338.59|
|N3172FJ-1|    24|203|      44.55|    8068.05|       9.21|   8728.73|
|N1172FJ-1|    25|228|      46.18|    8125.25|       8.74|    8778.9|
|N4172FJ-2|    15|177|       42.5|    8102.25|       9.43|   8346.38|
|N1172FJ-1|    43|339|      48.06|    8121.72|       8.56|   9053.16|
|N3172FJ-2|    48|242|      42.32|    8156.91|       9.37|   8396.64|
+---------+------+---+-----------+-----------+-----------+----------+
only showing top 10 

In [5]:
# Shape
print((pydf.count(), len(pydf.columns)))

(436, 7)

## Feature conversion

Convert selected features to a vector format that can be executed, using the SparkML libary.

In [6]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
# Feature conversion
vectorAssembler = VectorAssembler(inputCols = ['avgSensor11','avgSensor14','avgSensor15','avgSensor9'], outputCol = 'features')
# T-SQL feature reduction and simplification for PREDICT version
# vectorAssembler = VectorAssembler(inputCols = ['Sensor11'], outputCol = 'features')

vdf = vectorAssembler.transform(pydf)

## Splitting the dataset

Split the dataset so that ~70% is used for training, and ~30% used for testing/validation.


In [7]:
trainingFraction = 0.7
testingFraction = (1-trainingFraction)
seed = 42

# Split the dataframe into test and training dataframes
df_train, df_test = vdf.randomSplit([trainingFraction, testingFraction], seed=seed)

## Model learning

In this example, we are using linear regression. Please select the appropriate algorithm for time series data to suit your needs.

In [8]:
# Modeling
lin_reg = LinearRegression(featuresCol = 'features', labelCol='RUL', maxIter = 10, regParam=0.3)
model = lin_reg.fit(df_train)
print("Coefficients: " + str(model.coefficients))
print("Intercept: " + str(model.intercept))

Coefficients: [48.60422924904571,0.7054445822887963,228.64868661178608,-0.13947212394329359]
Intercept: -8580.95906729857

## Reviewing the results of inferences


In [9]:
# Inferencing in a test dataset
prediction = model.transform(df_test)

In [10]:
# View sample data
display(prediction.select("features","RUL","prediction"))

## Loading to SQL pool

Loads the Spark table as an item and the latest RUL predicted values into SQL pool.

The data can be easily loaded using the `sqlanalytics` API.

In [11]:
%%spark
val sqlDF = spark.sql("SELECT * FROM sensortablespark")

sqlDF: org.apache.spark.sql.DataFrame = [Cycle: bigint, DeviceId: string ... 9 more fields]

In [12]:
%%spark
import org.apache.spark.sql.SqlAnalyticsConnector._
import com.microsoft.spark.sqlanalytics.utils.Constants

val sql_pool_name = "aiaddw" // SQL pool name
// Load to SQL pool
sqlDF.write.sqlanalytics(s"$sql_pool_name.dbo.Sensor", Constants.INTERNAL)

import org.apache.spark.sql.SqlAnalyticsConnector._
import com.microsoft.spark.sqlanalytics.utils.Constants
sql_pool_name: String = aiaddw

## View sensor average for the most recent date


In [18]:
aggdf = spark.sql("""
    SELECT
        DeviceId,
        date_pst,
        round(avg(Sensor11),2) AS avgSensor11,
        round(avg(Sensor14),2) AS avgSensor14,
        round(avg(Sensor15),2) AS avgSensor15,
        round(avg(Sensor9),2) AS avgSensor9
    FROM 
        sensortablespark
    WHERE
        date_pst = (select max(date_pst) from sensortablespark)
    GROUP BY 
        DeviceId,date_pst
    """)

## Inferencing


In [20]:
# Pre-processing for inference
vdf2 = vectorAssembler.transform(aggdf)

# Scoring
predictdf = model.transform(vdf2)\
    .drop("features")\
    .withColumnRenamed("prediction","RUL")

# Display sample
predictdf.show()

# Write to temp view so we can access from Scala
predictdf.createOrReplaceTempView("tempPredict") 

+---------+----------+-----------+-----------+-----------+----------+------------------+
| DeviceId|  date_pst|avgSensor11|avgSensor14|avgSensor15|avgSensor9|               RUL|
+---------+----------+-----------+-----------+-----------+----------+------------------+
|N3172FJ-1|2020-04-30|      43.03|    8065.95|       9.28|   8524.81| 133.4481006445294|
|N1172FJ-2|2020-04-30|      42.92|    8060.53|       9.38|   8505.68|149.81109618334267|
|N2172FJ-1|2020-04-30|      42.99|     8066.1|       9.34|   8518.46|146.21431734565886|
|N3172FJ-2|2020-04-30|      42.58|    8058.95|       9.36|   8491.88| 129.5227973768324|
|N4172FJ-2|2020-04-30|      42.92|    8072.92|       9.27|   8530.38|129.95523756920375|
|N1172FJ-1|2020-04-30|      43.07|    8071.34|       9.33|   8534.42|149.28672333252143|
|N4172FJ-1|2020-04-30|      43.25|    8063.68|       9.29|   8539.14| 142.8275232075339|
|N2172FJ-2|2020-04-30|      43.04|    8061.54|       9.35|   8517.79|147.80763470203237|
+---------+----------

## Loading to SQL pool (2)


In [22]:
%%spark
var sql_pdf = spark.sql("select * from tempPredict")

sql_pdf: org.apache.spark.sql.DataFrame = [DeviceId: string, date_pst: date ... 5 more fields]

In [23]:
%%spark
// Load to SQL pool
sql_pdf.write.sqlanalytics(s"$sql_pool_name.dbo.PREDICT_SensorRUL", Constants.INTERNAL)