<a href="https://colab.research.google.com/github/werowe/HypatiaAcademy/blob/master/spark/sparkLR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spark ML Example




## Table of Contents


1. [Spark ML Example: Preventive Maintenance](#1)
2. [Business Assessment: Use Case Background](#2)
3. [Vehicle Fleets and Analytics](#3)
4. [Brake Failure Prediction](#4)
5. [Brake Pad Maintenance](#5)
6. [Ingest data](#6)
7. [Prepare data](#7)
8. [Train the model](#8)
9. [Test the model](#9)
10. [Expose the Model as a Web Service](#10)
11. [Serve the Model](#11)
12. [Complete Code](#12)


## <a name="1"></a>Spark ML Example: Preventive Maintenance

We take this example from the field of preventive maintenance (PM) as explained below.  Below we discuss the code in depth.  But first we give a use case for why this is needed.


## <a name="2"></a>Business Assessment: Use Case Background
PM was one of the early adopters of big data analytics and machine learning and IoT (Internet of Things) because it is so simple to conceive and implement for that use case.  Calculating when a machine needs maintenance is a problem that fits neatly into a predictive algorithm. This is because machine wear is a function of time and usage.  


## <a name="3"></a>Vehicle Fleets and Analytics
IoT-equipped trucks send data from vehicles using a cellular or satellite signal either as a stream or in bursts.  With IoT, trucks are fit with sensors and GPS trackers that measure heat, vibration, distance travelled, speed, etc.  These sensors are attached to the engine, brakes, transmission, refrigerated trailer, etc.


Companies gather and study this data to operate their vehicles in the safest and lowest cost manner possible.  For example, sensors on the engine can tell whether the engine has a problem.  It is the goal of PM to fix a device before it breaks as waiting until it breaks is expensive as the engine, brake assembly, or drive train can be destroyed and the vehicle taken out of service for a longer period of time than if it is properly maintained


## <a name="4"></a>Brake Failure Prediction
A heavy truck with 18 wheels has a unique preventive maintenance problem to solve, and that is knowing when to change brakes.  Trucks needs to know when to replace their brakes so that they do not have an accident or destroy the brake rotor, which is the metal part of the assembly.  If they wait too long the brake pad will destroy the rotor as metal rubs up against metal.   


The driver cannot be expected to check every brake every time they stop.  And if the company just changes brakes based on some preset schedule then they are wasting money, because they might be changing them too often. So it is preferred to write some mathematical or statistical model to predict when brakes should be changed.  


## <a name="5"></a>Brake Pad Maintenance
Brake pads are metal shavings held together by a resin. The brake applies pressure to the pad to force it down on the rotor, which is a metal disk connected to a truck’s axles.  The pad is designed to wear out over time.  It has to be softer than the rotor, so that it does not damage the rotor.   When the brake pad wears down, heat will go up because there is more friction.  And the further a vehicle has been driven the more its brakes will have worn down.


We contacted an engineer from Volvo and he verified that this model would work as a teaching exercise as it seems reasonable to correlate heat and distance driven with wear.  To get a more accurate model we would have to use something like data from the [IDA Industrial Challenge](https://ida2016.blogs.dsv.su.se/?page_id=1387), which was a competition made by Scana trucking company.


There are lots of factors that impact brake wear.  For example, brakes will wear out faster for vehicles that drive down steep hills.   


We do not have any actual sample data.  So we generated some sample date using this rough model:


`z = wear_rate = (0.003 * heat) + (0.004 * kilometers)-78`


This shows whether the brakes are worn out given the kilometers driven and the maximum heat generated during gathering the sample.


We plug that value into the logistic probability function:


`pr = 1 / (1 + e**-z)`




The binary logistic model generates a binary output, which we will call worn. So if pr > 50% then worn = 1. Otherwise worn = 0. If worn = 1 then it is time to change brake pads.



## <a name="6"></a>Ingest Data
  


The sample data is [here](https://raw.githubusercontent.com/werowe/mist_preventive_maintenance_ml/master/brakedata.csv).  Below is the first line.

<table>
<tr>
<td>worn</td><td>km</td><td>heat</td><td>z</td><td>pr</td>
</tr>

<tr>
<td>1</td><td>20,000</td><td>240</td><td>2.72</td><td>0.938197</td>
</tr>

</table>



In [29]:
!wget https://raw.githubusercontent.com/werowe/mist_preventive_maintenance_ml/master/brakedata.csv

--2024-12-26 09:28:51--  https://raw.githubusercontent.com/werowe/mist_preventive_maintenance_ml/master/brakedata.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 397 [text/plain]
Saving to: ‘brakedata.csv.2’


2024-12-26 09:28:51 (7.45 MB/s) - ‘brakedata.csv.2’ saved [397/397]



In [30]:
from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder \
    .appName("predict-brakes") \
    .getOrCreate()

sc = spark.sparkContext

We read this data into a Spark data frame and then select only the first three columns: whether the brake is worn, kilometers, brake rotor heat.



In [31]:
# brake train

import pandas as pd
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.classification import LogisticRegressionWithLBFGS


import math


df = spark.read.csv(
    "brakedata.csv",
    header=True,        # Use the first row as column names
    inferSchema=True,   # Automatically infer data types
    sep=",",            # Specify delimiter (default is ',')
    encoding="UTF-8"    # Handle encoding
)



Take a look the data:

In [32]:
df.show()

+----+------+----+-------+--------+
|worn|    km|heat|      z|      pr|
+----+------+----+-------+--------+
|   1|20,000| 240|   2.72|0.938197|
|   0| 5,000|  98|-57.706|     0.0|
|   1|50,000| 140| 122.42|     1.0|
|   0| 8,000| 260| -45.22|     0.0|
|   1|23,790| 225| 17.835|     1.0|
|   1|24,644| 245| 21.311|     1.0|
|   1|29,934| 195| 42.321|     1.0|
|   0|14,045| 153|-21.361|     0.0|
|   0| 8,000| 222|-45.334|     0.0|
|   0| 9,855| 149|-38.133|     0.0|
|   1|24,633| 271| 21.345|     1.0|
|   1|20,753| 209|  5.639|0.996456|
+----+------+----+-------+--------+



## <a name="7"></a>Prepare Data
The Spark ML LogisticRegressionWithLBFGS algorithm requires that we put the data into an iterable object of Labels and Points.  So we have an array of LabeledPoint objects.  The Label is the result of logistic regression.  In this case it indicates whether the brake is worn (1) or not (0).  The Points are the kilometers (km) and temperature (heat).

In [33]:
from pyspark.sql.functions import regexp_replace, col
from pyspark.sql.types import IntegerType


a = []

cols = ["worn", "km", "heat"]

for c in cols:
  df = df.withColumn(c, regexp_replace(col(c), ",", "").cast(IntegerType()))


def parsePoint(w,k,h):
    return LabeledPoint(worn, [km, heat])


for row in df.collect():

    worn = row["worn"]
    km = row["km"]
    heat = row["heat"]

    lp = parsePoint (worn, km, heat)

    a.append(lp)

## <a name="8"></a>Train the Model
Now we train the model by passing that array into LogisticRegressionWithLBFGS.train.  

Once the model lrm is created, we can call the lrm.predict() method.

In [34]:
lrm = LogisticRegressionWithLBFGS.train(sc.parallelize(a))

## <a name="9"></a>Test the Model
To test the model we take the training data and then run the prediction over each data point in the sample data.  We then count how many correct predictions there are and divide that by the sample size.  That calculates the model accuracy.

In [35]:
p = sc.parallelize(a)

valuesAndPreds = p.map(lambda p: (p.label, lrm.predict(p.features)))


accurate = 1 - valuesAndPreds.map(lambda vp: math.fabs(vp[0] - vp[1])).reduce(lambda x, y: x + y) / valuesAndPreds.count()

Here get a random row and then use `collect()[0]` to return it as a list and take the first item, and only item in the list.

The run the prediction.

In [58]:

import re


from pyspark.sql.functions import rand

random_row = df.withColumn("random", rand()).orderBy("random").limit(1).collect()[0]

print(random_row , "\n")

km = random_row['km']
heat = random_row['heat']



print("heat %i km %i" % ( heat,km))


worn = lrm.predict([km,heat])
print("\nbrake is worn=", worn)


Row(worn=1, km=23790, heat=225, z=17.835, pr=1.0, random=0.00015032723787544722) 

heat 225 km 23790

brake is worn= 1
