# Spark MLlib

In this lab, you will take your first steps using Spark's machine learning library, MLlib, specifically, the high-level ML pipeline API, to perform a k-means cluster analysis of transaction data.

## Objectives


1. Use the ML pipeline library to produce a k-means-based model.
2. Use the generated model to predict the cluster to which given test transactions belong.

## Instructions

### Overview

We're going to add a tool to our fraud prevention tool belt and see if we can identify high-value transactions based on a person's transaction history.  In particular, we're going to identify 4 clusters of transactions:  low debits, high debits, low credits and high credits.  To be suspicious, any transaction amount must be either a high debit or a high credit.

In order to keep it simple, we're only going to use a single feature of each transaction:  the amount.  In a real world system, there could be a few to a few hundred different features (merchant id, transaction description, transaction time of day, even dollar amounts, country of transaction origin, etc), which could be used to characterize suspicious activity.


### Machine learning process

The machine learning process typically has the following phases:

* **Featurization**:  identifying & quantifying the features that will be studied
* **Training**:  a diverse enough data set is used to train a model to be used for future predictions
* **Testing**:  a set of known data is presented to the trained model to see if it predicts well enough
* **Production**:  the trained & tested model is used on real incoming data to make predictions

Keep this process in mind as we begin our trip through machine learning land!


### Import the required packages

First things first. Let's get some imports we'll need out of the way. Execute the following commands in your cell:

In [None]:
from numpy import array
from math import pow

from pyspark.mllib.clustering import KMeans, KMeansModel


### Transformation

After reading data into an RDD, the first step in Machine Learning is "featurizing" the data.

> Note:  In almost all machine learning environments, any data that will be used, regardless of its type, must be "featurized", which means to convert it into numeric values, one per "feature".  This gives rise to multidimensional spaces having anywhere from one to millions of dimensions.

> Note:  Depending on the language used for Spark APIs features could be represnted by Vector (Scala, Java) or array (Python).

In our case we want to cluster transaction amounts so the features will be represented by an array of a single element only.

Let's get started with obtaining SparkSession and reading transactions in. Then transform transaction amounts into features.

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Python Clustering").getOrCreate()

file = "/home/jovyan/Resources/tx.csv"
text = spark.sparkContext.textFile(file)

txns = text.map(lambda st: st.split(",")).map(lambda el: (el[0], el[1], float(el[2])))

amnts = txns.map(lambda v: array([v[2]]))


### Train on the data

Now that our data is "featurized", we can train it.  It's as simple as this:


In [None]:
clusterCount = 10
iterationCount = 20

model = KMeans.train(amnts, clusterCount, iterationCount)


where `clusterCount` is the number of clusters we want our set to be clustered and `iterationCount` is the maximum number of iterations we want the algorithm to run.

### Test the model

At this point, according to the canonical machine learning process, we'd use some carefully crafted test data to confirm that our model was adequately trained.  In lieu of that, we're just going to have a look at what the model came up with, which, in this case, is a collection of cluster centers based on our one-dimensional training data.  Each center is a point that represents the middle of a cluster that the model fit.

Execute the following code in your cell:

In [None]:
print("Cluster centers")

for c, value in enumerate(model.centers):
    print(c, value[0])


Here, we're pulling the cluster centers adding their ids. They will be useful later on to find out what cluster a given value belongs to.

The output should look something like the following:

````
Cluster centers
0 -5.50525267994
1 1440.24
2 -819.5775
3 825.062
4 259.11030303
5 -192.826097561
6 2209.03
7 -36.3006958763
8 -362.944444444
9 -89.8272093023
````


The first value in the tuple is the index of the cluster center in the `clusterCenters`, which also is the cluster id.  The second value is the value of center.  This means that we have a high debit of around \$275 with cluster id `3`, a low debit of around $22 with id `0`, a low credit of around \$510 with id `1`, and a high credit of around \$1697 with id `2`, which, given the 1200+ transactions, makes a reasonable amount of sense.

Here are the raw transaction amounts with cluster centers overlaid (blue diamonds are raw transaction amounts, red squares are cluster centers):



![Cluster Centers with Raw Transaction Data](../Resources/img/centersx.png)


Here's another visualization taking into account the cluster ids.  Raw transaction amounts (again, blue diamonds) are along the x-axis at y = 0, and cluster centers (red squares) are offset along the y-axis at each cluster center's id value.

![Cluster Centers and Cluster IDs with Raw Transaction Data](../Resources/img/centersy.png)

The above visualization not only lets you know the cluster center values, but also their ids for later use.

### Use the model

Now that our testing proves our model reasonably well, let's throw some data at the model to see what it says about it.  Execute the following code, which creates some extremely varied transaction amounts from -\$5,000 to \$5,000, then presents them to the model, showing their resultant predictions:


In [None]:
# generate the amounts for clustering: +- 5*10^i, i=0..3
vals = spark.sparkContext.parallelize(range(4)).map(lambda i: pow(10, i)).flatMap(lambda x: array([-5*x, 5*x])).map(lambda v: array([v]))

# predict the clusters for the amounts
clsd = vals.map(lambda s: (s, model.predict(s)))

print("Predictions")
for value, c in clsd.collect():
    print(value[0], c)

Your results should look like the following:

```
Predictions
-5.0 0
5.0 0
-50.0 7
50.0 0
-500.0 8
500.0 4
-5000.0 2
5000.0 6
```


Here, we can see that a debit & a debit of \$5 & \$50 all belong to cluster id `0`, which we called "low debit".  Interesting.  Maybe we should rename "low debit" to be "normal transaction", since they're not all debits.  Thus is the way of data analysis.  Next, our model says that a debit of \$500 belongs to cluster id `3`, which was "high debit".  A HA!  Fraud alert!  Our model, rightfully so, suggests that we should investigate this transaction.  Same goes for the \$5,000 debit, also as it rightfully should!  Our model also classified the \$5,000 credit as "high credit" (in cluster id `2`), which should also be investigated.


> Note:  Printing transactions to the screen is not a real world way of notifying a company of fraudulent transactions.  However, since the model returns an `RDD`, we can do whatever we want with it, including sending it to another system, writing it to a file that some other fraud investigation process may be monitoring, and so on.

## Conclusion

In this lab, you saw how you can leverage Spark's sophisticated machine learning processing. While we used k-means for this example, there are many other machine learning algorithms included in Spark MLlib out of the box.  All of the sudden, becoming a bonafide data scientists seems within reach!

## Challenge

#### FInd the "optimal" number of clusters for transaction amounts

## Complete Solution

In [None]:
from numpy import array
from math import pow

from pyspark.mllib.clustering import KMeans, KMeansModel

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Python Clustering").getOrCreate()


file = "/home/jovyan/Resources/tx.csv"
text = spark.sparkContext.textFile(file)

txns = text.map(lambda st: st.split(",")).map(lambda el: (el[0], el[1], float(el[2])))

# get the amounts for clustering training
amnts = txns.map(lambda v: array([v[2]]))

clusterCount = 10
iterationCount = 20
model = KMeans.train(amnts, clusterCount, iterationCount)

print("Cluster centers")
for c, value in enumerate(model.centers):
    print(c, value[0])

# generate the amounts for clustering: +- 5*10^i, i=0..3
vals = spark.sparkContext.parallelize(range(4)).map(lambda i: pow(10, i)).flatMap(lambda x: array([-5*x, 5*x])).map(lambda v: array([v]))

# predict the clusters for the amounts
clsd = vals.map(lambda s: (s, model.predict(s)))

print("Predictions")
for value, c in clsd.collect():
    print(value[0], c)

## Solution to Challenge

In the solution above the number of desired clusters (`k`) is passed to the algorithm.

How do we know that this number is optimal? Well, we don't but what we can do is assess how good it is by calculating the total distance of all clustered points from their cluster centers and compare it between different `k`s. This total distance is called `Within Set Sum of Squared Error (WSSSE)` and it's used by the algorithm to evaluate the quality of the clustering for a given `k`. You can easily compute `WSSSE` yourself or you can obtain it by calling `computeCost` function on the model. 

So, the lower `WSSSE` the better, but not quite as you can easily see that `WSSSE` is 0 when the number of desired clusters is the same as the number of clustered points. You cannot beat that!

What we're looking for, instead, is the point where `WSSSE` starts decreasing slowly with increased number of `k`.

When you plot `WSSSE` the optimal `k` is where there is an "elbow". In our case 6.


In [None]:
from numpy import array
from math import pow

from pyspark.mllib.clustering import KMeans, KMeansModel

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Python Clustering").getOrCreate()


file = "/home/jovyan/Resources/tx.csv"
text = spark.sparkContext.textFile(file)

txns = text.map(lambda st: st.split(",")).map(lambda el: (el[0], el[1], float(el[2])))

# get the amounts for clustering training
amnts = txns.map(lambda v: array([v[2]]))

iterationCount = 20

print("Within Set Sum of Squared Errors")

for clusterCount in range(1, 21):

  model = KMeans.train(amnts, clusterCount, iterationCount)

  wssse = model.computeCost(amnts)
  print(str(clusterCount) + "\t" + str(round(wssse)))


![Optimal Cluster Count](../Resources/img/optimal_cluster_count.png)
