In [None]:
# setup-- 
import os
import pyspark
from splicemachine.spark.context import PySpliceContext
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
splice = PySpliceContext(spark)


In [None]:
%%scala
%%spark --start
SparkSession.builder

# KMeans
KMeans is an unsupervised-learning, clustering algorithm used to determine similarities and trends within a given dataset
KMeans is an iterative process, where K clusters are created by the user and continualy computed on a given dataset until the data converges and the algorithm ends.

## Setting up a KMeans
To implement KMeans, you will need two things:

* A dataset of structured data. Learn about structured data [here](https://www.quora.com/What-are-Structured-semi-structured-and-unstructured-data-in-Big-Data/answer/Manoj-R-Patil?srid=33JGI).

* A value for K. K can be computed a number of ways, none of which are necessarily incorrect. It is dependent on the specific dataset you are working with. It is suggested to first plot your data and do trials with multiple values of K. Learn more about choosing a good K [here](https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set).


## Computing KMeans
A KMeans algorithm is computed in three main steps:

1. K clusters are created and assigned locations, either randomly generated or randomly taken from K datapoints.

2. For each datapoint in your dataset, the square Euclidian Distance is computed against all clusters until a minimum is found. That datapoint is assigned to the cluster of minimum distance.

3. After all datapoints are assigned, clusters are recomputed and reassigned locations using the mean distance of its assigned datapoints.

4. Repeat steps 2 and 3 until one of the following occurs:

   a. A set number of iterations completes.
   
   b. No datapoints are reassigned to new clusters.
   
   c. Minimum distance changes occur within new clusters.

You can learn more about the KMeans algorithm here:

* [KMeans Algorithm](https://www.datascience.com/blog/introduction-to-k-means-clustering-algorithm-learn-data-science-tutorials)

* [KMeans Clustering](http://scikit-learn.org/stable/modules/clustering.html#k-means)

## Create a Simple KMeans Example

We use a US weather dataset, engineering the features and plotting our results.

* To follow along with this dataset, [download this data](https://raw.githubusercontent.com/fivethirtyeight/data/master/us-weather-history/KMDW.csv).

* For more examples of KMeans and Scala examples, [visit this site](https://github.com/apache/spark/tree/master/examples/src/main/scala/org/apache/spark/examples/mllib).

* To learn more about the Sum of Squares for Errors, which is used later in this notebook, [visit this site](http://www.wikihow.com/Calculate-the-Sum-of-Squares-for-Error).

In [None]:
%%scala 
import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.ml.evaluation.ClusteringEvaluator
import org.apache.spark.sql.Row
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.feature.{PCA, StandardScaler}
import org.apache.spark.ml.{Pipeline, PipelineModel}

//Create the data schema
import org.apache.spark.sql.types.StructType
val schema = new StructType()
    .add("date", "string")
    .add("actual_mean_temp", "float")
    .add("actual_min_temp", "float")
    .add("actual_max_temp", "float")
    .add("average_min_temp", "float")
    .add("average_max_temp", "float")
    .add("record_min_temp", "float")
    .add("record_max_temp", "float")
    .add("record_min_temp_year", "float")
    .add("record_max_temp_year", "float")
    .add("actual_precipitation", "float")
    .add("average_precipitation", "float")
    .add("record_precipitation", "float")


//Grabbing data
val datas = sc.textFile("s3a://splice-demo/weather_data.csv")
val rdd = sc.parallelize(datas.take(165))
val header = rdd.first()
val rdd2 = rdd.filter(x => x != header).map(_.split(",")).filter(x => x!= "")
    .map(p => Row(p(0),p(1).toFloat,p(2).toFloat,p(3).toFloat,p(4).toFloat,p(5).toFloat,p(6).toFloat,p(7).toFloat,p(8).toFloat,p(9).toFloat,p(10).toFloat,p(11).toFloat,p(12).toFloat))
    .filter(row => row.size == 13)


val df = spark.createDataFrame(rdd2, schema)

val features = header.split(",").filter(_ != "date")

// Assemble our feature vector
val assembler = new VectorAssembler()
  .setInputCols(features)
  .setOutputCol("features")

//Normalize all features to the same scale
val scaler = new StandardScaler()
  .setInputCol("features")
  .setOutputCol("scaledFeatures")

// PCA to reduce dimensionality
val pca = new PCA()
  .setInputCol("scaledFeatures")
  .setOutputCol("pcaFeatures")
  .setK(2)

// Assemble our Pipeline for proper parralelism
val pipeline = new Pipeline()
  .setStages(Array(assembler, scaler, pca))

val df2 = pipeline.fit(df).transform(df)


// Display feature columns
df2.select(features.head, features:_*).display(10)

// Inspect the schema
println("Schema:")
df2.schema.foreach(i => println(i))
println()
// Inspect the features column
println("Features:")
df2.select("features", "pcaFeatures").show(10, truncate=false)

### Plotting Out PCA Features

Run the following cell to plot out PCA features. The generated plot is interactive; try zooming in and out to inspect your data.

In [None]:
%%scala
import org.apache.spark.ml.linalg.DenseVector

val x = df2.select(col("pcaFeatures")).collect()

// Get x and y values for plotting
val xVals = x.map(i => i(0).asInstanceOf[DenseVector].values(0: Int))
val yVals = x.map(i => i(0).asInstanceOf[DenseVector].values(1: Int))

// Generate our plot
val p = new Plot {
    title = "PCA Features"
    labelStyle = "font-size:32px; font-weight: bold; font-family: courier; fill: green;"
    gridLineStyle = "stroke: purple; stroke-width: 3;"
    titleStyle = "color: green;"
}

p.add(new Points {
    x = xVals
    y = yVals
})

### Generate Our Clusters

Run the following cell to:

* cluster the data into groups
* train a k-means model
* make predictions
* evaluate clustering
* display the results

In [None]:
%%scala 

//Clustering data into 3 groups
val K = 4
val maxIterations = 5000

// Trains a k-means model.
val kmeans = new KMeans().setK(K).setSeed(1L)
val model = kmeans.setFeaturesCol("pcaFeatures").fit(df2)

// Make predictions
val predictions = model.transform(df2)

// Evaluate clustering by computing Silhouette score
val evaluator = new ClusteringEvaluator()

val silhouette = evaluator.evaluate(predictions)
println(s"Silhouette with squared euclidean distance = $silhouette")

// Shows the result.
println("Cluster Centers: ")
model.clusterCenters.foreach(println)


### Evaluate the Clustering Algorithm 

We plot our datapoints with our cluster centers to evaluate how well the algorithm worked:

In [None]:
%%scala

val clusterX = model.clusterCenters.map(i => i(0))
val clusterY = model.clusterCenters.map(i => i(1))

// Generate our plot with cluster centers
val p = new Plot {
    title = "PCA Features with Clusters"
    labelStyle = "font-size:32px; font-weight: bold; font-family: courier; fill: green;"
    gridLineStyle = "stroke: purple; stroke-width: 3;"
    titleStyle = "color: green;"
}

p.add(new Points {
    x = clusterX
    y = clusterY
    color = Color.red
    size = 10
})
p.add(new Points {
    x = xVals
    y = yVals
    color = Color.blue
    size = 5
})

Our algorithm didn't do too badly: the fourth cluster handled the outliers.  Now try going back and editing the number of cluster (in the `K` variable), then re-plot and re-evaluate the output.

## PySpark Example

For our PySpark example, we use the well-known Iris dataset to show off some of the great plotting features build into [BeakerX](http://beakerx.com) and created by [Plotly](https://plot.ly/python/):

In [None]:
from pyspark.ml.feature import VectorAssembler, StringIndexer
from pyspark.ml import Pipeline
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
import plotly.express as px

data = spark.createDataFrame(px.data.iris()).drop('species_id')

# Convert species column into int type
si = StringIndexer(inputCol='species', outputCol='species_vec')

# Create a vector of features
cols = [c for c in data.columns if c != 'species' and c != 'petal_length']
va = VectorAssembler(inputCols=cols, outputCol='features')

# Define stages of a Pipeline for Spark
pipeline = Pipeline(stages = [si, va])

data = pipeline.fit(data).transform(data)

# Show the final dataset
data.orderBy('sepal_width').show()

### Build the Kmeans Algorithm and Display Results

Run the following cell to run the algorithm and display results:

In [None]:
# Trains a k-means model.
kmeans = KMeans().setK(3).setSeed(1)
model = kmeans.fit(data)

# Make predictions
predictions = model.transform(data)

# Evaluate clustering by computing Silhouette score
evaluator = ClusteringEvaluator()

silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))

# Shows the result.
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
    print(center)

### Combine the Cluster Data for Plotting

Now we combine the cluster data to the dataset for plotting: 

In [None]:
import pandas as pd
centers = []
for center in model.clusterCenters():
    centers.append(center)
# Match the schema of the centers dataframe to the predictions dataframe
cents = pd.DataFrame(centers,columns=['sepal_length','sepal_width','petal_width'])
# The 3 labels we are trying to cluster on
cents['species'] = ['setosa','virginica','versicolor']
# Add a column with the value 10 to all cluster center datapoints. This is for plotting purposes
cents.insert(0,'center',[10]*len(cents))
cents

In [None]:
# Add the center datapoints to the predictions dataframe
preds = predictions.toPandas()
# Add a column with the value of 2 to all non-cluster-center datapoints. Again, for plotting purposes
preds.insert(0, "center", [2]*len(preds), True)
# Insert the cluster center data
preds = preds.append(cents, sort=False)
preds[:10]

### Plot the Data Points

Now we can plot the datapoints, colored by cluster, with the cluster centers as the large, diamond datapoints in the *center*.

You'll notice that the cluster center datapoints are larger, due to the data formatting we performed in the cells above.

Again, this plot is interactive: try moving the plot around to understand the datapoints.

In [None]:
# We use plotly to visualize our data
px.scatter_3d(preds, x='sepal_length', y='sepal_width', z='petal_width',
              color='species', symbol='center', size='center')
