# Workshop Azure Databricks
## 04. Clustering
<img src="https://raw.githubusercontent.com/retkowsky/images/master/AzureDatabricksLogo.jpg"><br>

# Documentation
Présentation https://azure.microsoft.com/fr-fr/services/databricks/

Documentation Azure Databricks : https://docs.microsoft.com/fr-fr/azure/databricks/

Documentation Azure ML : https://docs.microsoft.com/en-us/azure/machine-learning/

Github : https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/azure-databricks

## Clustering
In this exercise, you will use K-Means clustering to segment customer data into five clusters.

### Import the Libraries
You will use the **KMeans** class to create your model. This will require a vector of features, so you will also use the **VectorAssembler** class.

In [0]:
import datetime
now = datetime.datetime.now()
print(now)

In [0]:
from pyspark.ml.clustering import KMeans
from pyspark.ml.feature import VectorAssembler

### Load Source Data
The source data for your clusters is in a comma-separated values (CSV) file, and incldues the following features:
- CustomerName: The customer's name
- Age: The customer's age in years
- MaritalStatus: The custtomer's marital status (1=Married, 0 = Unmarried)
- IncomeRange: The top-level for the customer's income range (for example, a value of 25,000 means the customer earns up to 25,000)
- Gender: A numeric value indicating gender (1 = female, 2 = male)
- TotalChildren: The total number of children the customer has
- ChildrenAtHome: The number of children the customer has living at home.
- Education: A numeric value indicating the highest level of education the customer has attained (1=Started High School to 5=Post-Graduate Degree
- Occupation: A numeric value indicating the type of occupation of the customer (0=Unskilled manual work to 5=Professional)
- HomeOwner: A numeric code to indicate home-ownership (1 - home owner, 0 = not a home owner)
- Cars: The number of cars owned by the customer.

> Importer le fichier CSV dans DBFS

In [0]:
file_location = "/FileStore/tables/customers.csv"
file_type = "csv"

# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","

# The applied options are for CSV files. For other file types, these will be ignored.
customers = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

display(customers)

CustomerName,Age,MaritalStatus,IncomeRange,Gender,TotalChildren,ChildrenAtHome,Education,Occupation,HomeOwner,Cars
Aaron Adams,42,0,50000,0,0,0,3,2,1,1
Aaron Alexander,40,1,50000,0,0,0,2,2,1,2
Aaron Allen,63,0,25000,0,2,1,2,1,1,2
Aaron Baker,56,1,50000,0,4,2,2,2,1,2
Aaron Bryant,72,0,75000,0,4,0,4,4,1,2
Aaron Butler,42,1,75000,0,0,0,3,5,1,2
Aaron Campbell,49,0,75000,0,0,0,5,5,1,1
Aaron Carter,42,0,50000,0,0,0,3,2,0,1
Aaron Chen,57,1,75000,0,4,3,4,5,1,0
Aaron Coleman,42,0,50000,0,0,0,3,2,1,1


In [0]:
train_rows = customers.count()

In [0]:
print(train_rows)

In [0]:
customers.describe("Age").show()

### Create the K-Means Model
You will use the feaures in the customer data to create a Kn-Means model with a k value of 5. This will be used to generate 5 clusters.

In [0]:
assembler = VectorAssembler(inputCols = ["Age", "MaritalStatus", "IncomeRange", "Gender", "TotalChildren", "ChildrenAtHome", "Education", "Occupation", "HomeOwner", "Cars"], outputCol="features")
train = assembler.transform(customers)

kmeans = KMeans(featuresCol=assembler.getOutputCol(), predictionCol="cluster", k=5, seed=0)

model = kmeans.fit(train)

print ("OK")

### Get the Cluster Centers
The cluster centers are indicated as vector coordinates.

In [0]:
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
    print(center)

### Predict Clusters
Now that you have trained the model, you can use it to segemnt the customer data into 5 clusters and show each customer with their allocated cluster.

In [0]:
classif = model.transform(train)
classif.groupBy("cluster").count().orderBy("cluster").show()

In [0]:
display(classif)

CustomerName,Age,MaritalStatus,IncomeRange,Gender,TotalChildren,ChildrenAtHome,Education,Occupation,HomeOwner,Cars,features,cluster
Aaron Adams,42,0,50000,0,0,0,3,2,1,1,"Map(vectorType -> dense, length -> 10, values -> List(42.0, 0.0, 50000.0, 0.0, 0.0, 0.0, 3.0, 2.0, 1.0, 1.0))",2
Aaron Alexander,40,1,50000,0,0,0,2,2,1,2,"Map(vectorType -> dense, length -> 10, values -> List(40.0, 1.0, 50000.0, 0.0, 0.0, 0.0, 2.0, 2.0, 1.0, 2.0))",2
Aaron Allen,63,0,25000,0,2,1,2,1,1,2,"Map(vectorType -> dense, length -> 10, values -> List(63.0, 0.0, 25000.0, 0.0, 2.0, 1.0, 2.0, 1.0, 1.0, 2.0))",0
Aaron Baker,56,1,50000,0,4,2,2,2,1,2,"Map(vectorType -> dense, length -> 10, values -> List(56.0, 1.0, 50000.0, 0.0, 4.0, 2.0, 2.0, 2.0, 1.0, 2.0))",2
Aaron Bryant,72,0,75000,0,4,0,4,4,1,2,"Map(vectorType -> dense, length -> 10, values -> List(72.0, 0.0, 75000.0, 0.0, 4.0, 0.0, 4.0, 4.0, 1.0, 2.0))",3
Aaron Butler,42,1,75000,0,0,0,3,5,1,2,"Map(vectorType -> dense, length -> 10, values -> List(42.0, 1.0, 75000.0, 0.0, 0.0, 0.0, 3.0, 5.0, 1.0, 2.0))",3
Aaron Campbell,49,0,75000,0,0,0,5,5,1,1,"Map(vectorType -> dense, length -> 10, values -> List(49.0, 0.0, 75000.0, 0.0, 0.0, 0.0, 5.0, 5.0, 1.0, 1.0))",3
Aaron Carter,42,0,50000,0,0,0,3,2,0,1,"Map(vectorType -> sparse, length -> 10, indices -> List(0, 2, 6, 7, 9), values -> List(42.0, 50000.0, 3.0, 2.0, 1.0))",2
Aaron Chen,57,1,75000,0,4,3,4,5,1,0,"Map(vectorType -> dense, length -> 10, values -> List(57.0, 1.0, 75000.0, 0.0, 4.0, 3.0, 4.0, 5.0, 1.0, 0.0))",3
Aaron Coleman,42,0,50000,0,0,0,3,2,1,1,"Map(vectorType -> dense, length -> 10, values -> List(42.0, 0.0, 50000.0, 0.0, 0.0, 0.0, 3.0, 2.0, 1.0, 1.0))",2


In [0]:
display(classif)

CustomerName,Age,MaritalStatus,IncomeRange,Gender,TotalChildren,ChildrenAtHome,Education,Occupation,HomeOwner,Cars,features,cluster
Aaron Adams,42,0,50000,0,0,0,3,2,1,1,"Map(vectorType -> dense, length -> 10, values -> List(42.0, 0.0, 50000.0, 0.0, 0.0, 0.0, 3.0, 2.0, 1.0, 1.0))",2
Aaron Alexander,40,1,50000,0,0,0,2,2,1,2,"Map(vectorType -> dense, length -> 10, values -> List(40.0, 1.0, 50000.0, 0.0, 0.0, 0.0, 2.0, 2.0, 1.0, 2.0))",2
Aaron Allen,63,0,25000,0,2,1,2,1,1,2,"Map(vectorType -> dense, length -> 10, values -> List(63.0, 0.0, 25000.0, 0.0, 2.0, 1.0, 2.0, 1.0, 1.0, 2.0))",0
Aaron Baker,56,1,50000,0,4,2,2,2,1,2,"Map(vectorType -> dense, length -> 10, values -> List(56.0, 1.0, 50000.0, 0.0, 4.0, 2.0, 2.0, 2.0, 1.0, 2.0))",2
Aaron Bryant,72,0,75000,0,4,0,4,4,1,2,"Map(vectorType -> dense, length -> 10, values -> List(72.0, 0.0, 75000.0, 0.0, 4.0, 0.0, 4.0, 4.0, 1.0, 2.0))",3
Aaron Butler,42,1,75000,0,0,0,3,5,1,2,"Map(vectorType -> dense, length -> 10, values -> List(42.0, 1.0, 75000.0, 0.0, 0.0, 0.0, 3.0, 5.0, 1.0, 2.0))",3
Aaron Campbell,49,0,75000,0,0,0,5,5,1,1,"Map(vectorType -> dense, length -> 10, values -> List(49.0, 0.0, 75000.0, 0.0, 0.0, 0.0, 5.0, 5.0, 1.0, 1.0))",3
Aaron Carter,42,0,50000,0,0,0,3,2,0,1,"Map(vectorType -> sparse, length -> 10, indices -> List(0, 2, 6, 7, 9), values -> List(42.0, 50000.0, 3.0, 2.0, 1.0))",2
Aaron Chen,57,1,75000,0,4,3,4,5,1,0,"Map(vectorType -> dense, length -> 10, values -> List(57.0, 1.0, 75000.0, 0.0, 4.0, 3.0, 4.0, 5.0, 1.0, 0.0))",3
Aaron Coleman,42,0,50000,0,0,0,3,2,1,1,"Map(vectorType -> dense, length -> 10, values -> List(42.0, 0.0, 50000.0, 0.0, 0.0, 0.0, 3.0, 2.0, 1.0, 1.0))",2


In [0]:
display(classif)

CustomerName,Age,MaritalStatus,IncomeRange,Gender,TotalChildren,ChildrenAtHome,Education,Occupation,HomeOwner,Cars,features,cluster
Aaron Adams,42,0,50000,0,0,0,3,2,1,1,"Map(vectorType -> dense, length -> 10, values -> List(42.0, 0.0, 50000.0, 0.0, 0.0, 0.0, 3.0, 2.0, 1.0, 1.0))",2
Aaron Alexander,40,1,50000,0,0,0,2,2,1,2,"Map(vectorType -> dense, length -> 10, values -> List(40.0, 1.0, 50000.0, 0.0, 0.0, 0.0, 2.0, 2.0, 1.0, 2.0))",2
Aaron Allen,63,0,25000,0,2,1,2,1,1,2,"Map(vectorType -> dense, length -> 10, values -> List(63.0, 0.0, 25000.0, 0.0, 2.0, 1.0, 2.0, 1.0, 1.0, 2.0))",0
Aaron Baker,56,1,50000,0,4,2,2,2,1,2,"Map(vectorType -> dense, length -> 10, values -> List(56.0, 1.0, 50000.0, 0.0, 4.0, 2.0, 2.0, 2.0, 1.0, 2.0))",2
Aaron Bryant,72,0,75000,0,4,0,4,4,1,2,"Map(vectorType -> dense, length -> 10, values -> List(72.0, 0.0, 75000.0, 0.0, 4.0, 0.0, 4.0, 4.0, 1.0, 2.0))",3
Aaron Butler,42,1,75000,0,0,0,3,5,1,2,"Map(vectorType -> dense, length -> 10, values -> List(42.0, 1.0, 75000.0, 0.0, 0.0, 0.0, 3.0, 5.0, 1.0, 2.0))",3
Aaron Campbell,49,0,75000,0,0,0,5,5,1,1,"Map(vectorType -> dense, length -> 10, values -> List(49.0, 0.0, 75000.0, 0.0, 0.0, 0.0, 5.0, 5.0, 1.0, 1.0))",3
Aaron Carter,42,0,50000,0,0,0,3,2,0,1,"Map(vectorType -> sparse, length -> 10, indices -> List(0, 2, 6, 7, 9), values -> List(42.0, 50000.0, 3.0, 2.0, 1.0))",2
Aaron Chen,57,1,75000,0,4,3,4,5,1,0,"Map(vectorType -> dense, length -> 10, values -> List(57.0, 1.0, 75000.0, 0.0, 4.0, 3.0, 4.0, 5.0, 1.0, 0.0))",3
Aaron Coleman,42,0,50000,0,0,0,3,2,1,1,"Map(vectorType -> dense, length -> 10, values -> List(42.0, 0.0, 50000.0, 0.0, 0.0, 0.0, 3.0, 2.0, 1.0, 1.0))",2


In [0]:
classif.select("CustomerName", "cluster").show(10)

> You can open Lab05