<p style="text-align:center">
    <a href="https://skills.network/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMSkillsNetworkBD0231ENCoursera2789-2023-01-01">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo">
    </a>
</p>


## Clustering using SparkML


Estimated time needed: **30** minutes


<p style='color: red'>The purpose of this lab is to show you how to use SparkML to cluster data.


## __Table of Contents__
<ol>
  <li>
    <a href="#Objectives">Objectives
    </a>
  </li>
  <li>
    <a href="#Datasets">Datasets
    </a>
  </li>
  <li>
    <a href="#Setup">Setup
    </a>
    <ol>
      <li>
        <a href="#Installing-Required-Libraries">Installing Required Libraries
        </a>
      </li>
      <li>
        <a href="#Importing-Required-Libraries">Importing Required Libraries
        </a>
      </li>
    </ol>
  </li>
  <li>
    <a href="#Examples">Examples
    </a>
    <ol>
      <li>
        <a href="#Task-1---Create-a-spark-session">Task 1 - Create a spark session
        </a>
      </li>
      <li>
        <a href="#Task-2---Load-the-data-in-a-csv-file-into-a-dataframe">Task 2 - Load the data in a csv file into a dataframe
        </a>
      </li>
      <li>
        <a href="#Task-3---Create-a-feature-vector">Task 3 - Create a feature vector
        </a>
      </li>
      <li>
        <a href="#Task-4---Create-a-clustering-model">Task 4 - Create a clustering model
        </a>
      </li>
      <li>
        <a href="#Task-5---Print-Cluster-Centers">Task 5 - Print Cluster Centers
        </a>
      </li>
    </ol>
  </li>
  <li>
    <a href="#Exercises">Exercises
    </a>
  </li>
  <ol>
    <li>
      <a href="#Exercise-1---Create-a-spark-session">Exercise 1 - Create a spark session
      </a>
    </li>
    <li>
      <a href="#Exercise-2---Load-the-data-in-a-csv-file-into-a-dataframe">Exercise 2 - Load the data in a csv file into a dataframe
      </a>
    </li>
    <li>
      <a href="#Exercise-3---Create-a-feature-vector">Exercise 3 - Create a feature vector
      </a>
    </li>
    <li>
      <a href="#Exercise-4---Create-a-clustering-model">Exercise 4 - Create a clustering model
      </a>
    </li>
    <li>
      <a href="#Exercise-5---Print-Cluster-Centers">Exercise 5 - Print Cluster Centers
      </a>
    </li>
  </ol>
</ol>


## Objectives

After completing this lab you will be able to:

 - Use PySpark to connect to a spark cluster.
 - Create a spark session.
 - Read a csv file into a data frame.
 - Use KMeans algorithm to cluster the data
 - Stop the spark session
 
 
 


## Datasets

In this lab you will be using dataset(s):

 - Modified version of Wholesale customers dataset. Original dataset available at https://archive.ics.uci.edu/ml/datasets/Wholesale+customers 
 - Seeds dataset. Available at https://archive.ics.uci.edu/ml/datasets/seeds


----


## Setup


For this lab, we will be using the following libraries:

*   [`PySpark`](https://spark.apache.org/docs/latest/api/python/index.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMSkillsNetworkBD0231ENCoursera2789-2023-01-01) for connecting to the Spark Cluster


### Installing Required Libraries

Spark Cluster is pre-installed in the Skills Network Labs environment. However, you need libraries like pyspark and findspark to connect to this cluster.

If you wish to download this jupyter notebook and run on your local computer, follow the instructions mentioned <a href="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-BD0231EN-Coursera/labs/Connecting_to_spark_cluster_using_Skills_Network_labs.ipynb">here.</a>



The following required libraries are __not__ pre-installed in the Skills Network Labs environment. __You will need to run the following cell__ to install them:


In [ ]:
!pip install pyspark==3.1.2 -q
!pip install findspark -q

### Importing Required Libraries

_We recommend you import all required libraries in one place (here):_


In [ ]:
# You can also use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

# FindSpark simplifies the process of using Apache Spark with Python

import findspark
findspark.init()

#import functions/Classes for sparkml

from pyspark.ml.clustering import KMeans
from pyspark.ml.feature import VectorAssembler

from pyspark.sql import SparkSession


## Examples


## Task 1 - Create a spark session


In [ ]:
#Create SparkSession
#Ignore any warnings by SparkSession command

spark = SparkSession.builder.appName("Clustering using SparkML").getOrCreate()

## Task 2 - Load the data in a csv file into a dataframe


Download the data file


In [ ]:
!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-SkillsNetwork/datasets/customers.csv


Load the dataset into the spark dataframe


In [ ]:
# using the spark.read.csv function we load the data into a dataframe.
# the header = True mentions that there is a header row in out csv file
# the inferSchema = True, tells spark to automatically find out the data types of the columns.

# Load customers dataset
customer_data = spark.read.csv("customers.csv", header=True, inferSchema=True)


Print the schema of the dataset


In [ ]:
# Each row in this dataset is about a customer. The columns indicate the orders placed
# by a customer for Fresh_food, Milk, Grocery and Frozen_Food

In [ ]:
customer_data.printSchema()

Show top 5 rows from the dataset


In [ ]:
customer_data.show(n=5, truncate=False)

## Task 3 - Create a feature vector


In [ ]:
# Assemble the features into a single vector column
feature_cols = ['Fresh_Food', 'Milk', 'Grocery', 'Frozen_Food']
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
customer_transformed_data = assembler.transform(customer_data)


You must tell the KMeans algorithm how many clusters to create out of your data


In [ ]:
number_of_clusters = 3

## Task 4 - Create a clustering model


Create a KMeans clustering model


In [ ]:
kmeans = KMeans(k = number_of_clusters)


Train/Fit the model on the dataset<br>


In [ ]:
model = kmeans.fit(customer_transformed_data)


## Task 5 - Print Cluster Details


Your model is now trained. Time to evaluate the model.


In [ ]:
# Make predictions on the dataset
predictions = model.transform(customer_transformed_data)

In [ ]:
# Display the results
predictions.show(5)

Display how many customers are there in each cluster.


In [ ]:
predictions.groupBy('prediction').count().show()

In [ ]:
#stop spark session
spark.stop()

# Exercises


### Exercise 1 - Create a spark session


Create SparkSession with appname "Seed Clustering"


In [ ]:
spark = #TODO

<details>
    <summary>Click here for a Hint</summary>
    
Use the SparkSession.builder

</details>


<details>
    <summary>Click here for Solution</summary>

```python
spark = SparkSession.builder.appName("Seed Clustering").getOrCreate()
```

</details>


### Exercise 2 - Load the data in a csv file into a dataframe


In [ ]:
#download seed dataset
!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-SkillsNetwork/datasets/seeds.csv


Load the seed dataset


In [ ]:

seed_data =  #TODO


<details>
    <summary>Click here for a Hint</summary>
    
Use the spark.read.csv

</details>


<details>
    <summary>Click here for Solution</summary>

```python
seed_data = spark.read.csv("seeds.csv", header=True, inferSchema=True)
```

</details>


Print the schema of the dataset


In [ ]:
seed_data.printSchema()

Show top 5 rows of the data set


In [ ]:
seed_data.show(n=5, truncate=False, vertical=True)

### Exercise 3 - Create a feature vector


Assemble all columns into a single vector


In [ ]:
feature_cols =  #TODO
assembler =  #TODO
seed_transformed_data =  #TODO


<details>
    <summary>Click here for a Hint</summary>
    
Refer to task - 3
</details>


<details>
    <summary>Click here for Solution</summary>

```python
feature_cols = ['area',
 'perimeter',
 'compactness',
 'length of kernel',
 'width of kernel',
 'asymmetry coefficient',
 'length of kernel groove']

assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
seed_transformed_data = assembler.transform(seed_data)

```

</details>


### Exercise 4 - Create a clustering model


Create 7 clusters


In [ ]:
number_of_clusters =  #TODO
kmeans =  #TODO
model =  #TODO

<details>
    <summary>Click here for a Hint</summary>
    
use the kmeans.fit() method
</details>


<details>
    <summary>Click here for Solution</summary>

```python
number_of_clusters = 3
kmeans = KMeans(k = number_of_clusters)
model = kmeans.fit(seed_transformed_data)

```

</details>


### Exercise 5 - Print Cluster Details


In [ ]:
predictions =  #TODO

<details>
    <summary>Click here for a Hint</summary>
    
use the transform() method
</details>


<details>
    <summary>Click here for Solution</summary>

```python
predictions = model.transform(seed_transformed_data)
```

</details>


In [ ]:
predictions.show(n=5, truncate=False, vertical=True)

In [ ]:
predictions.groupBy('prediction').count().show()

In [ ]:
#stop spark session
spark.stop()

Congratulations you have completed this lab.<br>
You are encouraged to create different number of clusters using the same dataset.


## Authors


[Ramesh Sannareddy](https://www.linkedin.com/in/rsannareddy/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMBD0231ENSkillsNetwork866-2023-01-01)


### Other Contributors


Copyright © 2023 IBM Corporation. All rights reserved.


<!--## Change Log
-->


<!--
|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2023-05-01|0.1|Ramesh Sannareddy|Initial Version Created|
-->
