<p style="text-align:center">
    <a href="https://skills.network/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMSkillsNetworkBD0231ENCoursera2789-2023-01-01">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo">
    </a>
</p>


## Clustering using SparkML


Estimated time needed: **30** minutes


<p style='color: red'>The purpose of this lab is to show you how to use SparkML to cluster data.


## __Table of Contents__
<ol>
  <li>
    <a href="#Objectives">Objectives
    </a>
  </li>
  <li>
    <a href="#Datasets">Datasets
    </a>
  </li>
  <li>
    <a href="#Setup">Setup
    </a>
    <ol>
      <li>
        <a href="#Installing-Required-Libraries">Installing Required Libraries
        </a>
      </li>
      <li>
        <a href="#Importing-Required-Libraries">Importing Required Libraries
        </a>
      </li>
    </ol>
  </li>
  <li>
    <a href="#Examples">Examples
    </a>
    <ol>
      <li>
        <a href="#Task-1---Create-a-spark-session">Task 1 - Create a spark session
        </a>
      </li>
      <li>
        <a href="#Task-2---Load-the-data-in-a-csv-file-into-a-dataframe">Task 2 - Load the data in a csv file into a dataframe
        </a>
      </li>
      <li>
        <a href="#Task-3---Create-a-feature-vector">Task 3 - Create a feature vector
        </a>
      </li>
      <li>
        <a href="#Task-4---Create-a-clustering-model">Task 4 - Create a clustering model
        </a>
      </li>
      <li>
        <a href="#Task-5---Print-Cluster-Centers">Task 5 - Print Cluster Centers
        </a>
      </li>
    </ol>
  </li>
  <li>
    <a href="#Exercises">Exercises
    </a>
  </li>
  <ol>
    <li>
      <a href="#Exercise-1---Create-a-spark-session">Exercise 1 - Create a spark session
      </a>
    </li>
    <li>
      <a href="#Exercise-2---Load-the-data-in-a-csv-file-into-a-dataframe">Exercise 2 - Load the data in a csv file into a dataframe
      </a>
    </li>
    <li>
      <a href="#Exercise-3---Create-a-feature-vector">Exercise 3 - Create a feature vector
      </a>
    </li>
    <li>
      <a href="#Exercise-4---Create-a-clustering-model">Exercise 4 - Create a clustering model
      </a>
    </li>
    <li>
      <a href="#Exercise-5---Print-Cluster-Centers">Exercise 5 - Print Cluster Centers
      </a>
    </li>
  </ol>
</ol>


## Objectives

After completing this lab you will be able to:

 - Use PySpark to connect to a spark cluster.
 - Create a spark session.
 - Read a csv file into a data frame.
 - Use KMeans algorithm to cluster the data
 - Stop the spark session
 
 
 


## Datasets

In this lab you will be using dataset(s):

 - Modified version of Wholesale customers dataset. Original dataset available at https://archive.ics.uci.edu/ml/datasets/Wholesale+customers 
 - Seeds dataset. Available at https://archive.ics.uci.edu/ml/datasets/seeds


----


## Setup


For this lab, we will be using the following libraries:

*   [`PySpark`](https://spark.apache.org/docs/latest/api/python/index.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMSkillsNetworkBD0231ENCoursera2789-2023-01-01) for connecting to the Spark Cluster


### Installing Required Libraries

Spark Cluster is pre-installed in the Skills Network Labs environment. However, you need libraries like pyspark and findspark to connect to this cluster.

If you wish to download this jupyter notebook and run on your local computer, follow the instructions mentioned <a href="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-BD0231EN-Coursera/labs/Connecting_to_spark_cluster_using_Skills_Network_labs.ipynb">here.</a>



The following required libraries are __not__ pre-installed in the Skills Network Labs environment. __You will need to run the following cell__ to install them:


In [None]:
!pip install pyspark==3.1.2 -q
!pip install findspark -q

### Importing Required Libraries

_We recommend you import all required libraries in one place (here):_


In [15]:
# You can also use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

# FindSpark simplifies the process of using Apache Spark with Python

import findspark
findspark.init()

#import functions/Classes for sparkml

from pyspark.ml.clustering import KMeans
from pyspark.ml.feature import VectorAssembler

from pyspark.sql import SparkSession


## Examples


## Task 1 - Create a spark session


In [16]:
#Create SparkSession
#Ignore any warnings by SparkSession command

spark = SparkSession.builder.appName("Clustering using SparkML").getOrCreate()

24/10/04 08:29:06 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
24/10/04 08:29:06 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


## Task 2 - Load the data in a csv file into a dataframe


Download the data file


In [2]:
!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-SkillsNetwork/datasets/customers.csv


--2024-10-04 08:27:48--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-SkillsNetwork/datasets/customers.csv
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.63.118.104
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.63.118.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8909 (8.7K) [text/csv]
Saving to: ‘customers.csv’


2024-10-04 08:27:49 (2.77 GB/s) - ‘customers.csv’ saved [8909/8909]



Load the dataset into the spark dataframe


In [17]:
# using the spark.read.csv function we load the data into a dataframe.
# the header = True mentions that there is a header row in out csv file
# the inferSchema = True, tells spark to automatically find out the data types of the columns.

# Load customers dataset
customer_data = spark.read.csv("customers.csv", header=True, inferSchema=True)


Print the schema of the dataset


In [None]:
# Each row in this dataset is about a customer. The columns indicate the orders placed
# by a customer for Fresh_food, Milk, Grocery and Frozen_Food

In [18]:
customer_data.printSchema()

root
 |-- Fresh_Food: integer (nullable = true)
 |-- Milk: integer (nullable = true)
 |-- Grocery: integer (nullable = true)
 |-- Frozen_Food: integer (nullable = true)



Show top 5 rows from the dataset


In [19]:
customer_data.show(n=5, truncate=False)

+----------+----+-------+-----------+
|Fresh_Food|Milk|Grocery|Frozen_Food|
+----------+----+-------+-----------+
|12669     |9656|7561   |214        |
|7057      |9810|9568   |1762       |
|6353      |8808|7684   |2405       |
|13265     |1196|4221   |6404       |
|22615     |5410|7198   |3915       |
+----------+----+-------+-----------+
only showing top 5 rows



## Task 3 - Create a feature vector


In [20]:
# Assemble the features into a single vector column
feature_cols = ['Fresh_Food', 'Milk', 'Grocery', 'Frozen_Food']
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
customer_transformed_data = assembler.transform(customer_data)


You must tell the KMeans algorithm how many clusters to create out of your data


In [21]:
number_of_clusters = 3

## Task 4 - Create a clustering model


Create a KMeans clustering model


In [22]:
kmeans = KMeans(k = number_of_clusters)


Train/Fit the model on the dataset<br>


In [23]:
model = kmeans.fit(customer_transformed_data)


## Task 5 - Print Cluster Details


Your model is now trained. Time to evaluate the model.


In [25]:
# Make predictions on the dataset
predictions = model.transform(customer_transformed_data)

In [26]:
# Display the results
predictions.show(5)

+----------+----+-------+-----------+--------------------+----------+
|Fresh_Food|Milk|Grocery|Frozen_Food|            features|prediction|
+----------+----+-------+-----------+--------------------+----------+
|     12669|9656|   7561|        214|[12669.0,9656.0,7...|         0|
|      7057|9810|   9568|       1762|[7057.0,9810.0,95...|         0|
|      6353|8808|   7684|       2405|[6353.0,8808.0,76...|         0|
|     13265|1196|   4221|       6404|[13265.0,1196.0,4...|         0|
|     22615|5410|   7198|       3915|[22615.0,5410.0,7...|         1|
+----------+----+-------+-----------+--------------------+----------+
only showing top 5 rows



Display how many customers are there in each cluster.


In [27]:
predictions.groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|         1|   60|
|         2|   44|
|         0|  336|
+----------+-----+



In [28]:
#stop spark session
spark.stop()

# Exercises


### Exercise 1 - Create a spark session


Create SparkSession with appname "Seed Clustering"


In [29]:
spark = SparkSession.builder.appName("Seed Clustering").getOrCreate()

24/10/04 08:29:47 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
24/10/04 08:29:47 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


<details>
    <summary>Click here for a Hint</summary>
    
Use the SparkSession.builder

</details>


<details>
    <summary>Click here for Solution</summary>

```python
spark = SparkSession.builder.appName("Seed Clustering").getOrCreate()
```

</details>


### Exercise 2 - Load the data in a csv file into a dataframe


In [30]:
#download seed dataset
!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-SkillsNetwork/datasets/seeds.csv


--2024-10-04 08:29:50--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-SkillsNetwork/datasets/seeds.csv
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.63.118.104
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.63.118.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8973 (8.8K) [text/csv]
Saving to: ‘seeds.csv’


2024-10-04 08:29:53 (4.18 GB/s) - ‘seeds.csv’ saved [8973/8973]



Load the seed dataset


In [31]:

seed_data = spark.read.csv("seeds.csv", header=True, inferSchema=True)


<details>
    <summary>Click here for a Hint</summary>
    
Use the spark.read.csv

</details>


<details>
    <summary>Click here for Solution</summary>

```python
seed_data = spark.read.csv("seeds.csv", header=True, inferSchema=True)
```

</details>


Print the schema of the dataset


In [32]:
seed_data.printSchema()

root
 |-- area: double (nullable = true)
 |-- perimeter: double (nullable = true)
 |-- compactness: double (nullable = true)
 |-- length of kernel: double (nullable = true)
 |-- width of kernel: double (nullable = true)
 |-- asymmetry coefficient: double (nullable = true)
 |-- length of kernel groove: double (nullable = true)



Show top 5 rows of the data set


In [33]:
seed_data.show(n=5, truncate=False, vertical=True)

-RECORD 0-------------------------
 area                    | 15.26  
 perimeter               | 14.84  
 compactness             | 0.871  
 length of kernel        | 5.763  
 width of kernel         | 3.312  
 asymmetry coefficient   | 2.221  
 length of kernel groove | 5.22   
-RECORD 1-------------------------
 area                    | 14.88  
 perimeter               | 14.57  
 compactness             | 0.8811 
 length of kernel        | 5.554  
 width of kernel         | 3.333  
 asymmetry coefficient   | 1.018  
 length of kernel groove | 4.956  
-RECORD 2-------------------------
 area                    | 14.29  
 perimeter               | 14.09  
 compactness             | 0.905  
 length of kernel        | 5.291  
 width of kernel         | 3.337  
 asymmetry coefficient   | 2.699  
 length of kernel groove | 4.825  
-RECORD 3-------------------------
 area                    | 13.84  
 perimeter               | 13.94  
 compactness             | 0.8955 
 length of kernel   

### Exercise 3 - Create a feature vector


Assemble all columns into a single vector


In [35]:
feature_cols = ['area', 'perimeter', 'compactness', 'length of kernel', 'width of kernel', 'asymmetry coefficient', 'length of kernel groove']  
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
seed_transformed_data = assembler.transform(seed_data)


<details>
    <summary>Click here for a Hint</summary>
    
Refer to task - 3
</details>


<details>
    <summary>Click here for Solution</summary>

```python
feature_cols = ['area',
 'perimeter',
 'compactness',
 'length of kernel',
 'width of kernel',
 'asymmetry coefficient',
 'length of kernel groove']

assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
seed_transformed_data = assembler.transform(seed_data)

```

</details>


### Exercise 4 - Create a clustering model


Create 7 clusters


In [36]:
number_of_clusters = 7  
kmeans = KMeans(k = number_of_clusters)
model = kmeans.fit(seed_transformed_data)

<details>
    <summary>Click here for a Hint</summary>
    
use the kmeans.fit() method
</details>


<details>
    <summary>Click here for Solution</summary>

```python
number_of_clusters = 3
kmeans = KMeans(k = number_of_clusters)
model = kmeans.fit(seed_transformed_data)

```

</details>


### Exercise 5 - Print Cluster Details


In [37]:
predictions = model.transform(seed_transformed_data)

<details>
    <summary>Click here for a Hint</summary>
    
use the transform() method
</details>


<details>
    <summary>Click here for Solution</summary>

```python
predictions = model.transform(seed_transformed_data)
```

</details>


In [38]:
predictions.show(n=5, truncate=False, vertical=True)

-RECORD 0---------------------------------------------------------------
 area                    | 15.26                                        
 perimeter               | 14.84                                        
 compactness             | 0.871                                        
 length of kernel        | 5.763                                        
 width of kernel         | 3.312                                        
 asymmetry coefficient   | 2.221                                        
 length of kernel groove | 5.22                                         
 features                | [15.26,14.84,0.871,5.763,3.312,2.221,5.22]   
 prediction              | 2                                            
-RECORD 1---------------------------------------------------------------
 area                    | 14.88                                        
 perimeter               | 14.57                                        
 compactness             | 0.8811                  

In [39]:
predictions.groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|         1|   42|
|         6|   14|
|         3|   18|
|         5|   44|
|         4|   15|
|         2|   47|
|         0|   30|
+----------+-----+



In [42]:
#stop spark session
spark.stop()

Congratulations you have completed this lab.<br>
You are encouraged to create different number of clusters using the same dataset.


## Authors


[Ramesh Sannareddy](https://www.linkedin.com/in/rsannareddy/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMBD0231ENSkillsNetwork866-2023-01-01)


### Other Contributors


Copyright © 2023 IBM Corporation. All rights reserved.


<!--## Change Log
-->


<!--
|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2023-05-01|0.1|Ramesh Sannareddy|Initial Version Created|
-->
